CSV 파일에서 에스카로운 이중 따옴표를 찾기위한 정규 표현식

https://stackoverflow.com/questions/1601780

csv
regex

05-07-2019
|

문제

CSV 파일의 이중 인용문으로 설정된 열에 포함 된 2 개의 unescaped 이중 따옴표 세트를 찾는 것은 무엇입니까?

일치하지 않음 :

"asdf","asdf"
"", "asdf"
"asdf", ""
"adsf", "", "asdf"

성냥:

"asdf""asdf", "asdf"
"asdf", """asdf"""
"asdf", """"

해결책

이 시도:

(?m)""(?![ \t]*(,|$))

설명:

(?m)       // enable multi-line matching (^ will act as the start of the line and $ will act as the end of the line (i))
""         // match two successive double quotes
(?!        // start negative look ahead
  [ \t]*   //   zero or more spaces or tabs
  (        //   open group 1
    ,      //     match a comma 
    |      //     OR
    $      //     the end of the line or string
  )        //   close group 1
)          // stop negative look ahead

그래서, 평범한 영어로 : "선택적으로 공백과 탭이있는 쉼표 나 끝이없는 경우에만 두 개의 연속 이중 인용문을 일치시킵니다.".

(i) 정상인 것 외에 스트링 시작 그리고 스트링 종료 메타 캐릭터.

다른 팁

문제의 복잡성으로 인해 솔루션은 사용중인 엔진에 따라 다릅니다. 이것은 그것을 해결하기 위해서는 뒤에 룩을 사용하고 앞서보아야하며 각 엔진은 동일하지 않아야합니다.

내 대답은 루비 엔진을 사용하는 것입니다. 점검은 단지 하나의 정규식이지만 더 잘 설명하기 위해 전체 코드를 설명합니다.

Ruby Regex Engine (또는 내 지식)으로 인해 선택적으로 미리/뒤에있는 것은 불가능합니다. 따라서 쉼표 전후에 공간의 작은 문제가 필요합니다.

내 코드는 다음과 같습니다.

orgTexts = [
    '"asdf","asdf"',
    '"", "asdf"',
    '"asdf", ""',
    '"adsf", "", "asdf"',
    '"asdf""asdf", "asdf"',
    '"asdf", """asdf"""',
    '"asdf", """"'
]

orgTexts.each{|orgText|
    # Preprocessing - Eliminate spaces before and after comma
    # Here is needed if you may have spaces before and after a valid comma
    orgText = orgText.gsub(Regexp.new('\" *, *\"'), '","')

    # Detect valid character (non-quote and valid quote)
    resText = orgText.gsub(Regexp.new('([^\"]|^\"|\"$|(?<=,)\"|\"(?=,)|(?<=\\\\)\")'), '-')
    # resText = orgText.gsub(Regexp.new('([^\"]|(^|(?<=,)|(?<=\\\\))\"|\"($|(?=,)))'), '-')
    # [^\"]       ===> A non qoute
    # |           ===> or
    # ^\"         ===> beginning quot
    # |           ===> or
    # \"$         ===> endding quot
    # |           ===> or
    # (?<=,)\"    ===> quot just after comma
    # \"(?=,)     ===> quot just before comma
    # (?<=\\\\)\" ===> escaped quot

    #  This part is to show the invalid non-escaped quots
    print orgText
    print resText.gsub(Regexp.new('"'), '^')

    # This part is to determine if there is non-escaped quotes
    # Here is the actual matching, use this one if you don't want to know which quote is un-escaped
    isMatch = ((orgText =~ /^([^\"]|^\"|\"$|(?<=,)\"|\"(?=,)|(?<=\\\\)\")*$/) != 0).to_s
    # Basicall, it match it from start to end (^...$) there is only a valid character

    print orgText + ": " + isMatch
    print 
    print ""
    print ""
}

코드 인쇄물을 실행하면 :

"asdf","asdf"
-------------
"asdf","asdf": false


"","asdf"
---------
"","asdf": false


"asdf",""
---------
"asdf","": false


"adsf","","asdf"
----------------
"adsf","","asdf": false


"asdf""asdf","asdf"
-----^^------------
"asdf""asdf","asdf": true


"asdf","""asdf"""
--------^^----^^-
"asdf","""asdf""": true


"asdf",""""
--------^^-
"asdf","""": true

다른 엔진과 언어와 함께 사용할 수 있다는 아이디어를 여기에서 제공하기를 바랍니다.

".*"(\n|(".*",)*)

작동해야한다고 생각합니다 ...

단일 라인 경기의 경우 :

^("[^"]*"\s*,\s*)*"[^"]*""[^"]*"

또는 멀티 라인의 경우 :

(^|\r\n)("[^\r\n"]*"\s*,\s*)*"[^\r\n"]*""[^\r\n"]*"

편집/참고 : 사용 된 Regex 엔진에 따라 Lookbehinds 및 기타 물건을 사용하여 Regex를 LEANER로 만들 수 있습니다. 그러나 이것은 대부분의 Regex 엔진에서 잘 작동해야합니다.

이 정규 표현을 시도하십시오.

"(?:[^",\\]*|\\.)*(?:""(?:[^",\\]*|\\.)*)+"

그것은 인용 된 문자열에 적어도 하나의 에스카로운 이중 따옴표 쌍과 일치합니다.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow