Regular expression to find unescaped double quotes in CSV file

https://stackoverflow.com/questions/1601780

csv
regex

05-07-2019
|

Question

What would a regular expression be to find sets of 2 unescaped double quotes that are contained in columns set off by double quotes in a CSV file?

Not a match:

"asdf","asdf"
"", "asdf"
"asdf", ""
"adsf", "", "asdf"

Match:

"asdf""asdf", "asdf"
"asdf", """asdf"""
"asdf", """"

Solution

Try this:

(?m)""(?![ \t]*(,|$))

Explanation:

(?m)       // enable multi-line matching (^ will act as the start of the line and $ will act as the end of the line (i))
""         // match two successive double quotes
(?!        // start negative look ahead
  [ \t]*   //   zero or more spaces or tabs
  (        //   open group 1
    ,      //     match a comma 
    |      //     OR
    $      //     the end of the line or string
  )        //   close group 1
)          // stop negative look ahead

So, in plain English: "match two successive double quotes, only if they DON'T have a comma or end-of-the-line ahead of them with optionally spaces and tabs in between".

(i) besides being the normal start-of-the-string and end-of-the-string meta characters.

OTHER TIPS

Due to the complexity of your problem, the solution depends on the engine you are using. This because to solve it you must use look behind and look ahead and each engine is not the same one this.

My answer is using Ruby engine. The checking is just one RegEx but I out the whole code here for better explain it.

NOTE that, due to Ruby RegEx engine (or my knowledge), optional look ahead/behind is not possible. So I need a small problem of spaces before and after comma.

Here is my code:

orgTexts = [
    '"asdf","asdf"',
    '"", "asdf"',
    '"asdf", ""',
    '"adsf", "", "asdf"',
    '"asdf""asdf", "asdf"',
    '"asdf", """asdf"""',
    '"asdf", """"'
]

orgTexts.each{|orgText|
    # Preprocessing - Eliminate spaces before and after comma
    # Here is needed if you may have spaces before and after a valid comma
    orgText = orgText.gsub(Regexp.new('\" *, *\"'), '","')

    # Detect valid character (non-quote and valid quote)
    resText = orgText.gsub(Regexp.new('([^\"]|^\"|\"$|(?<=,)\"|\"(?=,)|(?<=\\\\)\")'), '-')
    # resText = orgText.gsub(Regexp.new('([^\"]|(^|(?<=,)|(?<=\\\\))\"|\"($|(?=,)))'), '-')
    # [^\"]       ===> A non qoute
    # |           ===> or
    # ^\"         ===> beginning quot
    # |           ===> or
    # \"$         ===> endding quot
    # |           ===> or
    # (?<=,)\"    ===> quot just after comma
    # \"(?=,)     ===> quot just before comma
    # (?<=\\\\)\" ===> escaped quot

    #  This part is to show the invalid non-escaped quots
    print orgText
    print resText.gsub(Regexp.new('"'), '^')

    # This part is to determine if there is non-escaped quotes
    # Here is the actual matching, use this one if you don't want to know which quote is un-escaped
    isMatch = ((orgText =~ /^([^\"]|^\"|\"$|(?<=,)\"|\"(?=,)|(?<=\\\\)\")*$/) != 0).to_s
    # Basicall, it match it from start to end (^...$) there is only a valid character

    print orgText + ": " + isMatch
    print 
    print ""
    print ""
}

When executed the code prints:

"asdf","asdf"
-------------
"asdf","asdf": false


"","asdf"
---------
"","asdf": false


"asdf",""
---------
"asdf","": false


"adsf","","asdf"
----------------
"adsf","","asdf": false


"asdf""asdf","asdf"
-----^^------------
"asdf""asdf","asdf": true


"asdf","""asdf"""
--------^^----^^-
"asdf","""asdf""": true


"asdf",""""
--------^^-
"asdf","""": true

I hope I give you some idea here that you can use with other engine and language.

".*"(\n|(".*",)*)

should work, I guess...

For single-line matches:

^("[^"]*"\s*,\s*)*"[^"]*""[^"]*"

or for multi-line:

(^|\r\n)("[^\r\n"]*"\s*,\s*)*"[^\r\n"]*""[^\r\n"]*"

Edit/Note: Depending on the regex engine used, you could use lookbehinds and other stuff to make the regex leaner. But this should work in most regex engines just fine.

Try this regular expression:

"(?:[^",\\]*|\\.)*(?:""(?:[^",\\]*|\\.)*)+"

That will match any quoted string with at least one pair of unescaped double quotes.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow