正则表达式在CSV文件中查找未转义的双引号

https://stackoverflow.com/questions/1601780

csv
regex

05-07-2019
|

题

正则表达式将用什么来查找CSV文件中由双引号引起的列中包含的2个未转义双引号的集合？

不匹配：

"asdf","asdf"
"", "asdf"
"asdf", ""
"adsf", "", "asdf"

<强>匹配

"asdf""asdf", "asdf"
"asdf", """asdf"""
"asdf", """"

解决方案

试试这个：

(?m)""(?![ \t]*(,|$))

说明：

(?m)       // enable multi-line matching (^ will act as the start of the line and $ will act as the end of the line (i))
""         // match two successive double quotes
(?!        // start negative look ahead
  [ \t]*   //   zero or more spaces or tabs
  (        //   open group 1
    ,      //     match a comma 
    |      //     OR
    $      //     the end of the line or string
  )        //   close group 1
)          // stop negative look ahead

所以，用简单的英语：＆quot;匹配两个连续的双引号，只要它们前面没有逗号或行尾，可选空格和制表符“

（i）除了是正常的字符串开头和字符串结尾元字符外。

其他提示

由于问题的复杂性，解决方案取决于您使用的引擎。这是因为要解决这个问题，你必须使用后视并向前看，每个引擎都不一样。

我的回答是使用Ruby引擎。检查只是一个RegEx，但我在这里完整的代码以便更好地解释它。

请注意，由于Ruby RegEx引擎（或我的知识），无法选择前瞻/后退。所以我在逗号之前和之后需要一个小空格问题。

这是我的代码：

orgTexts = [ '"asdf","asdf"', '"", "asdf"', '"asdf", ""', '"adsf", "", "asdf"', '"asdf""asdf", "asdf"', '"asdf", """asdf"""', '"asdf", """"' ] orgTexts.each{|orgText| # Preprocessing - Eliminate spaces before and after comma # Here is needed if you may have spaces before and after a valid comma orgText = orgText.gsub(Regexp.new('\" *, *\"'), '","') # Detect valid character (non-quote and valid quote) resText = orgText.gsub(Regexp.new('([^\"]|^\"|\"$|(?<=,)\"|\"(?=,)|(?<=\\\\)\")'), '-') # resText = orgText.gsub(Regexp.new('([^\"]|(^|(?<=,)|(?<=\\\\))\"|\"($|(?=,)))'), '-') # [^\"] ===> A non qoute # | ===> or # ^\" ===> beginning quot # | ===> or # \"$ ===> endding quot # | ===> or # (?<=,)\" ===> quot just after comma # \"(?=,) ===> quot just before comma # (?<=\\\\)\" ===> escaped quot # This part is to show the invalid non-escaped quots print orgText print resText.gsub(Regexp.new('"'), '^') # This part is to determine if there is non-escaped quotes # Here is the actual matching, use this one if you don't want to know which quote is un-escaped isMatch = ((orgText =~ /^([^\"]|^\"|\"$|(?<=,)\"|\"(?=,)|(?<=\\\\)\")*$/) != 0).to_s # Basicall, it match it from start to end (^...$) there is only a valid character print orgText + ": " + isMatch print print "" print "" }

执行代码时打印：

"asdf","asdf" ------------- "asdf","asdf": false "","asdf" --------- "","asdf": false "asdf","" --------- "asdf","": false "adsf","","asdf" ---------------- "adsf","","asdf": false "asdf""asdf","asdf" -----^^------------ "asdf""asdf","asdf": true "asdf","""asdf""" --------^^----^^- "asdf","""asdf""": true "asdf","""" --------^^- "asdf","""": true

我希望我在这里给你一些想法，你可以使用其他引擎和语言。

".*"(\n|(".*",)*)

应该有用，我猜......

对于单线比赛：

^("[^"]*"\s*,\s*)*"[^"]*""[^"]*"

或多行：

(^|\r\n)("[^\r\n"]*"\s*,\s*)*"[^\r\n"]*""[^\r\n"]*"

编辑/注意：根据所使用的正则表达式引擎，您可以使用lookbehinds和其他东西来使正则表达式更精简。但这应该适用于大多数正则表达式引擎。

试试这个正则表达式：

"(?:[^",\\]*|\\.)*(?:""(?:[^",\\]*|\\.)*)+"

这将匹配任何带引号的字符串和至少一对未转义的双引号。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow