I came across the following regex at work. What does it do?

,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))

To understand it, I split it into following parts

  • , = Match everything that has ,

  • (?= = Followed by

  • (?:[^\"]*\"[^\"]*\")* = Anything which does not match ", followed by ", followed by anything which does not match ", followed by ". For example, 1111"aaaaa"

  • (?![^\"]*\") = BUT not followed by anything that does not " and matches "

In other words, match anything that has , followed by either 11111"111" OR followed by ""

The use case where the above expression is used, simply for tokenizing a string, separated by ,, but I am assuming the author built for some thing more generic.

Can anyone provide a more simpler explanation than above?

The above expression is used to assign expression to boost::regex().

UPDATE: Actually, it is searching for the "," commas, with the following constraint

  1. It is okay that there are even number of " following the comma

  2. BUT, It is NOT okay to have a single " following the comma

For example consider the string: a, "h,w", 23

The first "," is matched, because it has following even number of " ("h,w")

The second "," in between "h,w" is NOT matched because of the second expression (?![^\"]*\") which states that a "," should not be followed by a single "

Finally the last "," matches.

The final output would be 2 , ,

有帮助吗?

解决方案

It looks like it will match any comma, but only if there are an even number of " characters after that comma.

, - A comma.

(?= - Followed by...

(?:[^\"]*\"[^\"]*\")* - Any string ending in a " mark and containing an even total number of " marks, or the empty string,

(?![^\"]*\") - and there is no other " mark later on in the string.

) to close the (?=.

This could be useful if we already know that the entire input string has an even total of " characters, there's no such thing as nesting or escaping quotes, and commas between quote marks should not be treated as delimiters. For example, given the input

25,"Hello, world!","More text",123.45

the regex should not match the comma between Hello and world, but should match the other three commas.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top