Question

I'm trying to extract information out of rc-files. In these files, "-chars in strings are escaped by doubling them ("") analog to c# verbatim strings. is ther a way to extract the string?

For example, if I have the following string "this is a ""test""" I would like to obtain this is a ""test"". It also must be non-greedy (very important).

I've tried to use the following regular expression;

"(?<text>[^""]*(""(.|""|[^"])*)*)"

However the performance was awful. I'v based it on the explanation here: http://ad.hominem.org/log/2005/05/quoted_strings.php

Has anybody any idea to cope with this using a regular expression?

Was it helpful?

Solution

You've got some nested repetition quantifiers there. That can be catastrophic for the performance.

Try something like this:

(?<=")(?:[^"]|"")*(?=")

That can now only consume either two quotes at once... or non-quote characters. The lookbehind and lookahead assert, that the actual match is preceded and followed by a quote.

This also gets you around having to capture anything. Your desired result will simply be the full string you want (without the outer quotes).

I do not assert that the outer quotes are not doubled. Because if they were, there would be no way to distinguish them from an empty string anyway.

OTHER TIPS

This turns out to be a lot simpler than you'd expect. A string literal with escaped quotes looks exactly like a bunch of simple string literals run together:

"Some ""escaped"" quotes"

"Some " + "escaped" + " quotes"

So this is all you need to match it:

(?:"[^"]*")+

You'll have to strip off the leading and trailing quotes in a separate step, but that's not a big deal. You would need a separate step anyway, to unescape the escaped quotes (\" or "").

Don't if this is better or worse than m.buettner's (guessing not - he seems to know his stuff) but I thought I'd throw it out there for critique.

"(([^"]+(""[^"]+"")*)*)"

Try this (?<=^")(.*?"{2}.*?"{2})(?="$) it will be maybe more faster, than two previous and without any bugs.

  • Match a " beginning the string
  • Multiple times match a non-" or two "
  • Match a " ending the string

"([^"]|(""))*?"

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top