Why does StackOverflowError occur?
In reference implementation of Pattern
class (which comes with Oracle's JRE, OpenJDK, and a number of other JVMs), greedy and lazy quantifiers are implemented with recursion1 when the repeated pattern is non-trivial. Therefore, you will run into StackOverflowError
when the input string is long enough.
1 Recursion is a quick but not scalable solution to allow backtracking in the pattern. Better implementation uses a data structure to store backtracking points (which basically converts recursive solution to iterative solution with stack).
Solution
The following regex should work:
"(?:\"(?:[^\"\\\\]++|\\\\.)*+\"|[^ \"]++)++"
Well, the regex is quite confusing with 2 layers of escaping: escaping in Java string literal and escaping in regex syntax.
The raw regex when you print the string out. My explanation will be based on the raw regex.
(?:"(?:[^"\\]++|\\.)*+"|[^ "]++)++
Explanation
Since you only care about what the whole regex matches, all the capturing groups (pattern)
has been turned into non-capturing group (?:pattern)
for efficiency.
The first alternative "(?:[^"\\]++|\\.)*+"
matches a quoted string.
The second alternative [^ "]++
matches a sequence of character that doesn't contain space and double quote "
.
(?:
" # Double quote
(?:
[^"\\]++ # A sequence of characters that are not " and \
| # OR
\\. # Escape sequence: \ followed by any character (except line terminators)
)*+ # Match 0 or more of the sequences above (allows empty string)
" # Double quote
|
[^ "]++
)++
Since the regex is written so that there is no needs for backtracking, all quantifiers are made possessive. Since Pattern
class implements possessive quantifier with a loop, instead of recursion as the case with greedy/lazy quantifiers, StackOverflowError
will not occur.
I remove the need for backtracking by writing the regex so that it matches the correct string on first try:
Since
[^"\\]
excludes the\
, there is no way we can "steal" a\
from an escaping sequence, or "steal" a"
and mess up the closing quote, we can safely advance ahead without backtracking. This explains the possessive quantifier here[^"\\]++
. There is no need to assign a quantifier here, but I do this to reduce the work on the branching.Since both
[^"\\]++
and\\.
can't "steal" a"
and mess up the closing quote, we can advance ahead without backtracking. This explains the possessive quantifier here(?:[^"\\]++|\\.)*+
[^ "]
can't start a quoted string, and it also can't match a space (delimiter). This is why we can use possessive quantifier on it.Since
"(?:[^"\\]++|\\.)*+"
and[^ "]++
can't mess up the match of each other, we can make the outer most quantifier possessive.
Example of a regex that doesn't match things correctly on first try and only get the correct result after backtracking would be ^([bcd]+:[ab]+)+$
against inputs such as b:ab:a
. The first iteration will match b:ab
, which cause the 2nd iteration to fail, then it backtracks and retry with the first iteration being b:a
and then successfully match the whole string.