Question

I'm trying to get quoted strings using regexp.

String regexp = "('([^\\\\']+|\\\\([btnfr\"'\\\\]|[0-3]?[0-7]{1,2}|u[0-9a-fA-F]{4}))*'|\"([^\\\\\"]+|\\\\([btnfr\"'\\\\]|[0-3]?[0-7]{1,2}|u[0-9a-fA-F]{4}))*\")";
Pattern p = Pattern.compile(regexp);
Matcher m = p.matcher(source); 
while (m.find()) {
    String newElement = m.group(1);
    //...
}

It works well, but if source text contains

' onkeyup="this.value = this.value.replace (/\D/, \'\')">'

program goes into eternal loop.

How can I correctly get this string?

For example, I have a text(php code):

'qty'=>'<input type="text" maxlength="3" class="qty_text" id='.$key.' value ='

The result should be

'qty'
'<input type="text" maxlength="3" class="qty_text" id='
' value ='
Was it helpful?

Solution

Your regex seems to work okay when presented with a string it matches; it's when it can't match that it goes into the endless loop. (In this case it's the \D that's causing it to choke.) But that regex is much more complicated than it needs to be; you're trying to match them, not validate them. Here's the quintessential regex for a string literal in C-style languages:

"[^"\\\r\n]*(?:\\.[^"\\\r\n]*)*"

...and the single-quoted version, for languages that support that style:

'[^'\\\r\n]*(?:\\.[^'\\\r\n]*)*'

It uses Friedl's "unrolled loop" technique for maximum efficiency. Here's the Java code for it, as generated by RegexBuddy 4:

Pattern regex = Pattern.compile(
    "\"[^\"\\\\\r\n]*(?:\\\\.[^\"\\\\\r\n]*)*\"|'[^'\\\\\r\n]*(?:\\\\.[^'\\\\\r\n]*)*'"
);

OTHER TIPS

Maybe I misunderstand the principle, but that looks rather trivial now that you added the example.

Consider this for instance:

String input = "'qty'=>'<input type=\"text\" maxlength=\"3\" class=\"qty_text\" id='.$key.' value ='";
String otherInput = "' onkeyup=\"this.value = this.value.replace (/\\D/, \'\')\">'";
// matching anything starting with single quote and ending with single quote 
// included, reluctant quantified
Pattern p = Pattern.compile("'.+?'");
Matcher m = p.matcher(input);
while (m.find()) {
    System.out.println(m.group());
}
m = p.matcher(otherInput);
System.out.println();
while (m.find()) {
    System.out.println(m.group());
}

Output:

'qty'
'<input type="text" maxlength="3" class="qty_text" id='
' value ='

' onkeyup="this.value = this.value.replace (/\D/, '
')">'

See the Java Pattern documentation for more detailed explanations.

The character groups that match neither backslashes nor quotes shouldn't be followed by a +. Remove the +es to fix the hang (which was due to catastrophic backtracking).

Also, your original regex wasn't recognizing \D as a valid backslash escape - therefore the string constant in your test input containing \D wasn't being matched. If you make the rules of your regex more liberal to recognize any character immediately following a backslash as part of the string constant, it will behave the way you expect.

"('([^\\\\']|\\\\.)*'|\"([^\\\\\"]|\\\\.)*\")"

You can do it all in one line using split() with the right regex:

String[] array = source.replaceAll("^[^']+", "").split("(?<!\\G.)(?<=').*?(?='|$)");

There's a reasonable amount of regex kung fu going on here, so I'll break it down:

  • The delimiter is wrapped by even/odd quotes, but can not contain the quotes because split() consumes the delimiter, so a look behind (?<=') and look ahead (?=') (which are non-consuming) is used to match the quotes instead of a literal quote in the regex
  • a reluctant match .*? for characters between the quotes ensures that it stops at the next quote (instead of matching through to the last quote)
  • I added an alternate match for end of input tot he look ahead (?='|$) in case there's no trailing close quote
  • And saving the best for last, the regex that is key to making this all work is the negative look behind (?<!\\G.) which means "don't match on the end of the previous match" and ensures the next match advances past the end of the previous delimiter, without which you would end up with just the quote characters in your array. \G matches the end of the previous match, but also matches start of input for the first match, so it rather neatly automatically handles not matching on the first quote - thus making the delimiter wrapped in even/odd quote instead of odd/even as it would be otherwise.
  • To cater for the input's first character not being a quote, you need to strip off the leading characters before splitting - that's why the replaceAll() is needed

Here's some test code using your sample input:

String source = "'qty'=>'<input type=\"text\" maxlength=\"3\" class=\"qty_text\" id='.$key.' value ='";
String[] array = source.replaceAll("^[^']+", "").split("(?<!\\G.)(?<=').*?(?='|$)");
System.out.println(Arrays.toString(array));

Output:

['qty', '<input type="text" maxlength="3" class="qty_text" id=', ' value =']
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top