Question

I want to construct a regex, that matches either ' or " and then matches other characters, ending when a ' or an " respectively is matched, depending on what was encountered right at the start. So this problem appears simple enough to solve with the use of a backreference at the end; here is some regex code below (it's in Java so mind the extra escape chars such as the \ before the "):

private static String seekerTwo = "(['\"])([a-zA-Z])([a-zA-Z0-9():;/`\\=\\.\\,\\- ]+)(\\1)";

This code will successfully deal with things such as:

"hello my name is bob"
'i live in bethnal green'

The trouble comes when I have a String like this:

"hello this seat 'may be taken' already"

Using the above regex on it will fail on the initial part upon encountering ' then it would continue and successfully match 'may be taken'... but this is obviously insufficient, I need the whole String to be matched.

What I'm thinking, is that I need a way to ignore the type of quotation mark, which was NOT matched in the very first group, by including it as a character in the character set of the 3rd group. However, I know of no way to do this. Is there some sort of sneaky NOT backreference function or something? Something I can use to reference the character in the 1st group that was NOT matched?? Or otherwise some kind of solution to my predicament?

Was it helpful?

Solution

This can be done using negative lookahead assertions. The following solution even takes into account that you could escape a quote inside a string:

(["'])(?:\\.|(?!\1).)*\1

Explanation:

(["'])    # Match and remember a quote.
(?:       # Either match...
 \\.      # an escaped character
|         # or
 (?!\1)   # (unless that character is identical to the quote character in \1)
 .        # any character
)*        # any number of times.
\1        # Match the corresponding quote.

This correctly matches "hello this seat 'may be taken' already" or "hello this seat \"may be taken\" already".

In Java, with all the backslashes:

Pattern regex = Pattern.compile(
    "([\"'])   # Match and remember a quote.\n" +
    "(?:       # Either match...\n" +
    " \\\\.    # an escaped character\n" +
    "|         # or\n" +
    " (?!\\1)  # (unless that character is identical to the matched quote char)\n" +
    " .        # any character\n" +
    ")*        # any number of times.\n" +
    "\\1       # Match the corresponding quote", 
    Pattern.COMMENTS);

OTHER TIPS

Tim's solution works fairly well if you can use lookaround (which Java does support). But if you should find yourself using a language or tool that does not support lookaround, you could simply match both cases (double quoted strings and single quoted strings) separately:

"(\\"|[^"])*"|'(\\'|[^'])*'

matches each case separately, but returns either case as the whole match


HOWEVER

Both cases can fall prey to at least one eventuality. If you don't look closely, you may think there should be two matches in this excerpt:

He turned to get on his bike. "I'll see you later, when I'm done with all this" he said, looking back for a moment before starting his journey. As he entered the street, one of the city's trolleys collided with Mike's bicycle. "Oh my!" exclaimed an onlooker.

...but there are three matches, not two:

"I'll see you later, when I'm done with all this"
's trolleys collided with Mike'
"Oh my!"

and this excerpt contains only ONE match:

The fight wasn't over yet, though. "Hey!" yelled Bob. "What do you want?" I retorted. "I hate your guts!" "Why would I care?" "Because I love you!" "You do?" Bob paused for a moment before whispering "No, I couldn't love you!"

can you find that one? :D

't over yet, though. "Hey!" yelled Bob. "What do you want?" I retorted. "I hate your guts!" "Why would I care?" "Because I love you!" "You do?" Bob paused for a moment before whispering "No, I couldn'

I would recommend (if you are up for using lookaround), that you consider doing some extra checking (such as a positive lookbehind for whitespace or similar before the first quote) to make sure you don't match things like 's trolleys collided with Mike' - though I wouldn't put much money on any solution without a lot of testing first. Adding (?<=\s|^) to the beginning of either expression will avoid the above cases... i.e.:

(?<=\s|^)(["'])(?:\\.|(?!\1).)*\1                    #based on Tim's

or

(?<=\s|^)("(\\"|[^"])*"|'(\\'|[^'])*')               #based on my alternative

I'm not sure how efficient lookaround is compared to non-lookaround, so the two above may be equivalent, or one may be more efficient than the other (?)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top