Question

I am trying to capture matches between two strings.

For example, I am looking for all text that appears between Q and XYZ, using the "soonest" match (not continuing to expand outwards). This string:

circus Q hello there Q SOMETEXT XYZ today is the day XYZ okay XYZ

Should return:

Q SOMETEXT XYZ

But instead, it returns:

Q hello there Q SOMETEXT XYZ

Here is the expression I'm using: Q.*?XYZ

It's going too far back to the left. It's working fine on the ride side when I use the question mark after the asterisk. How can I do the same for the left side, and stop once I hit that first left Q, making it work the same as the right side works? I've tried question marks and other symbols from http://msdn.microsoft.com/en-us/library/az24scfc.aspx, but there's something I'm just not figuring out.

I'm a regex novice, so any help on this would be appreciated!

Was it helpful?

Solution

Well, the non Greedy match is working - it gets the shortest string that satisfies the regex. The thing that you have to remember is that regex is a left to right process. So it matches the first Q, then gets the shortest number of characters followed by an XYZ. If you want it not to go past any Qs, you have to use a negated character class:

Q[^Q]*?XYZ

[^Q] matches any one character that is not a Q. Mind that this will only work for a single character. If your opening delimeter is multiple characters, you have to do it a different way. Why? Well, take the delimiter 'PQR' and the string is

foo PQR bar XYZ 

If you try to use the regex from before, but you extended the character class to :

PQR[^PQR]*?XYZ

then you'll get

'PQR bar XYZ'

As you expected. But if your string is

foo PQR Party Time! XYZ 

You'll get no matches. It's because [] delineates a "character class" - which matches exactly one character. Using these classes, you can match a range of characters, simply by listing them.

th[ae]n

will match both 'than' and 'then', but not 'thin'. Placing a carat ('^') at the beginning negates the class - meaning "match anything but these characters" - so by turning our one-character delimiter into [^PQR], rather than saying "not 'PQR'", you're saying "not 'P', 'Q', or 'R'". You can still use this if you want, but only if you're 100% sure that the characters from your delimiter will only be in your delimiter. If that's the case, it's faster to use greedy matching and only negate the first character of your delimiter. The regex for that would be:

PQR[^P]*XYZ 

But, if you can't make that guarantee, then match with:

PQR(?:.(?!PQR))*?XYZ

Regex doesn't directly support negative string matching (because it's impossible to define, when you think about it), so you have to use a negative lookahead.

(?!PQR)

is just such a lookahead. It means "Assert that the next few characters are not this internal regex", without matching any characters, so

.(?!PQR)

matches any character not followed by PQR. Wrap that in a group so that you can lazily repeat it,

(.(?!PQR))*?

and you have a match for "string that doesn't contain my delimiter". The only thing I did was add a ?: to make it a non-capturing group.

(?:.(?!PQR))*?

Depending on the language you use to parse your regex, it may try to pass back every matched group individually (useful for find and replace). This keeps it from doing that.

Happy regexing!

OTHER TIPS

The concept of greediness only works on the right side.

To make the expression only match from the last Q before XYZ, make it not match Q between them:

Q[^Q]*?XYZ
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top