Question

I am learning regex and I found the following pattern:

q(?=u)i

If I try to math quit, it fails, because q matches q, u matches u (so the lookahead is valid), but then the regex backtracks in the word quit and the character u is compared again, this time with i. The match fails.

I don't see any word matching this pattern. Is there any case? Or is this structure (pattern - lookahead - rest of pattern) useful?

Was it helpful?

Solution 2

No, there's no matching pattern similar to the one you mentioned that I have seen, but such a structure can be used (though admittedly a bit weird), for example:

q(?=.*t)u

This regex will match any string starting with qu but which has a t somewhere later. This means that question, quit will match, but not quasar. In which case, an equivalent and more readable (imo) regex qu(?=.*t) could be used.

OTHER TIPS

The regex lookaround syntax is zero width.

What this means is that it matches but doesn't move the cursor, so in your pattern:

  • q matches "q" the cursor moves to "u"
  • (?=u) matches "u" the cursor stays at "u"
  • i does not match "u" so the pattern fails.

Note that the pattern doesn't backtrack, the lookaround assertion is zero width.

The structure is very useful is you wanted to match a pattern that contained, say, "at least one X" in a series of letters. For example:

[a-z]{4}[1-9]{3}(?=.*X)[a-zA-Z]{5}

Says four lowercase letter followed by three digits followed by five letters of any case with at least one "X".

I would say, if the pattern after(or before) the lookahead ((?=...)) is fixed. the regex doesn't really make much sense. like:

foo(?=bar)fixed

but if the fixed part is dynamic, it would be useful. see this example:

kent$  echo "fooququuuxxxxxxx"|grep -Po 'q(?=uu).*' 
quuuxxxxxxx

kent$  echo "fooququuuxxxxxxx"|grep -Po 'q(?=u).*' 
ququuuxxxxxxx

in above example, only the lookahead is different, you got different match result.

In applying the use case structure presented in this question and the response by Boris the Spider, I came up with this solution

$detail = '[code]<!doctype>
<html>
  <head></head>
  <body><p>My Regex script</p></body>
</html>[/code]';

function regex($detail) 
{
   if(preg_match('#^\[code](?=.*(<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\2>))\[/code]#si',  $detail))
   {
       return true;
   }
   return false;        
}
echo regex($detail);

Taking a look inside the regex engine, this is what is happening. In applying ^\[q](?=.*(<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\2>))[/q]* to $detail above; \[q] matches [q] and the html section of the code matches (<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\2>)).

The match from the lookahead is discarded, so the engine steps back from [/q] in the string to the html section of the code. The lookahead was successful, so the engine continues with [/q]. But [/q] cannot match (<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\2>)). So this match attempt fails.

However this regex synthax works:

#^\[code](?=.*(<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\2>\[/code]))#si

The regex engine simply says: "opening [code] tag followed by any character, then a pair of html tags with at least a closing [/code] tag at the end".

I hope this helps explain the use case pattern match (quit) more.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top