Question

RegEx:

<span style='.+?'>TheTextToFind</span>

HTML:

<span style='font-size:11.0pt;'>DON'T_WANT_THIS_MATCHED <span style='font-size:18.0pt;'>TheTextToFind</span></span>

Why does the match include this?

<span style='font-size:11.0pt;'>DON'T_WANT_THIS_MATCHED

Example Link

Was it helpful?

Solution

The regex engine always find the left-most match. That's why you get

<span style='font-size:11.0pt;'>DON'T_WANT_THIS_MATCHED <span style='font-size:18.0pt;'>TheTextToFind</span>

as a match. (Basically the whole input, sans the last </span>).

To steer the engine in the correct direction, if we assume that > doesn't appear directly in the attribute, the following regex will match what you want.

<span style='[^>]+'>TheTextToFind</span>

This regex matches what you want, since with the above assumption, [^>]+ can't match outside a tag.

However, I hope that you are not doing this as part of a program that extracts information out of a HTML page. Use HTML parser for that purpose.


To understand why the regex matches as such, you need to understand that .+? will try to backtracks so that it can find a match for the sequel ('>TheTextToFind</span>).

# Matching .+?
# Since +? is lazy, it matches . once (to fulfill the minimum repetition), and
# increase the number of repetition if the sequel fails to match
<span style='f                        # FAIL. Can't match closing '
<span style='fo                       # FAIL. Can't match closing '
...
<span style='font-size:11.0pt;        # PROCEED. But FAIL later, since can't match T in The
<span style='font-size:11.0pt;'       # FAIL. Can't match closing '
...
<span style='font-size:11.0pt;'>DON'  # PROCEED. But FAIL later, since can't match closing >
...
<span style='font-size:11.0pt;'>DON'T_WANT_THIS_MATCHED <span style='
                                      # PROCEED. But FAIL later, since can't match closing >
...
<span style='font-size:11.0pt;'>DON'T_WANT_THIS_MATCHED <span style='font-size:18.0pt;
                                      # PROCEED. MATCH FOUND.

As you can see, .+? attempts with increasing length and matches font-size:11.0pt;'>DON'T_WANT_THIS_MATCHED <span style='font-size:18.0pt;, which allows the sequel '>TheTextToFind</span> to be matched.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top