Question

I'm looking to find all fee codes in a page. The codes are 5 digits, with an optional single letter at the beginning. I have this currently, which is working great.

preg_match_all("/\b([a-zA-Z])?\d{5}\b/", $content, $matches);

My problem is I need to exclude any that occur within the 'title' attribute of a link.

<a href="#" title="Sample Fee – also see B11023">G14015</a>

I want to match on the G14015, but not B11023.

Any suggestions? Much appreciated.

Was it helpful?

Solution

Based on your comments, clarifying that the fee codes are never found within a tag, I'd suggest a two pass solution. First, remove all tags by replacing them with a single space. Then process that to find the fee codes.

$content = preg_replace("/<[^>]+>/", " ", $content);
preg_match_all("/\b[A-Za-z]\d{5}\b/", $content, $matches);

This assumes no stray < or > is present.


Of course, the usual warning that one should not use regex to parse html or xml, applies.

OTHER TIPS

PHP had (*SKIP)(*FAIL) Magic

Resurrecting this question because it had a simple solution that wasn't mentioned. This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."

With all the warnings about using regex to parse html, here is a simple way to do it.

We can solve it with one single and simple regex:

(?i)<[^>]+(*SKIP)(*F)|[a-z]?\d{5}

See demo.

The left side of the alternation | matches complete <tags> then deliberately fails, after which the engine skips to the next position in the string. The right side matches the pattern you want, and we know they are the right ones because they were not matched by the expression on the left.

Sample Code

$regex = '~(?i)<[^>]+(*SKIP)(*F)|[a-z]?\d{5}~';
preg_match_all($regex, $yourstring, $matches);
print_r($matches[0]);

Reference

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top