Improving my regular expression skills

https://stackoverflow.com/questions/1121007

13-09-2019
|

Question

I've been wanting to improve my regex skills for quite some time now and "Mastering Regular Expressions" was recommended quite a few times so I bought it and have been reading it over the past day or so.

I have created the following regular expression:

^(?:<b>)?(?:^<i>)?<a href="/site\.php\?id=([0-9]*)">(.*?) \(([ a-z0-9]{2,10})\)</a>(?:^</i>)?(?:</b>)?$

Which matches the first two links but ignores the two enclosed by an <i> tag. It extracts the id, title and type.

<a href="/site.php?id=6321">site 1 title (type 1)</a>
<b><a href="/site.php?id=10254">site 2 title (type 2)</a></b>

<i><a href="/site.php?id=5479">site 3 title (type 3)</a></i>
<b><i><a href="/site.php?id=325">site 4 title (type 4)</a></i></b>

Although it works, it seems fairly long for something so simple, could it be improved?

Solution

If your writing screenscrappers as Whilliham rightfully mentions DOM might just be a suitable parser as Regex since HTML is alot more forgiving then regex.

Not shortened by much but a bit the regex is more forgiving

Removed start of string and end of string checks, did you really need them?
negative lookbehind to make sure <a> is not preceeded by <i>
use of \d simple asertation instead of [0-9] tad cleaner.
You had type in for 3 to 11 characters long i changed it to 3 or more.
removed checks for end tags they serve no contextual meaning for your screenscrapper (presumably).

(?<!<i>)<a href="/site.php\?id=(\d*)">(.*?) \(([ a-z\d]{2,})\)

OTHER TIPS

Short of using character classes (\d for 0-9 etc.) I don't see that the regular expression in question could be shortened much; however...

As a side note it can be worth mentioning that parsing HTML with regular expressions is hazardous at best; when dealing with HTML (and to a lesser extent XML), DOM tools are generally better suited.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow