Question

I've been wanting to improve my regex skills for quite some time now and "Mastering Regular Expressions" was recommended quite a few times so I bought it and have been reading it over the past day or so.

I have created the following regular expression:

^(?:<b>)?(?:^<i>)?<a href="/site\.php\?id=([0-9]*)">(.*?) \(([ a-z0-9]{2,10})\)</a>(?:^</i>)?(?:</b>)?$

Which matches the first two links but ignores the two enclosed by an <i> tag. It extracts the id, title and type.

<a href="/site.php?id=6321">site 1 title (type 1)</a>
<b><a href="/site.php?id=10254">site 2 title (type 2)</a></b>

<i><a href="/site.php?id=5479">site 3 title (type 3)</a></i>
<b><i><a href="/site.php?id=325">site 4 title (type 4)</a></i></b>

Although it works, it seems fairly long for something so simple, could it be improved?

Was it helpful?

Solution

If your writing screenscrappers as Whilliham rightfully mentions DOM might just be a suitable parser as Regex since HTML is alot more forgiving then regex.

Not shortened by much but a bit the regex is more forgiving

  • Removed start of string and end of string checks, did you really need them?
  • negative lookbehind to make sure <a> is not preceeded by <i>
  • use of \d simple asertation instead of [0-9] tad cleaner.
  • You had type in for 3 to 11 characters long i changed it to 3 or more.
  • removed checks for end tags they serve no contextual meaning for your screenscrapper (presumably).

(?<!<i>)<a href="/site.php\?id=(\d*)">(.*?) \(([ a-z\d]{2,})\)

OTHER TIPS

Short of using character classes (\d for 0-9 etc.) I don't see that the regular expression in question could be shortened much; however...

As a side note it can be worth mentioning that parsing HTML with regular expressions is hazardous at best; when dealing with HTML (and to a lesser extent XML), DOM tools are generally better suited.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top