Question

I need to get the href attribute values (which are greater than a specific length) of all 'a' elements in an HTML file. I made something like this;

<a.*href\s*=\s*"(?<link>.{15,})".*>

But it does not work correctly. Any suggestions?

Was it helpful?

Solution

Here are a couple ways to avoid capturing more than one field inside the tag:

Try making the quantifier non-greedy. {15,}? instead of {15,}. This way it will stop at the second double-quote, instead of capturing more fields inside the <a /> tag.

A better option is to replace that catch-all . in front of the quantifier with something more restrictive. Try an exclusive character class, for example [^\s]{15,} will look for at least 15 consecutive non-whitespace characters.

Both of these methods worked for me so far, but remember that URLs can be very messy and even malformed in the wild, so you aren't guaranteed to catch everything. It's better the more you know about your target site.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top