I need to get the href attribute values (which are greater than a specific length) of all 'a' elements in an HTML file. I made something like this;

<a.*href\s*=\s*"(?<link>.{15,})".*>

But it does not work correctly. Any suggestions?

有帮助吗?

解决方案

Here are a couple ways to avoid capturing more than one field inside the tag:

Try making the quantifier non-greedy. {15,}? instead of {15,}. This way it will stop at the second double-quote, instead of capturing more fields inside the <a /> tag.

A better option is to replace that catch-all . in front of the quantifier with something more restrictive. Try an exclusive character class, for example [^\s]{15,} will look for at least 15 consecutive non-whitespace characters.

Both of these methods worked for me so far, but remember that URLs can be very messy and even malformed in the wild, so you aren't guaranteed to catch everything. It's better the more you know about your target site.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top