Вопрос

I need to get the href attribute values (which are greater than a specific length) of all 'a' elements in an HTML file. I made something like this;

<a.*href\s*=\s*"(?<link>.{15,})".*>

But it does not work correctly. Any suggestions?

Это было полезно?

Решение

Here are a couple ways to avoid capturing more than one field inside the tag:

Try making the quantifier non-greedy. {15,}? instead of {15,}. This way it will stop at the second double-quote, instead of capturing more fields inside the <a /> tag.

A better option is to replace that catch-all . in front of the quantifier with something more restrictive. Try an exclusive character class, for example [^\s]{15,} will look for at least 15 consecutive non-whitespace characters.

Both of these methods worked for me so far, but remember that URLs can be very messy and even malformed in the wild, so you aren't guaranteed to catch everything. It's better the more you know about your target site.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top