Question

Im trying to craft a regex that only returns <link> tag hrefs

Why does this regex return all hrefs including <a hrefs?

    (?<=<link\s+.*?)href\s*=\s*[\'\"][^\'\"]+
    <link rel="stylesheet" rev="stylesheet" 
    href="idlecore-tidied.css?T_2_5_0_228" media="screen">
    <a href="anotherurl">Slash Boxes</a>

thank you

Was it helpful?

Solution

Either

/(?<=<link\b[^<>]*?)\bhref=\s*=\s*(?:"[^"]*"|'[^']'|\S+)/

or

/<link\b[^<>]*?\b(href=\s*=\s*(?:"[^"]*"|'[^']'|\S+))/

The main difference is [^<>]*? instead of .*?. This is because you don't want it to continue the search into other tags.

OTHER TIPS

Avoid lookbehind for such simple case, just match what you need, and capture what you want to get.

I got good results with <link\s+[^>]*(href\s*=\s*(['"]).*?\2) in The Regex Coach with s and g options.

/(?<=<link\s+.*?)href\s*=\s*[\'\"][^\'\"]+[^>]*>/

i'm a little shaky on the back-references myself, so I left that in there. This regex though:

/(<link\s+.*?)href\s*=\s*[\'\"][^\'\"]+[^>]*>/

...works in my Javascript test.

(?<=<link\s+.*?)href\s*=\s*[\'\"][^\'\"]+

works with Expresso (I think Expresso runs on the .NET regex-engine). You could even refine this a bit more to match the closing ' or ":

(?<=<link\s+.*?)href\s*=\s*([\'\"])[^\'\"]+(\1)

Perhaps your regex-engine doesn't work with lookbehind assertions. A workaround would be

(?:<link\s+.*?)(href\s*=\s*([\'\"])[^\'\"]+(\2))

Your match will then be in the captured group 1.

What regex flavor are you using? Perl, for one, doesn't support variable-length lookbehind. Where that's an option, I'd choose (edited to implement the very good idea from MizardX):

(?<=<link\b[^<>]*?)href\s*=\s*(['"])(?:(?!\1).)+\1

as a first approximation. That way the choice of quote character (' or ") will be matched. The same for a language without support for (variable-length) lookbehind:

(?:<link\b[^<>]*?)(href\s*=\s*(['"])(?:(?!\2).)+\2)

\1 will contain your match.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top