Regular expression for parsing links from a webpage?

https://stackoverflow.com/questions/6173

08-06-2019
|

Question

I'm looking for a .NET regular expression extract all the URLs from a webpage but haven't found one to be comprehensive enough to cover all the different ways you can specify a link.

And a side question:

Is there one regex to rule them all? Or am I better off using a series of less complicated regular expressions and just using mutliple passes against the raw HTML? (Speed vs. Maintainability)

Solution

((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)

I took this from regexlib.com

[editor's note: the {1} has no real function in this regex; see this post]

OTHER TIPS

from the RegexBuddy library:

URL: Find in full text

The final character class makes sure that if an URL is part of some text, punctuation such as a comma or full stop after the URL is not interpreted as part of the URL.

\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|]

With Html Agility Pack, you can use:

HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a@href")
{
Response.Write(link["href"].Value);
}
doc.Save("file.htm");

Look at the URI specification. That could help you a lot. And as far as performance goes, you can pretty much extract all the HTTP links in a modest web page. When I say modest I definitely do not mean one page all encompassing HTML manuals like that of ELisp manual. Also performance is a touchy topic. My advice would be to measure your performance and then decide if you are going to extract all the links using one single regex or with multiple simpler regex expressions.

http://gbiv.com/protocols/uri/rfc/rfc3986.html

All HTTP's and MAILTO's

(["'])(mailto:|http:).*?\1

All links, including relative ones, that are called by href or src.

#Matches things in single or double quotes, but not the quotes themselves
(?<=(["']))((?<=href=['"])|(?<=src=['"])).*?(?=\1)

#Maches thing in either double or single quotes, including the quotes.
(["'])((?<=href=")|(?<=src=")).*?\1

The second one will only get you links that use double quotes, however.

I don't have time to try and think of a regex that probably won't work, but I wanted to comment that you should most definitely break up your regex, at least if it gets to this level of ugliness:

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ 
\t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
....*SNIP*....
*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])
+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\
.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(
?:\r\n)?[ \t])*))*)?;\s*)

(this supposedly matches email addresses)

Edit: I can't even fit it on one post it's so nasty....

This will capture the URLs from all a tags as long as the author of the HTML used quotes:

<a[^>]+href="([^"]+)"[^>]*>

I made an example here.

URL's? As in images/scripts/css/etc.?

%href="(.["]*)"%

according to http://tools.ietf.org/html/rfc3986

extracting url from ANY text (not only HTML)

(http\\://[:/?#\\[\\]@!%$&'()*+,;=a-zA-Z0-9._\\-~]+)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow