extract links regex c#
-
26-10-2019 - |
Question
I've been trying to solve these problem for last two hours but seems like I can't find any solution.
I need to extract links from an HTML
file. There are 100+ links, but only 25 of them are valid.
Valid links are placed inside
<td><a href=" (link) ">
First I had (and still have) a problem with double quotes inside verbatim strings. So, I have replaced verbatim with "normal" strings so I can use \" for " but the problem is that this Regex
I have written doesn't work
Match LinksTemp = Regex.Match(
htmlCode,
"<td><a href=\"(.*)\">",
RegexOptions.IgnoreCase);
as I get "<td><a href="http://www.google.com">
as output instead of http://www.google.com
Anyone know how can I solve this problem and how can I use double quotes inside of verbatim strings (example @" <>"das"sa ")
Solution
Escaped double quotes sample: @"some""test"
Regex sample: "<a href=\"(.*?)\">"
var match = Regex.Match(html, "<td><a href=\"(.*?)\">",
RegexOptions.Singleline); //spelling error
var url = match.Groups[1].Value;
Also you may want to use Regex.Matches(...)
instead of Regex.Match(...)
OTHER TIPS
If you want to take every elements use code simply like this:
string htmlCode = "<td><a href=\" www.aa.pl \"><td> <a href=\" www.cos.com \"><td>";
Regex r = new Regex( "<a href=\"(.*?)\">", RegexOptions.IgnoreCase );
MatchCollection mc = r.Matches(htmlCode);
foreach ( Match m1 in mc ) {
MessageBox.Show( m1.Groups[1].ToString() );
}
Why not parse this with an HTML-parsing is good and fast HTML-Parsing. example:
string HTML = "<td><a href='http://www.google.com'>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(HTML);
HtmlNodeCollection a = doc.DocumentNode.SelectNodes("//a[@href]");
string url = a[0].GetAttributeValue("href", null);
Console.WriteLine(url);
Console.ReadLine();
you need import using HtmlAgilityPack;