extract links regex c#

https://stackoverflow.com/questions/6313033

26-10-2019
|

Question

I've been trying to solve these problem for last two hours but seems like I can't find any solution.

I need to extract links from an HTML file. There are 100+ links, but only 25 of them are valid.

Valid links are placed inside

<td><a href=" (link) ">

First I had (and still have) a problem with double quotes inside verbatim strings. So, I have replaced verbatim with "normal" strings so I can use \" for " but the problem is that this Regex I have written doesn't work

Match LinksTemp = Regex.Match(
                              htmlCode,
                              "<td><a href=\"(.*)\">",
                              RegexOptions.IgnoreCase);

as I get "<td><a href="http://www.google.com"> as output instead of http://www.google.com

Anyone know how can I solve this problem and how can I use double quotes inside of verbatim strings (example @" <>"das"sa ")

Solution

Escaped double quotes sample: @"some""test"
Regex sample: "<a href=\"(.*?)\">"

    var match = Regex.Match(html, "<td><a href=\"(.*?)\">", 
RegexOptions.Singleline); //spelling error
    var url = match.Groups[1].Value;

Also you may want to use Regex.Matches(...) instead of Regex.Match(...)

OTHER TIPS

If you want to take every elements use code simply like this:

string htmlCode = "<td><a href=\" www.aa.pl \"><td> <a href=\" www.cos.com \"><td>";
Regex r = new Regex( "<a href=\"(.*?)\">", RegexOptions.IgnoreCase );
MatchCollection mc = r.Matches(htmlCode);

foreach ( Match m1 in mc ) {                
   MessageBox.Show( m1.Groups[1].ToString() );
}

Why not parse this with an HTML-parsing is good and fast HTML-Parsing. example:

   string HTML = "<td><a href='http://www.google.com'>";

            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml(HTML);
            HtmlNodeCollection a = doc.DocumentNode.SelectNodes("//a[@href]");

            string url = a[0].GetAttributeValue("href", null);

            Console.WriteLine(url);
            Console.ReadLine();

you need import using HtmlAgilityPack;

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow