Question

ok ive got this code:

public static string ScreenScrape(string url)
    {
        System.Net.WebRequest request = System.Net.WebRequest.Create(url);
        // set properties of the request
        using (System.Net.WebResponse response = request.GetResponse())
        {
            using (System.IO.StreamReader reader = new System.IO.StreamReader(response.GetResponseStream()))
            {
                return reader.ReadToEnd();
            }
        }
    }

Now I want to filter the text to get the div class="comment" ones is there another option other than using regular expressions? or is that the only way?

thanks

Was it helpful?

Solution

You need to use the HTML Agility Pack.

For example:

var doc = new HtmlWeb().Load(url);
var comments = doc.Descendants("div")
                  .Where(div => div.GetAttributeValue("class", "") == "comment");

Note that this won't find <div class="OtherClass comment">; if you're looking for that, you can call IndexOf.

OTHER TIPS

HtmlAgilityPack is just a package, that lets you to manipulate html files, however if you want to do screen scrape Selenium WebDriver with PhantomJS is better solution. PhantomJS is headless web browser so it is really fast. Moreover, it has far better functionality compared to html agility pack. There is a short course on this topic.

You shoulnd't use regular expressions to parse HTML - they are the wrong tool for the job, as HTML it too complex for them.
You should use an HTML parser.
See also: Looking for C# HTML parser

You first port of call should be the HTML Agility Pack.

Regular expressions are the classical way to parse this kind of input for non .NET languages.

Additionaly, if you can normalize this to an XML variant (i.e. XHTML), you can use XPATH to query and retrieve the required nodes.

What you do not want to do is implement your own parser.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top