Question

I've this HTML page: http://pastebin.com/ewN5NZis

I wanna try to use HtmlAgilityPack for obtain this result:

List 1: Title1, Title2 List 2: John, Antony List 3: 29/04/14, 28/04/14

I wanna store data in 3 different .

I'm trying with:

        HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
        htmlDoc.OptionFixNestedTags = true;
        htmlDoc.LoadHtml(html);

        foreach (HtmlNode node in htmlDoc.DocumentNode.SelectNodes("//tr"))
        {
            res += node.InnerHtml;
        }

In res variable I'm storing all tag of the document is right? Now what I need to do in order to obtain 3 lists?

Thanks..

Was it helpful?

Solution

It is not recommended to take all raw text because you have to split it and it is suicide.

Try this (take each <td> with its specific class and take InnerText not InnerHTML):

List<string> topicList = new List<string>; 
List<string> authorList = new List<string>; 
List<string> lastPostList = new List<string>; 
foreach (HtmlNode node in htmlDoc.DocumentNode.SelectNodes("//td[@class='topic starter']"))
            {
                 string topic;
                 topic = node.InnerText;
                 topicList.Add(topic);
            }
foreach (HtmlNode node in htmlDoc.DocumentNode.SelectNodes("//td[@class='author']"))
            {
                 string author;
                 author = node.InnerText;
                 authorList.Add(author);
            }
foreach (HtmlNode node in htmlDoc.DocumentNode.SelectNodes("//td[@class='lastpost']"))
                {
                     string lastpost;
                     lastpost = node.InnerText;
                     lastPostList.Add(lastpost); // This will take also the author that posted last post (e.g. Antony 24/10/09).
                }

If you need separated text : last posted author and date you can use .split() property for string.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top