How do I read HTML Document in C# given that I have the webpage source stored in a string variable?

https://stackoverflow.com/questions/8266607

08-03-2021
|

Question

I have tried to do this on my own but couldn't.

I have an html document, and I'm trying to extract the addresses for all the pictures in it into a c# collection and I'm not sure of the syntax. I'm using HTMLAgilityPack... Here is what I have so far. Please advise.

The HTML Code is the following:

<div style='padding-left:12px;' id='myWeb123'>
<b>MyWebSite Pics</b>
<br /><br />
<img src="http://myWebSite.com/pics/HHTR_01.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_02.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_03.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_04.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_05.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_06.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_07.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_08.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_09.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_10.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<a href="http://www.myWebSite.com/" target="_blank" rel="nofollow">Source</a>
</div>

And the c# code is the following:

HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();

document.Load("FileName.html");

// Targets a specific node
HtmlNode someNode = document.GetElementbyId("myWeb123");

//HtmlNodeCollection linkNodes = document.DocumentNode.SelectNodes("//a[@href]");

HtmlNodeCollection linkNodes = document.DocumentNode.SelectNodes("//div[@id='myWeb123']");

if (linkNodes != null)
{
    int count = 0;
    foreach(HtmlNode linkNode in linkNodes)
    {

        string linkTitle = linkNode.GetAttributeValue("src", string.Empty);

        Debug.Print("linkTitle = " + linkTitle);

        if (linkTitle == string.Empty)
        {
            HtmlNode imageNode = linkNode.SelectSingleNode("img[@alt]");
            if (imageNode != null)
            {
                Debug.Print("imageNode = " + imageNode.Attributes.ToString());
            }
        }
        count++;
        Debug.Print("count = " + count);
    }
}

I tried to use the HtmlAgilityPack Documentation but this pack lacks examples and the information about its methods and classes are really hard for me to understand without examples.

Solution

try this, sorry if it will not be buildable, I have overwritten our code to your situation

List<string> result = new List<string>();
foreach (HtmlNode link in document.DocumentNode.SelectNodes("//img[@src]"))
{
    HtmlAttribute att = link.Attributes["src"];

    string temp = att.Value;
    string urlValue;
    do
    {
        urlValue = temp;
        temp = HttpUtility.UrlDecode(HttpUtility.HtmlDecode(urlValue));
    } while (temp != urlValue);

    result.Add(temp);
}

OTHER TIPS

You can use the overload of Load which takes a TextReader:

document.Load(new StringReader(text));

(I haven't looked over the rest of the code, but that addresses the "what do I do if I've already got the HTML in a string?" part.)

In this line:

HtmlNodeCollection linkNodes = document.DocumentNode.SelectNodes("//div[@id='myWeb123']");

you are selecting the <div> node, not the <img> nodes under it. Try this to select those img nodes:

HtmlNodeCollection linkNodes = document.DocumentNode
     .SelectNodes("//div[@id='myWeb123']/img");

As for the selection syntax, it's identical to XPath as used in XML. So search for XPath if you want examples of the selection.

In this case:

the leading / starts searching from the root of the document (instead of from some "currect node")
the // means that the next match can be at any depth instead of directly under the root
div[@id='myWeb123'] searches for a <div> node with an attribute 'id' that has value 'myWeb123'
the /img searches for an img node directly under the matched div node.

Using Xpath like this will be expensive if the page size grows. Best is to deserialize the html to an object. You also dont need to use the Htmlagility reference that you are using. Load the HTML using streamreader and the use Xmlserializer Use XSD tool , first to convert to xsd and then generate a class from the xsd tool

1)
C:\Program Files\Microsoft Visual Studio 9.0\VC>xsd /c /language:CS c:\xtest.xml

Microsoft (R) Xml Schemas/DataTypes support utility
[Microsoft (R) .NET Framework, Version 2.0.50727.3038]
Copyright (C) Microsoft Corporation. All rights reserved.
Writing file 'C:\Program Files\Microsoft Visual Studio 9.0\VC\xtest.xsd'.

2)
C:\Program Files\Microsoft Visual Studio 9.0\VC>xsd /c  xtest.xsd
Microsoft (R) Xml Schemas/DataTypes support utility
[Microsoft (R) .NET Framework, Version 2.0.50727.3038]
Copyright (C) Microsoft Corporation. All rights reserved.
Writing file 'C:\Program Files\Microsoft Visual Studio 9.0\VC\xtest.cs'.

Import this class to your solution

html col = new html();
StreamReader reader = new StreamReader("c:\\test.html"); 
XmlSerializer ser = new XmlSerializer(typeof(html));
col = (html)ser.Deserialize(reader);

The col object then will contain all the src of the img tags in one shot.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow