Question

I'm trying to download HTML so I can parse it using the minimum bandwidth to download. This is a bit of my code.

if (!String.IsNullOrEmpty(siteAddress))
                webReq = WebRequest.Create(siteAddress)

                WebResponse webRes = webReq.GetResponse();

                Stream streamResponse = webRes.GetResponseStream();
                StreamReader streamRead = new StreamReader(streamResponse);
                StringReader sr = new StringReader(streamRead.ReadToEnd().Trim());

                streamResponse.Close();
                streamRead.Close();    
                webRes.Close();

                HtmlAgilityPack.HtmlDocument hDoc = new HtmlAgilityPack.HtmlDocument();
                hDoc.Load(sr);

Can someone confirm that retrieving the response only provides the text response, and no images are downloaded as well? What about when loading it with the HTMLAgilityPack method?

Was it helpful?

Solution

When using WebClient, WebRequest or HtmlAgilityPack it is only the html you will download.

If you want the images (or other resources), you have to search for the image urls in the downloaded document and issue requests yourself to get them.

If you want to experiment a bit, the WebBrowser control could be something to look at. From that, you could take the Document property and look at its property Images and download all the images yourself.

What do you want to do?

OTHER TIPS

You download HTML source of the site, not the whole site. That is a big difference.

See How to use HTML Agility Pack and also this one

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top