Question

I wan't to change specific text in a bunch of HTML files and save the rest of their code unchanged. I figured out that I'll use Html Agility pack. So I wrote code like this:

        string Url = @"http://www.example.com";
        HtmlWeb web = new HtmlWeb();
        web.UserAgent = @"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36";
        HtmlDocument doc = web.Load(Url);
        doc.Save("a.html");

But the problem is that the source of the website saved differs from the original. Is there a way to prevent changing the source. Or maybe there's another way to be able to move through the DOM and change just specific things (like in chrome developer tools where you can save later as HTML, bu automatically).

----------- EDIT --------

For example it is seen on eBay. I can't post a link because it would be advertising but if you just try this code on any item offer you will see what's going on.

---------- EDIT2 --------

It seems that eBay is using iframes, and the HAP can't handle it. The and tags inside it are removed so probably that is the reason why saved website differs so much.

Was it helpful?

Solution

HtmlAgilityPack (HAP) will not necessarily write out the same HTML it reads. If you check the source, you'll see that the writing (WriteTo method) pushes out parsed nodes. If the original server sends invalid HTML, HAP will clean it up as part of its parsing.

If you need to save the original, use WebClient.DownloadString, and load that saved file with HAP.

OTHER TIPS

I have been using HtmlAgilityPack a lot lately, but I have never experienced that issue.

What I do is the following:

var wc = new WebClient();
var html = wc.DownloadString(@"http://www.example.com");
var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.SelectNodes("//XPath/Query");

Does that change the html content?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top