Question

I have a piece of html that I'm trying to parse using HtmlAgilityPack. Here's the piece of the code I'm interested in (sorry for using a picture, but it's cleaner and shows the point I want more clearly):

enter image description here

What I'm trying to do is very simple, but I can't figure it out. What I want is to select the div with id = content that is highlighted in the image. To do this with HtmlAgilitypack in c# I'm using:

HtmlDocument doc = new HtmlDocument(); //creating HtmlAgilityPack document
doc.LoadHtml(htmlstring); //loading html

var content = doc.DocumentNode.SelectSingleNode("//div[@id='content']"); //running XPATH

The problem is that the last instruction selects the div I mention above, but it's incomplete. Instead of containing all the children shown in the image it only contains one child, the first div with id = item The same piece of XPATH when run through Chrome with XPTAH Helper selects the correct div with all its children. I don't understand if I'm using HtmlAgilityPack incorrectly or if my XPATH expression is incorrect, can anyone give a hint?

Was it helpful?

Solution

Well, you've got some messed up HTML to deal with there. Every one of those items contains two malformed <a> tags.

One is missing its > at the end of its start tag:

<div id="covershot"><a href="http://www.cineblog01.tv/the-thirteenth-tale-subita-2013/" target="_self" <p><img src="http://www.locandinebest.net/imgk/The_Thirteenth_Tale_2013.jpg"></p>

and the other stops dead after <a class=" and has no closing tag.

<td><div><a class="<div class="fblike_button" style="margin: 10px 0;"><iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fwww.cineblog01.tv%2Fthe-thirteenth-tale-subita-2013%2F&amp;layout=button_count&amp;show_faces=false&amp;width=150&amp;action=like&amp;colorscheme=dark" scrolling="no" frameborder="0" allowTransparency="true" style="border:none; overflow:hidden; width:150px; height:20px"></iframe></div> </div> </td>

I'm guessing that's causing some problems for the parser. Have you tried selecting the wrapper or contentwrapper divs to see if it's putting the missing divs inside them?

You might try to fix these problems with some string replacement to see if that gets it to parse correctly:

htmlstring = htmlstring.Replace("target=\"_self\" <", "target=\"_self\" ><")
                       .Replace("<a class=\"<", "<");
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top