Question

I am trying to get all the text nodes of an element including its children, but for some reason it is giving me the entire documents HTML.

This is what I've came up with:

HtmlAgilityPack.HtmlNode el = htmlDoc.DocumentNode.SelectSingleNode("(//div[@class='TableContainer'])[" + index + "]");
if (el != null)
{
    foreach (HtmlNode node in el.SelectNodes("//text()"))
    {
        Debug.WriteLine("text=" + node.InnerText.Replace(" ", " "));
    }
}

It will print text=line of the whole document. I'm sure there's something wrong with the //text(), which is a snippet I found here at SO, but I don't know another way of doing it and I've been going crazy with it.

Was it helpful?

Solution

You should use a relative XPath expression, that is, relative to your el context node

HtmlAgilityPack.HtmlNode el = htmlDoc.DocumentNode.SelectSingleNode("(//div[@class='TableContainer'])[" + index + "]");
if (el != null)
{
    foreach (HtmlNode node in el.SelectNodes(".//text()"))
    {
        Debug.WriteLine("text=" + node.InnerText.Replace(" ", " "));
    }
}

"//text()" will select all descendant text nodes of the document root node

See Location Paths and Abbreviated Syntax from XPath specifications for details.

  • //para selects all the para descendants of the document root and thus selects all para elements in the same document as the context node

  • .//para selects the para element descendants of the context node

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top