Question

I have been looking at similar questions and searching on-line but I cannot seem to find a a solution. What I am trying to do is select all the DOM elements in order ( etc.) and then put them into an arraylist or something.

currently I have

public void Parse()
    {
        HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

        // There are various options, set as needed
        //htmlDoc.OptionFixNestedTags = true;

        // filePath is a path to a file containing the html
        htmlDoc.Load("Test.html");

        // Use:  htmlDoc.LoadHtml(xmlString);  to load from a string (was htmlDoc.LoadXML(xmlString)

        // ParseErrors is an ArrayList containing any errors from the Load statement
        if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count() > 0)
        {
            Console.WriteLine("There was an error parsing the HTML file");
        }
        else
        {
            if (htmlDoc.DocumentNode != null)
            {
                htmlDoc.DocumentNode.Descendants();

                Console.WriteLine("document node not null");
                //HtmlAgilityPack.HtmlNode bodyNode = htmlDoc.DocumentNode.SelectSingleNode("//body");

                foreach (HtmlNode node in htmlDoc.DocumentNode.Descendants())
                {
                    Console.WriteLine(node.Name);
                }
            }
        }
    }

The code out puts the name of the node (html, title, image, etc) but it outputs the closing tags as "#text". I assume this is because the tags start with a "/" How can I get a proper readout of all the DOM elements?

Was it helpful?

Solution 2

"#text" is name of text nodes and closing tags are not represented as anything unique in the DOM.

<div><span>foo</span> bar</div>

Will give you tree like

div
   span
      #text:foo
   #text:bar

OTHER TIPS

I suspect #text elements that you saw are line breaks instead of closing tag. For example this html input :

<div>
    <a href="http://example.org"></a>
</div>

using your code will output :

div
#text   <- line break between <div> and <a>
a
#text  <- line break between </a> and </div>

You can use this XPath query instead, to get all elements those aren't plain text node (skipping those unnecessary line breaks) :

foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//*"))
{
    Console.WriteLine(node.Name);
}

That XPath means, select all descendant of current element having any name (*).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top