"#text" is name of text nodes and closing tags are not represented as anything unique in the DOM.
<div><span>foo</span> bar</div>
Will give you tree like
div
span
#text:foo
#text:bar
题
I have been looking at similar questions and searching on-line but I cannot seem to find a a solution. What I am trying to do is select all the DOM elements in order ( etc.) and then put them into an arraylist or something.
currently I have
public void Parse()
{
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
// There are various options, set as needed
//htmlDoc.OptionFixNestedTags = true;
// filePath is a path to a file containing the html
htmlDoc.Load("Test.html");
// Use: htmlDoc.LoadHtml(xmlString); to load from a string (was htmlDoc.LoadXML(xmlString)
// ParseErrors is an ArrayList containing any errors from the Load statement
if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count() > 0)
{
Console.WriteLine("There was an error parsing the HTML file");
}
else
{
if (htmlDoc.DocumentNode != null)
{
htmlDoc.DocumentNode.Descendants();
Console.WriteLine("document node not null");
//HtmlAgilityPack.HtmlNode bodyNode = htmlDoc.DocumentNode.SelectSingleNode("//body");
foreach (HtmlNode node in htmlDoc.DocumentNode.Descendants())
{
Console.WriteLine(node.Name);
}
}
}
}
The code out puts the name of the node (html, title, image, etc) but it outputs the closing tags as "#text". I assume this is because the tags start with a "/" How can I get a proper readout of all the DOM elements?
解决方案 2
"#text" is name of text nodes and closing tags are not represented as anything unique in the DOM.
<div><span>foo</span> bar</div>
Will give you tree like
div
span
#text:foo
#text:bar
其他提示
I suspect #text
elements that you saw are line breaks instead of closing tag. For example this html input :
<div>
<a href="http://example.org"></a>
</div>
using your code will output :
div
#text <- line break between <div> and <a>
a
#text <- line break between </a> and </div>
You can use this XPath query instead, to get all elements those aren't plain text node (skipping those unnecessary line breaks) :
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//*"))
{
Console.WriteLine(node.Name);
}
That XPath means, select all descendant of current element having any name (*
).