Question

I want to parse HTML (you can assume as a XML, converted via Tidy) and get all the text nodes (which means nodes in Body tag that are visible) and their location in the XML file. Location means the text position in the flat XML file.

Was it helpful?

Solution

XmlTextReader implements IXmlLineInfo - if you look at the docs for IXmlLineInfo it gives an example of reading an XML file and reporting the location of each node.

EDIT: For those saying it's irrelevant, it may well be irrelevant to the XML - but quite possibly not to a human. If you're trying to tell people where to look in the XML for particular bits, it can be very helpful to report line numbers and positions.

OTHER TIPS

The SAX specification for reading XML (which almost all XML tools implement) provides a ContentHandler with a Locator which allows you to get the line and character (column) number.

int     getColumnNumber()
          Return the column number where the current document event ends.
 int    getLineNumber()
          Return the line number where the current document event ends.

(I missed the requirement for C#. The example above is for Java but I will try to find the corresponding C# interface).

The event could be a string of characters.

SAX for .NET is described in: http://saxdotnet.sourceforge.net/

You should not rely on text position in an XML file(whitespace is completely ignored by any sane parser). What you can (and should) do is use XPath to identify the nodes you are interested in, and then take out the text from those nodes. If you're interested in just the text nodes, then the query "//text()" will grab all the text nodes.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top