Question

So regular expressions may have side-effects. What, then, is the preferred method of getting the start and end character positions of all HTML tags in a document? Parsing libraries such as Jsoup and NekoHTML don't seem to provide this information, even XMLLocator doesn't seem to apply, since it only provides the end of the current document event.

I'm not interested in the type or name of tag, any of its attributes, or stripping anything out of the text. I just want to know where they start and where they end.

For purposes of this question, it can be assumed that the source HTML is valid.

Was it helpful?

Solution

I was curious myself, so I found this parser: http://jericho.htmlparser.net/

public void testJericho() throws IOException{

    Source source=new Source(new URL("http://example.com/"));
    List<Element> elementList=source.getAllElements();
    for (Element element : elementList) {
        printElement(element);
    }

}

public void printElement(Element element) {
    List<Element> children = element.getChildElements();
    for(Element child: children) 
        printElement(child);

    System.out.println(element.getName() + " start: " + element.getBegin());
    System.out.println(element.getName() + " end: " + element.getEnd());        
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top