So regular expressions may have side-effects. What, then, is the preferred method of getting the start and end character positions of all HTML tags in a document? Parsing libraries such as Jsoup and NekoHTML don't seem to provide this information, even XMLLocator doesn't seem to apply, since it only provides the end of the current document event.

I'm not interested in the type or name of tag, any of its attributes, or stripping anything out of the text. I just want to know where they start and where they end.

For purposes of this question, it can be assumed that the source HTML is valid.

有帮助吗?

解决方案

I was curious myself, so I found this parser: http://jericho.htmlparser.net/

public void testJericho() throws IOException{

    Source source=new Source(new URL("http://example.com/"));
    List<Element> elementList=source.getAllElements();
    for (Element element : elementList) {
        printElement(element);
    }

}

public void printElement(Element element) {
    List<Element> children = element.getChildElements();
    for(Element child: children) 
        printElement(child);

    System.out.println(element.getName() + " start: " + element.getBegin());
    System.out.println(element.getName() + " end: " + element.getEnd());        
}
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top