Java Html Parser that supports XPath Axes?

https://stackoverflow.com/questions/19525643

01-07-2022
|

Frage

Following is a fragment of an html document for which I need to associate the "title" - e.g. FILE_BYTES_WRITTEN - with the text() entry in the first succeeding .

The following xpath works great in python lxml:

/td[text()='FILE_BYTES_WRITTEN']/following-sibling::td

The doc fragment:

   <td>HDFS_BYTES_READ</td>
   <td align="right">4,825</td>
   <td align="right">0</td>
   <td align="right">4,825</td>
 </tr>

   <tr>

   <td>FILE_BYTES_WRITTEN</td>
   <td align="right">415,881</td>
   <td align="right">48,133</td>
   <td align="right">464,014</td>
 </tr>

   <tr>

   <td>HDFS_BYTES_WRITTEN</td>
   <td align="right">98,580,205</td>
   <td align="right">2,010</td>
   <td align="right">98,582,215</td>
 </tr>

But when I try to do this in Java I am having less success. I am not sure if there are any java html parsers that can support this. I am presently using HtmlCleaner.

Lösung 2

As a preamble: I will indeed look at HtmlUnit as suggested by @Sage.

In the meantime: I have come up with the following solution:

a) HtmlCleaner actually has a DomSerializer for converting to XHtml:

public static Document toXhtml(String html) throws ParserConfigurationException {
    HtmlCleaner cleaner = new HtmlCleaner();
    TagNode tagNode = cleaner.clean(html);
    DomSerializer domSerializer = new DomSerializer(new CleanerProperties());
    return domSerializer.createDOM(tagNode);
}

b) At the point that we have XHtml we have plenty of options- just use xalan for example..

Andere Tipps

You can look into HtmlUnit which has nice getByXPath() function. It is a guiless browser. Try to look into examples.

Another one that i use for parsing and like the most is Jsoup which has powerful select(query) function to do these things easily. Check out its selector class documentation. You will find everything you need.

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow