HTMLUnit HtmlPage.getBody() returns null even though the response contains a <body> tag

https://stackoverflow.com/questions/21162020

htmlunit

28-09-2022
|

Question

I'm using HTMLUnit in Java to extract information from website. Ran into a strange phenom where the page is not fully parsed into the DOM tree. After the following:

HtmlPage lineHours = (HtmlPage) _webClient.getTopLevelWindows().get(1).getEnclosedPage();

Watching the expression lineHours.asXml() results in the following (... marks ommitted sensitive data)

<?xml version="1.0" encoding="UTF-8"?>
<html>
  <head>
    <script ...>
    </script>
  </head>
</html>

While printing lineHours.getWebResponse().getContentAsString() results in the following:

<html>
  <head>
    <script ...>
    </script>
  </head>
</html>
<body>
  <div> ...

In short, the body tag is not parsed into the DOM tree. and therefore all XPath queries and helper methods such as HtmlPage.getBody() fail. In a regular browser the page renders well. Any ideas? Thanks Tomer

Solution

This was eventually solved by parsing the DOM tree using a Xerces parser and retrieving the result from it.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow