This was eventually solved by parsing the DOM tree using a Xerces parser and retrieving the result from it.
HTMLUnit HtmlPage.getBody() returns null even though the response contains a <body> tag
-
28-09-2022 - |
Question
I'm using HTMLUnit in Java to extract information from website. Ran into a strange phenom where the page is not fully parsed into the DOM tree. After the following:
HtmlPage lineHours = (HtmlPage) _webClient.getTopLevelWindows().get(1).getEnclosedPage();
Watching the expression lineHours.asXml() results in the following (... marks ommitted sensitive data)
<?xml version="1.0" encoding="UTF-8"?>
<html>
<head>
<script ...>
</script>
</head>
</html>
While printing lineHours.getWebResponse().getContentAsString() results in the following:
<html>
<head>
<script ...>
</script>
</head>
</html>
<body>
<div> ...
In short, the body tag is not parsed into the DOM tree. and therefore all XPath queries and helper methods such as HtmlPage.getBody() fail. In a regular browser the page renders well. Any ideas? Thanks Tomer
Solution
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow