Getting access to the original HTML in HtmlUnit HtmlElement?

https://stackoverflow.com/questions/23518671

17-07-2023
|

Вопрос

I am using HtmlUnit to read content from a web site.

Everything works perfectly to the point where I am reading the content with:

  HtmlDivision div = page.getHtmlElementById("my-id");

Even div.asText() returns the expected String object, but I want to get the original HTML inside <div>...</div> as a String object. How can I do that?

I am not willing to change HtlmUnit to something else, as the web site expects the client to run JavaScript, and HtmlUnit seems to be capable of doing what is required.

Решение

If by original HTML you mean the HTML code that HTMLUnit has already formatted then you can use div.asXml(). Now, if you really are looking for the original HTML the server sent you then you won't find a way to do so (at least up to v2.14).

Now, as a workaround, you could get the whole text of the page that the server sent you with this answer: How to get the pure raw HTML of a page in HTMLUnit while ignoring JavaScript and CSS?

As a side note, you should probably think twice why you need the HTML code. HTMLUnit will let you get the data from the code, so there shouldn't be any need to store the source code but rather the information it is contained in it. Just my 2 cents.

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow