How to retrieve HTML page in proper encoding using Java?

https://stackoverflow.com/questions/1255730

12-09-2019
|

Question

How can I read HTTP stream with HTML page in page's encoding?

Here is a code fragment I use to get the HTTP stream. InputStreamReader has the encoding optional argument, but I have no ideas about the way to obtain it.

URLConnection conn = url.openConnection();
InputStream is = conn.getInputStream();
BufferedReader d = new BufferedReader(new InputStreamReader(is));

Solution

Retrieving a Webpage is a reasonably complicated process. That's why libraries such as HttpClient exist. My advice is that unless you have a really compelling reason otherwise, use HttpClient.

OTHER TIPS

When the connection is establised thru

URLConnection conn = url.openConnection();

you can get the encoding method name thru url.getContentEncoding() so pass this String to InputStreamReader() so the code looks like

BufferedReader d = new BufferedReader(new InputStreamReader(is,url.getContentEncoding()));

The short answer is URLConnection.getContentEncoding(). The right answer is what cletus suggests, use an appropriate third party library unless you have a compelling reason not to.

I had a very similar problem to solve recently. Like the other answers, I also started playing around with HttpClient et al. However, those libraries require that you know upfront the encoding of the file you want to download. Otherwise, conversion of the retrieved HTML file will yield in unreadable characters.

This approach won't work, because the encoding of the HTML file is specified only in the HTML file itself. Depending on the HTML version, the encoding is specified in many different ways like XML header, two different head meta tag elements, etc. If you follow this approach, you would need to:

Download file and look at the content to figure out the encoding by parsing the HTML content.
Download file a second time to specify proper encoding.

Especially parsing HTML content for proper encoding strings is error-prone. Instead, I suggest you rely on a library like JSoup, which will do the job for you. So instead of downloading the file via httpclient, use JSoup to retrieve the file for you. In addition, JSoup provides a nice API to access different parts of the HTML page directly (e.g. page title).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow