Question

I've been looking online and trying to understand. I am parsing some html files that are encoded in iso-8859-1. Once parsed I want all the output to be in the standard java encoding (utf-something)

Here is how I do this:

currentDocument = Jsoup.parse(new File("thing.htm", "ISO-8859-1");
Element elt = currentDocument.getElementById("bim");
String title = elt.select("h1,h2,h3,h4,h5,h6").first().text();
System.out.println(title);

The string in the file is:

G18 Legemiddeløkonomi – pasientens venn eller fiende

The output is:

G18?Legemiddel?konomi ? pasientens venn eller fiende

I guess I'm doing something wrong somewhere as I know this is possible with Jsoup I just don't really know what it is. Btw I'm on MacOSX. Can somebody help me?

Thx

Was it helpful?

Solution

Ok so after investigating further and thanks to @Esailija I found that my console wasn't outputing in UTF-8 which was solved by:

PrintStream stdout = new PrintStream(System.out, true, "UTF-8"); 
System.setOut(stdout);

I also used: currentDocument.outputSettings().charset("UTF-8"); but I am not sure this is useful.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top