Question

I am having trouble with special characters and charset = iso-8859-1. The same code that I use here works fine with UTF-8, so I do not understand what I am doing wrong.

Here is the code:

File input = new File("/users/marcioapf/example.html");
Document doc = Jsoup.parse(input, "iso-8859-1", "");
Elements elements = doc.select("span.DEPUTADO")  ;
System.out.println(elements.toString());

Here is the output:

<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Jo&atilde;ozinho Pereira</span> 
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Isnaldo Bulh&otilde;es</span> 
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Antonio Albuquerque</span> 
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Jeferson Morais</span> 
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">In&aacute;cio Loiola</span> 

Here is how it should be:

<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Joãozinho Pereira</span> 
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Isnaldo Bulhões</span> 
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Antonio Albuquerque</span> 
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Jeferson Morais</span> 
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Inácio Loiola</span>

How I can I fix it?

Was it helpful?

Solution

Using EscapeMode.xhtml will give you output without entities. Try this code

  File input = new File("/users/marcioapf/example.html");
  Document doc = Jsoup.parse(input, "iso-8859-1", "");
  doc.outputSettings().escapeMode(EscapeMode.xhtml);
  Elements elements = doc.select("span.DEPUTADO")  ;
  System.out.println(elements.toString());
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top