Java HtmlCleaner: Does not handle extended ascii characters

https://stackoverflow.com/questions/10622907

09-06-2021
|

Question

I'm using HTMLCleaner to clean an HTML file which has characters like '€' (ascii decimal 128), 'TM' (ascii decimal 153), etc. That is, chars from the ASCII extended table.

HTMLCleaner cannot handle those chars and replaces them by character '?' (ascii decimal 63).

Is there any flag I can set in HTMLCleaner in order to process those chars?

Thanks in advance.

EDIT: The variable "encoding" is "iso-8859-1", just like the source file encoding.

    try {
        System.out.print("Parsing and cleaning:" + fileStr);
        URL url = new File(this.fileStr).toURI().toURL();
        // create an instance of HtmlCleaner
        HtmlCleaner cleaner = new HtmlCleaner();
        // default properties
        CleanerProperties props = cleaner.getProperties();
        // do parsing
        TagNode tagNode = new HtmlCleaner(props).clean(url);
        // serialize to XML file
        new PrettyXmlSerializer(props).writeToFile(tagNode, fileStr,
                encoding);
        System.out.println("Output: " + fileStr);
    } catch (MalformedURLException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }

I've just figured this out. The line:

TagNode tagNode = new HtmlCleaner(props).clean(url);

Shoube be replaced by:

TagNode tagNode = new HtmlCleaner(props).clean(url, encoding);

Where 'encoding' is the string representation of the charset of the source url.

Thank you!

Solution

Did you try setting the charset?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow