Question

Our website runs the user's input via HtmlTidy to clean it. Apparently while doing this it also causes pain for our international subscribers by converting umlauts. Is there any option to specify for HtmlTidy to not do this?

I tried CharacterEncoding with all possible options but nothing seems to be working.

Was it helpful?

Solution

Simply provide an output encoding (input encoding is optional) in the configuration file:

input-encoding: win1252
output-encoding: latin1

For an overview of available encodings, look at the output-encoding documentation.

EDIT: So you're using the .NET bindings. It's the very same settings:

Document d = new Document(new FileStream("in.html", FileMode.Open));

d.InputCharacterEncoding = EncodingType.Utf8;
d.OutputCharacterEncoding = EncodingType.Win1252;
d.CleanAndRepair();

d.Save("out.html");

With the correct encodings set, you will get the correct result, without ü and the like.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top