Reconstructing Windows-1252 characters from data incorrectly saved as UTF-8

https://stackoverflow.com/questions/8613829

28-03-2021
|

Question

I'm dealing with data that has been sampled using Java HtmlUnit. The webpage used Windows-1252 encoding but the response was retrieved as if the page was encoded as UTF-8 (ie when getContentAsString on the HtmlUnit WebResponse object was invoked, UTF-8 encoding was specified rather than deferring to the encoding specified in the server response). Is there any way to reverse this process to reconstruct the original Windows-1252 data from the incorrectly labelled UTF-8 character data?

Most other questions on this topic are concerned with identifying the type of file or converting from one stream type to another for characters correctly encoded in the first place. That is not the case here. I don't believe utilities such as iconv will work because they expect the streams to have been correctly persisted in their source encoding to begin with.

Solution

Probably not. If Windows-1252-encoded text gets mistaken for UTF-8, all non-ASCII codepoints would be damaged, because of the way UTF-8 deals with those codepoints. Only if you are very very lucky, and all non-ASCII codepoints come in pairs or triplets that, by pure chance, convert to real Unicode codepoints, you can reverse the process.

But you're pretty much out of luck.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow