Question

We have a CMS which has several thousand text/html files in it. It turns out that users have been uploading text/html files using various character encodings (utf-8,utf-8 w BOM, windows 1252, iso-8859-1).

When these files are read in and written to the response our CMS's framework forces a charset=UTF-8 on the response's content-type attribute.

Because of this, any non UTF-8 content is displayed to the user with mangled characters (?, black diamonds, etc. when there isnt the correct character translation from the "native" char encoding to UTF-8). Also, there is no metadata attached to these documents that indicate charset - As far as I know, the only way to tell what charset they are is to look at them in a text rendering app (Firefox,Notepadd++,etc.) and "look" at the content to see if it "looks" right.

Does anyone know how to automatically/intelligently convert files of unknown encoding to UTF-8? I've read this can be accomplished with statistical modeling but thats what above my head.

Thoughts on how to best approach the problem?

Thanks

Was it helpful?

Solution

You can use ICU4J's CharsetDetector

OTHER TIPS

Try to decode it as UTF-8. If this fails then look for \x92, and decode as CP1252 if found. Otherwise, decode as Latin-1.

In general, there is no way to tell. The byte sequence 63 61 66 C3 A9 is equally valid as "café" in windows-1252, "caf├⌐" in IBM437, or "café" in UTF-8. The last is statistically more likely, though.

If you don't want to deal with statistical methods, an approach that works much of the time is to assume that anything that looks like UTF-8 is, and that anything else is in windows-1252.

Or if UTF-16 is a possibility, look for FE FF or FF FE at the beginning of the file.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top