Question

I'm writing an app that takes some massive amounts of texts as input which could be in any character encoding, and I want to save it all in UTF-8. I won't receive, or can't trust, the character encoding that comes defined with the data (if any).

I have for a while used Pythons library chardet to detect the original character encoding, http://pypi.python.org/pypi/chardet, but ran into some problems lately where I noticed that it doesn't support Scandinavian encodings (for example iso-8859-1). And apart from that, it takes a huge amount of time/CPU/mem to get results. ~40s for a 2MB text file.

I tried just using the standard Linux file

file -bi name.txt

And with all my files so far it provides me with a 100% result. And this with ~0.1s for a 2MB file. And it supports Scandinavian character encodings as well.

So, I guess the advantages with using file is clear. What are the downsides? Am I missing something?

Was it helpful?

Solution

Old MS-DOS and Windows formatted files can be detected as unknown-8bit instead of ISO-8859-X, due to not completely standard encondings. Chardet instead will perform an educated guess, reporting a confidence value.

http://www.faqs.org/faqs/internationalization/iso-8859-1-charset/

If you won't handle old, exotic, out-of-standard text files, I think you can use file -i without many problems.

OTHER TIPS

I have found "chared" (http://code.google.com/p/chared/) to be pretty accurate. You can even train new encoding detectors for languages that not supported.

It might be a good alternative when chardet starts acting up.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top