Best way for Parsing ANSI and UTF-16LE files using Python 2/3?

https://stackoverflow.com/questions/819396

03-07-2019
|

Question

I have a collection of files encoded in ANSI or UTF-16LE. I would like python to open the files using the correct encoding. The problem is that the ANSI files do not raise any sort of exception when encoded using UTF-16le and vice versa.

Is there a straightforward way to open up the files using the correct file encoding?

Solution

Use the chardet library to detect the encoding.

OTHER TIPS

You can check for the BOM at the beginning of the file to check whether it's UTF.

Then unicode.decode accordingly (using one of the standard encodings).

EDIT Or, maybe, try s.decode('ascii') your string (given s is the variable name). If it throws UnicodeDecodeError, then decode it as 'utf_16_le'.

What's in the files? If it's plain text in a Latin-based alphabet, almost every other byte the UTF-16LE files will be zero. In the windows-1252 files, on the other hand, I wouldn't expect to see any zeros at all. For example, here's “Hello” in windows-1252:

93 48 65 6C 6C 6F 94

...and in UTF-16LE:

1C 20 48 00 65 00 6C 00 6C 00 6F 00 1D 20

Aside from the curly quotes, each character maps to the same value, with the addition of a trailing zero byte. In fact, that's true for every character in the ISO-8859-1 character set (windows-1252 extends ISO-8859-1 to add mappings for several printing characters—like curly quotes—to replace the control characters in the range 0x80..0x9F).

If you know all the files are either windows-1252 or UTF-16LE, a quick scan for zeroes should be all you need to figure out which is which. There's a good reason why chardet is so slow and complex, but in this case I think you can get away with quick and dirty.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow