Reading text files of unknown encoding in C++

https://stackoverflow.com/questions/7872701

11-02-2021
|

Question

What should I use to read text files for which I don't know their encoding (ASCII or Unicode)?

Is there some class that auto-detects the encoding?

Solution

I can only give a negative answer here: There is no universally correct way to determine the encoding of a file. An ASCII file can be read as a ISO-8859-15 encoding, because ASCII is a subset. Even worse for other files may be valid in two different encodings having different meanings in both. So you need to get this information via some other means. In many cases it is a good approach to just assume that everything is UTF8. If you are working on a *NIX environment the LC_CTYPE variable may be helpful. If you do not care about the encoding (e.g. you do not change or process the content) you can open files as binary.

OTHER TIPS

This is impossible in the general case. If the file contains exactly the bytes I'm typing here, it is equally valid as ASCII, UTF-8 or any of the ISO 8859 variants. Several heuristics can be used as a guess, however: read the first "page" (512 bytes or so), then, in the following order:

See if the block starts with a BOM in one of the Unicode formats
Look at the first four bytes. If they contain `'\0'`, you're probably dealing with some form of UTF-16 or UTF-32, according to the following pattern: '\0', other, '\0', other UTF16BE other, '\0', other, '\0' UTF16LE '\0', '\0', '\0', other UTF32BE other, '\0', '\0', '\0' UTF32RLE
Look for a byte with the top bit set. If it's the start of a legal UTF-8 character, then the file is probably in UTF-8. Otherwise... in the regions where I've worked, ISO 8859-1 is generally the best guess.
Otherwise, you more or less have to assume ASCII, until you encounter a byte with the top bit set (at which point, you use the previous heuristic).

But as I said, it's not 100% sure.

(PS. How do I format a table here. The text in point 2 is declared as an HTML table, but it doesn't seem to be showing up as one.

One of the ways(brute force) of doing can be

Built a list of suitable encodings (only iso-codepages and unicode)
Iterate over all considered encodings
Encode the text using this encoding
Encode it back to Unicode
Compare the results for errors
If no errors remember the encoding that produced the fewest bytes

Reference: http://www.codeproject.com/KB/recipes/DetectEncoding.aspx

If you are sure that your incoming encoding is ANSI or Unicode then you can also check for byte order mark. But let me tell you that it is not full-proof.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow