How to detect the character encoding of a text file?

https://stackoverflow.com/questions/4520184

12-10-2019
|

Question

I try to detect which character encoding is used in my file.

I try with this code to get the standard encoding

public static Encoding GetFileEncoding(string srcFile)
    {
      // *** Use Default of Encoding.Default (Ansi CodePage)
      Encoding enc = Encoding.Default;

      // *** Detect byte order mark if any - otherwise assume default
      byte[] buffer = new byte[5];
      FileStream file = new FileStream(srcFile, FileMode.Open);
      file.Read(buffer, 0, 5);
      file.Close();

      if (buffer[0] == 0xef && buffer[1] == 0xbb && buffer[2] == 0xbf)
        enc = Encoding.UTF8;
      else if (buffer[0] == 0xfe && buffer[1] == 0xff)
        enc = Encoding.Unicode;
      else if (buffer[0] == 0 && buffer[1] == 0 && buffer[2] == 0xfe && buffer[3] == 0xff)
        enc = Encoding.UTF32;
      else if (buffer[0] == 0x2b && buffer[1] == 0x2f && buffer[2] == 0x76)
        enc = Encoding.UTF7;
      else if (buffer[0] == 0xFE && buffer[1] == 0xFF)      
        // 1201 unicodeFFFE Unicode (Big-Endian)
        enc = Encoding.GetEncoding(1201);      
      else if (buffer[0] == 0xFF && buffer[1] == 0xFE)      
        // 1200 utf-16 Unicode
        enc = Encoding.GetEncoding(1200);


      return enc;
    }

My five first byte are 60, 118, 56, 46 and 49.

Is there a chart that shows which encoding matches those five first bytes?

Solution

You can't depend on the file having a BOM. UTF-8 doesn't require it. And non-Unicode encodings don't even have a BOM. There are, however, other ways to detect the encoding.

UTF-32

BOM is 00 00 FE FF (for BE) or FF FE 00 00 (for LE).

But UTF-32 is easy to detect even without a BOM. This is because the Unicode code point range is restricted to U+10FFFF, and thus UTF-32 units always have the pattern 00 {00-10} xx xx (for BE) or xx xx {00-10} 00 (for LE). If the data has a length that's a multiple of 4, and follows one of these patterns, you can safely assume it's UTF-32. False positives are nearly impossible due to the rarity of 00 bytes in byte-oriented encodings.

US-ASCII

No BOM, but you don't need one. ASCII can be easily identified by the lack of bytes in the 80-FF range.

UTF-8

BOM is EF BB BF. But you can't rely on this. Lots of UTF-8 files don't have a BOM, especially if they originated on non-Windows systems.

But you can safely assume that if a file validates as UTF-8, it is UTF-8. False positives are rare.

Specifically, given that the data is not ASCII, the false positive rate for a 2-byte sequence is only 3.9% (1920/49152). For a 7-byte sequence, it's less than 1%. For a 12-byte sequence, it's less than 0.1%. For a 24-byte sequence, it's less than 1 in a million.

UTF-16

BOM is FE FF (for BE) or FF FE (for LE). Note that the UTF-16LE BOM is found at the start of the UTF-32LE BOM, so check UTF-32 first.

If you happen to have a file that consists mainly of ISO-8859-1 characters, having half of the file's bytes be 00 would also be a strong indicator of UTF-16.

Otherwise, the only reliable way to recognize UTF-16 without a BOM is to look for surrogate pairs (D[8-B]xx D[C-F]xx), but non-BMP characters are too rarely-used to make this approach practical.

XML

If your file starts with the bytes 3C 3F 78 6D 6C (i.e., the ASCII characters "<?xml"), then look for an encoding= declaration. If present, then use that encoding. If absent, then assume UTF-8, which is the default XML encoding.

If you need to support EBCDIC, also look for the equivalent sequence 4C 6F A7 94 93.

In general, if you have a file format that contains an encoding declaration, then look for that declaration rather than trying to guess the encoding.

None of the above

There are hundreds of other encodings, which require more effort to detect. I recommend trying Mozilla's charset detector or a .NET port of it.

A reasonable default

If you've ruled out the UTF encodings, and don't have an encoding declaration or statistical detection that points to a different encoding, assume ISO-8859-1 or the closely related Windows-1252. (Note that the latest HTML standard requires a “ISO-8859-1” declaration to be interpreted as Windows-1252.) Being Windows' default code page for English (and other popular languages like Spanish, Portuguese, German, and French), it's the most commonly encountered encoding other than UTF-8.

OTHER TIPS

If you want to pursue a "simple" solution, you might find this class I put together useful:

http://www.architectshack.com/TextFileEncodingDetector.ashx

It does the BOM detection automatically first, and then tries to differentiate between Unicode encodings without BOM, vs some other default encoding (generally Windows-1252, incorrectly labelled as Encoding.ASCII in .Net).

As noted above, a "heavier" solution involving NCharDet or MLang may be more appropriate, and as I note on the overview page of this class, the best is to provide some form of interactivity with the user if at all possible, because there simply is no 100% detection rate possible!

Use StreamReader and direct it to detect the encoding for you:

using (var reader = new System.IO.StreamReader(path, true))
{
    var currentEncoding = reader.CurrentEncoding;
}

And use Code Page Identifiers https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx in order to switch logic depending on it.

Several answers are here but nobody has posted usefull code.

Here is my code that detects all encodings that Microsoft detects in Framework 4 in the StreamReader class.

Obviously you must call this function immediately after opening the stream before reading anything else from the stream because the BOM are the first bytes in the stream.

This function requires a Stream that can seek (for example a FileStream). If you have a Stream that cannot seek you must write a more complicated code that returns a Byte buffer with the bytes that have already been read but that are not BOM.

/// <summary>
/// UTF8    : EF BB BF
/// UTF16 BE: FE FF
/// UTF16 LE: FF FE
/// UTF32 BE: 00 00 FE FF
/// UTF32 LE: FF FE 00 00
/// </summary>
public static Encoding DetectEncoding(Stream i_Stream)
{
    if (!i_Stream.CanSeek || !i_Stream.CanRead)
        throw new Exception("DetectEncoding() requires a seekable and readable Stream");

    // Try to read 4 bytes. If the stream is shorter, less bytes will be read.
    Byte[] u8_Buf = new Byte[4];
    int s32_Count = i_Stream.Read(u8_Buf, 0, 4);
    if (s32_Count >= 2)
    {
        if (u8_Buf[0] == 0xFE && u8_Buf[1] == 0xFF)
        {
            i_Stream.Position = 2;
            return new UnicodeEncoding(true, true);
        }

        if (u8_Buf[0] == 0xFF && u8_Buf[1] == 0xFE)
        {
            if (s32_Count >= 4 && u8_Buf[2] == 0 && u8_Buf[3] == 0)
            {
                i_Stream.Position = 4;
                return new UTF32Encoding(false, true);
            }
            else
            {
                i_Stream.Position = 2;
                return new UnicodeEncoding(false, true);
            }
        }

        if (s32_Count >= 3 && u8_Buf[0] == 0xEF && u8_Buf[1] == 0xBB && u8_Buf[2] == 0xBF)
        {
            i_Stream.Position = 3;
            return Encoding.UTF8;
        }

        if (s32_Count >= 4 && u8_Buf[0] == 0 && u8_Buf[1] == 0 && u8_Buf[2] == 0xFE && u8_Buf[3] == 0xFF)
        {
            i_Stream.Position = 4;
            return new UTF32Encoding(true, true);
        }
    }

    i_Stream.Position = 0;
    return Encoding.Default;
}

Yes, there is one here: http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding.

You should read this: How can I detect the encoding/codepage of a text file

If your file starts with the bytes 60, 118, 56, 46 and 49, then you have an ambiguous case. It could be UTF-8 (without BOM) or any of the single byte encodings like ASCII, ANSI, ISO-8859-1 etc.

I use Ude that is a C# port of Mozilla Universal Charset Detector. It is easy to use and gives some really good results.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow