Question

Possible Duplicate:
Detect file encoding in PHP

How can I figure out with PHP what file encoding a file has?

Was it helpful?

Solution

Detecting the encoding is really hard for all 8 bit character sets but utf-8 (because not every 8 bit byte sequence is valid utf-8) and usually requires semantic knowledge of the text for which the encoding is to be detected.

Think of it: Any particular plain text information is just a bunch of bytes with no encoding information associated. If you look at any particular byte, it could mean anything, so to have a chance at detecting the encoding, you would have to look at that byte in context of other bytes and try some heuristics based on possible language combination.

For 8bit character sets you can never be sure though.

A demonstration of heuristics going wrong is here for example:

http://www.hoax-slayer.com/bush-hid-the-facts-notepad.html

Some 16bit sets, you have a chance at detecting because they might include a byte order mark or have every second byte set to 0.

If you just want to detect UTF-8, you can either use mb_detect_encoding as already explained, or you can use this handy little function:

function isUTF8($string){
    return preg_match('%(?:
    [\xC2-\xDF][\x80-\xBF]        # non-overlong 2-byte
    |\xE0[\xA0-\xBF][\x80-\xBF]               # excluding overlongs
    |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}      # straight 3-byte
    |\xED[\x80-\x9F][\x80-\xBF]               # excluding surrogates
    |\xF0[\x90-\xBF][\x80-\xBF]{2}    # planes 1-3
    |[\xF1-\xF3][\x80-\xBF]{3}                  # planes 4-15
    |\xF4[\x80-\x8F][\x80-\xBF]{2}    # plane 16
    )+%xs', $string);
}

OTHER TIPS

mb_detect_encoding should be able to do the job.

http://us.php.net/manual/en/function.mb-detect-encoding.php

In it's default setup, it'll only detect ASCII, UTF-8, and a few Japanese JIS variants. It can be configured to detect more encodings, if you specify them manually. If a file is both ASCII and UTF-8, it'll return UTF-8.

You can't really, unless the file is kind enough to tell you somewhere inside it.

For example, HTML files are meant to contain a content-type meta tag near the top, so that your web browser knows what encoding is used.. eg

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />

or

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

There are methods that try to guess by looking at the file and spotting byte sequences that suggest certain encodings, but these are really only guessing.

You can use the fread() function to look at the first few bytes of the file for the "magic number", and then map that magic number against a list of known magic numbers for file types.

BlackAura's suggestion is very good, IMHO.

Another option is to call file(1) on the file in question using system() or the like. Often, it is able to tell you the encoding as well. It should be available in any sane UNIX environment.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top