Question

Namely, how would you tell an archive (jar/rar/etc.) file from a textual (xml/txt, encoding-independent) one?

Was it helpful?

Solution

There's no guaranteed way, but here are a couple of possibilities:

1) Look for a header on the file. Unfortunately, headers are file-specific, so while you might be able to find out that it's a RAR file, you won't get the more generic answer of whether it's text or binary.

2) Count the number of character vs. non-character types. Text files will be mostly alphabetical characters while binary files - especially compressed ones like rar, zip, and such - will tend to have bytes more evenly represented.

3) Look for a regularly repeating pattern of newlines.

OTHER TIPS

Run file -bi {filename}. If whatever it returns starts with 'text/', then it's non-binary, otherwise it is. ;-)

I made this one. A bit simpler, but for latin-based languages, it should work fine, with the ratio adjustment.

/**
 *  Guess whether given file is binary. Just checks for anything under 0x09.
 */
public static boolean isBinaryFile(File f) throws FileNotFoundException, IOException {
    FileInputStream in = new FileInputStream(f);
    int size = in.available();
    if(size > 1024) size = 1024;
    byte[] data = new byte[size];
    in.read(data);
    in.close();

    int ascii = 0;
    int other = 0;

    for(int i = 0; i < data.length; i++) {
        byte b = data[i];
        if( b < 0x09 ) return true;

        if( b == 0x09 || b == 0x0A || b == 0x0C || b == 0x0D ) ascii++;
        else if( b >= 0x20  &&  b <= 0x7E ) ascii++;
        else other++;
    }

    if( other == 0 ) return false;

    return 100 * other / (ascii + other) > 95;
}

Have a look at the JMimeMagic library.

jMimeMagic is a Java library for determining the MIME type of files or streams.

Using Java 7 Files class http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#probeContentType(java.nio.file.Path)

boolean isBinaryFile(File f) throws IOException {
        String type = Files.probeContentType(f.toPath());
        if (type == null) {
            //type couldn't be determined, assume binary
            return true;
        } else if (type.startsWith("text")) {
            return false;
        } else {
            //type isn't text
            return true;
        }
    }

I used this code and it works for English and German text pretty well:

private boolean isTextFile(String filePath) throws Exception {
    File f = new File(filePath);
    if(!f.exists())
        return false;
    FileInputStream in = new FileInputStream(f);
    int size = in.available();
    if(size > 1000)
        size = 1000;
    byte[] data = new byte[size];
    in.read(data);
    in.close();
    String s = new String(data, "ISO-8859-1");
    String s2 = s.replaceAll(
            "[a-zA-Z0-9ßöäü\\.\\*!\"§\\$\\%&/()=\\?@~'#:,;\\"+
            "+><\\|\\[\\]\\{\\}\\^°²³\\\\ \\n\\r\\t_\\-`´âêîô"+
            "ÂÊÔÎáéíóàèìòÁÉÍÓÀÈÌÒ©‰¢£¥€±¿»«¼½¾™ª]", "");
    // will delete all text signs

    double d = (double)(s.length() - s2.length()) / (double)(s.length());
    // percentage of text signs in the text
    return d > 0.95;
}

If the file consists of the bytes 0x09 (tab), 0x0A (line feed), 0x0C (form feed), 0x0D (carriage return), or 0x20 through 0x7E, then it's probably ASCII text.

If the file contains any other ASCII control character, 0x00 through 0x1F excluding the three above, then it's probably binary data.

UTF-8 text follows a very specific pattern for any bytes with the high order bit, but fixed-length encodings like ISO-8859-1 do not. UTF-16 can frequently contain the null byte (0x00), but only in every other position.

You'd need a weaker heuristic for anything else.

Just to let you know, I've chosen quite a different path. I my case, there are only 2 types of files, chances that any given file will be a binary one are high. So

  1. presume that file is binary, try doing what's supposed to be done (e.g. deserialize)
  2. catch exception
  3. treat file as textual
  4. if that fails, something is wrong with file itself

You could try the DROID tool.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top