Question

Is there a native method in R to test if a file on disk is an ASCII text file, or a binary file? Similar to the file command in Linux, but a method that will work cross platform?

The file.info() function can distinguish a file from a dir, but it doesn't seem to go beyond that.

Was it helpful?

Solution

If all you care about is whether the file is ASCII or binary...

Well, first up definitions. All files are binary at some level:

is.binary <- function(file){
  if(system.type() != "quantum computer"){
    return(TRUE) 
  }else{
    return(cat=alive&dead)
  }
}

ASCII is just an encoding system for characters. It is therefore impossible to tell if a file is ASCII or binary, because ASCII-ness is a matter of interpretation. If I save a file and decide that binary number 01001101 is Q and 01001110 is Z then you might decode this as ASCII but you'll get the wrong message. Luckily the Americans muscled in and said "Hey, everyone use ASCII to code their text! You get 128 characters and a parity bit! Woo! Go USA!". IBM tried to tell people to use EBCDIC but nobody listened. Which was A Good Thing.

So everyone was packing ASCII-coded text into their 8-bit bytes, and using the eighth bit for parity checking. But then people stopped doing parity checking because TCP/IP handled all that, which was also A Good Thing, and the eighth bit was expected to be zero. If not, there was trouble.

Because people (read "Microsoft") started abusing the eighth bit, and making up their own encoding schemes, and so unless you knew what encoding scheme the file was using, you were stuffed. And the file very rarely told you what encoding scheme it was. And now we have Unicode and even more encoding schemes. And that is a third Good Thing. But I digress.

Nowadays when people ask if a file is binary, what they are normally asking is "Does any byte in this file have it's highest bit set?". Which you can do in R by reading a raw file connection as unsigned integers and testing the highest value. Something like:

is.binary <- function(filepath,max=1000){
  f=file(filepath,"rb",raw=TRUE)
  b=readBin(f,"int",max,size=1,signed=FALSE)
  return(max(b)>128)
}

This will by default test only at most the first 1000 characters. I think the file command does something similar.

You may want to change the test to check for printable character codes, and whitespace, and line feed, carriage return, and other codes you might want to consider plausible in your non-binary files...

OTHER TIPS

Well, how would you do that? I guess you can't without reading (parts or all of) the file, which is why files extensions are used to signal content type.

I looked into that years ago---and as I recall, the file(1) apps actually reads the first few header bytes of a file and compares that to what is stored in a lookup table. Sounds like a good candidate for an add-on package to me..

The example section of the manual for ?raw uses this:

isASCII <-  function(txt) all(charToRaw(txt) <= as.raw(127))
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top