Question

I used SAS to save a tab-delimited text file with utf8 encoding on a windows machine. Then I tried to open this in R:

read.table(myfile, header =TRUE, sep = "\t")

To my surprise, the data was totally messed up, but only in a sneaky way. Number values changed randomly, but the overall layout looked normal, so it took me a while to notice the problem, which I'm assuming now is the BOM.

This is not a new issue of course; they address it briefly here, and recommend using

read.table(myfile, fileEncoding = "UTF-8", header =TRUE, sep = "\t")

However, this made no improvement! My only solution was to suppress the header, with or without the fileEncoding argument:

read.table(myfile, fileEncoding = "UTF-8", header =FALSE, sep = "\t")
read.table(myfile, header =FALSE, sep = "\t")

In either case, I have to do some funny business to replace the column names with the first row, but only after I remove some version of the BOM that appears at the beginning of the first column name (<U+FEFF> if I use fileEncoding and  if I don't use fileEncoding).

Isn't there a simple way to just remove the BOM and use read.table without any special arguments?

Update for @Joe: The SAS that I used:

FILENAME myfile 'C:\Documents ... file.txt'  encoding="utf-8";
proc export data=lib.sastable
  outfile=myfile
  dbms=tab  replace;
  putnames=yes;
run;

Update on further weirdness: Using fileEncoding="UTF-8-BOM" as @Joe suggested in his solution below seems to remove the BOM. However, it did not fix my original motivating problem, which is corruption in the data; the header row is fine, but weirdly the last few digits of the first column of numbers gets messed up. I'll give Joe credit for his answer -- maybe my problem is not actually a BOM issue?

Hack solution: Use fileEncoding="UTF-8-BOM" AND also include the argument colClasses = "character". No idea why this works to fix the data corruption issue -- could be the topic of a future question.

Was it helpful?

Solution

As per your link, it looks like it works for me with:

read.table('c:\\temp\\testfile.txt',fileEncoding='UTF-8-BOM',header=TRUE,sep='\t')

note the -BOM in the file encoding.

This is in 2.1 Variations on read.table in the r documentation. Under 12 Encoding, see "Under UNIX you might need...", which apparently applies even on Windows now (for me, at least).

OTHER TIPS

or you can use the sas system option options=NOBOMFILE the write a uft-8 file without the BOM.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top