coerce single-byte ascii from a text file

https://stackoverflow.com/questions/18245047

24-06-2022
|

Domanda

I am analyzing a collection of large (>150mb) fixed-width data files. I've been slowly reading them in using read.fwf() in 100 line chunks (each row is 7385 characters), then pushing them into a relational database for further manipulation. The problem is that the text files occasionally have a wonky multibyte character (e.g., often enough to be annoying, instead of a "U", the data file has whatever the system assigns to the Unicode U+F8FF. In OS X, that's an apple symbol, but not sure if that is a cross-platform standard). When that happens, I get an error like this:

invalid multibyte string at 'NTY <20> MAINE
000008 [...]

That should have been the latter part of the word "COUNTY", but the U was, as described above, wonky. (Happy to provide more detailed code & data if anyone thinks they would be useful.)

I'd like to do all the coding in R, and I'm just not sure to how to coerce single-byte. Hence the subject-line part of my question: is there some easy way to coerce single-byte ascii out of a text file that has some erroneous multibyte characters in it?

Or maybe there's an even better way to deal with this (should I be calling grep at the system level from R to hunt out the erroneous multi-byte characters)?

Any help much appreciated!

Soluzione

What does the output of the file command say about your data file?

/tmp >file a.txt b.txt 
a.txt: UTF-8 Unicode text, with LF, NEL line terminators
b.txt: ASCII text, with LF, NEL line terminators

You can try to convert/transliterate the file's contents using iconv. For example, given a file that uses the Windows 1252 encoding:

# \x{93} and \x{94} are Windows 1252 quotes
/tmp >perl -E'say "He said, \x{93}hello!\x{94}"' > a.txt 
/tmp >file a.txt
a.txt: Non-ISO extended-ASCII text
/tmp >cat a.txt 
He said, ?hello!?

Now, with iconv you can try to convert it to ascii:

/tmp >iconv -f windows-1252 -t ascii a.txt 
He said, 
iconv: a.txt:1:9: cannot convert

Since there is no direct conversion here it fails. Instead, you can tell iconv to do a transliteration:

/tmp >iconv -f windows-1252 -t ascii//TRANSLIT a.txt  > converted.txt
/tmp >file converted.txt
converted.txt: ASCII text
/tmp >cat converted.txt 
He said, "hello!"

There might be a way to do this using R's IO layer, but I don't know R.

Hope that helps.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow