Read csv data file in R

https://stackoverflow.com/questions/16911343

30-05-2022
|

题

I am using read.table to read a data file. and got the following error:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
scan() expected 'a real', got 'true'

I know that means there's some error in my data file, the problem is how can I find where is it. The error message did not tell which row has the issue, it's hard for me to find it. Or how can I skip these rows?

Here's my R code:

data<-read.csv("/home/jianfezhang/prod/conversion_yaap/data/part-r-00000",
                   sep="\t",
                   col.names=c("site",
                               "treatment",
                               "mode",
                              "segment",
                              "source",
                              "itemId",
                              "leaf_categ_id",
                              "condition_id",
                              "auct_type_code",
                              "start_price_lstg_curncy",
                              "bin_price_lstg_curncy",
                              "start_price_variance",
                              "start_price_mean",
                              "start_price_media",
                              "bin_price_variance",
                              "bin_price_mean",
                              "bin_price_media",
                              "is_sold"),
                   colClasses=c(rep("factor",5),"numeric",rep("factor",3),rep("numeric",8),"factor")
                   );

解决方案

The error you get is caused by a the colClasses argument - some values in the file to not match the datatypes you specified.

Most of the time I encounter something like this, I probably just had some counting problem with the colClasses argument, e.g it would maybe be

colClasses=c(rep("factor",5),"numeric", rep("factor",4), rep("numeric",7),"factor")

instead of your default values. That may be simply checked by carefully comparing the contents of the first lines of your file with the datatypes you specified.

If this does not do the trick for you, you probably have some wrong datatype where you do not expect it. A simple, yet slow approach is to remove the colClasses argument and first read the whole file without specific options - probably add stringsAsFactors=FALSE to get only character values. This probably should work.

Then you may try to convert each column one by one, like

data$itemId <- as.numeric(data$itemId)

and then check the result for NA values, easily done by summary(data$itemId). If you got NA values, you can call which(is.na(data$itemId)) to get the row number and check your original file whether the NA in fact is valid or if you have some data problems there.

Most of the time you will be able to narrow down your problem this way.

If your file a lot of columns, however, this quickly becomes a lot of work....

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow