How can I specify only some colClasses in sqldf file.format?

https://stackoverflow.com/questions/17721645

03-06-2022
|

题

I have some CSV files with problematic columns for sqldf, causing some numeric columns to be classed as character. How can I just specify the classes for those columns, and not every column? There are many columns, and I don't necessarily want to have to specify the class for all of them.

Much of the data in these problem columns are zeros, so sqldf reads them as integer, when they are numeric (or real) data type. Note that read.csv correctly assigns classes. I'm not clever enough to generate a suitable data set that has the right properties (first 50 values zero, then a value of say 1.45 in 51st row), but here's an example call to load the data:

df <- read.csv.sql("data.dat", sql="select * from file",  
                   file.format=list(colClasses=c("attr4"="numeric")))

which returns this error:

Error in sqldf(sql, envir = p, file.format = file.format, dbname = dbname,  :
   formal argument "file.format" matched by multiple actual arguments

Can I somehow use another read.table call to work out the data types? Can I read all columns in as character, and then convert some to numeric? There are a small number that are character, and it would be easier to specify those than all of the numeric columns. I have come up with this ugly partial solution, but it still fails on the final line with same error message:

df.head <- read.csv("data.dat", nrows=10)
classes <- lapply(df.head, class)  # also fails to get classes correct
classes <- replace(classes, classes=="integer", "numeric")
df <- read.csv.sql("data.dat", sql="select * from file",  
                   file.format=list(colClasses=classes))

解决方案

Take a closer look at the documentation for read.csv.sql, specifically at the argument nrows:

nrows: Number of rows used to determine column types. It defaults to 50. Using -1 causes it to use all rows for determining column types.

Another thing you'll note from looking at the documentation for read.csv.sql and sqldf is that there is no colClasses parameter. If you read the file.format documenation in sqldf , you'll see that parameters in the file.format list are not passed to read.table but rather to sqliteImportFile, which has no understanding of R's data types. If you don't like modifying the nrows parameter, you could read the entire dataframe as having character type and then use whatever methods you like to figure out what column should be what class. You're always going to have the problem of not knowing whether an integer is an integer or numeric until you read the entire column however. Also, if the speed issue is really killing you here, you may want to consider moving away from CSV's.

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow