Frage

I'm trying to read in a large (3.7 million rows, 180 columns) dataset into R, using the ff package. There are several data types in the dataset - factor, logical, and numeric.

The problem is when reading in numeric variables. For example, one of my columns is:

TotalBeforeTax
126.9
88.0
124.5
90.9
...

When I try reading the data in, the following error is thrown:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  scan() expected 'a real', got '"126.90000"'

I tried declaring the class to integer (it's already declared as numeric) using the colClasses argument, but to no avail. I also tried changing it to a real (whatever that is supposed to mean), and it starts reading in the data, but at some point throws:

Error in methods::as(data[[i]], colClasses[i]) : 
  no method or default for coercing “character” to “a real”

(My guess is, because it comes across an NA and doesn't know what to do with it.)

The funny thing is, if I declare the column as a factor, everything reads in nicely.

What gives?

War es hilfreich?

Lösung

OK, so I managed to solve this using a primitive workaround. First, split the .csv file using a csv file splitter application. Then, execute the following code:

## First, set the folder where the split .csv files are. Set the file names.

sourceDir <- "split_files_folder"
sourceFile <- paste(sourceDir,"common_name_of_split_files", sep = "/")

## Now set the number of split pieces.

pieces <- "some_number"

## Set the destination folder for the tab-delimited text files. 
## Set the output file name.

destDir <- "destination_folder"
destFile <- paste(paste(destDir, "datafile", sep = "/"), "txt", sep = ".")

## Now, initialize the loop.

for (i in 1:pieces)
{
  temp <- read.csv(file = paste(paste(sourceFile, i, sep = "_"), "csv", sep = "."))
  if (i == 1) 
  {
    write.table(temp, file = destFile, quote = FALSE, sep = "\t", row.names = FALSE, col.names = TRUE)
  }
  else 
  {
    write.table(temp, file = destFile, append = TRUE, quote = FALSE, sep = "\t", row.names = FALSE, col.names = FALSE)
  }
}

And voila! You've got a huge tab-delimited text file!

Andere Tipps

Solution 1

You could try laf_to_ffdf from the ffbase package. Something like:

library(LaF)
library(ffbase)

con <- laf_open_csv("yourcsvfile.csv", 
  column_names = [as character vector with column names], 
  column_types = [a character vector with colClasses], 
  dec=".", sep=",", skip=1)

ffdf <- laf_to_ffdf(con)

Or if you want to detect the types automatically:

library(LaF)
library(ffbase)

m <- detect_dm_csv("yourcsvfile.csv")
con <- laf_open(m)
ffdf <- laf_to_ffdf(con)

Solution 2

Use a column class of character for the offending column and cast the column to numeric in transFUN argument of read.csv.ffdf:

ffdf <- read.csv.ffdf([your regular arguments], transFUN = function(d) {
  d$offendingcolumn <- as.numeric(d$offendingcolumn)
  d
})

The problem seems to be the number 126.9000 being surrounded by a quote ". So maybe you should first get the variable as character and secondly remove all unwanted character, and finally convert the variable to numeric.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top