Reading big data in R by read.big.matrix

https://stackoverflow.com/questions/12725603

05-07-2021
|

Question

I am reading a data of dimension 3131875*5 in r using read.big.matrix. My data has both character and numeric columns including date variable. The command which I should use is

as1 <- read.big.matrix("C:/Documents and Settings/Arundhati.Mukherjee/My Documents/Arundhati/big data/MB07_Arundhati/sample2.txt",
                       header=TRUE, 
                       backingfile="session.bin",
                       descriptorfile="session.desc",
                       type = NA)

But type = NA is not accepted in R in this case and I am getting an error:

Error in filebacked.big.matrix(nrow = nrow, ncol = ncol, type = type,  : 
  Problem creating filebacked matrix.
In addition: Warning messages:
1: In na.omit(as.integer(firstLineVals)) : NAs introduced by coercion
2: In na.omit(as.double(firstLineVals)) : NAs introduced by coercion
3: In read.big.matrix("C:/Documents and Settings/Arundhati.Mukherjee/My Documents/Arundhati/big data/MB07_Arundhati/sample2.txt",  :
  Because type was not specified, we chose double based on the first line of data.

I need to know what should be the type here. I tried with options like double but that is throwing me same error.

Please help me.

Solution

From ?read.big.matrix:

Files must contain only one atomic type (all integer, for example).

Therefore, you won't be able to read in data with combinations of character, numeric, integer, date, etc. You could do some work on the file, for instance using a different program to convert the character variables to integer representations (like converting to a factor in R).

EDIT:

On the bigmemory website there's an example of preprocessing data using a python script to change character information to integer. The script is written for a specific dataset, but perhaps you could use it as a guideline for your data.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow