Question

I have a huge csv file that has a numeric column with big integer value. I have a sample below.

0, 0, 11536375, 0, 1152921504606846976, 75962, 258238559    
1, 0, 11536375, 1, 1152921504606846977, 609189, 1515555074
2, 0, 11536375, 2, 1152921504606846978, 609189, 1530344731

I'm trying to read column 1,3,5:7 into R data frame. I decided to use sqldf for efficiency and because I use it to read other data source already. Problem is sqldf truncates column 5 to 1.152922e+18. This is more like a index that I'd need to join with another DF. So I need the exact value. I don't think nrows argument will help here. I do need to read a value that seems higher than what base R can handle. I think INT64 package might help but that has been archived. Any suggestion on how I can read big int in sqldf?

I used scan as a work around by specifying column 5 as a string. I get the full value now but it is inefficient when used in joins/merge. If reading as string is the only way out, can I achieve this in sqldf? "what" and "colClasses" are not supported by sqldf. How can I mention that column 5 should be treated as string?

Was it helpful?

Solution

Try this:

library(sqldf)

# create test data
Lines <- "a, b, c, d, e, f, g
0, 0, 11536375, 0, 1152921504606846976, 75962, 258238559    
1, 0, 11536375, 1, 1152921504606846977, 609189, 1515555074
2, 0, 11536375, 2, 1152921504606846978, 609189, 1530344731
"
cat(Lines, file = "testFile.dat")

DF <- read.csv.sql("testFile.dat", sql = 
  "select a, b, c, d, cast(e as text) e, f, g from file")

giving:

> DF
  a b        c d                    e      f          g
1 0 0 11536375 0  1152921504606846976  75962  258238559
2 1 0 11536375 1  1152921504606846977 609189 1515555074
3 2 0 11536375 2  1152921504606846978 609189 1530344731
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top