Subsetting while reading in R [duplicate]

https://stackoverflow.com/questions/23537292

17-07-2023
|

Question

I have a 5+ million rows in R with just 4 cols per row. One of the columns is the date when the data was collected. An example is

Date        PI   SSC    GC
2/11/2013 0.52  0.89   4.2
2/11/2013 0.56  0.45  12.0
2/11/2013 0.49  0.89  13.1
2/11/2013 0.59  0.47   4.8 
2/11/2013 0.61  0.58  12.3

I would like to know if there is a way to read ONLY certain rows corresponding for certain dates, rather than having to read the 5 million rows and then subsetting. For example, all rows corresponding to the date 2/11/2013 (I do not know how many are in the file). Also, in case it's of any help, the class of the Date column is factor.

Solution

Although not exactly an answer to what user asks, but 5 mil rows are not really too much to read. Ofcourse base R's read.table will be very slow but using fread from data.table package is fast enough. Here are the benchmarks

tbl <- read.table(header=T, stringsAsFactors=F, text='Date        PI   SSC    GC
2/11/2013 0.52  0.89   4.2')

require(data.table)
#create hige datatable with 5mil rows to write to temp file
bigtbl <- rbindlist( lapply(1:(5*1e6), function(x) tbl))
write.table(bigtbl, row.names=F, quote=F, file="temp.txt")


#benchmark of reading 5 mil row file back using fread function
system.time(bigtbl2 <- fread('temp.txt'))

## Read 5000000 rows and 4 (of 4) columns from 0.116 GB file in 00:00:11
##  user  system elapsed 
##  10.76    0.08   10.86

Ofcourse memory size may still be concern, but in this case it's only 153MB still

> tables()
     NAME         NROW  MB COLS           KEY
[1,] bigtbl2 5,000,000 153 Date,PI,SSC,GC    
Total: 153MB

If you are going read this data frequently, it makes sense to save the data in standard RData file using save function and read it back using load

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow