Question

I have a large data set in GBs that I'd have to process before I analyse them. I tried creating a connector, which allows me to loop through the large datasets and extract chunks at a time.This allows me to quarantine data that satisfies some conditions.

My problem is that I am not able to create an indicator for the connector that stipulates it is null and to execute close(connector) when the end of the dataset is reached. Moreover, for the first chunk of extracted data, I'd have to skip 17 lines since the file contains header that R is not able to read.

A manual attempt that works:

filename="nameoffile.txt"    
con<<-file(description=filename,open="r")    
data<-read.table(con,nrows=1000,skip=17,header=FALSE)    
data<-read.table(con,nrows=1000,skip=0,header=FALSE)    
.    
.    
.    
till end of dataset

Since I'd want to avoid mannually keying the above command until I reach the end of the dataset, I attempted to write a loop to automate the process, which was unsuccessful.

My attempt with loops that failed:

filename="nameoffile.txt"    
con<<-file(description=filename,open="r")    
data<-read.table(con,nrows=1000,skip=17,header=FALSE)        
if (nrow(rval)==0) {    
  con <<-NULL    
  close(con)    
  }else{    
    if(nrow(rval)!=0){    
    con <<-file(description=filename, open="r")    
    data<-read.table(conn,nrows=1000,skip=0,header=FALSE)      
  }}    
Was it helpful?

Solution

Looks like you're on the right track. Just open the connection once (you don't need to use <<-, just <-; use a larger chunk size so that R's vectorized operations can be used to process each chunk efficiently), along the lines of

filename <- "nameoffile.txt"
nrows <- 1000000
con <- file(description=filename,open="r")    
## N.B.: skip = 17 from original prob.! Usually not needed (thx @Moody_Mudskipper)
data <- read.table(con, nrows=nrows, skip=17, header=FALSE)
repeat {
    if (nrow(data) == 0)
        break
    ## process chunk 'data' here, then...
    ## ...read next chunk
    if (nrow(data) != nrows)   # last chunk was final chunk
        break
    data <- tryCatch({
        read.table(con, nrows=nrows, skip=0, header=FALSE)
    }, error=function(err) {
       ## matching condition message only works when message is not translated
       if (identical(conditionMessage(err), "no lines available in input"))
          data.frame()
       else stop(err)
    })
}
close(con)    

Iteration seems to me like a good strategy, especially for a file that you're going to process once rather than say reference repeatedly like a data base. The answer is modified to try to be more robust about detecting reading at the end of the file.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top