Question

I am using ff and R because I have a huge dataset (around 16 GB) to work with. As a test case, I got the file to read around 1M records and wrote it out as a ff database.

system.time(te3 <- read.csv.ffdf(file="testdata.csv", sep = ",", header=TRUE, first.rows=10000, next.rows=50000, colClasses=c("numeric","numeric","numeric","numeric")))

I have uploaded the resulting file (te3) here: http://bit.ly/1c8pXqt

I tried to do a simple calculation to create a new variable

ffdfwith(te3, {odfips <- ofips*100000 + dfips})

I get the following error (there are no missing records) which has flummoxed me:

Error in if (by < 1) stop("'by' must be > 0") : missing value where TRUE/FALSE needed
In addition: Warning message: In chunk.default(from = 1L, to = 1000000L, by = 2293760000, maxindex = 1000000L) : NAs introduced by coercion

Any insights will be appreciated. Also, related to FF, is it possible to use standard R packages such as MCMC (I need to use the inverse gamma function) with FF databases?

TIA,

Krishnan

Was it helpful?

Solution

Adding an extra variable to an ffdf is a basic question, but there are several options to reach the same goal. See below. I've downloaded your zipfile at http://bit.ly/1c8pXqt and unzipped it.

require(ffbase)
load.ffdf(dir="/home/janw/Desktop/stackoverflow/ffdb")

## Using ffdfwith or with will chunkwise execute the expression
te3$odfips <- ffdfwith(te3, ofips*100000 + dfips)
te3$odfips <- with(te3, ofips*100000 + dfips)
## It is better to restrict to the columns you need in the expression 
## otherwise you are going to load other columns in RAM also which is not needed. 
## This will speedup
te3$odfips <- ffdfwith(te3[c("ofips","dfips")], ofips*100000 + dfips)
te3$odfips <- with(te3[c("ofips","dfips")], ofips*100000 + dfips)
## ffdfwith will look at options("ffbatchbytes") and look at how many rows in your ffdf
## can be put in 1 batch in order to not overflow options("ffbatchbytes") and hence RAM. 
## So creating this variable will be done in chunks.
## If you want to specify the chunksize yourself, you can e.g. pass the by argument
## to with which will be passed on to ?chunk. Eg. below this variable is created
## in chunks of 100000 records.
te3$odfips <- with(te3[c("ofips","dfips")], ofips*100000 + dfips, by = 100000)

## As the Ops * and + are implemented in ffbase for ff vectors you can also do this:
te3$odfips <- te3$ofips * 100000 + te3$dfips

why are you getting this error is unclear to me. Maybe you have set options("ffbatchbytes") to a very low amount? I don't get this error.

The MCMC question is too vague to answer.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top