Question

When I need to process raw data or generate large amount of synthetic data, I use pytables in Python and loop over each row and "append" the row to a table. So I do not have to know the size of the table ex-ante. For example,

import tables

class test(tables.IsDescription):
    col1 = tables.Int32Col()
    col2 = tables.Int32Col()

hdf5_a = tables.openFile('test.hdf5', 'a')

table = hdf5_a.createTable('/', 'test', test)

for i in range(10):
    table.row['col1'] = i
    table.row['col2'] = i * 10

    table.row.append()

table.flush()
hdf5_a.close()

I need to do the same thing with R. Basically I want:

  1. generate synthetic data
  2. append data on the fly to a binary file on the disk
  3. later use this data without loading the whole thing to the memory

I seems packages such as ff and bigmemory should be useful for this, but examples I saw were a bit different from my need. Is there any code snippets which does something like this in R? I think a simple code example will be very helpful.

Was it helpful?

Solution

First a function for generating some data

gendata <- function() {
  n <- 1E3
  data.frame(a = 1:n, b = rnorm(n), c = sample(letters, n, replace=TRUE))
}

ff + ffbase

For ff the following pattern can be used:

library(ffbase)

dat <- NULL
for (i in seq_len(10)) {
  d <- gendata()
  dat <- ffdfappend(dat, d)
}
save.ffdf(dat, dir="./test")

The data can be loaded again by using load.ffdf("./test").

CSV

For test/csv files the following pattern can be used:

con <- file("test.csv", "wt")
first_block <- TRUE
for (i in seq_len(10)) {
  d <- gendata()
  write.table(d, file=con, sep=",", row.names=FALSE, col.names=first_block)
  first_block <- FALSE
}
close(con)

To use it, you will first have to import it into ff or bigmemory, or you can use it (read only) using LaF.

OTHER TIPS

The approach would be the same in R -- open a file for writing, append chunks, close the file. If you're familiar with HDF5 then rhdf5 is one option. The package vignette includes in section 3.3 an explicit example of iteration to create the file. The key is to doing this efficiently is write in chunks -- multiple rows to make use of R's efficient vectoriation -- rather than single line at a time. But you could also write single line at a time.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top