Question

Reading ~5x10^6 numeric values into R from a text file is relatively slow on my machine (a few seconds, and I read several such files), even with scan(..., what="numeric", nmax=5000) or similar tricks. Could it be worthwhile to try an Rcpp wrapper for this sort of task (e.g. Armadillo has a few utilities to read text files)? Or would I likely be wasting my time for little to no gain in performance because of an expected interface overhead? I'm not sure what's currently limiting the speed (intrinsic machine performance, or else?) It's a task that I repeat many times a day, typically, and the file format is always the same, 1000 columns, around 5000 rows.

Here's a sample file to play with, if needed.

nr <- 5000
nc <- 1000

m <- matrix(round(rnorm(nr*nc),3),nr=nr)

cat(m[1, -1], "\n", file = "test.txt") # first line is shorter
write.table(m[-1, ], file = "test.txt", append=TRUE,
            row.names = FALSE, col.names = FALSE)

Update: I tried read.csv.sql and also load("test.txt", arma::raw_ascii) using Armadillo and both were slower than the scan solution.

Was it helpful?

Solution

I highly recommend checking out fread in the latest version of data.table. The version on CRAN (1.8.6) doesn't have fread yet (at the time of this post) so you should be able to get it if you install from the latest source at R-forge. See here.

OTHER TIPS

Please bear in mind that I'm not an R-expert but maybe the concept applies here too: usually reading binary stuff is much faster than reading text files. If your source files don't change frequently (e.g. you are running varied versions of your script/program on the same data), try to read them via scan() once and store them in a binary format (the manual has a chapter about exporting binary files). From there on you can modify your program to read the binary input.

@Rcpp: scan() & friends are likely to call a native implementation (like fscanf()) so writing your own file read functions via Rcpp may not provide a huge performance gain. You can still try it though (and optimize for your particular data).

Salut Baptiste,

Data Input/Output is a huge topic, so big that R comes with its own manual on data input/output.

R's basic functions can be slow because they are so very generic. If you know your format, you can easily write yourself a faster import adapter. If you know your dimensions too, it is even easier as you need only one memory allocation.

Edit: As a first approximation, I would write a C++ ten-liner. Open a file, read a line, break it into tokens, assign to a vector<vector< double > > or something like that. Even if you use push_back() on individual vector elements, you should be competitive with scan(), methinks.

I once had a neat little csv reader class in C++ based on code by Brian Kernighan himself. Fairly generic (for csv files), fairly powerful.

You can then squeeze performance as you see fit.

Further edit: This SO question has a number of pointers for the csv reading case, and references to the Kernighan and Plauger book.

Yes, you almost certainly can create something that goes faster than read.csv/scan. However, for high performance file reading there are some existing tricks that already let you go much faster, so anything you do would be competing against those.

As Mathias alluded to, if your files don't change very often, then you can cache them by calling save, then restore them with load. (Make sure to use ascii = FALSE, since reading the binary files will be quicker.)

Secondly, as Gabor mentioned, you can often get a substantial performance boost by reading your file into a database and then from that database into R.

Thirdly, you can use the HadoopStreaming package to use Hadoop's file reading capabilities.

For more thoughts in these techniques, see Quickly reading very large tables as dataframes in R.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top