Most efficient way of exporting large (3.9 mill obs) data.frames to text file? [duplicate]

https://stackoverflow.com//questions/9703068

13-12-2019
|

Question

I have a fairly large dataframe in R that I would like to export to SPSS. This file has caused me hours of headaches trying to import it to R in the first place, however I got successful using read.fwf() using the options comment.char="%" (a character not appearing in the file) and fill= TRUE(it was a fixed-width ASCII file with some rows lacking all variables, causing error messages).

Anyway, my data frame currently consists of 3,9 mill observations and 48 variables (all character). I can write it to file fairly quickly by splitting it into 4 x 1 mill obs sets with df2 <- df[1:1000000,] followed by write.table(df2) etc., but can't write the entire file in one sweep without the computer locking up and needing a hard reset to come back up.

After hearing anecdotal stories about how R is unsuited for large datasets for years, this is the first time I have actually encountered a problem of this kind. I wonder whether there are other approaches(low-level "dumping" the file directly to disk?) or whether there is some package unknown to me that can handle export of large files of this type efficiently?

Solution

At a guess, your machine is short on RAM, and so R is having to use the swap file, which slows things down. If you are being paid to code, then buying more RAM will probably be cheaper than you writing new code.

That said, there are some possibilities. You can export the file to a database and then use that database's facility for writing to a text file. JD Long's answer to this question tells you how to read in files in this way; it shouldn't be too difficult to reverse the process. Alternatively the bigmemory and ff packages (as mentioned by Davy) could be used for writing such files.

OTHER TIPS

1) If your file is all character strings, then it saves using write.table() much faster if you first change it to a matrix.

2) also write it out in chunks of, say 1000000 rows, but always to the same file, and using the argument append = TRUE.

Update

After extensive work by Matt Dowle parallelizing and adding other efficiency improvements, fread is now as much as 15x faster than write.csv. See linked answer for more.

Now data.table has an fwrite function contributed by Otto Seiskari which seems to be about twice as fast as write.csv in general. See here for some benchmarks.

library(data.table) 
fwrite(DF, "output.csv")

Note that row names are excluded, since the data.table type makes no use of them.

Though I only use it to read very large files (10+ Gb) I believe the ff package has functions for writing extremely large dfs.

Well, as the answer with really large files and R often is, its best to offload this kind of work to a database. SPSS has ODBC connectivity, and the RODBC provides an interface from R to SQL.

I note, that in the process of checking out my information, I have been scooped.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow