Question

I have been working on large datasets lately (more than 400 thousands lines). So far, I have been using XTS format, which worked fine for "small" datasets of a few tenth of thousands elements.

Now that the project grows, R simply crashes when retrieving the data for the database and putting it into the XTS.

It is my understanding that R should be able to have vectors with size up to 2^32-1 elements (or 2^64-1 according the the version). Hence, I came to the conclusion that XTS might have some limitations but I could not find the answer in the doc. (maybe I was a bit overconfident about my understanding of theoretical possible vector size).

To sum up, I would like to know if:

  1. XTS has indeed a size limitation
  2. What do you think is the smartest way to handle large time series? (I was thinking about splitting the analysis into several smaller datasets).
  3. I don't get an error message, R simply shuts down automatically. Is this a known behavior?

SOLUTION

  1. The same as R and it depends on the kind of memory being used (64bits, 32 bits). It is anyway extremely large.
  2. Chuncking data is indeed a good idea, but it is not needed.
  3. This problem came from a bug in R 2.11.0 which has been solved in R 2.11.1. There was a problem with long dates vector (here the indexes of the XTS).
Was it helpful?

Solution

Regarding your two questions, my $0.02:

  1. Yes, there is a limit of 2^32-1 elements for R vectors. This comes from the indexing logic, and that reportedly sits 'deep down' enough in R that it is unlikely to be replaced soon (as it would affect so much existing code). Google the r-devel list for details; this has come up before. The xts package does not impose an additional restriction.

  2. Yes, splitting things into chunks that are manageable is the smartest approach. I used to do that on large data sets when I was working exclusively with 32-bit versions of R. I now use 64-bit R and no longer have this issue (and/or keep my data sets sane),

There are some 'out-of-memory' approaches, but I'd first try to rethink the problem and affirm that you really need all 400k rows at once.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top