Question

Are there any rule of thumbs to know when R will have problems to deal with a given dataset in RAM (given a PC configuration)?

For example, I have heard that one rule of thumb is that you should consider 8 bytes for each cell. Then, if I have 1.000.000 observations of 1.000 columns that would be close to 8 GB - hence, in most domestic computers, we probably would have to store the data in the HD and access it in chunks.

Is the above correct? Which rule of thumbs for memory size and usage can we apply beforehand? By that I mean enough memory not only to load the object, but to do some basic operations like some data tidying, some data visualisation and some analysis (regression).

PS: it would be nice to explain how the rule of thumb works, so it is not just a blackbox.

Was it helpful?

Solution

The memory footprint of some vectors at different sizes, in bytes.

n <- c(1, 1e3, 1e6)
names(n) <- n
one_hundred_chars <- paste(rep.int(" ", 100), collapse = "")

sapply(
  n,
  function(n)
  {
    strings_of_one_hundred_chars <- replicate(
      n,
      paste(sample(letters, 100, replace = TRUE), collapse = "")
    )
    sapply(
      list(
        Integers                                 = integer(n),
        Floats                                   = numeric(n),
        Logicals                                 = logical(n),
        "Empty strings"                          = character(n),
        "Identical strings, nchar=100"           = rep.int(one_hundred_chars, n),
        "Distinct strings, nchar=100"            = strings_of_one_hundred_chars,
        "Factor of empty strings"                = factor(character(n)),
        "Factor of identical strings, nchar=100" = factor(rep.int(one_hundred_chars, n)),
        "Factor of distinct strings, nchar=100"  = factor(strings_of_one_hundred_chars),
        Raw                                      = raw(n),
        "Empty list"                             = vector("list", n)
      ),
      object.size
    )
  }
)

Some values differ under between 64/32 bit R.

## Under 64-bit R
##                                          1   1000     1e+06
## Integers                                48   4040   4000040
## Floats                                  48   8040   8000040
## Logicals                                48   4040   4000040
## Empty strings                           96   8088   8000088
## Identical strings, nchar=100           216   8208   8000208
## Distinct strings, nchar=100            216 176040 176000040
## Factor of empty strings                464   4456   4000456
## Factor of identical strings, nchar=100 584   4576   4000576
## Factor of distinct strings, nchar=100  584 180400 180000400
## Raw                                     48   1040   1000040
## Empty list                              48   8040   8000040

## Under 32-bit R
##                                          1   1000     1e+06
## Integers                                32   4024   4000024
## Floats                                  32   8024   8000024
## Logicals                                32   4024   4000024
## Empty strings                           64   4056   4000056
## Identical strings, nchar=100           184   4176   4000176
## Distinct strings, nchar=100            184 156024 156000024
## Factor of empty strings                272   4264   4000264
## Factor of identical strings, nchar=100 392   4384   4000384
## Factor of distinct strings, nchar=100  392 160224 160000224
## Raw                                     32   1024   1000024
## Empty list                              32   4024   4000024

Notice that factors have a smaller memory footprint than character vectors when there are lots of repetitions of the same string (but not when they are all unique).

OTHER TIPS

The rule of thumb is correct for numeric vectors. A numeric vector uses 40 bytes to store information about the vector plus 8 for each element in the vector. You can use the object.size() function to see this:

object.size(numeric())  # an empty vector (40 bytes)  
object.size(c(1))       # 48 bytes
object.size(c(1.2, 4))  # 56 bytes

You probably won't just have numeric vectors in you analysis. Matrices grow similar to vectors (this is to be expected since they are just vectors with a dim attribute).

object.size(matrix())           # Not really empty (208 bytes)
object.size(matrix(1:4, 2, 2))  # 216 bytes
object.size(matrix(1:6, 3, 2))  # 232 bytes (2 * 8 more after adding 2 elements)

Data.frames are more complicates (they have more attributes than a simple vector) and so they grow faster:

object.size(data.frame())                  # 560 bytes
object.size(data.frame(x = 1))             # 680 bytes
object.size(data.frame(x = 1:5, y = 1:5))  # 840 bytes

A good reference for memory is Hadley Wickhams Advanced R Programming.

All of this said, remember that in order to do analyses in R, you need some cushion in memory to allow R to copy the data you may be working on.

I cannot really answer your question fully and I strongly suspect that there will be several factors that will affect what works in practice, but if you are just looking at the amount of raw memory a single copy of a given dataset would occupy, you can have a look at the documentation of R internals.

You will see that the amount of memory requires depends on the type of data being held. If you are talking about number data, these would typically be integer or numeric/real data. These in terms are described by the R internal types INTSXP and REALSXP, respectively which are described as follows:

INTSXP

length, truelength followed by a block of C ints (which are 32 bits on all R platforms).

REALSXP

length, truelength followed by a block of C doubles

A double is 64 bits (8 bytes) in length, so your 'rule of thumb' would appear to be roughly correct for a dataset exclusively containing numeric values. Similarly, with integer data, each element would occupy 4 bytes.

Trying to sum up the answers, please correct me if I am wrong.

If we do not want to understimate the memory needed, and if we want to make a safe estimate in the sense that will almost surely overestimate, it seems that we can put 40 bytes per column plus 8 bytes per cell, then multiply it by a "cushion factor" (that it seems to be arround 3) for data copying when tidying, graphing and analysing.

In a function:

howMuchRAM <-function(ncol, nrow, cushion=3){
  #40 bytes per col
  colBytes <- ncol*40

  #8 bytes per cell
  cellBytes <- ncol*nrow*8

  #object.size
  object.size <- colBytes + cellBytes

  #RAM
  RAM <- object.size*cushion
  cat("Your dataset will have up to", format(object.size*9.53674e-7, digits=1), "MB and you will probably need", format(RAM*9.31323e-10,digits=1), "GB of RAM to deal with it.")
  result <- list(object.size = object.size, RAM = RAM, ncol=ncol, nrow=nrow, cushion=cushion)
}

So in the case of 1.000.000 x 1.000 data frame:

howMuchRAM(ncol=1000,nrow=1000000)

Your dataset will have up to 7629 MB and you will probably need 22 GB of RAM to deal with it.

But as we can see in the answers, object sizes vary by type and if the vectors are not made of unique cells they will have smaller sizes, so it seems that this estimate would be really conservative.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top