Question

i need help to speed up a little bit of code. i have a data.frame "df" and would like to create new columns and fill them with given values. Here a sample code how i do it.

df <- as.data.frame(1:20)

a <- c(31:50)
b <- c(201:220)

df[c("A","B")] <- c(a, b) 

now the problem is that my data is big (some million rows) and it take more time than expected, so i think there is a better way. Any ideas? Thank you!

Was it helpful?

Solution

The task of extending data.frames (or any object) causes R to copy the whole object when you try to add a new column. Package data.table offers some great performance features that are added on to the data.frame model. It allows (among other things) to add columns in place. See the code below for a simple demo:

require(data.table)
a2 <- data.table(x=1:10)
a2[, y:=21:30]   ## this will create y inside a2 without copying it
summary(a2)      ## just like using a data.frame

The resulting object (a data.table) will play nice with (almost) all code that makes use of data.frame. It has an alternative syntax most operations, which are performed much more efficiently. It's worth spending some time looking into.


If you'd like to add multiple columns, then:

a2[, `:=`(y=21:30, z=31:40)]

Edit: @Thell has taken the time and prepared benchmarks with different methods for extending a data.frame. They suggest that despite the copying data.frame is faster. Keep this in mind as an alternative and see which one works best for your code.

OTHER TIPS

You stated you have 'some million' rows so here is an excerpt of benchmarks with 3 columns of 10 million rows...

R 3.0.3 (on 32bit Celeron system w/ 2GB memory)

## Unit: microseconds
##           expr     min      lq  median      uq     max neval
##    dt.addC(dt)   35.38   56.03   64.82   67.77   185.2   100
##     df.add(df)  181.43  214.80  221.42  229.81   366.6   100
##    dt.addB(dt) 2359.54 2457.09 2513.11 2577.00  6398.0   100
##    dt.addA(dt) 2913.74 2995.64 3047.29 3125.82  6791.1   100

R 3.1.0 (on 64bit Haswell i7 w/ 24 GB memory)

## Unit: microseconds
##           expr       min        lq    median        uq       max neval
##     df.add(df)     10.25     30.74     33.36     48.53     84.25   100
##    dt.addC(dt)  27120.45  27563.79  27990.22  29642.46  87637.63   100
##    dt.addB(dt)  38452.71  39018.90  46225.69  50142.46 130893.53   100
##    dt.addA(dt) 193268.78 247749.71 251380.74 256380.43 440916.17   100

Note: The difference between data.frame and data.table on 3.1.0 can be explained by the new way that R 3.1.0 handles assignments. Arun (one of the data.table authors) does so in this chat log.


df.add ( a common base way to add columns to dat.frame ).

df$b <- b.vals
df$c <- c.vals

dt.addA (the common base data.frame method applied to data.table)

dt$b <- b.vals
dt$c <- c.vals

dt.addB (a common data.table way to add columns)

dt[,`:=`(b=b.vals, c=c.vals)]

dt.addC (another data.table method of setting values [from Arun] )

## to reduce the overhead due to `[.data.table` on small data.frames.
set(dt, j="b", value=b.vals)
set(dt, j="c", value=c.vals)

Benchmarks for other data set sizes

R 3.1.0 on i7 System

# Test @ 1,000
## Unit: microseconds
##           expr      min      lq  median      uq      max neval
##    dt.addC(dt)    6.007   10.38   11.71   12.50    20.79   100
##     df.add(df)   11.534   19.49   20.57   21.32   940.63   100
##    dt.addB(dt)  326.166  344.85  351.43  365.47  1412.86   100
##    dt.addA(dt)  798.777  850.47  867.60  888.23  1935.20   100

##            test relative
## 1    df.add(df)        1
## 4   dt.addC(dt)        1
## 3   dt.addB(dt)       35
## 2   dt.addA(dt)       87

# Test @ 10,000
## Unit: microseconds
##           expr     min      lq  median      uq     max neval
##    dt.addC(dt)   11.13   17.88   19.20   20.80   988.9   100
##     df.add(df)   10.97   20.56   22.65   24.94    41.1   100
##    dt.addB(dt)  333.17  364.15  389.87  419.08  1347.0   100
##    dt.addA(dt)  823.99  875.88  897.10 1076.90 29233.1   100

##            test relative
## 1    df.add(df)        1
## 4   dt.addC(dt)        1
## 3   dt.addB(dt)       19
## 2   dt.addA(dt)       50

# Test @ 10,000,000
## Unit: microseconds
##           expr       min        lq    median        uq       max neval
##     df.add(df)     10.25     30.74     33.36     48.53     84.25   100
##    dt.addC(dt)  27120.45  27563.79  27990.22  29642.46  87637.63   100
##    dt.addB(dt)  38452.71  39018.90  46225.69  50142.46 130893.53   100
##    dt.addA(dt) 193268.78 247749.71 251380.74 256380.43 440916.17   100

##            test relative
## 1    df.add(df)        1
## 4   dt.addC(dt)     1536
## 3   dt.addB(dt)     2213
## 2   dt.addA(dt)    11667

R 3.0.3 on Celeron System

# Test @ 1,000
## Unit: microseconds
##           expr     min      lq  median      uq     max neval
##    dt.addC(dt)   55.78   82.58   94.48   96.14   176.1   100
##     df.add(df)  182.65  215.36  220.10  225.03   361.6   100
##    dt.addB(dt) 2699.10 2774.61 2827.34 2894.23  3442.2   100
##    dt.addA(dt) 5259.89 6066.00 6122.37 6231.50 10265.9   100

##            test relative
## 4   dt.addC(dt)    1.000
## 1    df.add(df)    2.889
## 3   dt.addB(dt)   32.444
## 6 dfadd2dtB(dt)   69.667
## 2   dt.addA(dt)   69.889
## 5 dfadd2dtA(dt)   96.000

# Test @ 10,000
## Unit: microseconds
##           expr    min     lq median      uq   max neval
##    dt.addC(dt)  134.0  162.8  168.7   185.8  4135   100
##     df.add(df)  576.7  616.4  633.7   663.2 72749   100
##    dt.addB(dt) 2789.8 2932.6 2993.0  3054.7  6702   100
##    dt.addA(dt) 5400.6 6701.5 6819.0 10079.2 11518   100

##            test relative
## 4   dt.addC(dt)    1.000
## 1    df.add(df)    8.143
## 3   dt.addB(dt)   14.619
## 2   dt.addA(dt)   34.286
## 6 dfadd2dtB(dt)   34.381
## 5 dfadd2dtA(dt)   53.810

# Test @ 10,000,000
## Unit: milliseconds
##           expr    min     lq median     uq    max neval
##    dt.addC(dt)  121.1  146.2  147.2  161.8  303.8   100
##    dt.addB(dt)  197.7  225.4  228.0  270.2  380.7   100
##     df.add(df)  767.8  823.5  857.0  938.2 1156.9   100
##    dt.addA(dt)  709.6 1071.9 1112.6 1170.1 1343.9   100

##            test relative
## 4   dt.addC(dt)    1.000
## 3   dt.addB(dt)    1.566
## 1    df.add(df)    6.172
## 2   dt.addA(dt)    7.594
```

System/Session Info...

Intel® Core™ i7-4700MQ Processor
24 GB

## R version 3.1.0 (2014-04-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] rbenchmark_1.0.0     microbenchmark_1.3-0 data.table_1.9.2    
## 

## "Linux" "3.11.0-19-generic" "x86_64"


Intel(R) Celeron(R) CPU 2.53GHz
2 GB

## R version 3.0.3 (2014-03-06)
## Platform: i686-pc-linux-gnu (32-bit)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] rbenchmark_1.0.0     microbenchmark_1.3-0 data.table_1.9.2    
## [4] knitr_1.5           
## 

## "Linux" "3.2.0-60-generic-pae" "i686"

Why don't you simply do the following:

df <- data.frame (x=1:20)
df$a <- 31:50
df$b <- 201:220

There's an excellent ebook called "R Fundamentals and Graphics" which will give you a solid understanding of the basics of R and its graphical features.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top