Pergunta

I have a very large data frame (150.000.000 rows) with a format like this:

df = data.frame(pnr = rep(500+2*(1:15),each=3), x = runif(3*15))

pnr is person id and x is some data. I would like to sample 10% of the persons. Is there a fast way to do this in dplyr?

The following is a solution, but it is slow because of the merge-statement

prns = as.data.frame(unique(df$prn))
names(prns)[1] = "prn"
prns$s = rbinom(nrow(prns),1,0.1)

df = merge(df,prns)
df2 = df[df$s==1,]
Foi útil?

Solução

I would actually suggest the "data.table" package over "dplyr" for this. Here's an example with some big-ish sample data (not much smaller than your own 15 million rows).

I'll also show some right and wrong ways to do things :-)

Here's the sample data.

library(data.table)
library(dplyr)
library(microbenchmark)
set.seed(1)
mydf <- DT <- data.frame(person = sample(10000, 1e7, TRUE),
                   value = runif(1e7))

We'll also create a "data.table" and set the key to "person". Creating the "data.table" takes no significant time, but setting the key can.

system.time(setDT(DT))
#    user  system elapsed 
#   0.001   0.000   0.001 

## Setting the key takes some time, but is worth it
system.time(setkey(DT, person)) 
#    user  system elapsed 
#   0.620   0.025   0.646

I can't think of a more efficient way to select your "person" values than the following, so I've removed these from the benchmarks--they are common to all approaches.

## Common to all tests...
A <- unique(mydf$person)
B <- sample(A, ceiling(.1 * length(A)), FALSE)

For convenience, the different tests are presented as functions...

## Base R #1
fun1a <- function() {
  mydf[mydf$person %in% B, ]
}

## Base R #2--sometimes using `which` makes things quicker
fun1b <- function() {
  mydf[which(mydf$person %in% B), ]
}

## `filter` from "dplyr"
fun2 <- function() {
  filter(mydf, person %in% B)
}

## The "wrong" way to do this with "data.table"
fun3a <- function() {
  DT[which(person %in% B)]
}

## The "right" (I think) way to do this with "data.table"
fun3b <- function() {
  DT[J(B)]
}

Now, we can benchmark:

## The benchmarking
microbenchmark(fun1a(), fun1b(), fun2(), fun3a(), fun3b(), times = 20)
# Unit: milliseconds
#     expr       min        lq    median        uq       max neval
#  fun1a() 382.37534 394.27968 396.76076 406.92431 494.32220    20
#  fun1b() 401.91530 413.04710 416.38470 425.90150 503.83169    20
#   fun2() 381.78909 394.16716 395.49341 399.01202 417.79044    20
#  fun3a() 387.35363 397.02220 399.18113 406.23515 413.56128    20
#  fun3b()  28.77801  28.91648  29.01535  29.37596  42.34043    20

Look at the performance we get from using "data.table" the right way! All the other approaches are impressively fast though.


summary shows the results to be the same. (The row order for the "data.table" solution would be different since it has been sorted.)

summary(fun1a())
#      person         value         
#  Min.   :  16   Min.   :0.000002  
#  1st Qu.:2424   1st Qu.:0.250988  
#  Median :5075   Median :0.500259  
#  Mean   :4958   Mean   :0.500349  
#  3rd Qu.:7434   3rd Qu.:0.749601  
#  Max.   :9973   Max.   :1.000000  

summary(fun2())
#      person         value         
#  Min.   :  16   Min.   :0.000002  
#  1st Qu.:2424   1st Qu.:0.250988  
#  Median :5075   Median :0.500259  
#  Mean   :4958   Mean   :0.500349  
#  3rd Qu.:7434   3rd Qu.:0.749601  
#  Max.   :9973   Max.   :1.000000  

summary(fun3b())
#      person         value         
#  Min.   :  16   Min.   :0.000002  
#  1st Qu.:2424   1st Qu.:0.250988  
#  Median :5075   Median :0.500259  
#  Mean   :4958   Mean   :0.500349  
#  3rd Qu.:7434   3rd Qu.:0.749601  
#  Max.   :9973   Max.   :1.000000 

Outras dicas

In base R, to sample 10% of the rows, rounding up to the next row

> df[sample(nrow(df), ceiling(0.1*nrow(df)), FALSE), ]
##    pnr         x
## 16 512 0.9781232
## 21 514 0.5279925
## 33 522 0.8332834
## 14 510 0.7989481
## 4  504 0.7825318

or rounding down to the next row

> df[sample(nrow(df), floor(0.1*nrow(df)), FALSE), ]
##    pnr           x
## 43 530 0.449985180
## 35 524 0.996350657
## 2  502 0.499871966
## 25 518 0.005199058

or sample 10% of the pnr column, rounding up

> sample(df$pnr, ceiling(0.1*length(df$pnr)), FALSE)
## [1] 530 516 526 518 514 

ADD:

If you're looking to sample 10% of the people (unique pnr ID), and return those people and their respective data, I think you want

> S <- sample(unique(df$pnr), ceiling(0.1*length(unique(df$pnr))), FALSE)
> df[df$pnr %in% S, ]
##    pnr         x
## 1  502 0.7630667
## 2  502 0.4998720
## 3  502 0.4839460
## 22 516 0.8248153
## 23 516 0.5795991
## 24 516 0.1572472

PS: I would wait for a dplyr answer. It will likely be quicker on 15mil rows.

If you don't necessarily want a thoroughly random sample, then you could do

filter(df, pnr %% 10 ==0).

Which would take every 10th person (you could get 10 different samples by changing to ==1,...). You could make this random by re-allocating IDs randomly - fairly trivial to do this using sample(15)[(df$pnr-500)/2] for your toy example - reversing the mapping of pnr onto a set that's suitable for sample might be less easy for the real-world case.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top