Try this:
library(plyr)
ddply(df,.(ID),function(x) x[sample(nrow(x),500),])
Question
I have a data frame made by almost 50,000 rows spread in 15 different IDs (every ID has thousands of observations). Data frame looks like:
ID Year Temp ph
1 P1 1996 11.3 6.80
2 P1 1996 9.7 6.90
3 P1 1997 9.8 7.10
...
2000 P2 1997 10.5 6.90
2001 P2 1997 9.9 7.00
2002 P2 1997 10.0 6.93
I want to take 500 random rows for every ID (so 500 for P1, 500 for P2,....) and create a new df. I try:
new_df<-df[df$ID %in% sample(unique(dfID),500),]
But it takes randomly one ID, while I need 500 random rows for every ID.
Solution 2
Try this:
library(plyr)
ddply(df,.(ID),function(x) x[sample(nrow(x),500),])
OTHER TIPS
This is available as the slice_sample
function in dplyr
:
library(dplyr)
new_df <- df %>% group_by(ID) %>% slice_sample(n=500)
In older versions of R, the function was called sample_n
, which has been deprecated.
Here is one approach in base R.
First, the prerequisite sample data to work with:
set.seed(1)
mydf <- data.frame(ID = rep(1:3, each = 5), matrix(rnorm(45), ncol = 3))
mydf
# ID X1 X2 X3
# 1 1 -0.6264538 -0.04493361 1.35867955
# 2 1 0.1836433 -0.01619026 -0.10278773
# 3 1 -0.8356286 0.94383621 0.38767161
# 4 1 1.5952808 0.82122120 -0.05380504
# 5 1 0.3295078 0.59390132 -1.37705956
# 6 2 -0.8204684 0.91897737 -0.41499456
# 7 2 0.4874291 0.78213630 -0.39428995
# 8 2 0.7383247 0.07456498 -0.05931340
# 9 2 0.5757814 -1.98935170 1.10002537
# 10 2 -0.3053884 0.61982575 0.76317575
# 11 3 1.5117812 -0.05612874 -0.16452360
# 12 3 0.3898432 -0.15579551 -0.25336168
# 13 3 -0.6212406 -1.47075238 0.69696338
# 14 3 -2.2146999 -0.47815006 0.55666320
# 15 3 1.1249309 0.41794156 -0.68875569
Second, the sampling:
do.call(rbind,
lapply(split(mydf, mydf$ID),
function(x) x[sample(nrow(x), 3), ]))
# ID X1 X2 X3
# 1.2 1 0.1836433 -0.01619026 -0.1027877
# 1.1 1 -0.6264538 -0.04493361 1.3586796
# 1.5 1 0.3295078 0.59390132 -1.3770596
# 2.10 2 -0.3053884 0.61982575 0.7631757
# 2.9 2 0.5757814 -1.98935170 1.1000254
# 2.8 2 0.7383247 0.07456498 -0.0593134
# 3.13 3 -0.6212406 -1.47075238 0.6969634
# 3.12 3 0.3898432 -0.15579551 -0.2533617
# 3.15 3 1.1249309 0.41794156 -0.6887557
There is also strata
from the sampling
package, which is convenient when you want to sample different sizes from each group:
# install.packages("sampling")
library(sampling)
set.seed(1)
x <- strata(mydf, "ID", size = c(2, 3, 2), method = "srswor")
getdata(mydf, x)
# X1 X2 X3 ID ID_unit Prob Stratum
# 2 0.1836433 -0.01619026 -0.1027877 1 2 0.4 1
# 5 0.3295078 0.59390132 -1.3770596 1 5 0.4 1
# 6 -0.8204684 0.91897737 -0.4149946 2 6 0.6 2
# 8 0.7383247 0.07456498 -0.0593134 2 8 0.6 2
# 9 0.5757814 -1.98935170 1.1000254 2 9 0.6 2
# 14 -2.2146999 -0.47815006 0.5566632 3 14 0.4 3
# 15 1.1249309 0.41794156 -0.6887557 3 15 0.4 3
In case you have big datasets, a data.table
solution could go like this:
library(data.table)
# Generate 26 mil rows random data
set.seed(2019)
dt <- data.table(c1 = sample(length(LETTERS)*10^6),
c2 = sample(LETTERS, replace = TRUE))
# For each letter, sample 500 rows
dt_sample <- dt[, .SD[sample(x = .N, size = 500)], by = c2]
# We indeed sampled 500 rows for each letter
dt_sample[, .N, by = c2][order(c2)]
#> c2 N
#> 1: A 500
#> 2: D 500
#> 3: G 500
#> 4: I 500
#> 5: M 500
#> 6: N 500
#> 7: O 500
#> 8: P 500
#> 9: Q 500
#> 10: R 500
#> 11: S 500
#> 12: T 500
#> 13: U 500
#> 14: V 500
#> 15: W 500
#> 16: Y 500
#> 17: Z 500
Created on 2019-04-23 by the reprex package (v0.2.1)
In case your data is unbalanced in the sense that some groups happen to be smaller (as number of rows) than your desired sample size, then you need to set a defensive trick like sample size should be min(500, .N)
- see sample random rows within each group in a data.table. So like:
dt[, .SD[sample(x = .N, size = min(500, .N))], by = c2]
An approach if on of the IDs is < 500. Here I used the mtcars set:
n <- 8
df <- mtcars
df$ID <- df$cyl
FUN <- function(x, n) {
if (length(x) <= n) return(x)
x[x %in% sample(x, n)]
}
df[unlist(lapply(split(1:nrow(df), df$ID), FUN, n = 8)), ]
Here's an elegant solution based on data.table
. You can randomly draw IDs from a panel data set (balanced or unbalanced) in three simple steps:
Step 1: Store unique IDs from your original data set in a vector (my data set is called "main" and the identifier is called "id"):
ids <- unique(main$id)
Step 2: Randomly draw IDs from the vector from step 1. In the example below, I randomly draw 50 IDs from the vector "ids" and store them in the new vector "draw":
draw <- ids %>% sample(50)
Step 3: Subset rows in your original data set based on matches with the IDs drawn in step 2.
rsample <- main[main$id %in% draw, ]
mydata1 is your original data(not tested)
mydata2<- split(mydata1,mydata1$ID)
names(mydata2)<-paste0("mydata2",1:length(levels(ID)))
mysample<-Map(function(x) x[sample((1:nrow(x)),size=500,replace=FALSE),], mydata2)
library(plyr)# for rbinding the mysample
ldply(mysample)
Although this is not very elegant solution, but it may work.
library(data.table)
df <- data.table(df)
f <- list()
for(i in unique(df1$ID)){
f[[i]] <- df1[id == i][sample(.N,(500))]
}
dfnew <- rbindlist(f)
library(data.table) #1
df <- data.table(df) #2
df[,group_num := sample(2,.N,replace = TRUE,prob = c(500,.N-500)/.N),by = "ID"] #3
df_sample = df[group_num == 1,] #4
or you can change line #3 and #4 to:
df[,random_num := sample(.N,.N),by="ID"]
df_sample = df[random_num <=500,]