The bottleneck seems to be in split
. When simulated on 200 groups and 150,000 observations each, split
takes 50 seconds out of the total 54 seconds.
The split
step can be made drastically faster using data.table
as follows.
## test is a data.table here
s.test <- test[, list(list(.SD)), by=letters]$V1
Here's a benchmark on data of your dimensions using data.table
+ mapply
:
## generate data
set.seed(1L)
k = 200L
n = 150000L
test <- data.frame(letters=sample(paste0("id", 1:k), n*k, TRUE),
numbers=sample(1e6, n*k, TRUE), stringsAsFactors=FALSE)
require(data.table) ## latest CRAN version is v1.9.2
setDT(test) ## convert to data.table by reference (no copy)
system.time({
s.test <- test[, list(list(.SD)), by=letters]$V1 ## split
setattr(s.test, 'names', unique(test$letters)) ## setnames
notIn <- mapply(function(x,y)
sum(!s.test[[x]]$numbers %in% s.test[[y]]$numbers),
x=names(s.test)[1:199], y=names(s.test)[2:200])
})
## user system elapsed
## 4.840 1.643 6.624
That's about ~7.5x speedup on your biggest data dimensions. Would this be sufficient?