TL;DR
Use comboGrid
from RcppAlgos
:
library(RcppAlgos)
comboGrid(c("aa", "ab", "cc"), c("aa", "ab", "cc"))
Var1 Var2
[1,] "aa" "aa"
[2,] "aa" "ab"
[3,] "aa" "cc"
[4,] "ab" "ab"
[5,] "ab" "cc"
[6,] "cc" "cc"
The Details
I recently came across this question R - Expand Grid Without Duplicates and as I was searching for duplicates, I found this question. The question there isn't exactly a duplicate, as it is a bit more general and has additional restrictions which @Ferdinand.kraft shined some light on.
It should be noted that many of the solutions here make use of some sort of combination function. The expand.grid
function returns the Cartesian product which is fundamentally different.
The Cartesian product operates on multiple objects which may or may not be the same. Generally speaking, combination functions are applied to a single vector. The same can be said about permutation functions.
Using combination/permutation functions will only produce comparable results to expand.grid
if the vectors supplied are identical. As a very simple example, consider v1 = 1:3, v2 = 2:4
.
With expand.grid
, we see that rows 3 and 5 are duplicates:
expand.grid(1:3, 2:4)
Var1 Var2
1 1 2
2 2 2
3 3 2
4 1 3
5 2 3
6 3 3
7 1 4
8 2 4
9 3 4
Using combn
doesn't quite get us to the solution:
t(combn(unique(c(1:3, 2:4)), 2))
[,1] [,2]
[1,] 1 2
[2,] 1 3
[3,] 1 4
[4,] 2 3
[5,] 2 4
[6,] 3 4
And with repeats using gtools
, we generate too many:
gtools::combinations(4, 2, v = unique(c(1:3, 2:4)), repeats.allowed = TRUE)
[,1] [,2]
[1,] 1 1
[2,] 1 2
[3,] 1 3
[4,] 1 4
[5,] 2 2
[6,] 2 3
[7,] 2 4
[8,] 3 3
[9,] 3 4
[10,] 4 4
In fact we generate results that are not even in the cartesian product (i.e. expand.grid
solution).
We need a solution that creates the following:
Var1 Var2
[1,] 1 2
[2,] 1 3
[3,] 1 4
[4,] 2 2
[5,] 2 3
[6,] 2 4
[7,] 3 3
[8,] 3 4
I authored the package RcppAlgos
and in the latest release v2.4.3
, there is a function comboGrid
which addresses this very problem. It is very general, flexible, and is fast.
First, to answer the specific question raised by the OP:
library(RcppAlgos)
comboGrid(c("aa", "ab", "cc"), c("aa", "ab", "cc"))
Var1 Var2
[1,] "aa" "aa"
[2,] "aa" "ab"
[3,] "aa" "cc"
[4,] "ab" "ab"
[5,] "ab" "cc"
[6,] "cc" "cc"
And as, @Ferdinand.kraft points out, sometimes the output may need to have duplicates excluded in a given row. For that, we use repetition = FALSE
:
comboGrid(c("aa", "ab", "cc"), c("aa", "ab", "cc"), repetition = FALSE)
Var1 Var2
[1,] "aa" "ab"
[2,] "aa" "cc"
[3,] "ab" "cc"
comboGrid
is also very general. It can be applied to multiple vectors:
comboGrid(rep(list(c("aa", "ab", "cc")), 3))
Var1 Var2 Var3
[1,] "aa" "aa" "aa"
[2,] "aa" "aa" "ab"
[3,] "aa" "aa" "cc"
[4,] "aa" "ab" "ab"
[5,] "aa" "ab" "cc"
[6,] "aa" "cc" "cc"
[7,] "ab" "ab" "ab"
[8,] "ab" "ab" "cc"
[9,] "ab" "cc" "cc"
[10,] "cc" "cc" "cc"
Doesn't need the vectors to be identical:
comboGrid(1:3, 2:4)
Var1 Var2
[1,] 1 2
[2,] 1 3
[3,] 1 4
[4,] 2 2
[5,] 2 3
[6,] 2 4
[7,] 3 3
[8,] 3 4
And can be applied to vectors of various types:
set.seed(123)
my_range <- 3:15
mixed_types <- list(
int1 = sample(15, sample(my_range, 1)),
int2 = sample(15, sample(my_range, 1)),
char1 = sample(LETTERS, sample(my_range, 1)),
char2 = sample(LETTERS, sample(my_range, 1))
)
dim(expand.grid(mixed_types))
[1] 1950 4
dim(comboGrid(mixed_types, repetition = FALSE))
[1] 1595 4
dim(comboGrid(mixed_types, repetition = TRUE))
[1] 1770 4
The algorithm employed avoids generating the entirety of the Cartesian product and subsequently removing dupes. Ultimately, we create a hash table using the Fundamental theorem of arithmetic along with deduplication as pointed out by user2357112 supports Monica in the answer to Picking unordered combinations from pools with overlap. All of this together with the fact that it is written in C++
means that it is fast and memory efficient:
pools = list(c(1, 10, 14, 6),
c(7, 2, 4, 8, 3, 11, 12),
c(11, 3, 13, 4, 15, 8, 6, 5),
c(10, 1, 3, 2, 9, 5, 7),
c(1, 5, 10, 3, 8, 14),
c(15, 3, 7, 10, 4, 5, 8, 6),
c(14, 9, 11, 15),
c(7, 6, 13, 14, 10, 11, 9, 4),
c(6, 3, 2, 14, 7, 12, 9),
c(6, 11, 2, 5, 15, 7))
system.time(combCarts <- comboGrid(pools))
user system elapsed
0.929 0.062 0.992
nrow(combCarts)
[1] 1205740
## Small object created
print(object.size(combCarts), unit = "Mb")
92 Mb
system.time(cartProd <- expand.grid(pools))
user system elapsed
8.477 2.895 11.461
prod(lengths(pools))
[1] 101154816
## Very large object created
print(object.size(cartProd), unit = "Mb")
7717.5 Mb