Question

The R function expand.grid returns all possible combination between the elements of supplied parameters. e.g.

> expand.grid(c("aa", "ab", "cc"), c("aa", "ab", "cc"))
  Var1 Var2
1   aa   aa
2   ab   aa
3   cc   aa
4   aa   ab
5   ab   ab
6   cc   ab
7   aa   cc
8   ab   cc
9   cc   cc

Do you know an efficient way to get directly (so without any row comparison after expand.grid) only the 'unique' combinations between the supplied vectors? The output will be

  Var1 Var2
1   aa   aa
2   ab   aa
3   cc   aa
5   ab   ab
6   cc   ab
9   cc   cc

EDIT the combination of each element with itself could be eventually discarded from the answer. I don't actually need it in my program even though (mathematically) aa aa would be one (regular) unique combination between one element of Var1 and another of var2.

The solution needs to produce pairs of elements from both vectors (i.e. one from each of the input vectors - so that it could be applied to more than 2 inputs)

Was it helpful?

Solution

How about using outer? But this particular function concatenates them into one character string.

outer( c("aa", "ab", "cc"), c("aa", "ab", "cc") , "paste" )
#     [,1]    [,2]    [,3]   
#[1,] "aa aa" "aa ab" "aa cc"
#[2,] "ab aa" "ab ab" "ab cc"
#[3,] "cc aa" "cc ab" "cc cc"

You can also use combn on the unique elements of the two vectors if you don't want the repeating elements (e.g. aa aa)

vals <- c( c("aa", "ab", "cc"), c("aa", "ab", "cc") )
vals <- unique( vals )
combn( vals , 2 )
#     [,1] [,2] [,3]
#[1,] "aa" "aa" "ab"
#[2,] "ab" "cc" "cc"

OTHER TIPS

In base R, you can use this:

expand.grid.unique <- function(x, y, include.equals=FALSE)
{
    x <- unique(x)

    y <- unique(y)

    g <- function(i)
    {
        z <- setdiff(y, x[seq_len(i-include.equals)])

        if(length(z)) cbind(x[i], z, deparse.level=0)
    }

    do.call(rbind, lapply(seq_along(x), g))
}

Results:

> x <- c("aa", "ab", "cc")
> y <- c("aa", "ab", "cc")

> expand.grid.unique(x, y)
     [,1] [,2]
[1,] "aa" "ab"
[2,] "aa" "cc"
[3,] "ab" "cc"

> expand.grid.unique(x, y, include.equals=TRUE)
     [,1] [,2]
[1,] "aa" "aa"
[2,] "aa" "ab"
[3,] "aa" "cc"
[4,] "ab" "ab"
[5,] "ab" "cc"
[6,] "cc" "cc"

If the two vectors are the same, there's the combinations function in the gtools package:

library(gtools)
combinations(n = 3, r = 2, v = c("aa", "ab", "cc"), repeats.allowed = TRUE)

#      [,1] [,2]
# [1,] "aa" "aa"
# [2,] "aa" "ab"
# [3,] "aa" "cc"
# [4,] "ab" "ab"
# [5,] "ab" "cc"
# [6,] "cc" "cc"

And without "aa" "aa", etc.

combinations(n = 3, r = 2, v = c("aa", "ab", "cc"), repeats.allowed = FALSE)

The previous answers were lacking a way to get a specific result, namely to keep the self-pairs but remove the ones with different orders. The gtools package has two functions for these purposes, combinations and permutations. According to this website:

  • When the order doesn't matter, it is a Combination.
  • When the order does matter it is a Permutation.

In both cases, we have the decision to make of whether repetitions are allowed or not, and correspondingly, both functions have a repeats.allowed argument, yielding 4 combinations (deliciously meta!). It's worth going over each of these. I simplified the vector to single letters for ease of understanding.

Permutations with repetition

The most expansive option is to allow both self-relations and differently ordered options:

> permutations(n = 3, r = 2, repeats.allowed = T, v = c("a", "b", "c"))
      [,1] [,2]
 [1,] "a"  "a" 
 [2,] "a"  "b" 
 [3,] "a"  "c" 
 [4,] "b"  "a" 
 [5,] "b"  "b" 
 [6,] "b"  "c" 
 [7,] "c"  "a" 
 [8,] "c"  "b" 
 [9,] "c"  "c" 

which gives us 9 options. This value can be found from the simple formula n^r i.e. 3^2=9. This is the Cartesian product/join for users familiar with SQL.

There are two ways to limit this: 1) remove self-relations (disallow repetitions), or 2) remove differently ordered options (i.e. combinations).

Combinations with repetitions

If we want to remove differently ordered options, we use:

> combinations(n = 3, r = 2, repeats.allowed = T, v = c("a", "b", "c"))
     [,1] [,2]
[1,] "a"  "a" 
[2,] "a"  "b" 
[3,] "a"  "c" 
[4,] "b"  "b" 
[5,] "b"  "c" 
[6,] "c"  "c" 

which gives us 6 options. The formula for this value is (r+n-1)!/(r!*(n-1)!) i.e. (2+3-1)!/(2!*(3-1)!)=4!/(2*2!)=24/4=6.

Permutations without repetition

If instead we want to disallow repetitions, we use:

> permutations(n = 3, r = 2, repeats.allowed = F, v = c("a", "b", "c"))
     [,1] [,2]
[1,] "a"  "b" 
[2,] "a"  "c" 
[3,] "b"  "a" 
[4,] "b"  "c" 
[5,] "c"  "a" 
[6,] "c"  "b" 

which also gives us 6 options, but different ones! The number of options is the same as above but it's a coincidence. The value can be found from the formula n!/(n-r)! i.e. (3*2*1)/(3-2)!=6/1!=6.

Combinations without repetitions

The most limiting is when we want neither self-relations/repetitions or differently ordered options, in which case we use:

> combinations(n = 3, r = 2, repeats.allowed = F, v = c("a", "b", "c"))
     [,1] [,2]
[1,] "a"  "b" 
[2,] "a"  "c" 
[3,] "b"  "c" 

which gives us only 3 options. The number of options can be calculated from the rather complex formula n!/(r!(n-r)!) i.e. 3*2*1/(2*1*(3-2)!)=6/(2*1!)=6/2=3.

Try:

factors <- c("a", "b", "c")

all.combos <- t(combn(factors,2))

     [,1] [,2]
[1,] "a"  "b" 
[2,] "a"  "c" 
[3,] "b"  "c"

This will not include duplicates of each factor (e.g. "a" "a"), but you can add those on easily if needed.

dup.combos <- cbind(factors,factors)

     factors factors
[1,] "a"     "a"    
[2,] "b"     "b"    
[3,] "c"     "c"   

all.combos <- rbind(all.combos,dup.combos)

     factors factors
[1,] "a"     "b"    
[2,] "a"     "c"    
[3,] "b"     "c"    
[4,] "a"     "a"    
[5,] "b"     "b"    
[6,] "c"     "c" 

You can use a "greater than" operation to filter redundant combinations. This works with both numeric and character vectors.

> grid <- expand.grid(c("aa", "ab", "cc"), c("aa", "ab", "cc"), stringsAsFactors = F)
> grid[grid$Var1 >= grid$Var2, ]
  Var1 Var2
1   aa   aa
2   ab   aa
3   cc   aa
5   ab   ab
6   cc   ab
9   cc   cc

This shouldn't slow down your code too much. If you're expanding vectors containing larger elements (e.g. two lists of dataframes), I recommend using numeric indices that refer to the original vectors.

TL;DR

Use comboGrid from RcppAlgos:

library(RcppAlgos)
comboGrid(c("aa", "ab", "cc"), c("aa", "ab", "cc"))
     Var1 Var2
[1,] "aa" "aa"
[2,] "aa" "ab"
[3,] "aa" "cc"
[4,] "ab" "ab"
[5,] "ab" "cc"
[6,] "cc" "cc"

The Details

I recently came across this question R - Expand Grid Without Duplicates and as I was searching for duplicates, I found this question. The question there isn't exactly a duplicate, as it is a bit more general and has additional restrictions which @Ferdinand.kraft shined some light on.

It should be noted that many of the solutions here make use of some sort of combination function. The expand.grid function returns the Cartesian product which is fundamentally different.

The Cartesian product operates on multiple objects which may or may not be the same. Generally speaking, combination functions are applied to a single vector. The same can be said about permutation functions.

Using combination/permutation functions will only produce comparable results to expand.grid if the vectors supplied are identical. As a very simple example, consider v1 = 1:3, v2 = 2:4.

With expand.grid, we see that rows 3 and 5 are duplicates:

expand.grid(1:3, 2:4)
  Var1 Var2
1    1    2
2    2    2
3    3    2
4    1    3
5    2    3
6    3    3
7    1    4
8    2    4
9    3    4

Using combn doesn't quite get us to the solution:

t(combn(unique(c(1:3, 2:4)), 2))
     [,1] [,2]
[1,]    1    2
[2,]    1    3
[3,]    1    4
[4,]    2    3
[5,]    2    4
[6,]    3    4

And with repeats using gtools, we generate too many:

gtools::combinations(4, 2, v = unique(c(1:3, 2:4)), repeats.allowed = TRUE)
      [,1] [,2]
 [1,]    1    1
 [2,]    1    2
 [3,]    1    3
 [4,]    1    4
 [5,]    2    2
 [6,]    2    3
 [7,]    2    4
 [8,]    3    3
 [9,]    3    4
[10,]    4    4

In fact we generate results that are not even in the cartesian product (i.e. expand.grid solution).

We need a solution that creates the following:

     Var1 Var2
[1,]    1    2
[2,]    1    3
[3,]    1    4
[4,]    2    2
[5,]    2    3
[6,]    2    4
[7,]    3    3
[8,]    3    4

I authored the package RcppAlgos and in the latest release v2.4.3, there is a function comboGrid which addresses this very problem. It is very general, flexible, and is fast.

First, to answer the specific question raised by the OP:

library(RcppAlgos)
comboGrid(c("aa", "ab", "cc"), c("aa", "ab", "cc"))
     Var1 Var2
[1,] "aa" "aa"
[2,] "aa" "ab"
[3,] "aa" "cc"
[4,] "ab" "ab"
[5,] "ab" "cc"
[6,] "cc" "cc"

And as, @Ferdinand.kraft points out, sometimes the output may need to have duplicates excluded in a given row. For that, we use repetition = FALSE:

comboGrid(c("aa", "ab", "cc"), c("aa", "ab", "cc"), repetition = FALSE)
     Var1 Var2
[1,] "aa" "ab"
[2,] "aa" "cc"
[3,] "ab" "cc"

comboGrid is also very general. It can be applied to multiple vectors:

comboGrid(rep(list(c("aa", "ab", "cc")), 3))
      Var1 Var2 Var3
 [1,] "aa" "aa" "aa"
 [2,] "aa" "aa" "ab"
 [3,] "aa" "aa" "cc"
 [4,] "aa" "ab" "ab"
 [5,] "aa" "ab" "cc"
 [6,] "aa" "cc" "cc"
 [7,] "ab" "ab" "ab"
 [8,] "ab" "ab" "cc"
 [9,] "ab" "cc" "cc"
[10,] "cc" "cc" "cc"

Doesn't need the vectors to be identical:

comboGrid(1:3, 2:4)
     Var1 Var2
[1,]    1    2
[2,]    1    3
[3,]    1    4
[4,]    2    2
[5,]    2    3
[6,]    2    4
[7,]    3    3
[8,]    3    4

And can be applied to vectors of various types:

set.seed(123)
my_range <- 3:15
mixed_types <- list(
    int1 = sample(15, sample(my_range, 1)),
    int2 = sample(15, sample(my_range, 1)),
    char1 = sample(LETTERS, sample(my_range, 1)),
    char2 = sample(LETTERS, sample(my_range, 1))
)

dim(expand.grid(mixed_types))
[1] 1950    4

dim(comboGrid(mixed_types, repetition = FALSE))
[1] 1595    4

dim(comboGrid(mixed_types, repetition = TRUE))
[1] 1770    4

The algorithm employed avoids generating the entirety of the Cartesian product and subsequently removing dupes. Ultimately, we create a hash table using the Fundamental theorem of arithmetic along with deduplication as pointed out by user2357112 supports Monica in the answer to Picking unordered combinations from pools with overlap. All of this together with the fact that it is written in C++ means that it is fast and memory efficient:

pools = list(c(1, 10, 14, 6),
             c(7, 2, 4, 8, 3, 11, 12),
             c(11, 3, 13, 4, 15, 8, 6, 5),
             c(10, 1, 3, 2, 9, 5,  7),
             c(1, 5, 10, 3, 8, 14),
             c(15, 3, 7, 10, 4, 5, 8, 6),
             c(14, 9, 11, 15),
             c(7, 6, 13, 14, 10, 11, 9, 4),
             c(6,  3,  2, 14,  7, 12,  9),
             c(6, 11,  2,  5, 15,  7))
             
system.time(combCarts <- comboGrid(pools))
   user  system elapsed 
  0.929   0.062   0.992

nrow(combCarts)
[1] 1205740

## Small object created
print(object.size(combCarts), unit = "Mb")
92 Mb
  
system.time(cartProd <- expand.grid(pools))
   user  system elapsed 
  8.477   2.895  11.461 
  
prod(lengths(pools))
[1] 101154816

## Very large object created
print(object.size(cartProd), unit = "Mb")
7717.5 Mb

here's a very ugly version that worked for me on a similar problem.

AHP_code = letters[1:10] 
 temp. <- expand.grid(AHP_code, AHP_code, stringsAsFactors = FALSE)
  temp. <- temp.[temp.$Var1 != temp.$Var2, ] # remove AA, BB, CC, etc. 
  temp.$combo <- NA 
  for(i in 1:nrow(temp.)){  # vectorizing this gave me weird results, loop worked fine. 
    temp.$combo[i] <- paste0(sort(as.character(temp.[i, 1:2])), collapse = "")
  }
  temp. <- temp.[!duplicated(temp.$combo),]
  temp. 

USING SORT

Just for fun, one can in principle also remove duplicates from expand.grid by combining sort and unique.

unique(t(apply(expand.grid(c("aa", "ab", "cc"), c("aa", "ab", "cc")), 1, sort)))

This gives:

    [,1] [,2]
[1,] "aa" "aa"
[2,] "aa" "ab"
[3,] "aa" "cc"
[4,] "ab" "ab"
[5,] "ab" "cc"
[6,] "cc" "cc"
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top