How should I count the number of unique rows in a 'binary' matrix?

Question 1

Building on Kevin's answer, here is a C++11 version using a slightly different approach:

List rowCounts_2(IntegerMatrix x) {
  int n = x.nrow() ;
  int nc = x.ncol() ;
  std::vector<int> hashes(n) ;
  for( int k=0, pow=1; k<nc; k++, pow*=2){
    IntegerMatrix::Column column = x.column(k) ;

    std::transform( column.begin(), column.end(), hashes.begin(), hashes.begin(), [=]( int v, int h ){
        return h + pow*v ;
    }) ;
  }

  using Pair = std::pair<int,int> ;
  std::unordered_map<int, Pair> map_counts ;

  for( int i=0; i<n; i++){
    Pair& p = map_counts[ hashes[i] ] ;
    if( p.first == 0){
      p.first = i+1 ; // using directly 1-based index
    }
    p.second++ ;
  }

  int nres = map_counts.size() ;
  IntegerVector idx(nres), counts(nres) ;
  auto it=map_counts.begin() ;
  for( int i=0; i<nres; i++, ++it){
    idx[i] = it->second.first ;
    counts[i] = it->second.second ;
  }

  return List::create( _["counts"] = counts, _["idx"] = idx );
}

The idea is to trade memory for speed. The first change is that I'm allocating and filling a std::vector<int> to host the hashes. Doing this allows me to traverse the input matrix column by column which is more efficient.

Once this is done, I'm training a hash map of pairs (index, counts) std::unordered_map<int, std::pair<int,int>>. The key of the map is the hash, the value is a pair (index, count).

Then I just have to traverse the hash map and collect the results. The results don't appear in ascending order of idx (it is easy to do it if we really want that).

I get these results for n=1e5 and n=1e7.

> m <- matrix(sample(0:1, 1e+05, TRUE), ncol = 10)

> microbenchmark(rowCounts(m), rowCountsR(m), rowCounts_2(m))
Unit: microseconds
           expr      min       lq    median        uq       max neval
   rowCounts(m) 1194.536 1201.273 1213.1450 1231.7295  1286.458   100
  rowCountsR(m)  575.004  933.637  962.8720  981.6015 23678.451   100
 rowCounts_2(m)  421.744  429.118  442.5095  455.2510   530.261   100

> m <- matrix(sample(0:1, 1e+07, TRUE), ncol = 10)

> microbenchmark(rowCounts(m), rowCountsR(m), rowCounts_2(m))
Unit: milliseconds
           expr      min       lq   median        uq       max neval
   rowCounts(m) 97.22727 98.02716 98.56641 100.42262 102.07661   100
  rowCountsR(m) 57.44635 59.46188 69.34481  73.89541 100.43032   100
 rowCounts_2(m) 22.95741 23.38186 23.78068  24.16814  27.44125   100

Taking advantage of threading helps further. Below is how the time is split between 4 threads on my machine. See the code in this gist.

enter image description here

Here are benchmarks with the last version too:

> microbenchmark(rowCountsR(m), rowCounts_1(m), rowCounts_2(m), rowCounts_3(m,4))
Unit: milliseconds
              expr       min        lq    median        uq       max neval
     rowCountsR(m)  93.67895 127.58762 127.81847 128.03472 151.54455   100
    rowCounts_1(m) 120.47675 120.89169 121.31227 122.86422 137.86543   100
    rowCounts_2(m)  28.88102  29.68101  29.83790  29.97112  38.14453   100
 rowCounts_3(m, 4)  12.50059  12.68981  12.87712  13.10425  17.21966   100

Question 2

We can take advantage of the structure of our matrix to count the number of unique rows in a nice way. Because the values are all 0 and 1, we can define a 'hash' function that maps each row to a unique integer value, and then count those hashes.

The hash function we will implement is identical to the following R code:

hash <- function(x) sum(x * 2^(0:(length(x)-1)))

where x is an integer vector of 0s and 1s, representing a row of a matrix.

In my solution, because I'm using C++ and there is no associative container that maintains insertion order (in the standard library), I use both a std::map<int, int> to count hashes of each row, and a std::vector<int> to track the order in which hashes are inserted.

Because of the restriction of number of columns <= 20, we can compute the hashed values and store in an int, but to be safe for larger matrices one should store the hashes in a double (because overflow would occur with n > 31)

With that in mind, we can write a solution:

#include <Rcpp.h>
using namespace Rcpp;

inline int hash(IntegerMatrix::Row x) {
  int n = x.size();
  int hash = 0;
  for (int j=0; j < n; ++j) {
    hash += x[j] << j;
  }
  return hash;
}

// [[Rcpp::export]]
List rowCounts(IntegerMatrix x) {

  int nrow = x.nrow();

  typedef std::map<int, int> map_t;

  map_t counts;

  // keep track of insertion order with a separate vector
  std::vector<int> ordered_hashes;
  std::vector<int> insertion_order;

  ordered_hashes.reserve(nrow);
  insertion_order.reserve(nrow);

  for (int i=0; i < nrow; ++i) {
    IntegerMatrix::Row row = x(i, _);
    int hashed_row = hash(row);
    if (!counts[hashed_row]) {
      ordered_hashes.push_back(hashed_row);
      insertion_order.push_back(i);
    }
    ++counts[hashed_row];
  }

  // fill the 'counts' portion of the output
  int n = counts.size();
  IntegerVector output = no_init(n);
  for (int i=0; i < n; ++i) {
    output[i] = counts[ ordered_hashes[i] ];
  }

  // fill the 'idx' portion of the output
  IntegerVector idx  = no_init(n);
  for (int i=0; i < n; ++i) {
    idx[i] = insertion_order[i] + 1; // 0 to 1-based indexing
  }

  return List::create(
    _["counts"] = output,
    _["idx"] = idx
  );

}

/*** R
set.seed(123)
m <- matrix( sample(0:1, 10, TRUE), nrow=5 )
rowCounts(m)
m <- matrix( sample(0:1, 1E5, TRUE), ncol=5 )
str(rowCounts(m))

## Compare it to a close-ish R solution
microbenchmark( times=5,
  rowCounts(m),
  table(do.call(paste, as.data.frame(m)))
)
*/

Calling sourceCpp on this gives me:

> Rcpp::sourceCpp('rowCounts.cpp')
> set.seed(123)
> m <- matrix( sample(0:1, 10, TRUE), nrow=5 )
> m
     [,1] [,2]
[1,]    0    0
[2,]    1    1
[3,]    0    1
[4,]    1    1
[5,]    1    0

> rowCounts(m)
$counts
[1] 1 2 1 1

$idx
[1] 1 2 3 5

> m <- matrix( sample(0:1, 1E5, TRUE), ncol=5 )
> str(rowCounts(m))
List of 2
 $ counts: int [1:32] 602 640 635 624 638 621 622 615 633 592 ...
 $ idx   : int [1:32] 1 2 3 4 5 6 7 8 9 10 ...

> microbenchmark( times=5,
+   rowCounts(m),
+   table(do.call(paste, as.data.frame(m)))
+ )
Unit: milliseconds
                                    expr      min        lq    median        uq       max neval
                            rowCounts(m)  1.14732  1.150512  1.172886  1.183854  1.184235     5
 table(do.call(paste, as.data.frame(m))) 22.95222 23.146423 23.607649 24.455728 24.953177     5

Question 3

I was curious how a pure R solution would perform:

set.seed(123)
m <- matrix( sample(0:1, 1E5, TRUE), ncol=5 )

rowCountsR <- function(x) {
  ## calculate hash
  h <- m %*% matrix(2^(0:(ncol(x)-1)), ncol=1)
  i <- which(!duplicated(h))
  counts <- tabulate(h+1)
  counts[order(h[i])] <- counts
  list(counts=counts, idx=i)
}

library("rbenchmark")
benchmark(rowCounts(m), rowCountsR(m))
#            test replications elapsed relative user.self sys.self user.child sys.child
# 1  rowCounts(m)          100   0.189    1.000     0.188        0          0         0
# 2 rowCountsR(m)          100   0.258    1.365     0.256        0          0         0

Edit: more columns, thanks @Arun for pointing this out.

set.seed(123)
m <- matrix( sample(0:1, 1e7, TRUE), ncol=10)
benchmark(rowCounts(m), rowCountsR(m), replications=100)
#           test replications elapsed relative user.self sys.self user.child sys.child
#1  rowCounts(m)          100  20.659    1.077    20.533    0.024          0         0
#2 rowCountsR(m)          100  19.183    1.000    15.641    3.408          0         0