Question

Very simple question and I regret having to ask it on SO. I'm hoping a creative mind out there has a better solution than I have thought of.

I want to find the differences between adjacent values of a matrix where the value for "type" is the same. For the below example, I would want a vector with the values 2,6,1.

mat
     value type
[1,]  5   A
[2,]  7   A
[3,]  1   B
[4,]  2   C
[5,]  8   C
[6,]  9   C

I have tried implementing it in two ways, but both of them are quite slow:

  • Approach 1: If type(row 1) = type(row 2), then find value(row 2) - value(row 1), then increment by one. If type(row 1) != type(row 2), then increment by 2.

  • Approach 2: Loop through each different type and find all differences.

There are about 500,000 different "types" and 5 million different rows. Can anyone think of a more efficient way? I am programming in the language R, where there is already a function to find the differences for me (see: ?diff).

Was it helpful?

Solution

Since you say you have too many rows to do this, I'll suggest a data.table solution:

require(data.table)
DT <- data.table(df) # where `df` is your data.frame

DT[, diff(value), by=type]$V1
# [1] 2 6 1

Simulating this code on a data of your dimensions:

It takes close to 20 seconds (bottleneck should be calls to diff) on data of your dimensions.

require(data.table)
set.seed(45)
types <- sapply(1:5e5, function(x) paste0(sample(letters, 5, TRUE), collapse=""))

DT <- data.table(value=sample(100, 5e6, TRUE), type=sample(types, 5e6, TRUE))
system.time(t1 <- DT[, diff(value), by=type]$V1)
#   user  system elapsed 
# 18.610   0.238  19.166 

To compare against the other answer with tapply:

system.time(t2 <- tapply(DT[["value"]], DT[["type"]], diff))
#   user  system elapsed 
# 48.471   0.664  51.673 

Also, tapply orders the results by type where as data.table without key will preserve the original order.


Edit: Following @eddi's comment:

> system.time(t3 <- DT[, value[-1]-value[-.N], by=type]$V1)
#  user  system elapsed 
# 6.221   0.195   6.641 

There's a 3x improvement by removing the call to diff. Thanks @eddi.

OTHER TIPS

Try:

tapply(mat[["value"]], mat[["type"]], diff)

Then you can unlist your result to get it in a tidier format.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top