Question

The data set is similar to this:

library(data.table)
uid <- c("a","a","a","b","b","b","c","c","c")
date <- c(2001,2002,2003)
DT <- data.table(id=uid, year=rep(date,3), value= c(1,3,2,1:6))

Q1

Now I want to find which observations has the "value" column increase year by year what I want is like this: for b and c, value is increasing all the time.

4:  b 2001     1
5:  b 2002     2
6:  b 2003     3
7:  c 2001     4
8:  c 2002     5
9:  c 2003     6

In real data, the recording time span for each id is different.

besides, I want to calculate : for given id, how many years the value increases.

   ID  V1
1: a   1
2: b   2
3: c   2

Thanks a lot if you have some ideas about this. I preferred the data.table method, due to the speed calculation requirement.

Was it helpful?

Solution 2

For your first question, if they're not sorted, I'd do a setkey on id, year for sorting (rather than using base:::order, as it's very slow). id is also added so that you'll get the results in the same order as you expect for question 2 as well.

setkey(DT, id, year)
DT[, if (.N == 1L || 
        ( .N > 1 && all(value[2:.N]-value[1:(.N-1)] > 0) )
     ) .SD, 
by=list(id)]

   id year value
1:  b 2001     1
2:  b 2002     2
3:  b 2003     3
4:  c 2001     4
5:  c 2002     5
6:  c 2003     6

For your second question:

DT[, if (.N == 1L) 1L else sum(value[2:.N]-value[1:(.N-1)] > 0), by=list(id)]
   id V1
1:  a  1
2:  b  2
3:  c  2

I take the 2nd to the last (.N) value and subtract it with 1st to n-1th explicitly because diff being a S3 generic will take time for dispatch of the right method (here, diff.default) and it would be much faster to directly write your function in j.

OTHER TIPS

I think this does what you want:

DT[order(year)][, sum(diff(value) > 0), by=id]

produces:

   id V1
1:  a  1
2:  b  2
3:  c  2

This assumes you have at most one value per year.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top