Question

What is the best way to do a Cartesian join and use the roll forward feature, but applying the roll feature to each alternative series from the joining table, rather than the whole series.

Best explained with an example:

library(data.table)
A = data.table(x = c(1,2,3,4,5), y = letters[1:5])
B = data.table(x = c(1,2,3,1,4), f = c("Alice","Alice","Alice", "Bob","Bob"), z = 101:105)
setkey(B,x)
C = B[A, roll = TRUE, allow.cartesian=TRUE, rollends = FALSE]

A
B
C[f == "Alice"]
C[f == "Bob"]
C

So we have the two starting tables:

> A
   x y
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
> B
   x     f   z
1: 1 Alice 101
2: 1   Bob 104
3: 2 Alice 102
4: 3 Alice 103
5: 4   Bob 105

And I want to join these so that I have for each x value in A I have both and Alice and Bob row, rolling forwards if either are missing (but not rolling past the end). This doesn't quite work as I've currently got it:

> C[f == "Alice"]
   x     f   z y
1: 1 Alice 101 a
2: 2 Alice 102 b
3: 3 Alice 103 c
> C[f == "Bob"]
   x   f   z y
1: 1 Bob 104 a
2: 4 Bob 105 d
> C
   x     f   z y
1: 1 Alice 101 a
2: 1   Bob 104 a
3: 2 Alice 102 b
4: 3 Alice 103 c
5: 4   Bob 105 d
6: 5    NA  NA e

Because Alice is there for 2 and 3, it doesn't roll Bob's data forwards. I need the extra rows for Bob so I want to get:

> C[f == "Alice"]
   x     f   z y
1: 1 Alice 101 a
2: 2 Alice 102 b
3: 3 Alice 103 c
> C[f == "Bob"]
   x   f   z y
1: 1 Bob 104 a
2: 2 Bob 104 b  # THESE ROWS ARE MISSING
3: 3 Bob 104 c  # THESE ROWS ARE MISSING
4: 4 Bob 105 d
> C
   x     f   z y
1: 1 Alice 101 a
2: 1   Bob 104 a
3: 2 Alice 102 b
4: 2   Bob 104 b  # THESE ROWS ARE MISSING
5: 3 Alice 103 c
6: 3   Bob 104 c  # THESE ROWS ARE MISSING
7: 4   Bob 105 d
8: 5    NA  NA e
Was it helpful?

Solution

Here you go:

setkey(B, f, x)

setkey(B[CJ(unique(f), unique(x)), allow.cartesian = T,
         roll = T, rollends = c(F,F)], x)[A, allow.cartesian = T]
#   x     f   z y
#1: 1 Alice 101 a
#2: 1   Bob 104 a
#3: 2 Alice 102 b
#4: 2   Bob 104 b
#5: 3 Alice 103 c
#6: 3   Bob 104 c
#7: 4 Alice  NA d
#8: 4   Bob 105 d
#9: 5    NA  NA e

And you can filter out the NA's to suit your needs.

OTHER TIPS

I also found an alternative way of doing this. I've accepted the other answer as that produces a result more closely matching that requested by the question, but this might also be useful for some people. The difference is what happens at the end of the series.

C = B[, .SD[A, roll = TRUE, rollends = FALSE], by = f]
setkey(C, x)

> C
        f x   z y
 1: Alice 1 101 a
 2:   Bob 1 104 a
 3: Alice 2 102 b
 4:   Bob 2 104 b
 5: Alice 3 103 c
 6:   Bob 3 104 c
 7: Alice 4  NA d
 8:   Bob 4 105 d
 9: Alice 5  NA e
10:   Bob 5  NA e

Row 9 & 10 are the only difference; in eddi's answer these are combined as a single row, with NA in both columns.

This solution is also slightly slower than eddi's when I tested on bigger data.tables (although both are quite fast).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top