I have two data frames in R and I would like to conditionally merge them on id and day. The merge is that the right variables merged to the left variables be as new/fresh/recent as possible, but must be at least three days old.

But, if there isn't a match in right to my id-date pair in left I'd still like to retain them. My study has two parts, so I don't want to drop the id-day observations just because they're not complete.

Can I do this in one sqldf step? My current approach requires an additional base R merge.

left <- data.frame(id=rep(1:5, each=10),
                   day=rep(1:10, times=5),
                   x=rnorm(5*10))
right <- data.frame(id=rep(1:2, each=21),
                   day=rep(-10:10, times=2),
                   y=rnorm(2*21))
combined <- sqldf("SELECT L.id, L.day, L.x, R.y
                  FROM left L LEFT OUTER JOIN right R
                  ON (L.id = R.id)
                  WHERE ((L.day - R.day) >= 3)
                  GROUP BY L.id, L.day
                  HAVING (R.day = MAX(R.day))")
combined                  

combined.2 <- merge(left, combined, all=TRUE)
combined.2 
有帮助吗?

解决方案

Try nesting the select statements like this:

sqldf("SELECT * from left
       LEFT JOIN (SELECT id, L.day, L.x, R.y
                  FROM left L LEFT OUTER JOIN right R
                  USING (id)
                  WHERE ((L.day - R.day) >= 3)
                  GROUP BY L.id, L.day
                  HAVING (R.day = MAX(R.day))) 
       USING (id, day, x)")

This could also be done as follows. It uses the fact that if max is used then the other values on the same resulting row are guaranteed to come from the same original row as the max. This is an extension to SQL that SQLite provides.

sqldf("select max(R.day) as maxRday, L.*, R.y
  from left L left outer join right R
  on L.id = R.id and L.day - R.day >= 3
  group by L.id, L.day")[-1]

其他提示

With version 1.9.8 (on CRAN 25 Nov 2016), data.table gained the ability to perform non-equi joins. This feature wasn't available in 2014 when bartektartanus promised to post a data.table answer.

Now, in 2020, with a delay of 6 years here is a data.table answer:

library(data.table)
setDT(right)[, join_day := day + 3L][
  setDT(left), on = .(id, join_day <= day), .(x = last(x), y = last(y)), by = .EACHI][
    , setnames(.SD, "join_day", "day")]

which returns

    id day  x  y
 1:  1   1  1  2
 2:  1   2  2  2
 3:  1   3  3  3
 4:  2   1  4  6
 5:  2   2  5  7
 6:  2   3  6  7
 7:  3   1  7 13
 8:  3   2  8 14
 9:  3   3  9 15
10:  4   1 10 NA
11:  4   2 11 NA
12:  4   3 12 NA

for the modified sample data

left <- data.table(id = rep(1:4, each=3),
                   day = rep(1:3, times=4),
                   x = 1:(3*4))
right <- data.table(id = c(rep(1:2, each=5L), rep(3, 9L)),
                    day = c(seq(-4L, 4L, 2L), seq(-3L, 5L, 2L), -4:4))[, y := seq_along(id)]

where

left
    id day  x
 1:  1   1  1
 2:  1   2  2
 3:  1   3  3
 4:  2   1  4
 5:  2   2  5
 6:  2   3  6
 7:  3   1  7
 8:  3   2  8
 9:  3   3  9
10:  4   1 10
11:  4   2 11
12:  4   3 12

and

right
    id day  y
 1:  1  -4  1
 2:  1  -2  2
 3:  1   0  3
 4:  1   2  4
 5:  1   4  5
 6:  2  -3  6
 7:  2  -1  7
 8:  2   1  8
 9:  2   3  9
10:  2   5 10
11:  3  -4 11
12:  3  -3 12
13:  3  -2 13
14:  3  -1 14
15:  3   0 15
16:  3   1 16
17:  3   2 17
18:  3   3 18
19:  3   4 19
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top