Question

I have merged two datasets. Set X contains an ID and a startDate, set Y an ID and an endDate. Both have the ID as key. I want to match every startDate with its corresponding endDate (provided it is bigger than the startDate). The problem is that IDs may appear multiple times in a given set, but not always as many times in the same set. Hence, a startDate can get matched up with multiple entries of endDate, and vice versa. This is basically the code I have:

require(data.table)

startDate = c(1,2,3,1)
IDX = c(1,2,2,3)
endDate = c(2,3,4)
IDY = c(1,1,2)

X = data.table(startDate,IDX)
Y= data.table(endDate,IDY)

setkey(X,IDX)
setkey(Y,IDY)
DT = X[Y,allow.cartesian = TRUE][endDate>startDate]   

Now I would like to conditionally remove duplicated entries from the set DT. Corresponding to the keys "ID" and "startDate" I want to only keep the duplicated entry with the lowest endDate. For every duplicated entry of "ID" and "endDate" I want to do the same, but instead keep the entry that has the highest startDate.

DT looks like this:

   IDX startDate endDate
1:   1         1       2
2:   1         1       3
3:   2         2       4
4:   2         3       4

There are two duplicates in this new data table. 1 and 2 are duplicates of eachother, with a different endDate. Only entry 1 has to stay (since it has the closest endDate to the startDate). 3 and 4 are also duplicates of eachother, with a different startDate. Here, entry 4 has to stay (since it has the closes startDate to the endDate). So the required output is

   IDX startDate endDate
1:   1         1       2
2:   2         3       4

I don't know how to achieve this with duplicated(DT), since it does not necessarily give the specific duplicate that I want. Anyone got a suggestion? Moreover, can I perhaps already solve this during the "join" instead of solving it afterwards?

Was it helpful?

Solution

I don't think this can be done in a single join, but perhaps the following can work:

Y[, startDate := endDate]
setkey(X, IDX, startDate)
setkey(Y, IDY, startDate)

Y[X, roll = -Inf][, list(startDate = startDate[.N]), by = list(IDY, endDate)]
#   IDY endDate startDate
#1:   1       2         1
#2:   2       4         3
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top