How to do multi-key lookups using data.table?

https://stackoverflow.com/questions/21838361

12-10-2022
|

Question

I am using data.table to do some repeated lookups on a large dataset (45M rows, 4 int columns).

Here is what I want to do.

library(data.table)
# generate some data, u's can show up in multiple s's
d1 <- data.table(u=rep(1:500,2), s=round(runif(1000,1,100),0))
setkey(d1, u, s)

# for each u, I want to lookup all their s's
us <- d1[J(u=1), "s", with=F]
# for each of the s's in the above data.table, 
#   I want to lookup other u's from the parent data.table d1

# DOESN'T WORK:
otherus <- d1[J(s = us), "u", with=F]   

# THIS WORKS but takes a really long time on my large dataset:
otherus <- merge(d1, us, by='s')

Merge works for my purpose but since my 'd1' >>> 'us', it takes a long time. At first I thought maybe I am using the merge from the base, but based on the docs it does look like data.table merge is dispatched is the class(first_arg to merge) is a data.table.

I am still getting used to data.table J() syntax. Is there a niftier way to accomplish this?

Thanks in advance.

No correct solution

OTHER TIPS

You can change the key for that purpose.

setkey(d1,s,u)

After that command all u values for the same s value are grouped together.

        u   s
   1:  20   1
   2:  35   1
   3:  36   1
   4:  87   1
   5: 123   1
  ---        
 996: 208 100
 997: 262 100
 998: 352 100
 999: 430 100
1000: 455 100

Operations performed on the groups defined by the key columns usually work really fast, e.g.

d1[,mean(u),keyby='s']

If you need to do fast aggregation for both groups uand s, you could store two instances of the data.table. For one you use setkey(d1,u,s) and for the other setkey(d1,s,u). If you want to perform operations quickly on the groups defined by the values of u use the former data.table otherwise the latter.

Will the following work?

d1 <- data.table(u=rep(1:500,2), s=round(runif(1000,1,100),0))
setkey(d1, u, s)
us <- d1[J(u=1), "s"]
otherus <- merge(d1, us, by='s') 

setkey(d1,s)
otherus2 <- d1[us]
identical(otherus2, otherus)

setkey(d1, u, s)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow