How to join data.tables when one is a lookup table?

https://stackoverflow.com/questions/23460650

15-07-2023
|

Question

I'm having trouble applying a simple data.table join example to a larger (10GB) data set. merge() works just fine on data.frames with the larger dataset, although I'd love to take advantage of the speed in data.table. Could anyone point out what I'm misunderstanding about data.table (and the error message in particular)?

Here is the simple example (derived from this thread: Join of two data.tables fails).

# The data of interest.
(DT <- data.table(id    = c(rep(1154:1155, 2), 1160),
                  price = c(1.99, 2.50, 15.63, 15.00, 0.75), 
                  key   = "id"))

     id price
1: 1154  1.99
2: 1154 15.63
3: 1155  2.50
4: 1155 15.00
5: 1160  0.75

# Lookup table.
(lookup <- data.table(id      = 1153:1160, 
                      version = c(1,1,3,4,2,1,1,2), 
                      yr      = rep(2006, 4), 
                      key     = "id"))

     id version   yr
1: 1153       1 2006
2: 1154       1 2006
3: 1155       3 2006
4: 1156       4 2006
5: 1157       2 2006
6: 1158       1 2006
7: 1159       1 2006
8: 1160       2 2006

# The desired table.  Note: lookup[DT] works as well.
DT[lookup, allow.cartesian = T, nomatch=0]

     id price version   yr
1: 1154  1.99       1 2006
2: 1154 15.63       1 2006
3: 1155  2.50       3 2006
4: 1155 15.00       3 2006
5: 1160  0.75       2 2006

The larger data set consists of two data.frames: temp.3561 (the dataset of interest) and temp.versions (the lookup dataset). They have the same structure as DT and lookup (above), respectively. Using merge() works well, however my application of data.table is clearly flawed:

# Merge data.frames: works just fine
long.merged         <- merge(temp.versions, temp.3561, by = "id")

# Convert the data.frames to data.tables
DTtemp.3561         <- as.data.table(temp.3561)
DTtemp.versions     <- as.data.table(temp.versions)

# Merge the data.tables: doesn't work
setkey(DTtemp.3561, id)
setkey(DTtemp.versions, id)
DTlong.merged       <- merge(DTtemp.versions, DTtemp.3561, by = "id")

Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x),  : 
  Join results in 11277332 rows; more than 7946667 = max(nrow(x),nrow(i)). Check for duplicate 
key values in i, each of which join to the same group in x over and over again. If that's ok, 
try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the 
large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. 
Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-
help for advice.

DTtemp.versions has the same structure as lookup (in the simple example), and the key "id" consists of 779,473 unique values (no duplicates).

DTtemp3561 has the same structure as DT (in the simple example) plus a few other variables, but its key "id" only has 829 unique values despite the 7,946,667 observations (lots of duplicates).

Since I'm just trying to add version numbers and years from DTtemp.versions to each observation in DTtemp.3561, the merged data.table should have the same number of observations as DTtemp.3561 (7,946,667). Specifically, I don't understand why merge() generates "excess" observations when using data.table but not when using data.frame.

Likewise

# Same error message, but with 12,055,777 observations
altDTlong.merged   <- DTtemp.3561[DTtemp.versions]

# Same error message, but with 11,277,332 observations
alt2DTlong.merged  <- DTtemp.versions[DTtemp.3561]

Including allow.cartesian=T and nomatch=0 doesn't drop the "excess" observations.

Oddly, if I truncate the dataset of interest to have 10 observatons, merge() works fine on both data.frames and data.tables.

# Merge short DF: works just fine
short.3561         <- temp.3561[-(11:7946667),]
short.merged       <- merge(temp.versions, short.3561, by = "id")

# Merge short DT
DTshort.3561       <- data.table(short.3561, key = "id")
DTshort.merged     <- merge(DTtemp.versions, DTshort.3561, by = "id")

I've been through the FAQ (http://datatable.r-forge.r-project.org/datatable-faq.pdf, and 1.12 in particular). How would you suggest thinking about this?

Solution

Could anyone point out what I'm misunderstanding about data.table (and the error message in particular)?

Taking you answer directly. The error message

Join results in 11277332 rows; more than 7946667 = max(nrow(x),nrow(i)). Check for duplicate key values in i...

states the result of your join has more values than usual cases expects. This means the lookup table key has duplicates which results multiple matches on join.

If it doesn't answer your question you should restate it.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow