How to pair|group and then join|merge two data.tables?

https://stackoverflow.com/questions/23230788

07-07-2023
|

Question

I'm diving in the world of data.tables and so far enjoy the syntax, as I find I can do a lot more with writing a lot less. It is a bit exotic at times however.

Here's one thing I need to figure out--I know how to do joins, such as x[y], but what I need to do is a bit more complex (but still pretty simple!).

Our sales database suffers from many iterations of the same Rep's name, I keep a separate list that tells me when two names are actually the same rep. In for the $$'s it might have one or two versions of a particular rep's name (usually it's the first one, but sometimes someone's name may have been misspelled for for first few months of the year then corrected).

I'll provide two sample data.table's that I want to combine, I don't know HOW to get the result I want but I will also write out what I want to occur.

DT1 <- data.table(name=c("Bob Smith", "Robert Smith", "Mary Stone", "Maryanne Stone", "Jason Hasberg"),
                  sales=c(12, 15, 23, 10, 11))
DT2 <- data.table(correctname=c("Bob Smith", "Maryanne Stone", "Jason Hasberg"),
                  namechoice1=c("Robert Smith", "Mary Stone", "Jason Hasberg"),
                  namechoice2=c("Bob Smith", "Maryanne Stone", NA))

DT1

             name sales
1:      Bob Smith    12
2:   Robert Smith    15
3:     Mary Stone    23
4: Maryanne Stone    10
5:  Jason Hasberg    11

DT2

      correctname   namechoice1    namechoice2
1:      Bob Smith  Robert Smith      Bob Smith
2: Maryanne Stone    Mary Stone Maryanne Stone
3:  Jason Hasberg Jason Hasberg             NA

So in ENGLISH: If name in DT1 is either namechoice1, or namechoice2, then use correctname on that line item, then sum the sales for the various names under that name.

(watch out, I threw in a NA for Jason as very often the name doesn't need correcting)

Expected result:

      correctname   sales
1:      Bob Smith      27
2: Maryanne Stone      33
3:  Jason Hasberg      11

I'm hoping for an answer that is as few lines as possible, but perhaps there needs to be some further subsetting before the final sum can be calculated..

Looking forward to your answers, THANK YOU!!

Solution

You need to melt your name map table into long format so you'll have one row per alias, with each row also containing correct name. Then you can just join on the alias and aggregate on the true name:

DT2.new <- melt(DT2, id.vars="correctname")[!is.na(value), list(correctname, value)]
setkey(DT2.new, value)
DT2.new[DT1][, sum(sales), by=correctname]

Produces:

      correctname V1
1:      Bob Smith 27
2: Maryanne Stone 33
3:  Jason Hasberg 11

Note that the correct way of storing your aliases is in the format of DT2.new. Among other things, this allows you to have a different number of aliases for each person instead of needing to have as many columns as your employee with most aliases has aliases.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow