1a) sqldf with multiple statements Try this:
library(sqldf)
dfB_s <- sqldf("select distinct * from dfB order by Name, Color")
dfB_g <- sqldf("select Name, group_concat(Color) Color
from dfB_s
group by Name")
sqldf("select *
from dfA
left join dfB_g using (Name)")
1b) sqldf with one statement or all in one:
sqldf("select *
from dfA
left join
(select Name, group_concat(Color) Color
from
(select distinct * from dfB order by Name, Color)
group by Name)
using (Name)")
Either of these gives:
Name Age Color
1 Ben 13 Blue,Green,Red
2 Gary 20 Red
3 John 5 <NA>
4 Michael 57 Black,Yellow
2) without packages Without sqldf it would be done like this:
dfB_s <- unique(dfB)[order(dfB$Name, dfB$Color), ]
dfB_g <- aggregate(Color ~ Name, dfB_s, toString)
merge(dfA, dfB_g, all.x = TRUE, by = "Name")
3) data.table If speed is the issue you might want to try data.table:
library(data.table)
unique(data.table(dfB, key = "Name,Color"))[
, toString(Color), by = Name][
data.table(dfA)]
giving:
Name V1 Age
1: Ben Blue, Green, Red 13
2: Gary Red 20
3: John NA 5
4: Michael Black, Yellow 57
4) dplyr and here is a dplyr solution:
library(dplyr)
dfA %.%
left_join(dfB %.%
unique() %.%
arrange(Name, Color) %.%
group_by(Name) %.%
summarise(Color = toString(Color)))
ADDED other solutions. Fixed some errors.