R - join data frames in an RDB-style, and converting multiple entries from one frame to a single entry in the other (string)

StackOverflow https://stackoverflow.com/questions/23463040

  •  15-07-2023
  •  | 
  •  

Question

Sorry in advance for the long post.

Although I manage to overcome this using a for-loop, I have a feeling sqldf would be more efficient, but I could not get it right so far.

My first data frame has a unique identifier (Name). It is something like a 1000x5, but in the spirt of this:

Name <- c('Ben','Gary','John','Michael')
Age  <- c(13,20,5,57)
dfA  <- as.data.frame(cbind(Name,Age))

dfA
>        Name Age
>   1     Ben  13
>   2    Gary  20
>   3    John   5
>   4 Michael  57

My second data frame does NOT have a unique key, is also 5000x5, but looks generally like this:

Name   <- c('Ben','Ben','Ben','Gary','Michael','Michael','Michael')
Color  <- c('Blue','Red','Green','Red','Yellow','Yellow','Black')
Other.Entries <- c('180','200','150','100','70','200','130')
dfB   <- as.data.frame(cbind(Name,Color))

dfB
>     Name  Color  Other_Entries(not.related)
>1     Ben   Blue   180
>2     Ben    Red   180
>3     Ben  Green   150
>4    Gary    Red   100
>5 Michael Yellow   70
>6 Michael Yellow   200
>7 Michael  Black   130

Notice that there are redundancies in the Colors for each Names, and not all Names appear.

My desired output is to:

  1. Retrieve the Color for each Name in data frame B (remove redundant, possibly alphabetically)

  2. Convert these few Colors to a string (by using function "toString" for example)

  3. Add the string as a new entry in the first data frame

At first when I used the for loop I created a new data frame with an empty column like this

dfCombined <- dfA
dfCombined["Color"] <- NA

.. and iterated over all rows, querying from the second data frame.

But perhaps this may not be necessary using something clever.

The end result should be:

dfCombined
>     Name Age    Color
>1     Ben  13   Blue, Green, Red
>2    Gary  20   Red
>3    John   5
>4 Michael  57   Black, Yellow

Any suggestions?

Était-ce utile?

La solution

1a) sqldf with multiple statements Try this:

library(sqldf)

dfB_s <- sqldf("select distinct * from dfB order by Name, Color")
dfB_g <- sqldf("select Name, group_concat(Color) Color 
                from  dfB_s
                group by Name")
sqldf("select * 
       from dfA 
       left join dfB_g using (Name)")

1b) sqldf with one statement or all in one:

sqldf("select * 
       from dfA
       left join
             (select Name, group_concat(Color) Color 
             from 
                 (select distinct * from dfB order by Name, Color)
             group by Name)
       using (Name)")

Either of these gives:

     Name Age          Color
1     Ben  13 Blue,Green,Red
2    Gary  20            Red
3    John   5           <NA>
4 Michael  57   Black,Yellow

2) without packages Without sqldf it would be done like this:

dfB_s <- unique(dfB)[order(dfB$Name, dfB$Color), ]
dfB_g <- aggregate(Color ~ Name, dfB_s, toString)
merge(dfA, dfB_g, all.x = TRUE, by = "Name")

3) data.table If speed is the issue you might want to try data.table:

library(data.table)

unique(data.table(dfB, key = "Name,Color"))[
           , toString(Color), by = Name][
           data.table(dfA)]

giving:

      Name               V1 Age
1:     Ben Blue, Green, Red  13
2:    Gary              Red  20
3:    John               NA   5
4: Michael    Black, Yellow  57

4) dplyr and here is a dplyr solution:

library(dplyr)

dfA %.% 
   left_join(dfB %.%
                 unique() %.%
                 arrange(Name, Color) %.% 
                 group_by(Name) %.% 
                 summarise(Color = toString(Color)))

ADDED other solutions. Fixed some errors.

Autres conseils

To batch process it do this in real code. Psudo code: Pull name run while loop for color array load array variable:$array = array("foo", "bar", "hello", "world"); var_dump($array); run insert into new table for each name.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top