dplyr group_by and summarize for two df's with same column name

https://stackoverflow.com/questions/23502523

16-07-2023
|

Question

suppose you have the following two data.frames:

set.seed(1)
x <- letters[1:10]
df1 <- data.frame(x)
z <- rnorm(20,100,10)
df2 <- data.frame(x,z)

(note that both dfs have a column named "x")

and you want to summarize the sums of df2$z for the groups of "x" in df1 like this:

df1 %.%
  group_by(x) %.%
  summarize(
    z = sum(df2$z[df2$x == x]) 
   )

this returns an error "invalid indextype integer" (translated).

But when I change the name of column "x" in any one of the two dfs, it works:

df2 <- data.frame(x1 = x,z) #column is now named "x1", it would also work if the name was changed in df1

df1 %.%
   group_by(x) %.%
   summarize(
     z = sum(df2$z[df2$x1 == x]) 
   )

#   x        z
#1  a 208.8533
#2  b 205.7349
#3  c 185.4313
#4  d 193.8058
#5  e 214.5444
#6  f 191.3460
#7  g 204.7124
#8  h 216.8216
#9  i 213.9700
#10 j 202.8851

I can imagine many situations, where you have two dfs with the same column name (like an "ID" column) for which this might be a problem, unless there is a simple way around it.

Did I miss something? There may be other ways to get to the same result for this example but I'm interested in understanding if this is possible in dplyr (or perhaps why not).

(the two dfs dont necessarily need to have the same unique "x" values as in this example)

Solution

Following the comment from @beginneR, I'm guessing it'd be something like:

inner_join(df1, df2) %.% group_by(x) %.% summarise(z=sum(z))

Joining by: "x"
Source: local data frame [10 x 2]

   x        z
1  a 208.8533
2  b 205.7349
3  c 185.4313
4  d 193.8058
5  e 214.5444
6  f 191.3460
7  g 204.7124
8  h 216.8216
9  i 213.9700
10 j 202.8851

OTHER TIPS

you can try:

df2%.%filter(x%in%df1$x)%.%group_by(x)%.%summarise(sum(z))

hth

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow