Remove rows from data frame if values added together is less than x

https://stackoverflow.com/questions/20643361

19-09-2022
|

Question

I have the following data frame, call it df, which is a data frame consisting in three vectors: "Scene," "Name," and "Appearances." I would like to total the value for "Appearances" for every instance in which the "Name" is in the list and divide it by the number of times the name appears in the list. Then I want to remove from df all the rows in which that total number (total Appearances, divided by the number of times the name is in the list) is less than 2.

So for example, here in df, everyone's row would be tossed out except John's and Hitler's, whose values, are calculated (2+2)/2=2 and (4+1/2)=2.5

Scene      Name   Appearances 
112       Hamlet         1  
113       Zyklon         1 
114       Hitler         4  
115  Chamberlain         1  
115       Hitler         1  
117       Gospel         1  
117         John         2  
117      Deussen         1  
118        Plato         1 
118         John         2  
118        Hegel         1  
119      Cankara         1  
120        Freud         1  
121        Freud         1  
122  Petersbourg         1

I have tried a couple things, with some multiplication instead, but they're both mathematically wrong and return errors.

First, I tried to turn df into a two way table, and delete entries belonging to an infrequent name:

removeinfreqs <- function(df){
x <- table(df$Name, df$Appearances)
d<-df[(df$Name %in% names * df$Appearances)/df$Name %in% names(x[x >= 3]), ]
d
}

but I got an error: "Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments"

I tried the same sort of thing with the subset command:

df_less<-subset(df, df$Name %in% names * df$Appearances/df$Name %in% names >= 3)

But I get the same error: "Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments"

I have very little experience working with data frames in r. How can I perform this operation? Any help greatly appreciated.

Solution

First, calculate mean Appearance values for each Name:

meanAp <- with(df, ave(Appearances, Name, FUN = mean))

Second, extract rows:

df[meanAp >= 2, ]

#    Scene   Name Appearances
# 3    114 Hitler           4
# 5    115 Hitler           1
# 7    117   John           2
# 10   118   John           2

OTHER TIPS

Here's an alternative with "data.table":

library(data.table)
DT <- data.table(df)

DT[, if(mean(Appearances) >= 2) .SD, by = Name]
#      Name Scene Appearances
# 1: Hitler   114           4
# 2: Hitler   115           1
# 3:   John   117           2
# 4:   John   118           2

(Hat tip to @thelatemail/@mnel.)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow