Merge duplicates and assign value of highest frequency (except neutrals!) in R

https://stackoverflow.com/questions/22295877

12-06-2023
|

Question

i posted a very similar question but i need to change conditions. i have a data.frame full with multiple entries. the columns are "no", "article", and "class" ("p"=positive, "n"=negative, "x"=neutral). it looks like this:

no <- c(3, 3, 5, 5, 5, 24, 24, 35, 35, 41, 41, 41)
article <- c("earnings went up.", "earnings went up.", "massive layoff.", "they moved their offices.", "Mr. X joined the company.", "class action filed.", "accident in warehouse.", "blabla one.", "blabla two.", "blabla three.", "blabla four.", "blabla five.")
class <- c("p","p","n","x","x","n","n","x","p","p","n","p")

mydf <- data.frame(no, article, class)
mydf

#    no                   article class
# 1   3         earnings went up.     p
# 2   3         earnings went up.     p
# 3   5           massive layoff.     n
# 4   5 they moved their offices.     x
# 5   5 Mr. X joined the company.     x
# 6  24       class action filed.     n
# 7  24    accident in warehouse.     n
# 8  35               blabla one.     x
# 9  35               blabla two.     p
# 10 41             blabla three.     p
# 11 41              blabla four.     n
# 12 41              blabla five.     p

i want to get rid of the multiple entries. the articles of the multiple entries should be merged, but only if the articles are NOT the same! then, i want the class with highest frequency to be assigned except for "x". "x" means neutral, so if there is e.g. a duplicate with "x", "p" i still want "p" to be assigned. if there's "n", "x" --> "n" should be assigned. same with other multiple entries. if there's equal frequency of "p" and "n" --> "x" should be assigned.

# examples:
# "p", "x"      --> "p"
# "p", "n"      --> "x" 
# "x", "n", "x" --> "n" 
# "p", "n", "p" --> "p"  

# the resulting data.frame should look like this:

#    no                                                            article  class
# 1   3                                                   earnings went up.     p
# 2   5 massive layoff. they moved their offices. Mr. X joined the company.     n
# 3  24                          class action filed. accident in warehouse.     n
# 4  35                                             blabla one. blabla two.     p
# 5  41                                           blabla four. blabla five.     p

in my old question articles were merged even if they were the same, and the class with highest frequency was assigned ("x", "n", "p" treated the same). if there was no highest frequency, "x" was assigned. helpful approaches were:

library(qdap)
df2 <- with(mydf, sentCombine(article, no))

df2$class <- df2$no %l% vect2df(c(tapply(mydf[, 3], mydf[, 1], function(x){
tab <- table(x)
ifelse(sum(tab %in% max(tab)) > 1, "x", names(tab)[max(tab) == tab])
})))

i tried to change this code but i know too little about how to write functions and about qdap to really understand this.

Solution

How about this with dplyr

require(dplyr) # for aggregation

getclass<-function(class){
  n.n<-length(class[class=="n"])
  n.p<-length(class[class=="p"])
  ret<-"x"                         # return x, unless
  if(n.n>n.p)ret<-"n"              # there are more n's than p's (return p)
  if(n.n<n.p)ret<-"p"              # or more p's than n's (return n)
  return(ret)
}

group_by(mydf,no) %.%
  summarise(article=paste0(unique(article),collapse=" "),class=getclass(class))

Source: local data frame [5 x 3]

  no                                                             article class
1  3                                                   earnings went up.     p
2  5 massive layoff. they moved their offices. Mr. X joined the company.     n
3 24                          class action filed. accident in warehouse.     n
4 35                                             blabla one. blabla two.     p
5 41                             blabla three. blabla four. blabla five.     p

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow