how to combine values in duplicate rows and assign value of highest frequency from another column in R?

StackOverflow https://stackoverflow.com/questions/22281985

Question

i've got a data.frame full with duplicates, triplets and so on. it looks like this:

no <- c(3, 3, 5, 5, 5, 24, 24, 35, 35, 41, 41, 41)
article <- c("earnings went up.", "costs were reduced.", "massive layoff.", "they moved their offices.", "Mr. X joined the company.", "class action filed.", "accident in warehouse.", "blabla one.", "blabla two.", "blabla three.", "blabla four.", "blabla five.")
class <- c("p","p","n","x","x","n","n","x","p","p","x","p")

mydf <- data.frame(no, article, class)
mydf

#    no                   article class
# 1   3         earnings went up.     p
# 2   3       costs were reduced.     p
# 3   5           massive layoff.     n
# 4   5 they moved their offices.     x
# 5   5 Mr. X joined the company.     x
# 6  24       class action filed.     n
# 7  24    accident in warehouse.     n
# 8  35               blabla one.     x
# 9  35               blabla two.     p
# 10 41             blabla three.     p
# 11 41              blabla four.     x
# 12 41              blabla five.     p

now for each "no" i want to merge the two articles and assign the class with the highest frequency. if there is no highest frequency, i want the class "x" to be assigned.

the new data frame should look like this:

#    no                                                            article  class
# 1   3                               earnings went up. costs were reduced.     p
# 2   5 massive layoff. they moved their offices. Mr. X joined the company.     x
# 3  24                          class action filed. accident in warehouse.     n
# 4  35                                             blabla one. blabla two.     x
# 5  41                                           blabla four. blabla five.     p

how is that possible?

Was it helpful?

Solution

An approach with plyr:

myfun <- function(x) {
  tab <- table(x)
  idx <- max(tab) == tab
  if (sum(idx) > 1) 
    "x"
  else
    names(tab)[idx]
}

library(plyr)
ddply(mydf, .(no), summarise,
      article = paste(article, collapse = " "),
      class = myfun(class))

Result:

  no                                                             article class
1  3                               earnings went up. costs were reduced.     p
2  5 massive layoff. they moved their offices. Mr. X joined the company.     x
3 24                          class action filed. accident in warehouse.     n
4 35                                             blabla one. blabla two.     x
5 41                             blabla three. blabla four. blabla five.     p

OTHER TIPS

Using the qdap package:

library(qdap)
df2 <- with(mydf, sentCombine(article, no))

df2$class <- df2$no %l% vect2df(c(tapply(mydf[, 3], mydf[, 1], function(x){
    tab <- table(x)
    ifelse(sum(tab %in% max(tab)) > 1, "x", names(tab)[max(tab) == tab])
})))

df2

##   no                                                            text.var class
## 1  3                               earnings went up. costs were reduced.     p
## 2  5 massive layoff. they moved their offices. Mr. X joined the company.     x
## 3 24                          class action filed. accident in warehouse.     n
## 4 35                                             blabla one. blabla two.     x
## 5 41                             blabla three. blabla four. blabla five.     p
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top