Finding the count of each word in a classification using stringr

https://stackoverflow.com/questions/23374738

r
stringr

12-07-2023
|

Pregunta

I am trying to match two sets of words with number of strings. The two sets of words are car and school, and using the stringr package I've set it up to match any instance of a word from either car or school.

library(stringr)
car <- c("Honda", "Chevy", "Toyota", "Ford")
school <- c("Michigan", "Ohio State", "Missouri")
car_match <- str_c(car, collapse = "|")
school_match <- str_c(school, collapse = "|")
df <- data.frame(keyword=c("He drives a Honda", 
                           "He goes to Ohio State", 
                           "He likes Ford and goes to Ohio State"))
df

main <- function(df) {
  df$car <- as.numeric(str_detect(df$keyword, car_match))
  df$school <- as.numeric(str_detect(df$keyword, school_match))
  df
}
main(df)

> main(df)
                               keyword car school
1                    He drives a Honda   1      0
2                He goes to Ohio State   0      1
3 He likes Ford and goes to Ohio State   1      1

Great, that works.

Now, I want to go back and see if I can easily get a count of the frequency for each word within the car and school 'buckets.'

So it should look as follows

Car        Freq
Honda      1
Chevy      0 
Toyota     0
Ford       1

school     Freq
Michigan    0
Ohio State  2
Missouri    0

Because Honda, which is in the car classification, appears once, it has a frequency count of one. Likewise, Ohio State, which is in the school classification and appears twice, has a frequency of two.

Can anyone help me go from classification matching to finding the frequency of each word within the classification?

I could probably go back and set each word in car as it's own str_c and match that way, but I'd like to find a "simpler" route.

Solución

Perhaps something like this:

sapply(car, function(x) sum(str_count(df$keyword, x)))
# Honda  Chevy Toyota   Ford 
#     1      0      0      1 

sapply(school, function(x) sum(str_count(df$keyword, x)))
# Michigan Ohio State   Missouri 
#        0          2          0

Otros consejos

You can use the qdap package to do this task as follows:

library(qdap)
key <- list(
    car = c("Honda", "Chevy", "Toyota", "Ford"),
    school = c("Michigan", "Ohio State", "Missouri")
)

(out <- with(df, termco(keyword, keyword, key, elim.old = FALSE)))
counts(out)

##                                keyword word.count Honda Chevy Toyota Ford Michigan Ohio State Missouri car school
## 1                    He drives a Honda          4     1     0      0    0        0          0        0   1      0
## 2                He goes to Ohio State          5     0     0      0    0        0          1        0   0      1
## 3 He likes Ford and goes to Ohio State          8     0     0      0    1        0          1        0   1      1

colSums(counts(out)[, -1])

## word.count      Honda      Chevy     Toyota       Ford   Michigan Ohio State   Missouri        car     school 
##         17          1          0          0          1          0          2          0          2          2

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow