Find Similar Word Pattern for Factor Variable using R

https://stackoverflow.com/questions/23151453

05-07-2023
|

Question

I have Data with 10 000 Observations, variable named Com, type Factor with 3000 Levels. What I'm trying to do here is to find similar pattern between values in variable Com and then combine it into one. So, I can do analysis on it later. The str of Data is as below:

> Data
 'data.frame':   10000 obs. of  1 variable:
  $ Com: Factor w/ 3000 levels

Example: Frequency of Com:

> Frequency<-data.frame(Com=c("C/C++ PROGRAMMING", "C; C++ PROGRAMMING", "C++ PROGRAMMING", "C++", "PROGRAMMING C++", "C", "C PROGRAMMING", "C, C++ PROGRAMMING", "PROGRAMMING IN C; C++", "PROGRAMMINGS IN C/C++","PROGRAMMING IN C/C++", "PROGRAMMING (C, C++, CUDA)"), Freq=c(2,3,3,1,2,5,6,2,1,3,4,5))
> Frequency
                                 Com   Freq
1                  C/C++ PROGRAMMING      2
2                 C; C++ PROGRAMMING      3
3                    C++ PROGRAMMING      3
4                                C++      1
5                    PROGRAMMING C++      2
6                                  C      5
7                      C PROGRAMMING      6
8                 C, C++ PROGRAMMING      2
9              PROGRAMMING IN C; C++      1
10             PROGRAMMINGS IN C/C++      3
11              PROGRAMMING IN C/C++      4
12        PROGRAMMING (C, C++, CUDA)      5       # Just add one more situation

I want the result of Frequency to be:

> Frequency
                                 Com   Freq
1                  C/C++ PROGRAMMING     15
2                    C++ PROGRAMMING      6
3                      C PROGRAMMING     11
4         PROGRAMMING (C, C++, CUDA)      5

I can recode the levels of Com in order to this. However, there are 3000 Levels for this variable (Com) and I have to find it one by one which going to take my time.

So, is there any other method to do this without taking so much time? I have tried looking at Pattern matching and replacement in R, but still can't solve the problem.

Thanks in advance.

Solution

You can do in some steps using regular expressions:

dat$Freq <- as.numeric(dat$Freq)
dat$Com[grep('.*(C).*(C[++]).*',dat$Com)] <- 'ccplusplus'
dat$Com[grep('C[++]',dat$Com)] <- 'cplusplus'
dat$Com[grep('C',dat$Com)] <- 'c'
tapply(dat$Freq,dat$Com,sum)

# c ccplusplus  cplusplus 
# 11         15          6

OTHER TIPS

From the package stringr, you might easily exploit str_detect and you can work by thinking of what the modalities you want to group together have in common. I can guess it is a hard work, but I don't think that R can enter your mind and find what you consider "similar".

An example:

df$Com_grouped <- NA

df$Com_grouped <- ifelse(str_detect(df$Com, "C") & (!str_detect(df$Com, "C[++]")), "C PROGRAMMING", df$Com_grouped)

Finally: tapply(df$Freq, df$Com_grouped, sum), so that you can get the frequencies.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow