Using apply() function to update the factor levels of multiple columns of a data frame in R

StackOverflow https://stackoverflow.com/questions/22979309

  •  30-06-2023
  •  | 
  •  

Question

Straight to the question. Say I have a following data frame:

> head(temp)
  Gender Age Agegroup
2   Male  63      61+
3   Male  60    50-60
4   Male  55    50-60
5   Male  36    30-39
7   Male  39    30-39
8   Male  63      61+

Calling a summary function:

> summary(temp)
    Gender            Age         Agegroup     
 Male  :864692   Min.   :25.00   25-29:0  
 Female:     0   1st Qu.:35.00   30-39:205237  
                 Median :45.00   40-49:235622  
                 Mean   :44.48   50-60:250977  
                 3rd Qu.:54.00   61+  : 68807  
                 Max.   :64.00   

As you can see there are zero observations for the Female factor and 25-29 factor levels. As a result, I dont need those levels. I remove them using the following code:

temp$Gender<-factor(temp$Gender)
temp$Agegroup<-factor(temp$Agegroup)

My question is: how would I use the one of the apply function to execute the code I used to remove levels? Something that will look like:

# Pseudo code just to illustrate my purpose
temp[,c(1,3)]<-apply(temp[,c(1,3)],FUN=factor)

It will be handy in case I need to update the levels of lots of columns of a data frame. Thanks. Let me know if you need more clarification.

Was it helpful?

Solution

You're looking for droplevels.

Here's some sample data similar to yours:

set.seed(1)
mydf <- data.frame(A = factor(rep("M", 5), levels = c("M", "F")),
                   B = sample(20:50, 5, TRUE))
mydf$C <- cut(mydf$B, seq(0, 80, 10))
mydf
#   A  B       C
# 1 M 28 (20,30]
# 2 M 31 (30,40]
# 3 M 37 (30,40]
# 4 M 48 (40,50]
# 5 M 26 (20,30]
summary(mydf)
#  A           B            C    
#  M:5   Min.   :26   (20,30]:2  
#  F:0   1st Qu.:28   (30,40]:2  
#        Median :31   (40,50]:1  
#        Mean   :34   (0,10] :0  
#        3rd Qu.:37   (10,20]:0  
#        Max.   :48   (50,60]:0  
#                     (Other):0

Now, let's use droplevels and see what happens:

mydf2 <- droplevels(mydf)
summary(mydf2)
#  A           B            C    
#  M:5   Min.   :26   (20,30]:2  
#        1st Qu.:28   (30,40]:2  
#        Median :31   (40,50]:1  
#        Mean   :34              
#        3rd Qu.:37              
#        Max.   :48         

If you really wanted to use an *apply approach, perhaps you can use lapply as follows:

mydf3 <- mydf                    ## Create a copy of your original just in case
mydf3[] <- lapply(mydf3, factor)
summary(mydf3)
#  A      B           C    
#  M:5   26:1   (20,30]:2  
#        28:1   (30,40]:2  
#        31:1   (40,50]:1  
#        37:1              
#        48:1                   
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top