How to recognize if there are any categorical variables that should be encoded as factors in a data.frame? [closed]

StackOverflow https://stackoverflow.com/questions/21869156

  •  13-10-2022
  •  | 
  •  

Frage

I have the following data.frame. How can I recognize if there is any categorical variable that should be encoded as factors in a data.frame?

  YEAR  PBE  CBE  PPO  CPO  PFO DINC  CFO RDINC RFP
1  1925 59.7 58.6 60.5 65.8 65.8 51.4 90.9  68.5 877
2  1926 59.7 59.4 63.3 63.3 68.0 52.6 92.1  69.6 899
3  1927 63.0 53.7 59.9 66.8 65.5 52.1 90.9  70.2 883
4  1928 71.0 48.1 56.3 69.9 64.8 52.7 90.9  71.9 884
5  1929 71.0 49.0 55.0 68.7 65.6 55.1 91.1  75.2 895
6  1930 74.2 48.2 59.6 66.1 62.4 48.8 90.7  68.3 874
7  1931 72.1 47.9 57.0 67.4 51.4 41.5 90.0  64.0 791
8  1932 79.0 46.0 49.5 69.7 42.8 31.4 87.8  53.9 733
9  1933 73.1 50.8 47.3 68.7 41.6 29.4 88.0  53.2 752
10 1934 70.2 55.2 56.6 62.2 46.4 33.2 89.1  58.0 811
11 1935 82.2 52.2 73.9 47.7 49.7 37.0 87.3  63.2 847
12 1936 68.4 57.3 64.4 54.4 50.1 41.8 90.5  70.5 845
13 1937 73.0 54.4 62.2 55.0 52.1 44.5 90.4  72.5 849
14 1938 70.2 53.6 59.9 57.4 48.4 40.8 90.6  67.8 803
15 1939 67.8 53.9 51.0 63.9 47.1 43.5 93.8  73.2 793
16 1940 63.4 54.2 41.5 72.4 47.8 46.5 95.5  77.6 798
17 1941 56.0 60.0 43.9 67.4 52.2 56.3 97.5  89.5 830

Is this a correct answer?

yes! factor(beef$PBE) has 14 levels, factor(beef$PPO) has 16 levels, factor(beef$CFO) has 15 levels, and the rest cannot be encoded as factor because they have complete 17 levels.

War es hilfreich?

Lösung 2

Spacedman makes some very good points that it is in general not desirable to create factors from numeric data willy-nilly. For visualization or some modeling approaches it can be useful, though. I use the utility function below to replace columns with few distinct entries in a data.frame by (ordered) factors, I post it below with an example:

make_factors <- function(data, max_levels=15) {
    # convert all columns in <data> that are not already factors
    # and that have fewer than <max_levels> distinct values into factors. 
    # If the column is numeric, it becomes an ordered factor.

    stopifnot(is.data.frame(data))
    for(n in names(data)){
        if(!is.factor(data[[n]]) && 
                length(unique(data[[n]])) <= max_levels) {
            data[[n]] <- if(!is.numeric(data[[n]])){
                 as.factor(data[[n]])
            } else {
                 ordered(data[[n]])
            }    
        }
    }
    data
}


# create dataset with one numeric column <foo> with few  distinct entries 
# and one character column <baz> with few  distinct entries :
data <- iris
data <- within(data, {
     foo <- round(iris[, 1])
     baz <- as.character(foo)
})   


table(data$foo)

## 4  5  6  7  8 
## 5 47 68 24  6 

str(data)

## 'data.frame':    150 obs. of  7 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ baz         : chr  "5" "5" "5" "5" ...
##  $ foo         : num  5 5 5 5 5 5 5 5 4 5 ...

str(make_factors(data))

## 'data.frame':    150 obs. of  7 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ baz         : Factor w/ 5 levels "4","5","6","7",..: 2 2 2 2 2 2 2 2 1 2 ...
##  $ foo         : Ord.factor w/ 5 levels "4"<"5"<"6"<"7"<..: 2 2 2 2 2 2 2 2 1 2 ...

Andere Tipps

Your exact question is: "How can I recognize if there is any categorical variable that should be encoded as factors in a data.frame?". Use of "should" here is the crucial part. Why do we encode things as factors anyway?

Factors are used when data are restricted to a number of discrete levels, such as "red", "orange", or "green" for the colour of a traffic light (yes, some countries have "red+orange", or "flashing orange" as well, add these to the levels of your factor). An ordered factor is used when data is categorical but has a defined order, such as "small", "medium", "large, or "extra large".

If your data is numbers, it is most likely that it should remain as numbers, unless it is already a numerical coding for an underlying category (eg 1=male, 2=female). There's very few reasons to convert anything else numeric into factors unless you have only a few values and statistical analysis using categorical methods makes more sense than continuous numerical methods.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top