Question

Suppose my dataset is a 100 x 3 matrix filled with categorical variables. I would like to do binary classification on the response variable. Let's make up a dataset with following code:

set.seed(2013)
y <- as.factor(round(runif(n=100,min=0,max=1),0))
var1 <- rep(c("red","blue","yellow","green"),each=25)
var2 <- rep(c("shortest","short","tall","tallest"),25)
df <- data.frame(y,var1,var2)

The data looks like this:

> head(df)
  y var1     var2
1 0  red shortest
2 1  red    short
3 1  red     tall
4 1  red  tallest
5 0  red shortest
6 1  red    short

I tried to do random forest and adaboost on this data with two different approaches. The first approach is to use the data as it is:

> library(randomForest)
> randomForest(y~var1+var2,data=df,ntrees=500)

Call:
 randomForest(formula = y ~ var1 + var2, data = df, ntrees = 500) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 1

        OOB estimate of  error rate: 44%
Confusion matrix:
   0  1 class.error
0 29 22   0.4313725
1 22 27   0.4489796

----------------------------------------------------
> library(ada)
> ada(y~var1+var2,data=df)

Call:
ada(y ~ var1 + var2, data = df)

Loss: exponential Method: discrete   Iteration: 50 

Final Confusion Matrix for Data:
          Final Prediction
True value  0  1
         0 34 17
         1 16 33

Train Error: 0.33 

Out-Of-Bag Error:  0.33  iteration= 11 

Additional Estimates of number of iterations:

train.err1 train.kap1 
        10         16 

The second approach is to transform the dataset into wide format and treat each category as a variable. The reason I am doing this is because my actual dataset has 500+ factors in var1 and var2, and as a result, tree partitioning will always divide the 500 categories into 2 splits. A lot of information is lost by doing that. To transform the data:

id <- 1:100
library(reshape2)
tmp1 <- dcast(melt(cbind(id,df),id.vars=c("id","y")),id+y~var1,fun.aggregate=length)
tmp2 <- dcast(melt(cbind(id,df),id.vars=c("id","y")),id+y~var2,fun.aggregate=length)
df2 <- merge(tmp1,tmp2,by=c("id","y"))

The new data looks like this:

> head(df2)
   id y blue green red yellow short shortest tall tallest
1   1 0    0     0   2      0     0        2    0       0
2  10 1    0     0   2      0     2        0    0       0
3 100 0    0     2   0      0     0        0    0       2
4  11 0    0     0   2      0     0        0    2       0
5  12 0    0     0   2      0     0        0    0       2
6  13 1    0     0   2      0     0        2    0       0

I apply random forest and adaboost to this new dataset:

> library(randomForest)
> randomForest(y~blue+green+red+yellow+short+shortest+tall+tallest,data=df2,ntrees=500)

Call:
 randomForest(formula = y ~ blue + green + red + yellow + short +      shortest + tall + tallest, data = df2, ntrees = 500) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 39%
Confusion matrix:
   0  1 class.error
0 32 19   0.3725490
1 20 29   0.4081633

----------------------------------------------------
> library(ada)
> ada(y~blue+green+red+yellow+short+shortest+tall+tallest,data=df2)
Call:
ada(y ~ blue + green + red + yellow + short + shortest + tall + 
tallest, data = df2)

Loss: exponential Method: discrete   Iteration: 50 

Final Confusion Matrix for Data:
          Final Prediction
True value  0  1
         0 36 15
         1 20 29

Train Error: 0.35 

Out-Of-Bag Error:  0.33  iteration= 26 

Additional Estimates of number of iterations:

train.err1 train.kap1 
         5         10 

The results from two approaches are different. The difference is more obvious as we introduce more levels in each variable, i.e., var1 and var2. My question is, since we are using exactly the same data, why is the result different? How should we interpret the results from both approaches? Which is more reliable?

Was it helpful?

Solution

While these two models look identical, they are fundamentally different from one another- On the second model, you implicitly include the possibility that a given observation may have multiple colors and multiple heights. The correct choice between the two model formulations will depend on the characteristics of your real-world observations. If these characters are exclusive (i.e., each observation is of a single color and height), the first formulation of the model will be the right one to use. However, if an observation may be both blue and green, or any other color combination, you may use the second formulation. From a hunch looking at your original data, it seems like the first one is most appropriate (i.e., how would an observation have multiple heights??).

Also, why did you code your logical variable columns in df2 as 0s and 2s instead of 0/1? I wander if that will have any impact on the fit depending on how the data is being coded as factor or numerical.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top