Question

I am trying to analyse two independent variables(Say x1 and x2) influence on dependent variable(y binary var). When I am using rpart or information gain, result shows that x1 is more important than x2. (split on x1 first and information.gain also larger) While when I use glm on y~x1 + x2, the result shows x2 is very significant and x1 is not significant. Can anyone help me to explain the reason and which one I should use? Thanks!

Was it helpful?

Solution

Because trees and logistic regression follow different algorithms, it is entirely possible that they will give different results.

This usually implies that

  1. x1 and x2 are correlated
  2. Neither x1 nor x2 are good predictors of y.

If they are correlated, use PCA, or a similar technique, to reduce their correlation. Otherwise, which one to use depends on your data. You can use the training-testing set methodology to determine which gives a better fit and go with that model.

Just remember that trees are noisy. A random forest (randomForest package) may be a better model.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top