tree and logistic give different result [closed]

https://stackoverflow.com/questions/22176686

03-06-2023
|

Question

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.

This question does not appear to be about programming within the scope defined in the help center.

Closed 9 years ago.

I am trying to analyse two independent variables(Say x1 and x2) influence on dependent variable(y binary var). When I am using rpart or information gain, result shows that x1 is more important than x2. (split on x1 first and information.gain also larger) While when I use glm on y~x1 + x2, the result shows x2 is very significant and x1 is not significant. Can anyone help me to explain the reason and which one I should use? Thanks!

Solution

Because trees and logistic regression follow different algorithms, it is entirely possible that they will give different results.

This usually implies that

x1 and x2 are correlated
Neither x1 nor x2 are good predictors of y.

If they are correlated, use PCA, or a similar technique, to reduce their correlation. Otherwise, which one to use depends on your data. You can use the training-testing set methodology to determine which gives a better fit and go with that model.

Just remember that trees are noisy. A random forest (randomForest package) may be a better model.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow