문제

I am trying to analyse two independent variables(Say x1 and x2) influence on dependent variable(y binary var). When I am using rpart or information gain, result shows that x1 is more important than x2. (split on x1 first and information.gain also larger) While when I use glm on y~x1 + x2, the result shows x2 is very significant and x1 is not significant. Can anyone help me to explain the reason and which one I should use? Thanks!

도움이 되었습니까?

해결책

Because trees and logistic regression follow different algorithms, it is entirely possible that they will give different results.

This usually implies that

  1. x1 and x2 are correlated
  2. Neither x1 nor x2 are good predictors of y.

If they are correlated, use PCA, or a similar technique, to reduce their correlation. Otherwise, which one to use depends on your data. You can use the training-testing set methodology to determine which gives a better fit and go with that model.

Just remember that trees are noisy. A random forest (randomForest package) may be a better model.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top