Decision tree or logistic regression?

https://datascience.stackexchange.com/questions/6048

16-10-2019
|

Question

I am working on a classification problem. I have a dataset containing equal number of categorical variables and continuous variables. How will i know what technique to use? between a decision tree and a logistic regression?

Is it right to assume that logistic regression will be more suitable for continuous variable and decision tree will be more suitable for continuous + categorical variable?

Solution

Long story short: do what @untitledprogrammer said, try both models and cross-validate to help pick one.

Both decision trees (depending on the implementation, e.g. C4.5) and logistic regression should be able to handle continuous and categorical data just fine. For logistic regression, you'll want to dummy code your categorical variables.

As @untitledprogrammer mentioned, it's difficult to know a priori which technique will be better based simply on the types of features you have, continuous or otherwise. It really depends on your specific problem and the data you have. (See No Free Lunch Theorem)

You'll want to keep in mind though that a logistic regression model is searching for a single linear decision boundary in your feature space, whereas a decision tree is essentially partitioning your feature space into half-spaces using axis-aligned linear decision boundaries. The net effect is that you have a non-linear decision boundary, possibly more than one.

This is nice when your data points aren't easily separated by a single hyperplane, but on the other hand, decisions trees are so flexible that they can be prone to overfitting. To combat this, you can try pruning. Logistic regression tends to be less susceptible (but not immune!) to overfitting.

Lastly, another thing to consider is that decision trees can automatically take into account interactions between variables, e.g. $xy$ if you have two independent features $x$ and $y$. With logistic regression, you'll have to manually add those interaction terms yourself.

So you have to ask yourself:

what kind of decision boundary makes more sense in your particular problem?
how do you want to balance bias and variance?
are there interactions between my features?

Of course, it's always a good idea to just try both models and do cross-validation. This will help you find out which one is more likely to have better generalization error.

OTHER TIPS

Try using both regression and decision trees. Compare the efficiency of each technique by using a 10 fold cross validation. Stick to the one with higher efficiency. It would be difficult to judge which method would be a better fit just by knowing that your dataset is continuous and, or categorical.

It really depends on the structure of the underlying distribution of your data. If you have strong reason to believe that the data approximate a Bernoulli distribution, multinomial logistic regression will perform well and give you interpretable results. However if there exist nonlinear structures in the underlying distribution, you should seriously consider a nonparametric method.

While you could use a decision tree as your nonparametric method, you might also consider looking into generating a random forest- this essentially generates a large number of individual decision trees from subsets of the data and the end classification is the agglomerated vote of all the trees. A random forest helps give you an idea of the share each predictor variable contributes to the response.

Another factor to keep in mind is interpretability. If you are just trying to classify data, then you probably don't care about the underlying relationships between explanatory and response variables. However, if you are interested at all in interpretability a multinomial logistic regression is much easier to interpret, parametric methods in general, because they make assumptions about the underlying distribution, tell you more intuitively interpretable relationships.

To use Decision Tree, you should transform the continuous variable into categorical.

One more thing, Logistic Regression is usually used to predict result according to the probability.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange