what is the difference between “fully developed decision trees” and “shallow decision trees”?

https://datascience.stackexchange.com/questions/9735

16-10-2019
|

Question

As reading Ensemble methods on scikit-learn docs, it says that

bagging methods work best with strong and complex models (e.g., fully developed decision trees), in contrast with boosting methods which usually work best with weak models (e.g., shallow decision trees).

But search on google it always return information about Decision Tree.

I'd like to know the detail of the two trees mentioned in the doc, what's the fully developed and shallow meanings.

Update:

About why bagging work best with fully developed and why boosting work best with shallow.

First, I think complex models (e.g., fully developed decision trees) means such a data set has a complex format be called as fully developed decision trees.

After I read above quote over 20 times and rapaio's answer, I think my poor English lead me to the wrong road(misunderstand). I also mistake shallow as shadow , which make me confusing a long time....Now I understand the meanings of fully developed and shallow .

I think the quote is saying bagging work best with a models(already trained) which algorithm is complex. And boosting only need simple model. Both bagging and boosting need many estimators, as n_estimators=100 in scikit-learn examples.

If n_estimators=100 :
- bagging need 100 fully developed decision trees estimators(models)
- boosting need 100 shallow decision trees estimators(models)
Does my thoughts is right? Hope my update can help non-native speakers.
e.g. means for example, so there are other models can use for bagging and boosting. How about changing the model to svm or something else? Or both of they need a tree base model?

Solution

[Later edit - Rephrase everything]

Types of trees

A shallow tree is a small tree (most of the cases it has a small depth). A full grown tree is a big tree (most of the cases it has a large depth).

Suppose you have a training set of data which looks like a non-linear structure.

Bias variance decomposition as a way to see the learning error

Considering bias variance decomposition we know that the learning error has 3 components:

$$Err = \text{Bias}^2 + \text{Var} + \epsilon$$

Bias is the error produced by the fit model when it is not capable to represent the true function; it is in general associated with underfitting.

Var is the error produced by the fit model due to sampled data, it describes how unstable a model is if the training data changes; it is in general associated with overfitting.

$\epsilon$ is the irreducible error which envelops the true function; this can't be learned

Considering our shallow tree we can say that the model has low variance since changing the sample does not change too much the model. It needs too many changed data points to be considered unstable. At the same time we can say that has a high bias, since it really can't represent the sine function which is the true model. We can say also that it has a low complexity. It can be described by 3 constants and 3 regions.

Consequently, the full grown tree has low bias. It is very complex since it can be described only using many regions and many constants on those regions. This is why it has low bias. The complexity of the model impacts also variance which is high. In some regions at least, a single point sampled differently can change the shape of the fitted tree.

As a general rule of thumb when a model has low bias it has also high variance and when it has low variance it has high bias. This is not true always, but it happens very often. And intuitively is a correct idea. The reason is that when you are getting close to the points in the sample you learn the patterns but also learn the errors from sample, when you are far away from sample you are instead very stable since you do not incorporate errors from sample.

How can we build ensembles using those kind of trees?

Bagging

The statistical model behind bagging is based on bootstrapping, which is a statistical procedure to evaluate the error of a statistic. Assuming you have a sample and you want to evaluate the error of a statistical estimation, bootstrapping procedure allows you to approximate the distribution of the estimator. But trees are only a simple function of sample "split the space into regions and predict with the average, a statistic". Thus if one builds multiple trees from bootstrap samples and averages the trees can be considered i.i.d. and the same principle works to reduce variance. Because of that bagging allows one to reduce the variance without affecting too much the bias. Why it needs full depth trees? Well it is a simple reason, perhaps not so obvious at first sight. It need classifiers with high variance in order to reduce it. This procedure does not affect the bias. If the underlying models have low variance high bias, the bagging will reduce slightly the already small variance and produce nothing for bias.

Boosting

How does boosting? Many compare boosting with model averaging, but the comparison is flawed. The idea of boosting is that the ensemble is an iterative procedure. It is true that the final rule for classification looks like a weighted average of some weak models, but the point is that those models were built iteratively. Has nothing to do with bagging and how it works. Any $k$ tree is build using information learned from all previous $k-1$ trees. So we have initially a weak classifier which we fit to data, where all the points have the same importance. This importance can be changed by weights like in adaboost or by residuals like in gradient boosting, it really does not matter. The next weak classifier will not treat in the same way all the points, but those previously classified correctly has smaller importance than those classifier incorrectly. The consequence is that the model enriches it's complexity, it's ability to reproduce more complex surfaces. This is translated in the fact that it reduces the bias, since it can go closer to data. A similar intuition is behind: if the classifier has already a low bias, what will happen when I boost it? Probably a much unbearable overfit, that's all.

Which one is better?

There is no clear winner. It depends too much on data set and on other parameters. For example bagging can't hurt. It is possible to be useless ut usually does not hurt performance. Boosting can lead to overfit. That is because you can go eventually too close to data.

The is a lot of literature which says that when the irreducible error is high, the bagging is much better and boosting does not progress too much.

Can we decrease both variance and bias?

Pure bootstrap and bagging approaches serves a single purpose. Either reduce variance either reduce bias. However modern implementations changed various things in how those approaches works. Sampling can be used in boosting and it seems to work towards reducing also the variance. There are bagging procedures which takes some ideas prom boosting, for example iterative bagging (or adaptive bagging) published by Breiman. So, the answer is yes, is possible.

Can we use other learners other than trees?

Of course. Often you will see boosting to use this approach. I read some papers on bagging svm or another learners also. I tried myself to bag some svms but without much success. The trees are the preferred way, however, for a very simple reason, they are simple to build, simple to adapt and simple to control their effect. My personal opinion is that it is not everything said regarding ensembles of trees.

PS: a last note on the number of weak classifiers: this depends entirely on the complexity of the data set mixed with the complexity of the learner. There is no recipe. Often 20 of them are enough to get most of the information and additional ones are only for tiny tuning.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange