Assumptions/Limitations of Random Forest Models

https://datascience.stackexchange.com/questions/6015

16-10-2019
|

Question

What are the general assumptions of a Random Forest Model? I could not find by searching online. For example, in a linear regression model, limitations/assumptions are:

It may not work well when there are non-linear relationship between dependent and independent variables.
It may not work if the dependent variables considered in the model are linearly related. Therefore one has to remove correlated variable by some other technique.
It assumes that model errors are uncorrelated and uniform (No hetroscedasticity).

Are there any assumptions/limitations on similar lines.

Solution

Reproducing the accepted answer from CrossValidated here, as it is the most complete and the best answer to this question.

In order to understand this, remember the "ingredients" of random forest classifier (there are some modifications, but this is the general pipeline):

At each step of building individual tree we find the best split of data
While building a tree we use not the whole dataset, but bootstrap sample
We aggregate the individual tree outputs by averaging (actually 2 and 3 means together more general bagging procedure).

Assume first point. It is not always possible to find the best split. For example in the following dataset each split will give exactly one misclassified object. Example of the dataset with no best split

And I think that exactly this point can be confusing: indeed, the behaviour of the individual split is somehow similar to the behaviour of Naive Bayes classifier: if the variables are dependent - there is no better split for Decision Trees and Naive Bayes classifier also fails (just to remind: independent variables is the main assumption that we make in Naive Bayes classifier; all other assumptions come from the probabilistic model that we choose).

But here comes the great advantage of decision trees: we take any split and continue splitting further. And for the following splits we will find a perfect separation (in red). Example of the decision boundary

And as we have no probabilistic model, but just binary split, we don't need to make any assumption at all.

That was about Decision Tree, but it also applies for Random Forest. The difference is that for Random Forest we use Bootstrap Aggregation. It has no model underneath, and the only assumption that it relies is that sampling is representative. But this is usually a common assumption. For example, if one class consist of two components and in our dataset one component is represented by 100 samples, and another component is represented by 1 sample - probably most individual decision trees will see only the first component and Random Forest will misclassify the second one. Example of weakly represented second component

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange