Random Forests with Big Data - number of trees v. number of observations

https://datascience.stackexchange.com/questions/8694

16-10-2019
|

Pergunta

I frequently use Random Forest, Regularized Random Forest, Guided Random Forest, and similar tree models.

The size of the data that I'm dealing with has grown beyond what I can work around using HPC and parallelism. It's typically large due to row length (observations) not columns (features). The data is also often not normally distributed.

I have to make a choice between:

Running a small number of trees (i.e. 50 or less) with either complete data or a relatively large and comparative sample
Running several times the number of trees, but with a correspondingly scaled down sample size

There are work-arounds and for any 1 case -- for instance, I can do some ad hoc tests to see which I think will work better, but what I'm wondering is if there is a good theoretical (or robust empirical) reasoning to either guide the choice of approach over the other or to describe the tradeoff being made?

In other words, I'm hoping that someone more comfortable with the math, statistics, and theory underlying this (type of) algorithm can offer some generalizable insight.

Solução

I would recommend using a combination of both options #1 and #2.

You could first try tuning your hyper-parameters to find out till what extent could you reduce the number of trees to a point where the random forest model's prediction starts deteriorating on the test set.

This is because changing the value of mtry, the randomly selected number of features for a new tree, is the only meaningful hyper-parameter that should impact accuracy of the model. Since averaging converges as the no. of trees increases, the no. of trees could be reduced to a point where its performance is not impacted as much. Hence, you need to iterate and choose a a limit beyond which very small number of trees may not produce a strong enough ensemble. A random forest needs works best by using more base learners for reducing the variance by averaging each individual tree's output.

It is not clear from your case whether you're using the Random Forest for a a classification or a regression problem. In case this is a classification problem, and if your data-set is imbalanced in terms of ratio of positive vs. negative classes; then you could reduce the size of the training set by under-sampling the majority class to bring it nearer to a 1:1 ratio. Since you have a large number of records, such class based sampling could improve accuracy as well as reduce data size for training.

Additionally, if you've got a fine tuned Random Forest with good performance, then you could also evaluate dropping features that are least important as determined by the algorithm on OOB samples. This would reduce the time taken to train the model.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange