Question

I have an unbalanced dataset, with 3 classes, with 60% of class 1, 38% of class 2, and 2% of class 3.

I don't want to generate more examples of class 3, and I cannot get more examples of class 3.

The problem is that I need to choose between RandomForest, and ExtraTree (this is homework), and explain why I choose one of these.

So I choose the Random Forest classifier, but I am not sure if my assumptions are right or no.

I choose that, because, the split of extra tree is random, so the probabilities of the pick some examples of class 3 are low, and because I think (this is the real question) that because Random is more high-variance than Extra tree, can be more useful because the high variance can help with the dataset is unbalance.

So are this two assumption especially the last one, correct? I choose correctly random forest over extra tree?

Thanks

Was it helpful?

Solution

Both Random Forest Classifier and Extra Trees randomly sample the features at each split point, but because Random Forest is greedy it will try to find the optimal split point at each node whereas Extra trees selects the split point randomly.

I would choose Random Forest because it's more likely to create a split point that accounts for the imbalanced class, whereas Extra Trees might keep splitting over and over again on a subset of the data without separating out class 3 due to the random split point.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top