Random Forest Classification - SciKit vs Weka on prediction with 100 features

https://stackoverflow.com/questions/20658519

19-09-2022
|

Question

I wanted to get a much faster random forest classifier than the one from Weka, I first tried the C++ Shark implementation (results: few speed improvement, drop in correctly classifed instances) and then tested Python Scikit-learn. I read on many websites and papers that Weka performs badly compared to Scikit, WiseRF...

After my first try with a forest of 100 trees:

Training time: Weka  ~ 170s VS Scikit ~ 31s
Prediction results on the same test set: Weka ~ 90% correctly classified VS Scikit score ~ 45% !!!

=> Scikit RF runs fast but classify very badly on this first try.

I tuned the parameters of Scikit RandomForestClassifier and managed to get a score close to 70% but the speed of scikit dropped nearly down to Weka performance (bootstrap=False, min_samples_leaf=3, min_samples_split=1, criterion='entropy', max_features=40, max_depth=6). I do have many missing values and scikit does not handle them out of the box so I tried many different strategies (all strategies of Imputer, skip instances with missing values, replace with 0 or extreme values) and reached 75%.

So at this stage Scikit RandomForestClassifier performs at 75% (compared to 90% with weka) and build the model in 78s (using 6 cores vs 170s with only 1 core with Weka). I am very surprised with those results. I tested ExtraTrees which performs very well in terms of speed but still reach an average of 75% correct classification.

Do you have any idea what I am missing ?

My data: ~100 features, ~100 000 instances, missing values, classification prediction (price forecast).

Solution

Wrapping up the discussion in the comments to make StackOverflow mark this question as answered:

Apparently OP was able to reach comparable of accuracy by dropping samples with missing values and grid searching optimal hyper-parameter values with GridSearchCV.

One-hot-encoding categorical features was apparently not impacting the outcome much in this case.

OTHER TIPS

I also had a huge performance difference from the Weka and Scikit-learn Random Forest implementations with the same data and the same configuration(?). After trying all possible solutions, I noticed that it was actually pretty straightforward. Weka shuffles the data in default but Scikit-learn does not. Even after setting Weka's configuration option: use the data as ordered, it is still the same. So, here is how I handled it. Use the random_state=1 (it is the default in Weka), shuffle=True in Scikit-learn for cross-validator, bootstrap=True in classifier. It produces quite the similar results with Weka. E.g.

classifier = ensemble.RandomForestClassifier(n_estimators=300,  max_depth=30, min_samples_leaf=1, min_samples_split=1, random_state=1, bootstrap=True, criterion='entropy', n_jobs=-1)

cv = StratifiedKFold(n_splits=num_folds, shuffle=True, random_state=1)
grid_search = GridSearchCV(classifier, param_grid=param_grid, cv=cv)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow