Question

I've built a toy Random Forest model in R (using the German Credit dataset from the caret package), exported it in PMML 4.0 and deployed onto Hadoop, using the Cascading Pattern library.

I've run into an issue where Cascading Pattern scores the same data differently (in a binary classification problem) than the same model in R. Out of 200 observations, 2 are scored differently.

Why is this? Could it be due to a difference in the implementation of Random Forests?

Was it helpful?

Solution

The difference was, it appears, due to the different implementation of Random Forests in R and Cascading Pattern (as well as openscoring which I tried later) with respect to ties in the tree voting - i.e. when an even number of trees are built (say, 500) and exactly half classify an application as Good, and the other half as Bad, the handling of those situations differs. Solved it by growing and odd (501) number of trees.

OTHER TIPS

I think the most likely explanation is that the two libraries do not quite support TreeModel in PMML in the same way. Perhaps one only supports a subset of features, and ignores ones it does not understand. This could cause different scoring.

I'd also double check that upstream parsing code is the same in both cases. Maybe a missing value is treated differently upstream.

Decision Trees are unstable learners and very sensitive to changes in the input parameters.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top