Question

I am trying to use a Random Forest Model (Regression Type) as a substitute of logistic regression model. I am using R - randomForest Package. I want to understand the meaning of Importance of Variables (%IncMSE and IncNodePurity) by example.

Suppose I have a population of 100 employees out of which 30 left the company. Suppose in a particular decision tree, population is split by an attribute (say location) into two nodes. One node contains 50 employees out of which 10 left the company and other contains 50 employees from which 20 left the company. Can someone demonstrate me a calculation of %IncMSE and IncNodePurity. (if Required for averages etc., please consider another decision tree)

This may look like a repeated question but I could not find a worked out example.

Was it helpful?

Solution

MSE is measure of error of the overall regression model, $\frac{1}{n}\sum\|y_i-\hat y_i\|^2$.

For an important variable, if it is replaced with random noise, you would imagine MSE with the faulty data to increase. IncMSE (Incremental MSE) for a particular variable is how much the MSE will increase if the variable is completely randomized.

This is usually computed on the out-of-bag data.


Node purity is a measure of how homogeneous a node is. An example of node purity is information entropy, i.e. $-p_1\log p_1-p_0\log p_0$ if there are two classes. For regression models, node impurity is usually taken as the variance in a node.

Everytime you split a node, you do it to make the new nodes homogeneous, hence the purity increases.

IncPurity of a variable is weighted average of incremental purity because of each split by this variable was used to split, with the node population as the weight.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top