Why is the number of samples smaller than the number of values in my decision tree?

https://datascience.stackexchange.com/questions/12325

16-10-2019
|

Question

I'm using scikit-learn RandomForestClassifier for a classification problem. When taking a closer look at one of the trees I noticed that the number of samples at the root was 662, but there were 507 instances of the first class and 545 of the second. What's going on or did I understand something wrong? Is the number of samples actually the number of unique samples and since I used bootstrap aggregation there are many samples that were chosen multiple times?

Solution

Yes, it seems to display unique samples, the others have been duplicated by the bootstrap sampling.

There's the 0.632 rule - when you have N items and you take a random sample of size N with replacement (as bootstrap does), you only get 63.2% of the samples from N, the rest are duplicates.

That roughly matches what you've seen: 0.632 * (507+545) = 665 unique samples.

You can also try it with some Python code:

samples = np.arange(507 + 545)

bootstrap_sample = np.random.choice(samples, size=len(samples), replace=True)

print(len(np.unique(bootstrap_sample)))

This always prints values closely around 665.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange