Question

Suppose that one partitions the data to training/validation/test sets for further application of some classification algorithm, and it happens that training set doesn't contain all class labels that were present in the complete dataset - say some records with label "x" appear only in validation set and not in the training.

Is this the valid partitioning? The above can have many consequences like confusion matrix would be no longer square, also during the algorithm we may evaluate an error and this would be affected by unseen labels in training set.

The second question is following: is it common for partitioning algorithms to take care about above issue and partition the data in the way that training set has all existing labels?

Was it helpful?

Solution

This is what stratified sampling is supposed to solve.

https://en.wikipedia.org/wiki/Stratified_sampling

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top