Pregunta

Suppose that one partitions the data to training/validation/test sets for further application of some classification algorithm, and it happens that training set doesn't contain all class labels that were present in the complete dataset - say some records with label "x" appear only in validation set and not in the training.

Is this the valid partitioning? The above can have many consequences like confusion matrix would be no longer square, also during the algorithm we may evaluate an error and this would be affected by unseen labels in training set.

The second question is following: is it common for partitioning algorithms to take care about above issue and partition the data in the way that training set has all existing labels?

¿Fue útil?

Solución

This is what stratified sampling is supposed to solve.

https://en.wikipedia.org/wiki/Stratified_sampling

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top