Question

I currently have a dataset with approximately 5% labelled points and 95% unlabelled. I would like to label some of the unlabelled points only if I am very confident and leave the rest NaN. Personally I would like to use a random forest but I am not sure if that is possible - I assume I am going to have to use some generative model?

One of the reasons I would like to do this is because the known points do not contain all the labels therefore I would like to classify as many of the unknown points as possible before using unsupervised learning on the rest.

Is there a library I could use?

Was it helpful?

Solution

Most sklearn categorizers have probability outputs.

CLF.predict_proba

from that you can decide the threshold.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top