semi supervised learning doubt only classify points with confidence above threshold
-
21-10-2020 - |
Question
I currently have a dataset with approximately 5% labelled points and 95% unlabelled. I would like to label some of the unlabelled points only if I am very confident and leave the rest NaN. Personally I would like to use a random forest but I am not sure if that is possible - I assume I am going to have to use some generative model?
One of the reasons I would like to do this is because the known points do not contain all the labels therefore I would like to classify as many of the unknown points as possible before using unsupervised learning on the rest.
Is there a library I could use?
Solution
Most sklearn categorizers have probability outputs.
CLF.predict_proba
from that you can decide the threshold.
Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange