Question

I have a dataset of patient records. But I do not know whether he is +ve for a cancer or not. So, I do not have the labels in my dataset.

Now I can run a machine learning models like clustering to generate labels.

For ex: I can run clustering to group the two classes based on similarity and find out who all belong to +ve and -ve class.

Of course, we cannot sit and manual review the patients' data to know whether he is actually +ve for cancer or not.

So when we generate labels via machine learning models like clustering above, is it a recommended approach?

Is it used in industries/real time where people don't have ground truth and only rely on labels based on ML models?

How can we trust these labels generated?

If it's a human I know that it can be trusted. But how do we trust these labels.

Are things like this being used in Industries and how do they tackle the trust issue?

Was it helpful?

Solution

So when we generate labels via machine learning models like clustering above, is it a recommended approach? Only if you can really make highly distinct 2 clusters/groups. This will be highy unlikely, especially for complicated and high dimensional datasets. One of the reasons is that clustering algorithms are just weaker than the supervised algorithms. If you can find a good representation (have a look at representation learning from Bengio), i.e. highly discriminative embeddings, than it might work.

Is it used in industries/real time where people don't have ground truth and only rely on labels based on ML models? Its an option, one can definately try it, but not rely on it.

How can we trust these labels generated? As long as you can validate it with out of fold set with ground truths, or with humans looking at the clusters, there is no problem.

Are things like this being used in Industries and how do they tackle the trust issue? Its one of the possible solutions, personally I always try first transfer learning. Especially for problems like yours, chances are there is already some pretrained model. Only thing you need is some labeling tool, for 1000 samples (it takes a couple of hours to do it but its worth it). Have a look at this tool.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top