Is using unsupervised learning to setup supervised classification reasonable?

https://datascience.stackexchange.com/questions/74636

11-12-2020
|

質問

I've got a biological dataset describing genes. The overall idea is that there are thousands of these genes to sort through, so if ML can rank them I can then know which should be going into the lab for functional research first. Currently, I make labels for supervised classification of these genes based on their known biology (so for example some genes interact with drugs related to a disease so I label them as 'most likely to cause the disease' and this goes down until I have a final 4th label of 'unlikely to cause the disease'). The way I make these labels seems impossible to not be biased, since I'm making all the decisions, so I'm wondering if I can compare my decisions with seeing how an unsupervised model would group the data (e.g. I've got 4 labels but if the model finds 5 groups then it shows how far off I am potentially?).

Would it even also be possible to use unsupervised learning to create the labels by itself or would this too be unreliable as you can't know why it's grouping certain genes together? Or would doing this step alone actually make the supervised step redundant anyway?

解決

Is using unsupervised learning to setup supervised classification reasonable?

Absolutely. This is a common strategy in ML. As you said yourself, using information coming from the data itself has the benefit of being less biased.

Would it even also be possible to use unsupervised learning to create the labels?

Technically yes. Though, some clustering techniques require you to specify the number of clusters, which isn't helpful. As you said, if you can cluster data points in a satisfactory manner, you don't need supervised learning anymore. Also, indeed, if your scenario requires you to have an understanding of what differentiates the clusters, you may not be lucky depending on which clusters come out. They are not always interpretable.

What I would suggest would be to turn your classification problem into a regression problem. 1.0 could be most likely to cause the disease, 0.0, least likely. This way, you don't have to worry about how many labels you need in the first place.

ライセンス： CC-BY-SA と帰属

所属していません datascience.stackexchange