supervised learning and labels

https://datascience.stackexchange.com/questions/9573

16-10-2019
|

Question

In this wiki page, I came across with the following phrase.

When data is not labeled, a supervised learning is not possible, and an unsupervised learning is required

I cannot figure out why supervised learning is not possible?

Appreciate any help to resolve this ambiguity.

Solution

The main difference between supervised and unsupervised learning is the following:

In supervised learning you have a set of labelled data, meaning that you have the values of the inputs and the outputs. What you try to achieve with machine learning is to find the true relationship between them, what we usually call the model in math. There are many different algorithms in machine learning that allow you to obtain a model of the data. The objective that you seek, and how you can use machine learning, is to predict the output given a new input, once you know the model.

In unsupervised learning you don't have the data labelled. You can say that you have the inputs but not the outputs. And the objective is to find some kind of pattern in your data. You can find groups or clusters that you think that belong to the same group or output. Here you also have to obtain a model. And again, the objective you seek is to be able to predict the output given a new input.

Finally, going back to your question, if you don't have labels you can not use supervised learning, you have to use unsupervised learning.

OTHER TIPS

That sentence is misleading. Here's a better way to look at it:

Whether A problem is supervised or unsupervised depends on the nature of the problem you're trying to solve. In a supervised learning problem there's some ground truth you want the algorithm to predict. The ground truth could be a discrete label (Classification) or a value in continuous domain (Regression). On the other hand, an unsupervised learning problem doesn't try to "predict" some label or value. Rather, it tries to learn a better representation or structure of the data. Clustering and dimension reduction are both examples of unsupervised learning problems.

Now, in order for you to train a supervised learning algorithm, you do need to provide it the ground truth. Lack of labeled data does NOT make the problem unsupervised, it only means that you have to spend the effort to obtain the labeled data needed, or else you can't train your algorithm. In reality, it is often unrealistic or too expensive to obtain labels/target value for all the data you have. Therefore, there is also a class of semi-supervised algorithms which does supervised learning using both labeled and unlabeled data, when certain assumptions apply.

In short, whether a problem is supervised or not depends on the nature of the problem. Some problem requires you to have labeled data in order to train your learning algorithm, and some do not, but having labeled data or not should NOT change the nature of the problem you're trying to solve.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange