Is Overfitting a problem in Unsupervised learning?

https://datascience.stackexchange.com/questions/17645

22-10-2019
|

Question

I come to this question as I read the use of PCA to reduce overfitting is a bad practice. That is because PCA does not consider labels/output classes and so Regularization is always preferred.

That seems purely valid in Supervised Learning.

What about the case for Unsupervised Learning? We don't have any labels whatsoever. So 2 questions.

Is overfitting a problem in Unsupervised learning?
If yes, Can we use PCA to prevent overfitting?Is that a good practice?

Solution 2

This is the summary of what I have researched so far:

Fundamentally:

Model: A model is a set of rules that fit/represent the trends/rules in data supplied.

Overfitting: Overfitting in general sense is modeling of noise/randomness along with the sample where the effect of noise affects model result.

With the fundamentals at hand, one can have an intuition that, WHEN YOU FIT, THERE IS A CHANCE THAT YOU CAN OVERFIT. i.e. When you can model something that is REQUIRED, there is a good chance you can model something which is NOT REQUIRED.

So, YES, OVERFITTING IS POSSIBLE IN UNSUPERVISED LEARNING.

If PCA can be used to remove/reduce overfitting in Unsupervised Learning?

Supervised learning uses labels as a comparison measure, i.e. 2 samples are compared in data(feature sets, feature vectors or whatever jargon one throws in here.), w.r.t. their labels to identify patterns. So, PCA is a technique which does not consider Labels. So, removal of data with PCA is not preferred for supervised, as it may remove data for which feature may not have enough information but labels do.

So PCA is not recommended for removing Overfitting for Supervised Learning. You can use it, if at all, with the risk that you may lose information from your data.

Unsupervised learning does not have labels, instead, it inter-compares 2 samples to identify patterns.

Basically, there is NO data for which feature may not have enough information but labels do as labels don't exist. So, PCA will help you reduce dimensionality as it would tend to defer data that doesn't add much information.

Having said above, it is NOT necessary that it would definitely help you reduce overfitting.

But, it's worth a try. As, if the noise is dominant in the data, there is a definite pattern in the data, and your model is just abstracting it which drills down to modeling your data.

So yes, PCA can help you reduce overfitting in your data, and to the question that is it a good practice?

I haven't come across an article or reasoning that defers it for Unsupervised. Regardless of everything, PCA does seem a practical approach to reducing Overfitting for Unsupervised Learning without loss of information.

OTHER TIPS

Overfitting happens when the model fits the training dataset more than it fits the underlying distribution. In a way, it models the specific sample rather than producing a more general model of the phenomena or underlying process.

It can be presented using Bayesian methods. If I use Naive Bayes then I have a simple model that might not fit either the dataset or the distribution too well but of low complexity.

Now suppose that we use a very large Bayesian network. It might end up not being able to gain more insight and use the complexity to model the dataset (or even just trash).

So, overfitting is possible in unsupervised learning.

In PCA we start with a model in the size of the dataset. We have assumptions about the way the data behaves and use them to reduce the model size by removing parts which don't explain the main factors of variation. Since we reduce the model size one could expect to benefit always.

However, we face problems. First, a model in the size of the dataset is extremely large (given such a large size you can model any dataset). Compressing it a bit is not enough.

Another problem is that it is possible that our assumptions are not correct. Then we will have a smaller model be it won't be aligned with the distribution. Again, it this case it might overfit or just not fit the model.

Though that, PCA is aimed to reduce the dimensionality, what lead to a smaller model and possibly reduce the chance of overfitting. So, in case that the distribution fits the PCA assumptions, it should help.

To summarize, overfitting is possible in unsupervised learning too. PCA might help with it, on a suitable data.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange