문제

I have a classification problem. I want to reduce number of features to 4 (I have 30). I'm wondering why I get better result in classification when I use correlation based feature selection(cfs) first and then employ pca in comparison with just employing pca (the latter one is worse than the first one). It also should be mentioned that data loss in the second approach (just pca) 0.2-variance cover:0.8- and in the first one is 0.4 -variance coverd: 0.6!

Thank you in advance

도움이 되었습니까?

해결책

PCA simply finds more compact ways of representing correlated data. PCA does not explicitly compact the data in order to better explain the target variable. In some cases, most of your inputs might be correlated with each other but have minimal relevance to your target variable. That's probably what is happening in your case.

Consider a toy example. Lets say I want to predict stock prices. Say I'm given four predictors:

  1. Year-over-year earnings growth (relevant)
  2. Percent chance of rain (irrelevant)
  3. Humidity (irrelevant)
  4. Temperature (irrelevant)

If I apply PCA to this data set, the first principle component would relate to weather since 75% of the predictors are weather related. Is this principle component relevant? It's not.

The two options you've highlighted boil down to using CFS or not using it. The option that uses CFS does better because it explicitly selects variables that have relevance to the target variable.

다른 팁

If you have a classification problem, you should you LDA instead of PCA. PCA ignores classes, whereas LDA is class-aware.

For example, if your data is 2D and you use PCA in the following example, you get:

enter image description here

So before PCA, the classes were perfectly linearly separable, but after PCA they are not separable at all. I'm not saying this happens in your case, but it could be.

Correlated Variables should be removed from PCA, as the variables together tend to exaggerate the effect they are expressing. CFS selects uncorrelated subsets of variables.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 datascience.stackexchange
scroll top