Pergunta

I have a classification problem. I want to reduce number of features to 4 (I have 30). I'm wondering why I get better result in classification when I use correlation based feature selection(cfs) first and then employ pca in comparison with just employing pca (the latter one is worse than the first one). It also should be mentioned that data loss in the second approach (just pca) 0.2-variance cover:0.8- and in the first one is 0.4 -variance coverd: 0.6!

Thank you in advance

Foi útil?

Solução

PCA simply finds more compact ways of representing correlated data. PCA does not explicitly compact the data in order to better explain the target variable. In some cases, most of your inputs might be correlated with each other but have minimal relevance to your target variable. That's probably what is happening in your case.

Consider a toy example. Lets say I want to predict stock prices. Say I'm given four predictors:

  1. Year-over-year earnings growth (relevant)
  2. Percent chance of rain (irrelevant)
  3. Humidity (irrelevant)
  4. Temperature (irrelevant)

If I apply PCA to this data set, the first principle component would relate to weather since 75% of the predictors are weather related. Is this principle component relevant? It's not.

The two options you've highlighted boil down to using CFS or not using it. The option that uses CFS does better because it explicitly selects variables that have relevance to the target variable.

Outras dicas

If you have a classification problem, you should you LDA instead of PCA. PCA ignores classes, whereas LDA is class-aware.

For example, if your data is 2D and you use PCA in the following example, you get:

enter image description here

So before PCA, the classes were perfectly linearly separable, but after PCA they are not separable at all. I'm not saying this happens in your case, but it could be.

Correlated Variables should be removed from PCA, as the variables together tend to exaggerate the effect they are expressing. CFS selects uncorrelated subsets of variables.

Licenciado em: CC-BY-SA com atribuição
scroll top