Selezione funzionalità e PCA

https://datascience.stackexchange.com/questions/12250

16-10-2019
|

Domanda

Ho un problema di classificazione. Io voglio ridurre il numero di funzioni a 4 (ho 30). Mi chiedo il motivo per cui ho il migliore risultato nella classificazione quando uso la selezione delle funzioni di correlazione base (CFS) e poi impiego PCA in confronto con un solo impiegando PCA (che quest'ultima sia peggiore del primo). Si dovrebbe anche menzionare che la perdita di dati nel secondo approccio (solo PCA) copertura 0,2-varianza: 0,8- e nel primo è 0.4 -variance rivestita: 0.6

Grazie in anticipo

Soluzione

PCA simply finds more compact ways of representing correlated data. PCA does not explicitly compact the data in order to better explain the target variable. In some cases, most of your inputs might be correlated with each other but have minimal relevance to your target variable. That's probably what is happening in your case.

Consider a toy example. Lets say I want to predict stock prices. Say I'm given four predictors:

Year-over-year earnings growth (relevant)
Percent chance of rain (irrelevant)
Humidity (irrelevant)
Temperature (irrelevant)

If I apply PCA to this data set, the first principle component would relate to weather since 75% of the predictors are weather related. Is this principle component relevant? It's not.

The two options you've highlighted boil down to using CFS or not using it. The option that uses CFS does better because it explicitly selects variables that have relevance to the target variable.

Altri suggerimenti

If you have a classification problem, you should you LDA instead of PCA. PCA ignores classes, whereas LDA is class-aware.

For example, if your data is 2D and you use PCA in the following example, you get:

So before PCA, the classes were perfectly linearly separable, but after PCA they are not separable at all. I'm not saying this happens in your case, but it could be.

Correlated Variables should be removed from PCA, as the variables together tend to exaggerate the effect they are expressing. CFS selects uncorrelated subsets of variables.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a datascience.stackexchange