Question

I am trying to do PCA for dimension reduction in WEKA (Classification Problem).

I have 200 attributes in my data and close to 2100 rows.

Here are the steps that i follow

  • Import csv file in WEKA explorer

  • In preprocess tab, apply, Normalize data (To bring entire data in range of [0,1]

  • Then implement PCA.

    • In options for PCA, there is an option for centerData which if set to False, would calculate using correlation matrix after standardizing data (Correct me if i am wrong) and if set to true would using covariance matrix.

My doubt is

  1. Should i be normalizing data before implementing PCA or not? I tried doing it before and after normalizing i am getting different results. So i am confused.
  2. Should i Standardize data (bring mean to 0) and then apply PCA.

What is the option that i should select in PCA WEKA for centerData option in either case?

Was it helpful?

Solution

This question has been answered in part here: PCA first or normalization first?

To answer your questions directly:

Normalizing would be a personal choice. If you set centerData=TRUE, and do not normalize or standardize your data, some attributes with large values will have greater influence in the PCA. If you set centerData=FALSE, Weka standardizes the data for you.

And just to confirm your suspicions, in Weka, centerData does the following:

centerData=TRUE

  • Centers your data (does not normalize or standardize, so if you decide to do that, you need to do it before)
  • PCA is performed with the covariance matrix

centerData=FALSE

  • PCA is performed with the correlation matrix (data is standardized by the method)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top