Do I use the mean vector from my training set to center my testing set when dimension reducing for classification?

https://datascience.stackexchange.com/questions/70135

10-12-2020
|

Question

Please let me know if this is the right place to ask this (or if any of my tags are wrong) or if I need to write this any differently.

I am using the principal component analysis procedure to reduce the dimensions of the training set. I build the classifier. Then, before I classify the feature vectors from the test set, during the centering part of the dimension reduction, do I use the same mean vector from the training set, do I take the mean vector of the testing set and subtract that from the test set, or do I take the mean vector of the union of the training and test set and subtract that from the test set?

If the third option, does that mean I was also supposed to use the union of the training and testing set to center the training set as well? No, (for the sake of generalizing to other testing sets) right?

Also, even though I am pretty sure the answer will be the same as above, can you please let me know if the same is true for using the covariance matrix from the training set to get an eigenvector matrix and multiplying the inverse (transverse) of it times the test set to reduce it. Or, do we use the testing set or the union of the two to get the covariance and then eigenvector matrix to multiply times the testing set?

Please let me know if any of the premises are wrong. This is my first time.

Solution

Do I use the mean vector from my training set to center my testing set when dimension reducing for classification?: Yes.

Test set must not be combined with training set in any step of calculating the reduced dimension space. Characteristics of final space is determined by training set and test set just follows that i.e. the mean-adjusting step uses training mean.

You just calculate the final eigenvectors matrix $E$ (whose dimension is $d\times d$ at the beginning where $d$ is the dimensionality of data, and becomes $d_{reduced}\times d$ after choosing top vectors) and then your test data $D$ ($n\times d$) is just multiplied to that matrix, you get test data in the reduced space ($D^{'}$):

$$D_{n\times d}\times E^{T} = D^{'}_{n\times d_{reduced}}$$

where the dimensionality of $E^{T}$ is $d\times d_{reduced}$ as $T$ denotes matrix transpose (you mentioned inverse which is wrong).

NOTE: Depending on how you arrange samples in your data matrix, the matrix product will be totally different. Do not get confused if you see different things in literature. The standard form of data is usually $n_{samples}\times n_{features}$ which was assumed above as well. Each row is a sample and each column is a dimension.

I hope it helped. You can comment if you had any questions.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange