PCA - what do I do with its results?

https://datascience.stackexchange.com/questions/72949

10-12-2020
|

Question

I have a data set with more than 20 features, and I applied PCA:

M.fit_transform(all_data)
variance = M.explained_variance_ratio_
var = np.cumsum(np.round(M.explained_variance_ratio_, decimals=3)*100)
plt.ylabel('% Variance Explained')
plt.xlabel('# of Features')
plt.title('PCA Analysis')
plt.ylim(30,102.5)
plt.plot(var, marker="s")
plt.show()

Printing the var variable, I get

array([ 89., 100., 100., 100., 100., 100., 100., 100., 100., 100.])

I understand this tells us that the variance is explained by 2 features.

So I calculated it again, now the 2 components:

from sklearn.decomposition import PCA
M = PCA(n_components = 2)
X = M.fit_transform(all_data)
plt.scatter(X[:,0],X[:,1])

And this gives a "random looking plot". I understand that the data was changed during the PCA process.

What can I do with this information? How will this help me understand the data?

Is it useful per se? Is it useful as a preparation method for other methods? Which ones can I try?

La solution

What can I do with this information?
- You can do a lot of things with this data. You can visualize it, you can use the vectors for prediction or regression, whatever the task at hand. However, there are a few restrictions of PCA that you need to keep in mind. For eg. its very memory intensive, so you need to have a "lot" of RAM to use PCA on certain data-sets.
How will this help me understand the data?
- You can visualize the data like this (image is taken from http://www.nlpca.org/pca_principal_component_analysis.html):

With reference to the above image, you can see that the data-points can be clearly separated into different clusters. Using this, you can apply K-Means and get different cluster centres. Using these cluster centres, you can further investigate and find additional insights.
1. Is it useful per see?
  - PCA is a dimensionality reduction technique which is very memory intensive.
  - If you have the required memory, you can easily reduce the number of features by 50-80 %, while still retaining a good amount of information. For eg, we can reduce 100 features to 20-30 features which contain maximum amount of information.
  - While performing PCA, it's important to check if the matrix computation can be done with the RAM that you have, otherwise, you can check out Iterative - PCA.
2. Is it useful as a preparation method for other methods?
  - It is very useful as a preparation method for clustering and visualization.
  - Please refer this link : https://qiita.com/bmj0114/items/db9145a707cb6ed13201
3. Which ones can I try?
  - You can try the example given in the link above.

Licencié sous: CC-BY-SA avec attribution

Non affilié à datascience.stackexchange