PCA - what do I do with its results?
Pergunta
I have a data set with more than 20 features, and I applied PCA:
M.fit_transform(all_data)
variance = M.explained_variance_ratio_
var = np.cumsum(np.round(M.explained_variance_ratio_, decimals=3)*100)
plt.ylabel('% Variance Explained')
plt.xlabel('# of Features')
plt.title('PCA Analysis')
plt.ylim(30,102.5)
plt.plot(var, marker="s")
plt.show()
Printing the var
variable, I get
array([ 89., 100., 100., 100., 100., 100., 100., 100., 100., 100.])
I understand this tells us that the variance is explained by 2 features.
So I calculated it again, now the 2 components:
from sklearn.decomposition import PCA
M = PCA(n_components = 2)
X = M.fit_transform(all_data)
plt.scatter(X[:,0],X[:,1])
And this gives a "random looking plot". I understand that the data was changed during the PCA process.
What can I do with this information? How will this help me understand the data?
Is it useful per se? Is it useful as a preparation method for other methods? Which ones can I try?
Solução
What can I do with this information?
- You can do a lot of things with this data. You can visualize it, you can use the vectors for prediction or regression, whatever the task at hand. However, there are a few restrictions of PCA that you need to keep in mind. For eg. its very memory intensive, so you need to have a "lot" of RAM to use PCA on certain data-sets.
How will this help me understand the data?
- You can visualize the data like this (image is taken from http://www.nlpca.org/pca_principal_component_analysis.html):
With reference to the above image, you can see that the data-points can be clearly separated into different clusters. Using this, you can apply K-Means and get different cluster centres. Using these cluster centres, you can further investigate and find additional insights.
Is it useful per see?
- PCA is a dimensionality reduction technique which is very memory intensive.
- If you have the required memory, you can easily reduce the number of features by 50-80 %, while still retaining a good amount of information. For eg, we can reduce 100 features to 20-30 features which contain maximum amount of information.
- While performing PCA, it's important to check if the matrix computation can be done with the RAM that you have, otherwise, you can check out Iterative - PCA.
Is it useful as a preparation method for other methods?
- It is very useful as a preparation method for clustering and visualization.
- Please refer this link : https://qiita.com/bmj0114/items/db9145a707cb6ed13201
Which ones can I try?
- You can try the example given in the link above.