Question

I have a data set with more than 20 features, and I applied PCA:

M.fit_transform(all_data)
variance = M.explained_variance_ratio_
var = np.cumsum(np.round(M.explained_variance_ratio_, decimals=3)*100)
plt.ylabel('% Variance Explained')
plt.xlabel('# of Features')
plt.title('PCA Analysis')
plt.ylim(30,102.5)
plt.plot(var, marker="s")
plt.show()

Printing the var variable, I get

array([ 89., 100., 100., 100., 100., 100., 100., 100., 100., 100.])

I understand this tells us that the variance is explained by 2 features.

So I calculated it again, now the 2 components:

from sklearn.decomposition import PCA
M = PCA(n_components = 2)
X = M.fit_transform(all_data)
plt.scatter(X[:,0],X[:,1])

And this gives a "random looking plot". I understand that the data was changed during the PCA process.

What can I do with this information? How will this help me understand the data?

Is it useful per se? Is it useful as a preparation method for other methods? Which ones can I try?

Était-ce utile?

La solution

  1. What can I do with this information?

    • You can do a lot of things with this data. You can visualize it, you can use the vectors for prediction or regression, whatever the task at hand. However, there are a few restrictions of PCA that you need to keep in mind. For eg. its very memory intensive, so you need to have a "lot" of RAM to use PCA on certain data-sets.
  2. How will this help me understand the data?

PCA Visualization

  • With reference to the above image, you can see that the data-points can be clearly separated into different clusters. Using this, you can apply K-Means and get different cluster centres. Using these cluster centres, you can further investigate and find additional insights.

    1. Is it useful per see?

      • PCA is a dimensionality reduction technique which is very memory intensive.
      • If you have the required memory, you can easily reduce the number of features by 50-80 %, while still retaining a good amount of information. For eg, we can reduce 100 features to 20-30 features which contain maximum amount of information.
      • While performing PCA, it's important to check if the matrix computation can be done with the RAM that you have, otherwise, you can check out Iterative - PCA.
    2. Is it useful as a preparation method for other methods?

    3. Which ones can I try?

      • You can try the example given in the link above.
Licencié sous: CC-BY-SA avec attribution
scroll top