How is PCA is different from SubSpace clustering and how do we extract variables responsible for the first PCA component?

https://datascience.stackexchange.com/questions/18067

22-10-2019
|

Question

New update:

I understand PCA components ensure we select variables responsible for high variance, but I would like to know how to extract key variables responsible only for high variance through PCA components.

Ideally, a simple example would help.

This is my code:

#Implementing PCA for visualizing after Kmeans clustering

`# Interpret 3 cluster solution
model3=KMeans(n_clusters=3)
model3.fit(clus_train)
clusassign=model3.predict(clus_train)
# plot clusters

'''The new variables, called canonical variables, are ordered in terms of the proportion of variance and the clustering variables that is accounted for by each of the canonical variables. So the first canonical variable will count for the largest proportion of the variance. The second canonical variable will account for the next largest proportion of variance, and so on. Usually, the majority of the variance in the clustering variables will be accounted for by the first couple of canonical variables and those are the variables that we can plot. '''

from sklearn.decomposition import PCA
pca_2 = PCA(2) # Selecting 2 components
plot_columns = pca_2.fit_transform(clus_train)
plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,)

Observations are more spread out indicating less correlation among the observations and higher within cluster variance.

plt.xlabel('Canonical variable 1')
plt.ylabel('Canonical variable 2')
plt.title('Scatterplot of Canonical Variables for 3 Clusters')
plt.show()`

Solution

Reducing the dimensionality of a dataset with PCA does not only benefits humans trying to look at the data in a graspable number of dimensions. It is also useful for machine learning algorithms to be trained on a subset of dimensions. Both to reduce the complexity of the data and the computational cost of training such machine learning model.

OTHER TIPS

PCA is a very common technique, so you might want to google around. PCA is awfully common for data visualisation, but it has many other uses.

For instance, if you want to fit a linear regression on average income. Now, you have collected 500+ predictors, but lots of them are correlated like:

How much the person pays tax last year
How much the person pays tax the year before
How much the person pays tax three years ago
....

Those predictors are highly correlated and might present modelling problems in your linear model. A very common technique is to use PCA to reduce into a set of reduced orthogonal principal components. You can then use those components for building your model.

https://stats.stackexchange.com/questions/22665/how-to-use-principal-components-as-predictors-in-glm

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange