Sklearn and PCA. Why is max n_row == max n_components?

https://datascience.stackexchange.com/questions/11214

16-10-2019
|

Pergunta

I posted my question on stack overflow, but there someone suggested that I should try it here. What I'm doing now :)

OK, first to my data. I have a word-bi-gram frequency matrix (1100 x 100658, dtype=int), where the first 5 columns contain information about the document. So every row is a document and every column a word-bi-gram like (of-the, on-the, and-that,...). I want to visualize the data, but before I do that, I want to reduce the dimension. So I thought I do that with PCA from sklearn. First I set the column labels with

myPandaDataFrame.columns = word-bi-grams

then I deleted some doc-columns, because I want to see what kind of information I can get if I only look at the proficiency.

del existing_df['SUBSET']
del existing_df['PROMPT']
del existing_df['L1']
del existing_df['ESSAYID']

then I set the proficiency column to be the index with

myPandaDataFrame.columns.set_index(['PROFICIENCY'], inplace=True, drop=True)

and then I did this

from sklearn.decomposition import PCA
x = 500
pcax = PCA(n_components=x)
pcax.fit(myPandaDataFrame)
PCA(copy=True, n_components=x, whiten=False)
existing_2dx = pcax.transform(myPandaDataFrame)
existing_df_2dx = pandas.DataFrame(existing_2dx)
existing_df_2dx.index = myPandaDataFrame.index
existing_df_2dx.columns = ['PC{0}'.format(i) for i in range(x)]

But with this implementation I can only set 1100 n_components as a maximum. This is the number of documents (rows). This makes me suspicious. I tried a couple of examples / tutorials, but I can't get it right. So I hope someone can help me find out what I'm doing wrong? If would also be very happy about a good example / tutorial that can help me with my problem. Thank you.

With best regards.

Solução

Given m rows of n columns, I think it's natural to think of the data as n-dimensional. However the inherent dimension d of the data may be lower; d <= n. d is the rank of the m x n matrix you could form from the data. The dimensionality of the data can be reduced to d with no loss of information, even. The same actually goes for rows, which is less intuitive but true; d <= m. So, it always makes sense to reduce dimensionality to something <= d since there's no loss; we typically reduce much further. This is why it won't let you reduce to more than the number of rows.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange