truncated svd on tf idf gives value error array is too big

https://stackoverflow.com/questions/20216743

05-08-2022
|

Question

I am trying to apply TruncatedSVD.fit_transform() on sparse matrix given by TfidfVectorizer in scikit-learn which gives :

    tsv = TruncatedSVD(n_components=10000,algorithm='randomized',n_iterations=5)
    tfv = TfidfVectorizer(min_df=3,max_features=None,strip_accents='unicode',analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1)
    tfv.fit(text)
    text = tfv.transform(text)
    tsv.fit(text)

Value error : array is too big

What are the other approaches which I can use or dimensionality reduction.

Solution

I am pretty sure that the problem is:

tsv = TruncatedSVD(n_components=10000...

You have 10000 components in your SVD. If you have an m x n data matrix, SVD will have matrices with dimensions m x n_components and n_components x n, and these will be dense, even if the data was sparse. Those matrices are probably too big.

I copied your code and ran it on Kaggle Hashtag data(which is what I think this is from), and at 300 components, python was using up to 1GB. At 10000, you'd use about 30 times that.

Incidentally, what you are doing here is latent semantic analysis, and that isn't likely to benefit from this many components. Somewhere in the range of 50-300 should capture everything that matters.

OTHER TIPS

There is a possibility that you are getting this error as you are using 32 bit python. Try switching to 64 bit. The other approach for dimensionality reduction for sparse matrices is using RandomizedPCA which is PCA using randomized SVD.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow