Efficiently populate SciPy sparse matrix from subset of dictionary

https://stackoverflow.com/questions/22796118

25-06-2023
|

Question

I need to store word co-occurrence counts in several 14000x10000 matrices. Since I know the matrices will be sparse and I do not have enough RAM to store all of them as dense matrices, I am storing them as scipy.sparse matrices.

I have found the most efficient way to gather the counts to be using Counter objects. Now I need to transfer the counts from the Counter objects to the sparse matrices, but this takes too long. It currently takes on the order of 18 hours to populate the matrices.

The code I'm using is roughly as follows:

for word_ind1 in range(len(wordlist1)):
    for word_ind2 in range(len(wordlist2)):
        word_counts[word_ind2, word_ind1]=word_counters[wordlist1[word_ind1]][wordlist2[word_ind2]]

Where word_counts is a scipy.sparse.lil_matrix object, word_counters is a dictionary of counters, and wordlist1 and wordlist2 are lists of strings.

Is there any way to do this more efficiently?

Solution

You're using LIL matrices, which (unfortunately) have a linear-time insertion algorithm. Therefore, constructing them in this way takes quadratic time. Try a DOK matrix instead, those use hash tables for storage.

However, if you're interested in boolean term occurrences, then computing the co-occurrence matrix is much faster if you have a sparse term-document matrix. Let A be such a matrix of shape (n_documents, n_terms), then the co-occurrence matrix is

A.T * A

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow