How to create a huge sparse matrix in scipy

https://stackoverflow.com/questions/23381497

12-07-2023
|

Question

I am trying to create a very huge sparse matrix which has a shape (447957347, 5027974). And, it contains 3,289,288,566 elements.

But, when i create a csr_matrix using scipy.sparse, it return something like this:

<447957346x5027974 sparse matrix of type '<type 'numpy.uint32'>'
    with -1005678730 stored elements in Compressed Sparse Row format>

The source code for creating matrix is:

indptr = np.array(a, dtype=np.uint32)    # a is a python array('L') contain row index information
indices = np.array(b, dtype=np.uint32)   # b is  a python array('L') contain column index information
data = np.ones((len(indices),), dtype=np.uint32)
test = csr_matrix((data,indices,indptr), shape=(len(indptr)-1, 5027974), dtype=np.uint32)

And, I also found when I convert an 3 billion length python array to numpy array, it will raise an error:

ValueError:setting an array element with a sequence

But, when I create three 1 billion length python arrays, and convert them to numpy array, then append them. It works fine.

I'm confused.

Solution

You are using an older version of SciPy. In the original implementation of sparse matrices, indices where stored in an int32 variable, even on 64 bit systems. Even if you define them to be uint32, as you did, they get casted. So whenever your matrix has more than 2^31 - 1 nonzero entries, as is your case, the indexing overflows and lots of bad things happen. Note that in your case the weird negative number of elements is explained by:

>>> np.int32(np.int64(3289288566))
-1005678730

The good news is that this has already been figured out. I think this is the relevant PR, although there were some more fixes after that one. In any case, if you use the latest release candidate for SciPy 0.14, your problem should be gone.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow