Question

I have a scipy.sparse.csr.csr_matrix that represents words in a document and a list of lists where each index represents the categories for each index in the matrix.

The problem that I am having is that I need to randomly select N amount of rows from the data.

So if my matrix looks like this

[1:3 2:3 4:4]
[1:5 2:5 5:4]

and my list of lists looked like this

((20,40) (80,50))  

and I needed to sample 1 value I could end up with this

[1:3 2:3 4:4]
((20,40))

I have searched the scipy documentation but I cannot find a way to generate a new csr matrix using a list of indexes.

Was it helpful?

Solution

You can simply index a csr matrix by using a list of indices. First we create a matrix, and look at it:

>>> m = csr_matrix([[0,0,1,0], [4,3,0,0], [3,0,0,8]])
<3x4 sparse matrix of type '<type 'numpy.int64'>'
    with 5 stored elements in Compressed Sparse Row format>

>>>  print m.toarray()
[[0 0 1 0]
 [4 3 0 0]
 [3 0 0 8]]

Of course, we can easily just look a the first row:

>>> m[0]
<1x4 sparse matrix of type '<type 'numpy.int64'>'
    with 1 stored elements in Compressed Sparse Row format>

>>> print m[0].toarray()
[[0 0 1 0]]

But we can also look at the first and third row at once using the list [0,2] as an index:

>>> m[[0,2]]
<2x4 sparse matrix of type '<type 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>

>>> print m[[0,2]].toarray()
[[0 0 1 0]
 [3 0 0 8]]

Now you can generate N random indices with no repeats (no replacement) using numpy's choice:

i = np.random.choice(np.arange(m.shape[0]), N, replace=False)

Then you can grab those indices from both your original matrix m:

sub_m = m[i]

To grab them from your categories list of lists, you must first make it an array, then you can index with the list i:

sub_c = np.asarray(categories)[i]

If you want to have a list of lists back, just use:

sub_c.tolist()

or, if what you really have/want is a tuple of tuples, I think you have to do it manually:

tuple(map(tuple, sub_c))
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top