You can simply index a csr matrix by using a list of indices. First we create a matrix, and look at it:
>>> m = csr_matrix([[0,0,1,0], [4,3,0,0], [3,0,0,8]])
<3x4 sparse matrix of type '<type 'numpy.int64'>'
with 5 stored elements in Compressed Sparse Row format>
>>> print m.toarray()
[[0 0 1 0]
[4 3 0 0]
[3 0 0 8]]
Of course, we can easily just look a the first row:
>>> m[0]
<1x4 sparse matrix of type '<type 'numpy.int64'>'
with 1 stored elements in Compressed Sparse Row format>
>>> print m[0].toarray()
[[0 0 1 0]]
But we can also look at the first and third row at once using the list [0,2]
as an index:
>>> m[[0,2]]
<2x4 sparse matrix of type '<type 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>
>>> print m[[0,2]].toarray()
[[0 0 1 0]
[3 0 0 8]]
Now you can generate N
random indices with no repeats (no replacement) using numpy's choice
:
i = np.random.choice(np.arange(m.shape[0]), N, replace=False)
Then you can grab those indices from both your original matrix m
:
sub_m = m[i]
To grab them from your categories list of lists, you must first make it an array, then you can index with the list i
:
sub_c = np.asarray(categories)[i]
If you want to have a list of lists back, just use:
sub_c.tolist()
or, if what you really have/want is a tuple of tuples, I think you have to do it manually:
tuple(map(tuple, sub_c))