How can I remove a column from a sparse matrix efficiently?

https://stackoverflow.com/questions/2368544

24-09-2019
|

Question

If I am using the sparse.lil_matrix format, how can I remove a column from the matrix easily and efficiently?

Solution

I've been wanting this myself and in truth there isn't a great built-in way to do it yet. Here's a way to do it. I chose to make a subclass of lil_matrix and add the remove_col function. If you want, you can instead add the removecol function to the lil_matrix class in your lib/site-packages/scipy/sparse/lil.py file. Here's the code:

from scipy import sparse
from bisect import bisect_left

class lil2(sparse.lil_matrix):
    def removecol(self,j):
        if j < 0:
            j += self.shape[1]

        if j < 0 or j >= self.shape[1]:
            raise IndexError('column index out of bounds')

        rows = self.rows
        data = self.data
        for i in xrange(self.shape[0]):
            pos = bisect_left(rows[i], j)
            if pos == len(rows[i]):
                continue
            elif rows[i][pos] == j:
                rows[i].pop(pos)
                data[i].pop(pos)
                if pos == len(rows[i]):
                    continue
            for pos2 in xrange(pos,len(rows[i])):
                rows[i][pos2] -= 1

        self._shape = (self._shape[0],self._shape[1]-1)

I have tried it out and don't see any bugs. I certainly think that it is better than slicing the column out, which just creates a new matrix as far as I know.

I decided to make a removerow function as well, but I don't think that it is as good as removecol. I'm limited by not being able to remove one row from an ndarray in the way that I would like. Here is removerow which can be added to the above class

    def removerow(self,i):
        if i < 0:
            i += self.shape[0]

        if i < 0 or i >= self.shape[0]:
            raise IndexError('row index out of bounds')

        self.rows = numpy.delete(self.rows,i,0)
        self.data = numpy.delete(self.data,i,0)
        self._shape = (self._shape[0]-1,self.shape[1])

Perhaps I should submit these functions to the Scipy repository.

OTHER TIPS

Much simpler and faster. You might not even need the conversion to csr, but I just know for sure that it works with csr sparse matrices and converting between shouldn't be an issue.

from scipy import sparse

x_new = sparse.lil_matrix(sparse.csr_matrix(x)[:,col_list])

For a sparse csr matrix (X) and a list of indices to drop (index_to_drop):

to_keep = list(set(xrange(X.shape[1]))-set(index_to_drop))    
new_X = X[:,to_keep]

It is easy to convert lil_matrices to csr_matrices. Check tocsr() in lil_matrix documentation

Note however that going from csr to lil matrices using tolil() is expensive. So, this choice is good when you do not require to have your matrix in lil format.

I'm new to python so my answer is probably wrong, but I was wondering why something like the following won't be efficient?

Lets say your lil_matrix is called mat and that you want to remove the i-th column:

mat=hstack( [ mat[:,0:i] , mat[:,i+1:] ] )

Now the matrix will turn to a coo_matrix after that but you can turn it back to lil_matrix.

Ok, I understand that this will have to create the two matrices inside the hstack before it does the assignment to the mat variable so it would be like having the original matrix plus one more at the same time but I guess if the sparsity is big enough then I think there shouldn't be any memory problems (since memory (and time) is the whole reason of using sparse matrices).


def removecols(W, col_list):
        if min(col_list) = W.shape[1]:
                raise IndexError('column index out of bounds')
        rows = W.rows
        data = W.data
        for i in xrange(M.shape[0]):
            for j in col_list:
                pos = bisect_left(rows[i], j)
                if pos == len(rows[i]):
                        continue
                elif rows[i][pos] == j:
                        rows[i].pop(pos)
                        data[i].pop(pos)
                        if pos == len(rows[i]):
                                continue
                for pos2 in xrange(pos,len(rows[i])):
                        rows[i][pos2] -= 1
        W._shape = (W._shape[0], W._shape[1]-len(col_list))
        return W

Just rewrote your code to work with col_list as input - maybe this will be helpful for somebody.

By looking at the notes for each sparse matrix, specifically in our case is csc matrix it has the following advantages as listed in the documentation [1]

efficient arithmetic operations CSC + CSC, CSC * CSC, etc.
efficient column slicing
fast matrix vector products (CSR, BSR may be faster)

If you have the column indices you want to remove, just use slicing. For removing rows use csr matrix since it is efficient in row slicing

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow