Domanda

I have a list of 100k items and each item has a list of indices. I am trying to put this into a boolean sparse matrix for vector multiplication. My code isn't running as fast as I would like, so I am looking for performance tips or maybe alternative approaches for getting this data into a matrix.

rows = []
cols = []
for i, item in enumerate(items):
    indices = item.getIndices()
    rows += [i]*len(indices)
    cols += indices

data = np.ones(len(rows), dtype='?')
mat = coo_matrix(data,(rows,cols)),shape=(len(items),totalIndices),dtype='?')
mat = mat.tocsr()

There wind up being 800k items in the rows/cols lists and just the extending of those lists seems to be taking up 16% and 13% of the building time. Converting to the coo_matrix then takes up 12%. Enumeration is taking up 13%. I got these stats from line_profiler and I am using python 3.3.

È stato utile?

Soluzione

The best I can do is:

def foo3(items,totalIndices):
    N = len(items)
    cols=[]
    cnts=[]
    for item in items:
        indices = getIndices(item)
        cols += indices
        cnts.append(len(indices))
    rows = np.arange(N).repeat(cnts) # main change
    data = np.ones(rows.shape, dtype=bool)
    mat = sparse.coo_matrix((data,(rows,cols)),shape=(N,totalIndices))
    mat = mat.tocsr()
    return mat

For 100000 items it's only a 50% increase in speed.

Altri suggerimenti

A lot of sparse matrix algorithms run twice through the data, once to figure out the size of the sparse matrix, the other to fill it in with the right values. So perhaps it is worth trying something like this:

total_len = 0
for item in items:
    total_len += len(item.getIndices())

rows = np.empty((total_len,), dtype=np.int32)
cols = np.empty((total_len,), dtype=np.int32)

total_len = 0
for i, item in enumerate(items):
    indices = item.getIndices()
    len_ = len(indices)
    rows[total_len:total_len + len_] = i
    cols[total_len:total_len + len_] = indices
    total_len += len_

Followed by the same you are currently doing. You can also build the CSR matrix directly, avoiding the COO one, which will save some time as well. After the first run to find out the total size you would do:

indptr = np.empty((len(items) + 1,), dtype=np.int32)
indptr[0] = 0
indices = np.empty((total_len,), dtype=np.int32)

for i, item in enumerate(items):
    item_indices = item.getIndices()
    len_ = len(item_indices)
    indptr[i+1] = indptr[i] + len_
    indices[indptr[i]:indptr[i+1]] = item_indices

data = np.ones(total_len,), dtype=np.bool)
mat = csr_matrix((data, indices, indptr))
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top