Numpy efficient construction of sparse coo_matrix or faster list extension

Question 1

The best I can do is:

def foo3(items,totalIndices):
    N = len(items)
    cols=[]
    cnts=[]
    for item in items:
        indices = getIndices(item)
        cols += indices
        cnts.append(len(indices))
    rows = np.arange(N).repeat(cnts) # main change
    data = np.ones(rows.shape, dtype=bool)
    mat = sparse.coo_matrix((data,(rows,cols)),shape=(N,totalIndices))
    mat = mat.tocsr()
    return mat

For 100000 items it's only a 50% increase in speed.

Question 2

A lot of sparse matrix algorithms run twice through the data, once to figure out the size of the sparse matrix, the other to fill it in with the right values. So perhaps it is worth trying something like this:

total_len = 0
for item in items:
    total_len += len(item.getIndices())

rows = np.empty((total_len,), dtype=np.int32)
cols = np.empty((total_len,), dtype=np.int32)

total_len = 0
for i, item in enumerate(items):
    indices = item.getIndices()
    len_ = len(indices)
    rows[total_len:total_len + len_] = i
    cols[total_len:total_len + len_] = indices
    total_len += len_

Followed by the same you are currently doing. You can also build the CSR matrix directly, avoiding the COO one, which will save some time as well. After the first run to find out the total size you would do:

indptr = np.empty((len(items) + 1,), dtype=np.int32)
indptr[0] = 0
indices = np.empty((total_len,), dtype=np.int32)

for i, item in enumerate(items):
    item_indices = item.getIndices()
    len_ = len(item_indices)
    indptr[i+1] = indptr[i] + len_
    indices[indptr[i]:indptr[i+1]] = item_indices

data = np.ones(total_len,), dtype=np.bool)
mat = csr_matrix((data, indices, indptr))