There are two things that may be killing you: the duplicates or the overhead of a list over an array. In either case, probably the right thing to do is to grow your list only so big before dumping it into a coo_matrix
and adding it to your total. I took a couple of timings:
rows = list(np.random.randint(100, size=(10000,)))
cols = list(np.random.randint(100, size=(10000,)))
values = list(np.random.rand(10000))
%timeit sps.coo_matrix((values, (rows, cols)))
100 loops, best of 3: 4.03 ms per loop
%timeit (sps.coo_matrix((values[:5000], (rows[:5000], cols[:5000]))) +
sps.coo_matrix((values[5000:], (rows[5000:], cols[5000:]))))
100 loops, best of 3: 5.24 ms per loop
%timeit sps.coo_matrix((values[:5000], (rows[:5000], cols[:5000])))
100 loops, best of 3: 2.16 ms per loop
So there is about a 25% overhead in splitting the lists in two, converting each to a coo_matrix
and then adding them together. And it doesn't seem to be as bad if you do more splits:
%timeit (sps.coo_matrix((values[:2500], (rows[:2500], cols[:2500]))) +
sps.coo_matrix((values[2500:5000], (rows[2500:5000], cols[2500:5000]))) +
sps.coo_matrix((values[5000:7500], (rows[5000:7500], cols[5000:7500]))) +
sps.coo_matrix((values[7500:], (rows[7500:], cols[7500:]))))
100 loops, best of 3: 5.76 ms per loop