Norm of sparse python vectors
-
03-07-2021 - |
Question
Is it possible to effectively obtain the norm of a sparse vector in python?
I tried the following:
from scipy import sparse
from numpy.linalg import norm
vector1 = sparse.csr_matrix([ 0 for i in xrange(4000000) ], dtype = float64)
#just to test I set a few points to a value higher than 0
vector1[ (0, 10) ] = 5
vector1[ (0, 1500) ] = 80
vector1[ (0, 2000000) ] = 6
n = norm(t1)
but then I get the error:
ValueError: dimension mismatch
The norm function only works with arrays so probably that's why the csr_matrix is not working, but then I didn't find another way of computing the norm effectively. One possible solution would be to compute:
norm(asarray(vector1.todense()))
but then it kills the purpose of using sparse vectors at first. And as the last approach I could iterate through each element of the vector and compute manually the norm, but since efficiency is really important I was searching for something faster and easier to implement.
Thanks in advance for any help!
EDIT: I tried everything that was suggested and the best solution is:
(vector1.data ** 2).sum()
from Dougal. But the Cython solution is also very good and works better as the vector grows in number of elements different of zero. Thanks everyone for the help!
Solution
- I hope you are not really initializing and setting elements like that, those warnings are raised for a reason, and a 4M temporary list proofs you have plenty of resources left ;).
- Calculating a norm by hand is very simple, by just using the underlying data
vector1.data
directly. You can also use things likevector1.multiply(vector1)
plus.sum
orvector1.dot(vector1.T)
but as Dougal pointed out, that can be much slower for this simple case. - I guess you want to do more, but if you only want vector norms, going through sparse matrices seems like a lot of unnecessary work.
OTHER TIPS
I just had the same problem here, I implemented a function in cython to increase the speed of this simple operation. I tested it with a 4M sparse vector of doubles with 100k non-zero elements. The method using sqrt(vector.multiply(vector).sum()) used 874us and my function 205us.
# sparseLib.pyx
#cython: boundscheck=False
from cython.parallel cimport prange
from cython.view cimport array as cvarray
import numpy as np
from libc.math cimport sqrt
cpdef double sparseNorm2(double [:] data) nogil:
cdef long i
cdef double value = 0.0
for i in xrange(data.shape[0]):
value += data[i]*data[i]
return sqrt(value)
I don't think your initialization is doing what you think it is.
For norm to work, you need to have a square array. If you are trying to make a square array with 4million elements, you want to do
csr_matrix( (2000,2000), dtype=float64)
full documentation for initialization at scipy