Question

Is it possible to effectively obtain the norm of a sparse vector in python?

I tried the following:

from scipy import sparse
from numpy.linalg import norm

vector1 = sparse.csr_matrix([ 0 for i in xrange(4000000) ], dtype = float64)

#just to test I set a few points to a value higher than 0

vector1[ (0, 10) ] = 5
vector1[ (0, 1500) ] = 80
vector1[ (0, 2000000) ] = 6

n = norm(t1)

but then I get the error:

ValueError: dimension mismatch

The norm function only works with arrays so probably that's why the csr_matrix is not working, but then I didn't find another way of computing the norm effectively. One possible solution would be to compute:

norm(asarray(vector1.todense()))

but then it kills the purpose of using sparse vectors at first. And as the last approach I could iterate through each element of the vector and compute manually the norm, but since efficiency is really important I was searching for something faster and easier to implement.

Thanks in advance for any help!

EDIT: I tried everything that was suggested and the best solution is:

(vector1.data ** 2).sum()

from Dougal. But the Cython solution is also very good and works better as the vector grows in number of elements different of zero. Thanks everyone for the help!

Was it helpful?

Solution

  1. I hope you are not really initializing and setting elements like that, those warnings are raised for a reason, and a 4M temporary list proofs you have plenty of resources left ;).
  2. Calculating a norm by hand is very simple, by just using the underlying data vector1.data directly. You can also use things like vector1.multiply(vector1) plus .sum or vector1.dot(vector1.T) but as Dougal pointed out, that can be much slower for this simple case.
  3. I guess you want to do more, but if you only want vector norms, going through sparse matrices seems like a lot of unnecessary work.

OTHER TIPS

I just had the same problem here, I implemented a function in cython to increase the speed of this simple operation. I tested it with a 4M sparse vector of doubles with 100k non-zero elements. The method using sqrt(vector.multiply(vector).sum()) used 874us and my function 205us.

# sparseLib.pyx
#cython: boundscheck=False
from cython.parallel cimport prange
from cython.view cimport array as cvarray

import numpy as np

from libc.math cimport sqrt

cpdef double sparseNorm2(double [:] data) nogil:
  cdef long i
  cdef double value = 0.0
  for i in xrange(data.shape[0]):
    value += data[i]*data[i]
  return sqrt(value)

I don't think your initialization is doing what you think it is.

For norm to work, you need to have a square array. If you are trying to make a square array with 4million elements, you want to do

csr_matrix( (2000,2000), dtype=float64)

full documentation for initialization at scipy

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top