Efficient modification of the rows of a sparse matrix in Java

Question 1

If you're not tied to la4j, the Vectorz package (which I maintain) has some tools to do these kinds of operations very efficiently. This is possible because of two features:

Sparse storage of data
Lightweight mutable "views", so you can mutate rows of a matrix as vectors in-place

The strategy I would use is:

Create a VectorMatrixMN to store a matrix as a collection of sparse vector rows
Use a SparseIndexedVector for each row, which is an efficient format for mostly-zero data

Normalising the rows of the matrix can then be done with the following code:

VectorMatrixMN m = ....

for (int i=0; i<SIZE; i++) {
    AVector row=m.getRow(i);
    double sum=row.elementSum();
    if (sum>0) {
        row.divide(sum);
    } else {
        m.setRow(i, new RepeatedElementVector(SIZE,1.0/SIZE));
    }
}

Note that this code is modifying the rows in-place, so you don't need to do anything like "setRow" to get the data back in the matrix.

Using this configuration with a 32,000 x 32,000 sparse matrix and a density of 100 non-zero values per row, I timed this at less than 32ms to normalise the whole matrix with this code (i.e. about 10ns per non-zero element == 0.03ns per matrix element : so you are clearly getting big benefits by exploiting the sparsity).

You could also optionally use a ZeroVector for rows that are all-zero (these will be even faster, but impose some extra constraints since ZeroVectors are immutable.....)

EDIT:

I've coded a complete example that demonstrates using sparse matrices for a use case very similar to this question:

https://github.com/mikera/vectorz/blob/develop/src/test/java/example/SparseMatrix.java

Question 2

I'm the author of the la4j library. I do see several places of improvement in your code. Here is my advice:

Calling getRow (as well as setRow) is always a bad idea (especially for sparse matrices), since it launches a full-copying of the matrix row. I would suggest you to avoid such calls. Thus, w/o getRow/setRow code should looks like:

SparseMatrix t = new CRSMatrix(n, n);

double uniformWeight = (double) 1 / n; // used when the rowSum is zero
for (int i = 0; i < n; i++) {
  double rowSum = t.foldRow(i, Matrices.asSumAccumulator(0.0));
  if (rowSum > 0.0) {
    MatrixFunction divider = Matrices.asDivFunction(rowSum);
    for (int j = 0; j < n; j++) {
      // TODO: I should probably think about `updateRow` method
      //       in order to avoid this loop
      t.update(i, j, divider);
    }
  } else {
    for (int j = 0; j < n; j++) {
      t.set(i, j, uniformWeight);
    }
  }
}

Please, try this. I didn't compile it, but it should work.

Update

Using a boolean array in order to keep track of the same rows is a fantastic idea. The main bottle-neck here is the loop:

for (int j = 0; j < n; j++) {
  t.set(i, j, uniformWeight);
}

Here we're completely ruining the performance/footprint of the sparse matrice since assigning the entire row to the same value. So, I would say, combining these two ideas together: avoid getRow/setRow + extra array with flags (I would use BitSet instead, it is much efficien in terms of footprint) should give you an awesome performance.

Thank you for using the la4j library, and please, report any issues with performance/functional to mail-list, or GitHub page. All references are available here: http://la4j.org.