MATLAB nnmf() - large term-document matrix - memory and speed issue

https://stackoverflow.com/questions/12056455

27-06-2021
|

Question

I have a large term-document matrix and want to use the non-negative matrix factorization function matlab offers. The problem is that after the 1st iteration the memory usage rises rapidly and reaches the top (my system has 6GB), and on the other hand the CPU usage levels become very low (about 1%-5%). The whole system behaves like it has crashed and only if you wait for ages does the second iteration finish. (Note that to get good results many more iterations are needed).

Question:

If someone has any experience with this, or has run nnmf() with even larger matrices than mine, I would really like to know how he/she has actually overcome the problem mentioned.

Also: I have done this with a smaller matrix (about 7000x1800) and had no problems. I use sparse matrices because a term-document matrix has a lot of zero elements and this helps to reduce the space needed to be stored. For example at my case the Term-Document matrix has 14608 * 18828 = 275039424 elements and sum(sum(spa~=0)) = 1312582 non zero elements:

>> whos
Name          Size                    Bytes  Class     Attributes

full      14608x18828            2200315392  double              
spa       14608x18828              21151944  double    sparse

Solution 2

Something that finally worked:

I checked the nnmf.m file (algorithm implementation provided by Matlab) and tried to understand the code. There is one variable called 'd' which does the following:d = a - w*h; and is a full matrix with the same dimensions as 'a' (i.e. the large term-document matrix):

Name             Size                    Bytes  Class      Attributes
a            14608x18828              21151944  double     sparse    
d            14608x18828            2200315392  double               
...
h                4x18828                602496  double               
h0               4x18828                602496  double               
...
w            14608x4                    467456  double               
w0           14608x4                    467456  double

To save some memory space, I used clear to remove this matrix when it is not needed. Part of the old nnmf.m file:

d = a - w*h;
dnorm = sqrt(sum(sum(d.^2))/nm);
dw = max(max(abs(w-w0) / (sqrteps+max(max(abs(w0))))));
dh = max(max(abs(h-h0) / (sqrteps+max(max(abs(h0))))));
delta = max(dw,dh);

was replaced with this new one:

d = a - w*h;
dnorm = sqrt(sum(sum(d.^2))/nm);
clear d;
dw = max(max(abs(w-w0) / (sqrteps+max(max(abs(w0))))));
dh = max(max(abs(h-h0) / (sqrteps+max(max(abs(h0))))));
delta = max(dw,dh);

clear d was added there because d was never used after that. For the term-document matrix that was being used, this worked without causing memory problems.

OTHER TIPS

I think we've all watched too many episodes of Star Trek. Our computers are not infinitely fast, with an infinite amount of memory. We get spoiled with the expectation that we can do virtually any computation we want, with nearly immediate results. Just because you want to work with factorizations of huge matrices does not mean you will be able to do so. Get more memory to work with problems of this size. Or work on smaller problems.

The matrices you describe are not even terribly sparse, and their factorizations will be essentially completely full. Sparse matrices are rarely of value unless the number of non-zeros values is a small fraction of the total. A sparse matrix that has only 50% zero values is a waste of the tool. For example,

>> A = randi([0 1],100, 100);
>> B = sparse(A);
>> whos
  Name        Size             Bytes  Class     Attributes

  A         100x100            80000  double              
  B         100x100            81480  double    sparse

See that here A is 50% zero, yet the sparse form actually takes MORE memory to store than the full version! Remember that you need to store not only the non-zero elements, but their location too.

Ok, so sparse storage was not that efficient. But surely you can gain in terms of efficiency of operations? Not even so in that. Using Steve Eddins timeit function to make the comparisons, I get these results:

>> timeit(@() A*A)
ans =
   7.3604e-05

>> timeit(@() B*B)
ans =
    0.0014884

So the sparse multiply was considerably slower than a full multiply.

Essentially a sparse matrix that is only 50% zero is wasting the capabilities of the sparse form. Had the matrix been far more sparse, the results would have been different.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow