C implementation for D&C is faster than Naive solution for matrix multiplication?

https://stackoverflow.com/questions/19963706

30-07-2022
|

Question

Both the Divide & Conquer (D&C) solution and the Naive solution for matrix multiplication are implemented "in-place" with C programming language. So no dynamic memory allocation at all.

As we have learned for both solutions, they actually have the same time complexity which is O(n^3). And right now they share the same space complexity since all of they are in-place implemented. Then how could one is so much faster than another?

Used clock_gettime to obtain the time.

With Cygwin on Windows 7 of a core i7 laptop, the D&C solution runs surprisingly faster than Naive solution (redundancy log removed):

Edited:

"algo0" indicates Naive solution, while "algo1" indicates D&C solution.

"len" indicates the width & height of the matrix. And the matrix is NxN matrix.

"00:00:00:000:003:421" means: "hour:minute:second:millisec:microsec:nanosec".

[alg0]time cost[0, len=00000002]: 00:00:00:000:003:421 (malloc_cnt=0)
[alg1]time cost[0, len=00000002]: 00:00:00:000:000:855 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[1, len=00000004]: 00:00:00:000:001:711 (malloc_cnt=0)
[alg1]time cost[1, len=00000004]: 00:00:00:000:001:711 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[2, len=00000008]: 00:00:00:000:009:408 (malloc_cnt=0)
[alg1]time cost[2, len=00000008]: 00:00:00:000:008:553 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[3, len=00000016]: 00:00:00:000:070:134 (malloc_cnt=0)
[alg1]time cost[3, len=00000016]: 00:00:00:000:065:858 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[4, len=00000032]: 00:00:00:000:564:066 (malloc_cnt=0)
[alg1]time cost[4, len=00000032]: 00:00:00:000:520:873 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[5, len=00000064]: 00:00:00:004:667:337 (malloc_cnt=0)
[alg1]time cost[5, len=00000064]: 00:00:00:004:340:188 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[6, len=00000128]: 00:00:00:009:662:680 (malloc_cnt=0)
[alg1]time cost[6, len=00000128]: 00:00:00:008:139:403 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[7, len=00000256]: 00:00:00:080:031:116 (malloc_cnt=0)
[alg1]time cost[7, len=00000256]: 00:00:00:065:395:329 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[8, len=00000512]: 00:00:00:836:392:576 (malloc_cnt=0)
[alg1]time cost[8, len=00000512]: 00:00:00:533:799:924 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[9, len=00001024]: 00:00:09:942:086:780 (malloc_cnt=0)
[alg1]time cost[9, len=00001024]: 00:00:04:307:021:362 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[10, len=00002048]: 00:02:53:413:046:992 (malloc_cnt=0)
[alg1]time cost[10, len=00002048]: 00:00:35:588:289:832 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[11, len=00004096]: 00:25:46:154:930:041 (malloc_cnt=0)
[alg1]time cost[11, len=00004096]: 00:04:38:196:205:661 (malloc_cnt=0)

While even on Raspberry Pi which has only one ARM core, the result is similar (also, redundancy data removed):

[alg0]time cost[0, len=00000002]: 00:00:00:000:005:999 (malloc_cnt=0)
[alg1]time cost[0, len=00000002]: 00:00:00:000:051:997 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[1, len=00000004]: 00:00:00:000:004:999 (malloc_cnt=0)
[alg1]time cost[1, len=00000004]: 00:00:00:000:008:000 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[2, len=00000008]: 00:00:00:000:014:999 (malloc_cnt=0)
[alg1]time cost[2, len=00000008]: 00:00:00:000:023:999 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[3, len=00000016]: 00:00:00:000:077:996 (malloc_cnt=0)
[alg1]time cost[3, len=00000016]: 00:00:00:000:157:991 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[4, len=00000032]: 00:00:00:000:559:972 (malloc_cnt=0)
[alg1]time cost[4, len=00000032]: 00:00:00:001:248:936 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[5, len=00000064]: 00:00:00:005:862:700 (malloc_cnt=0)
[alg1]time cost[5, len=00000064]: 00:00:00:010:739:450 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[6, len=00000128]: 00:00:00:169:060:336 (malloc_cnt=0)
[alg1]time cost[6, len=00000128]: 00:00:00:090:290:373 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[7, len=00000256]: 00:00:03:207:909:599 (malloc_cnt=0)
[alg1]time cost[7, len=00000256]: 00:00:00:771:870:443 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[8, len=00000512]: 00:00:35:725:494:551 (malloc_cnt=0)
[alg1]time cost[8, len=00000512]: 00:00:08:139:712:988 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[9, len=00001024]: 00:06:29:762:101:314 (malloc_cnt=0)
[alg1]time cost[9, len=00001024]: 00:01:50:964:568:907 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[10, len=00002048]: 00:52:03:950:717:474 (malloc_cnt=0)
[alg1]time cost[10, len=00002048]: 00:14:19:222:020:444 (malloc_cnt=0)

My first guess is that it must be some optimizations done by GCC. But exactly how?

Following are codes for both Naive solution and D&C solution. Naive solution:

void ClassicalMulti(int const * const mat1,
                    int const * const mat2,
                    int * const matrix,
                    const int n) {
    if (!mat1 || !mat2 || n<=0) {
        printf("ClassicalMulti: Invalid Input\n");
        return;
    }

    int cnt, row, col;

    for (row=0;row<n;++row) {
        for (col=0;col<n;++col) {
            for (cnt=0;cnt<n;++cnt) {
                matrix[row*n+col] += mat1[row*n+cnt] * mat2[cnt*n+col];
            }
        }
    }
}

Divide and Conquer solution:

void DCMulti(int const * const mat1,
             int const * const mat2,
             int * const matrix,
             const int p1,
             const int p2,
             const int pn,
             const int n) {
    if (!mat1 || !mat2 || !matrix || n<2 || p1<0 || p2 <0 || pn<2) {
        printf("DCMulti: Invalid Input\n");
        return;
    }

    if (pn == 2) {
        int pos = (p1/n)*n + p2%n;
        matrix[pos]     += mat1[p1]*mat2[p2] + mat1[p1+1]*mat2[p2+n];
        matrix[pos+1]   += mat1[p1]*mat2[p2+1] + mat1[p1+1]*mat2[p2+1+n];
        matrix[pos+n]   += mat1[p1+n]*mat2[p2] + mat1[p1+1+n]*mat2[p2+n];
        matrix[pos+1+n] += mat1[p1+n]*mat2[p2+1] + mat1[p1+1+n]*mat2[p2+1+n];
    } else {
        int a = p1;
        int b = p1 + pn/2;
        int c = p1 + pn*n/2;
        int d = p1 + pn*(n+1)/2;
        int e = p2;
        int f = p2 + pn/2;
        int g = p2 + pn*n/2;
        int h = p2 + pn*(n+1)/2;
        DCMulti(mat1, mat2, matrix, a, e, pn/2, n);   // a*e
        DCMulti(mat1, mat2, matrix, b, g, pn/2, n);   // b*g
        DCMulti(mat1, mat2, matrix, a, f, pn/2, n);   // a*f
        DCMulti(mat1, mat2, matrix, b, h, pn/2, n);   // b*h 
        DCMulti(mat1, mat2, matrix, c, e, pn/2, n);   // c*e 
        DCMulti(mat1, mat2, matrix, d, g, pn/2, n);   // d*g 
        DCMulti(mat1, mat2, matrix, c, f, pn/2, n);   // c*f 
        DCMulti(mat1, mat2, matrix, d, h, pn/2, n);   // d*h 
    }
}

Solution

The difference in these two approaches is simply in the memory access patterns. i.e. cache locality; for large matrices, especially the rows compete for the same cache lines and cause increasingly large penalty for cache misses. In the end, the D&C -strategy pays off, even though a globally better approach would be to divide the problem in say 8x8 blocks -- a technique called loop tiling. (Not surprisingly, matrix multiplication is presented as the arch example in the wikipedia article...)

OTHER TIPS

There are many different ways to do naive matrix multiplication. Some are much faster than others.

You might try transposing the second matrix before doing what you're doing. Aki Suikhonen correctly notes that tiling can be beneficial. Another thing you might try is "breaking up the enormous load-store chain"---each iteration of your inner loop adds a number to the same location. The CPU will probably elide almost all of the loads, but it still has to wait for the previous addition to complete before the next one can begin. Try having more than one accumulator and adding them up at the end. You might want to unroll the inner loop to make this easier to code up.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow