How to improve performance with caching technique

Question 1

This is an interesting case. The compiler did a poor job of loop invariant hoisting in the two inner loops. Namely, the two inner for-loop checks the following condition in each iteration:

(j+1)*B) < N ? ((j+1)*B) : N

and

(i+1)*B) < N ? ((i+1)*B) : N

The calculation and branching are both expensive; but they are actually loop invariant for the two inner for-loops. Once manually hoisting them out of the two inner for-loops, I was able to get the cache optimized version to perform better than the unoptimized version (10% when N==524288, 30% when N=1048576).

Here is the modified code (simple change really, look for u1, u2):

//break array data in N/B blocks, ib is index for i cached block and jb is index for j strided cached block
//each i block is compared with the j block, (which j block is always after the i block) 
for (i = 0; i < num_blocks; i++){
    for (j = i; j < num_blocks; j++){
        int u1 =  (((j+1)*B) < N ? ((j+1)*B) : N);
        int u2 =  (((i+1)*B) < N ? ((i+1)*B) : N);
        //reads the moving frame block to compare with the i cached block
        for (jb = j * B; jb < u1 ; jb++){
            //avoid float comparisons that occur when i block = j block
            //Register Allocated
            regx = P[jb].x;
            regy = P[jb].y;
            for (i == j ? (ib = jb + 1) : (ib = i * B); ib < u2; ib++){
                //calculate distance of current points
                if((distance = (P[ib].x - regx) * (P[ib].x - regx) +
                        (P[ib].y - regy) * (P[ib].y - regy)) < min_dist){
                    min_dist = distance;
                    p1 = &P[ib];
                    p2 = &P[jb];
                }
            }
        }
    }
}

Question 2

Tiling may be an old concept, but it's still very relevant today. In your original piece of code, for each i, you may be able to reuse most of the P[j] elements while still cached, but only if the length of the inner loop was small enough to fit there. The actual size should be determined by which cache level you want to target for tiling - the L1 would provide best performance as it's fastest, but as it's also the smallest you'd need small blocks and the tiling overhead may be too much. The L2 allows bigger tiles but sligthly reduced performance, and so on.

Note that you don't need to use 2d tiling here, this is not matrix multiplication - you're traversing over the same array. You could simply tile the inner loop since it's the one overflowing the cache, once you've done that - the outer loop (i) can run all the way to the end on the current block of cached inner-loop elements. There's actually no sense in 2d tiling since no one is going to reuse the elements of the outer loop (as opposed to matrix mul)

So, assuming Point is 64 bit large, you can fit 512 such elements of the array safely in your 32k L1, or 4096 elements in your 256k L2. you'll have to miss once for P[i] on each block if i is out of bounds of the current j block, but that's negligible.

By the way - this explanation may still be obsolete, since a sufficiently good compiler might try to do all this for you. It's quite complicated though, so i'm a bit skeptic any of the common ones would even try, but it should be easy to prove here that reordering is safe. One might argue of course that a "sufficiently good compiler" is a paradox, but that's off topic...