Question

I have a O(N^4) image processing loop and after profiling it (Using Intel Vtune 2013), I see that the number of Instructions retired is reduced drastically. I need help understanding this behavior on a multicore architecture. (I'm using Intel Xeon x5365- has 8 cores with shared L2 cache for every 2 cores). And also the no of branch mis-predictions have increased drastically!! ///////////////EDITS/////////// A sample of my non-Unrolled code is shown below:

for(imageNo =0; imageNo<496;imageNo++){
for (unsigned int k=0; k<256; k++)
{
    double z = O_L + (double)k * R_L;
    for (unsigned int j=0; j<256; j++)
    {
        double y = O_L + (double)j * R_L;

        for (unsigned int i=0; i<256; i++)
        {
            double x[1] = {O_L + (double)i * R_L} ;             
            double w_n =  (A_n[2] * x[0] + A_n[5] * y + A_n[8] * z + A_n[11])  ;
            double u_n =  ((A_n[0] * x[0] + A_n[3] * y + A_n[6] * z + A_n[9] ) / w_n);
            double v_n =  ((A_n[1] * x[0] + A_n[4] * y + A_n[7] * z + A_n[10]) / w_n);                      

            for(int loop=0; loop<1;loop++)
            {
                px_x[loop] = (int) floor(u_n);
                px_y[loop] = (int) floor(v_n);
                alpha[loop] = u_n - px_x[loop] ;
                beta[loop]  = v_n - px_y[loop] ;
            }
///////////////////(i,j) pixels ///////////////////////////////
                if (px_x[0]>=0 && px_x[0]<(int)threadCopy[0].S_x && px_y[0]>=0 && px_y[0]<(int)threadCopy[0].S_y)                   

                    pixel_1[0] = threadCopy[0].I_n[px_y[0] * threadCopy[0].S_x + px_x[0]];
                else
                    pixel_1[0] = 0.0;               

                if (px_x[0]+1>=0 && px_x[0]+1<(int)threadCopy[0].S_x && px_y[0]>=0 && px_y[0]<(int)threadCopy[0].S_y)                   

                    pixel_1[2] = threadCopy[0].I_n[px_y[0] * threadCopy[0].S_x + (px_x[0]+1)];
                else
                    pixel_1[2] = 0.0;                   

/////////////////// (i+1, j) pixels/////////////////////////    

                if (px_x[0]>=0 && px_x[0]<(int)threadCopy[0].S_x && px_y[0]+1>=0 && px_y[0]+1<(int)threadCopy[0].S_y)
                    pixel_1[1] = threadCopy[0].I_n[(px_y[0]+1) * threadCopy[0].S_x + px_x[0]];
                else
                    pixel_1[1] = 0.0;                   

                if (px_x[0]+1>=0 && px_x[0]+1<(int)threadCopy[0].S_x && px_y[0]+1>=0 && px_y[0]+1<(int)threadCopy[0].S_y)

                    pixel_1[3] = threadCopy[0].I_n[(px_y[0]+1) * threadCopy[0].S_x + (px_x[0]+1)];

                else 

                    pixel_1[3] = 0.0;

                pix_1 = (1.0 - alpha[0]) * (1.0 - beta[0]) * pixel_1[0] + (1.0 - alpha[0]) * beta[0]  * pixel_1[1]

                +  alpha[0]  * (1.0 - beta[0]) * pixel_1[2]   +  alpha[0]  *  beta[0]  * pixel_1[3];                                

            f_L[k * L * L + j * L + i] += (float)(1.0 / (w_n * w_n) * pix_1);
                        }
    }

}
   }

I'm unrolling the inner most loop by 4 iterations.(You will have a general ideal how I stripped the loop. Basically i created an array of Array[4] and filled respective vales in it.) Doing the math, I'm reducing the total no of iterations by 75%. Say there are 4 loop handling instructions for every loop (load i, inc i, cmp i, jle loop), the total no of instructions after unrolling should reduce by (256-64)*4*256*256*496=24.96G. The profiled results are:

Before UnRolling: Instr retired: 3.1603T      no of branch mis-predictions: 96 million
After UnRolling:  Instr retired: 2.642240T    no of branch mis-predictions: 144 million

The no instr retired decreased by 518.06G . I have no clue how this is happening. I would appreciate any help regarding this (Even if it is remote possibility for its occurrence) . Also, I would like to know why are branch mis-predictions increasing. Thanks in advance!

Was it helpful?

Solution

It is not clear where gcc would be reducing the number of instructions. It is possible that increased register pressure might encourage gcc to use load+operate instructions (so the same number of primitive operations but fewer instructions). The index for f_L would only be incremented once per innermost loop, but this would only save 6.2G (3*64*256*256*496) instructions. (By the way, the loop overhead should only be three instructions since i should remain in a register.)

The following pseudo-assembly (for a RISC-like ISA) using a two-way unrolling shows how an increment can be saved:

// the address of f_L[k * L * L + j * L + i] is in r1
// (float)(1.0 / (w_n * w_n) * pix_1) results are in f1 and f2
load-single f9 [r1];    // load float at address in r1 to register f9
add-single f9 f9 f1;    // f9 = f9 + f1
store-single [r1] f9;   // store float in f9 to address in r1
load-single f10 4[r1];  // load float at address of r1+4 to f10
add-single f10 f10 f2;  // f10 = f10 + f2
store-single 4[r1] f10; // store float in f10 to address of r1+4
add r1 r1 #8;           // increase the address by 8 bytes

The trace of two iterations of the non-unrolled version would look more like:

load-single f9 [r1];  // load float at address of r1 to f9
add-single f9 f9 f2;  // f9 = f9 + f2
store-single [r1] f9; // store float in f9 to address of r1
add r1 r1 #4;         // increase the address by 4 bytes
...
load-single f9 [r1];  // load float at address of r1 to f9
add-single f9 f9 f2;  // f9 = f9 + f2
store-single [r1] f9; // store float in f9 to address of r1
add r1 r1 #4;         // increase the address by 4 bytes

Because memory addressing instructions commonly include adding an immediate offset (Itanium is an unusual exception) and the pipelines are not generally implemented to optimize the case when the immediate is zero, using a non-zero immediate offset is generally "free". (It certainly reduces the number of instructions—7 vs. 8 in this case—, but generally it also improves performance.)

With respect to branch prediction, the according to Agner Fog's The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers(PDF) the Core2 microarchitecture's branch predictor uses an 8 bit global history. This means that it tracks the results for the last 8 branches and uses these 8 bits (along with bits from the instruction address) to index a table. This allows correlations between nearby branches to be recognized.

For your code, the branch corresponding to, e.g., the 8th previous branch is not the same branch in each iteration (since short-circuiting is used), so it is not easy to conceptualize how well correlations would be recognized.

Some correlations in branches are obvious. If px_x[0]>=0 is true, px_x[0]+1>=0 will also be true. If px_x[0] <(int)threadCopy[0].S_x is true, then px_x[0]+1 <(int)threadCopy[0].S_x is likely to be true.

If the unrolling is done such that px_x[n] is tested for all four values of n then these correlations would be pushed farther away so that the results are not used by the branch predictor.

Some optimization possibilities

Although you did not ask about any optimization possibilities, I am going to offer some avenues for exploration.

First, for the branches, if not being strictly portable is OK, the test x>=0 && x<y can be simplified to (unsigned)x<(unsigned)y. This is not strictly portable because, e.g., a machine could theoretically represent negative numbers in a sign-magnitude format with the most significant bit as the sign and negative indicated by a zero bit. For the common representations of signed integers, such a reinterpreting cast will work as long as y is a positive signed integer since a negative x value will have the most significant bit set and so be larger than y interpreted as an unsigned integer.

Second, the number of branches can be significantly reduced by using the 100% correlations for either px_x or px_y:

if ((unsigned) px_y[0]<(unsigned int)threadCopy[0].S_y)
{
    if ((unsigned)px_x[0]<(unsigned int)threadCopy[0].S_x)
        pixel_1[0] = threadCopy[0].I_n[px_y[0] * threadCopy[0].S_x + px_x[0]];
    else
        pixel_1[0] = 0.0;
    if ((unsigned)px_x[0]+1<(unsigned int)threadCopy[0].S_x)
        pixel_1[2] = threadCopy[0].I_n[px_y[0] * threadCopy[0].S_x + (px_x[0]+1)];
    else
        pixel_1[2] = 0.0;
}
if ((unsigned)px_y[0]+1<(unsigned int)threadCopy[0].S_y)
{
    if ((unsigned)px_x[0]<(unsigned int)threadCopy[0].S_x)
        pixel_1[1] = threadCopy[0].I_n[(px_y[0]+1) * threadCopy[0].S_x + px_x[0]];
    else
        pixel_1[1] = 0.0;
    if ((unsigned)px_x[0]+1<(unsigned int)threadCopy[0].S_x)
        pixel_1[3] = threadCopy[0].I_n[(px_y[0]+1) * threadCopy[0].S_x + (px_x[0]+1)];
    else
        pixel_1[3] = 0.0;
}

(If the above section of code is replicated for unrolling, it should probably be replicated as a block rather than interleaving tests for different px_x and px_y values to allow the px_y branch to be near the px_y+1 branch and the first px_x branch to be near the other px_x branch and the px_x+1 branches.)

Another possible optimization is changing the calculation of w_n into a calculation of its reciprocal. This would change a multiply and three divisions into four multiplies and one division. Division is much more expensive than multiplication. In addition, calculating an approximate reciprocal is much more SIMD-friendly since there are usually reciprocal estimate instructions that provide a starting point which can be refined by the Newton-Raphson method.

If even worse obfuscation of the code is acceptable, you might consider changing code like double y = O_L + (double)j * R_L; into double y = O_L; ... y += R_L;. (I ran a test, and gcc does not seem to recognize this optimization, probably because of the use of floating point and the cast to double.) Thus:

for(int imageNo =0; imageNo<496;imageNo++){

double z = O_L;
for (unsigned int k=0; k<256; k++)
{

    double y = O_L;
    for (unsigned int j=0; j<256; j++)
    {
        double x[1]; x[0] = O_L;
        for (unsigned int i=0; i<256; i++)
        {
            ...
            x[0] +=  R_L ;
        } // end of i loop
        y += R_L;
    }  // end of j loop
    z += R_L;
} // end of k loop
    } // end of imageNo loop

I am guessing that this would only modest improve performance, so the obfuscation cost would be higher relative to the benefit.

Another change that might be worth trying is incorporating some of the pix_1 calculation into the sections conditionally setting pixel_1[] values. This would significantly obfuscate the code and might not have much benefit. In addition, it might make autovectorization by the compiler more difficult. (With conditionally setting the values to the appropriate I_n or to zero, an SIMD comparison could set each element to -1 or 0 and a simple and with the I_n value would provide the correct value. In this case, the overhead of forming the I_n vector would probably not be worthwhile given that Core2 only supports 2-wide double precision SIMD, but with gather support or even just a longer vector the tradeoffs might change.)

However, this change would increase the size of basic blocks and reduce the amount of computation when any of px_x and px_y are out of range (I am guessing this is uncommon, so the benefit would be very small at best).

double pix_1 = 0.0;
double alpha_diff = 1.0 - alpha;
if ((unsigned) px_y[0]<(unsigned int)threadCopy[0].S_y)
{
    double beta_diff = 1.0 - beta;
    if ((unsigned)px_x[0]<(unsigned int)threadCopy[0].S_x)
        pix1 += alpha_diff * beta_diff 
             * threadCopy[0].I_n[px_y[0] * threadCopy[0].S_x + px_x[0]];
    // no need for else statement since pix1 is already zeroed and not 
    // adding the pixel_1[0] factor is the same as zeroing pixel_1[0]
    if ((unsigned)px_x[0]+1<(unsigned int)threadCopy[0].S_x)
        pix1 += alpha * beta_diff 
             * threadCopy[0].I_n[px_y[0] * threadCopy[0].S_x + (px_x[0]+1)];
}
if ((unsigned)px_y[0]+1<(unsigned int)threadCopy[0].S_y)
{
    if ((unsigned)px_x[0]<(unsigned int)threadCopy[0].S_x)
        pix1 += alpha_diff * beta 
             * threadCopy[0].I_n[(px_y[0]+1) * threadCopy[0].S_x + px_x[0]];
    if ((unsigned)px_x[0]+1<(unsigned int)threadCopy[0].S_x)
        pix1 += alpha * beta 
             * threadCopy[0].I_n[(px_y[0]+1) * threadCopy[0].S_x + (px_x[0]+1)];
}

Ideally, code like yours would be vectorized, but I do not know how to get gcc to recognize the opportunities, how to express the opportunities using intrinsics, nor whether significant effort at manually vectorizing this code would be worthwhile with an SIMD width of only two.

I am not a programmer (just someone who likes learning and thinking about computer architecture) and I have a significant inclination toward micro-optimization (as clear from the above), so the above proposals should be considered in that light.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top