Question

I'm in the process of optimizing the encoding step of a C++ library called PackJPG

I've profiled the code with Intel VTune and found that the current bottleneck is the following function in the arithmetic coder that PackJPG uses:

void aricoder::encode( symbol* s )
{   
    // update steps, low count, high count
    unsigned int delta_plus_one = ((chigh - clow) + 1);
    cstep = delta_plus_one / s->scale;
    chigh = clow + ( cstep * s->high_count ) - 1;
    clow  = clow + ( cstep * s->low_count );

    // e3 scaling is performed for speed and to avoid underflows
    // if both, low and high are either in the lower half or in the higher half
    // one bit can be safely shifted out
    while ( ( clow >= CODER_LIMIT050 ) || ( chigh < CODER_LIMIT050 ) ) {
        if ( chigh < CODER_LIMIT050 ) { // this means both, high and low are below, and 0 can be safely shifted out
            // write 0 bit
            write_zero();
            // shift out remaing e3 bits
            write_nrbits_as_one();

        }
        else { // if the first wasn't the case, it's clow >= CODER_LIMIT050
            // write 1 bit
            write_one();
            clow  &= CODER_LIMIT050 - 1;
            chigh &= CODER_LIMIT050 - 1;
            // shift out remaing e3 bits

            write_nrbits_as_zeros();
        }
        clow  <<= 1;
        chigh = (chigh << 1) | 1;

    }

    // e3 scaling, to make sure that theres enough space between low and high
    while ( ( clow >= CODER_LIMIT025 ) && ( chigh < CODER_LIMIT075 ) ) {
        ++nrbits;
        clow  &= CODER_LIMIT025 - 1;
        chigh ^= CODER_LIMIT025 + CODER_LIMIT050;
        // clow  -= CODER_LIMIT025;
        // chigh -= CODER_LIMIT025;
        clow  <<= 1;
        chigh = (chigh << 1) | 1;

    }
}

This function seems to borrow some ideas from: http://paginas.fe.up.pt/~vinhoza/itpa/bodden-07-arithmetic-TR.pdf. I've managed to optimize the function somewhat (primarily by speeding up the bit writing) but now I'm stuck.

Right now the biggest bottleneck seems to be division at the beginning. This screenshot from VTune shows the time it takes results as well as the assembly created (the blue assembly to the right corresponds to the line in the source code selected to the left).

s->scale is not necessarily an even power of 2 so the division can't be replaced with a modulo operation.

The code was compiled with MSVC (from Visual Studio 2013) with the following settings:

/GS /Qpar- /GL /analyze- /W3 /Gy- /Zc:wchar_t /Zi /Gm- /Ox /sdl /Fd"Release\vc120.pdb" /fp:precise /D "WIN32" /D "NDEBUG" /D "_WINDOWS" /D "_USRDLL" /D "PACKJPG_EXPORTS" /D "_CRT_SECURE_NO_WARNINGS" /D "BUILD_DLL" /D "_WINDLL" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /arch:IA32 /Gd /Oy- /Oi /MT /Fa"Release\" /EHsc /nologo /Fo"Release\" /Ot /Fp"Release\PackJPG.pch" 

Any ideas on how to optimize this further?

UPDATE 1 I've now tried all suggestions so far and this is the fastest version now:

void aricoder::encode( symbol* s )
{   
    unsigned int clow_copy = clow;
    unsigned int chigh_copy = chigh;
    // update steps, low count, high count
    unsigned int delta_plus_one = ((chigh_copy - clow_copy) + 1);
    unsigned register int cstep = delta_plus_one / s->scale;

    chigh_copy = clow_copy + (cstep * s->high_count) - 1;
    clow_copy = clow_copy + (cstep * s->low_count);

    // e3 scaling is performed for speed and to avoid underflows
    // if both, low and high are either in the lower half or in the higher half
    // one bit can be safely shifted out
    while ((clow_copy >= CODER_LIMIT050) || (chigh_copy < CODER_LIMIT050)) {
        if (chigh_copy < CODER_LIMIT050) {  // this means both, high and low are below, and 0 can be safely shifted out
            // write 0 bit
            write_zero();
            // shift out remaing e3 bits
            write_nrbits_as_one();

        }
        else { // if the first wasn't the case, it's clow >= CODER_LIMIT050
            // write 1 bit
            write_one();
            clow_copy &= CODER_LIMIT050 - 1;
            chigh_copy &= CODER_LIMIT050 - 1;
            // shift out remaing e3 bits

            write_nrbits_as_zeros();
        }
        clow_copy <<= 1;
        chigh_copy = (chigh_copy << 1) | 1;

    }

    // e3 scaling, to make sure that theres enough space between low and high
    while ((clow_copy >= CODER_LIMIT025) & (chigh_copy < CODER_LIMIT075)){
        ++nrbits;
        clow_copy &= CODER_LIMIT025 - 1;
        chigh_copy ^= CODER_LIMIT025 + CODER_LIMIT050;
        // clow  -= CODER_LIMIT025;
        // chigh -= CODER_LIMIT025;
        clow_copy <<= 1;
        chigh_copy = (chigh_copy << 1) | 1;

    }
    clow = clow_copy;
    chigh = chigh_copy;
}

Here's the updated VTune results with this version: This new version includes the following changes:

  • Avoid one branch by using & instead of && in the last while loop (that trick did not help in the first loop).
  • Copy the class fields to local variables.

The following suggestions unfortunately did not improve performance:

  • Replacing the first while loop with a switch with goto statements.
  • Using fixed point arithmetic for division (it created rounding errors).
  • Doing a switch on s->scale and doing bit-shifts instead of division for even powers of 2.

@example suggested that it's not the division that's slow but the memory access for one of the operands of the division. That seems to be correct. According to VTune we are getting cache misses here quite often. Any suggestions on how to fix that?

Was it helpful?

Solution

According to VTune we are getting cache misses here quite often. Any suggestions on how to fix that?

The way we organize data directly impacts the performance as data locality and hence how cache mechanism would behave depend on this. So to achieve this our program should try to do linear memory access as much as possible and should avoid any indirect memory read/write(pointer based data structure). This would really be loved by cache mechanism, as the probability of memory in having the L1 cache would significantly higher.

While looking your code and VTune report, it looks like the most important data is argument passed to this particular function. The various data members of this objects is getting used(memory read) within this particular function.

void aricoder::encode( symbol* s )

Now, there is following code where program is accessing the data members of this object:

s->scale
s->high_count
s->low_count

From both VTune report, we can verify that all three memory access have different timing. This indicates that these data are at different offset of this particular object. And while accessing the one of them(s->high_count), it is going out from L1 cache and hence it is taking more time as it has to bring the data into cache. Due to this the s->low_count is benefiting as it is now in L1 cache. From these data I can think the following point:

  1. Put your most accessed data members into the hot zone of within your object. This means we should put all these members to the first/top of object. By this way we would be in better chance that our object fits into the first cache line of an object. So we should try to re-organize our object memory layout as per its data members access. I assume that your are not dealing with the virtual table in this object as they are not so good from cache mechanism.

  2. It is possible that your overall program is organized in such a way that around this point(.i.e the execution of this function), the L1 cache is full and hence program is trying to access it from L2 and this transition, there would be more CPU cycles(spike). In this scenario I do not think we can do much as this is kind of limitation of machine and in some sense we are stretching our boundary too much and trying to deal with too low level stuff.

  3. Your object s seems to be of type of POD and hence there would be linear access. This is good and there is no scope of improvement. However the way we allocates may have impact on cache mechanism. If it is getting allocated everytime, it can have impact while executing within the current function.

Apart from that I think we should also refer about the following SO post which talks about these concepts in great detail about(Data Cache/ Instruction Cache). These post also have great link which has in-depth analysis and information about this.

What is "cache-friendly" code?

How to write instruction cache friendly program in c++?

I suggest that, you should try to refer these post. They would be really really helpful to understand internals about these concepts even though it might not help you out to optimize your current piece of code. May be your program is already optimized and there is very little we can do in this :).

OTHER TIPS

This is not complete answer. This code is a demonstration of usage fixed point arithmetic to perform fast integer division. Widely used in DSP and signal processing. Note, the code make sense for optimization only if 'scale' changes are infrequent. Also, in case of small values of 'scale', code could be rewritten to use uint32_t as intermediate result.

#include <stdio.h>
#include <stdint.h>

int main(int argc, char **argv)
{
   uint32_t scale;
   uint32_t scale_inv;
   uint32_t delta_plus_one;
   uint32_t val0, val1;
   uint64_t tmp;

   scale = 5;
   delta_plus_one = 44533;

   /* Place the line in 'scale' setter function */
   scale_inv = 0x80000000 / scale;

   /* Original expression */
   val0 = (delta_plus_one / scale);

   /* Division using multiplication uint64_t by uint32_t,
      using uint64_t as intermediate result */
   tmp = (uint64_t)(delta_plus_one) * scale_inv;
   /* shift right to produce result */
   val1 = tmp >> 31;

   printf("val0 = %u; val1 = %u\n", val0, val1);
   return 0;
}

To start off CODER_LIMIT050 is a stupid name, made especially stupid by the coexistence of CODER_LIMIT025 and CODER_LIMIT075. Other than that, you probably don't want to use short circuit logic if there are no side effects anyway, so the second while statement can be:

while ( ( clow >= CODER_LIMIT025 ) & ( chigh < CODER_LIMIT075 ) )

The first while block can be further optimized to collapse the 3 possible branching statements per iteration into one:

start:
switch ( ( clow >= CODER_LIMIT050 ) | (( chigh < CODER_LIMIT050 )<<1) )
{
default: break;

case 1:
    write_zero ( );
    write_nrbits_as_one ( );
    clow <<= 1;
    chigh = ( chigh << 1 ) | 1;
    goto start;

case 3: // think about this case, is this what you want?
case 2:
    write_one ( );
    clow &= CODER_LIMIT050 - 1;
    chigh &= CODER_LIMIT050 - 1;
    write_nrbits_as_zeros ( );
    clow <<= 1;
    chigh = ( chigh << 1 ) | 1;
    goto start;
}

If you want to optimize away the division by s->scale, ask yourself just exactly how variable is it? If there are only a few possible cases, then template it out. Once it is a compile time constant, the compiler can try to either find a bit shift if possible or find its multiplicative inverse in the Galois Field GF(4294967296) if it has one.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top