Optimizing an arithmetic coder

Question 1

According to VTune we are getting cache misses here quite often. Any suggestions on how to fix that?

The way we organize data directly impacts the performance as data locality and hence how cache mechanism would behave depend on this. So to achieve this our program should try to do linear memory access as much as possible and should avoid any indirect memory read/write(pointer based data structure). This would really be loved by cache mechanism, as the probability of memory in having the L1 cache would significantly higher.

While looking your code and VTune report, it looks like the most important data is argument passed to this particular function. The various data members of this objects is getting used(memory read) within this particular function.

void aricoder::encode( symbol* s )

Now, there is following code where program is accessing the data members of this object:

s->scale
s->high_count
s->low_count

From both VTune report, we can verify that all three memory access have different timing. This indicates that these data are at different offset of this particular object. And while accessing the one of them(s->high_count), it is going out from L1 cache and hence it is taking more time as it has to bring the data into cache. Due to this the s->low_count is benefiting as it is now in L1 cache. From these data I can think the following point:

Put your most accessed data members into the hot zone of within your object. This means we should put all these members to the first/top of object. By this way we would be in better chance that our object fits into the first cache line of an object. So we should try to re-organize our object memory layout as per its data members access. I assume that your are not dealing with the virtual table in this object as they are not so good from cache mechanism.
It is possible that your overall program is organized in such a way that around this point(.i.e the execution of this function), the L1 cache is full and hence program is trying to access it from L2 and this transition, there would be more CPU cycles(spike). In this scenario I do not think we can do much as this is kind of limitation of machine and in some sense we are stretching our boundary too much and trying to deal with too low level stuff.
Your object s seems to be of type of POD and hence there would be linear access. This is good and there is no scope of improvement. However the way we allocates may have impact on cache mechanism. If it is getting allocated everytime, it can have impact while executing within the current function.

Apart from that I think we should also refer about the following SO post which talks about these concepts in great detail about(Data Cache/ Instruction Cache). These post also have great link which has in-depth analysis and information about this.

What is "cache-friendly" code?

How to write instruction cache friendly program in c++?

I suggest that, you should try to refer these post. They would be really really helpful to understand internals about these concepts even though it might not help you out to optimize your current piece of code. May be your program is already optimized and there is very little we can do in this :).

Question 2

This is not complete answer. This code is a demonstration of usage fixed point arithmetic to perform fast integer division. Widely used in DSP and signal processing. Note, the code make sense for optimization only if 'scale' changes are infrequent. Also, in case of small values of 'scale', code could be rewritten to use uint32_t as intermediate result.

#include <stdio.h>
#include <stdint.h>

int main(int argc, char **argv)
{
   uint32_t scale;
   uint32_t scale_inv;
   uint32_t delta_plus_one;
   uint32_t val0, val1;
   uint64_t tmp;

   scale = 5;
   delta_plus_one = 44533;

   /* Place the line in 'scale' setter function */
   scale_inv = 0x80000000 / scale;

   /* Original expression */
   val0 = (delta_plus_one / scale);

   /* Division using multiplication uint64_t by uint32_t,
      using uint64_t as intermediate result */
   tmp = (uint64_t)(delta_plus_one) * scale_inv;
   /* shift right to produce result */
   val1 = tmp >> 31;

   printf("val0 = %u; val1 = %u\n", val0, val1);
   return 0;
}

Question 3

To start off CODER_LIMIT050 is a stupid name, made especially stupid by the coexistence of CODER_LIMIT025 and CODER_LIMIT075. Other than that, you probably don't want to use short circuit logic if there are no side effects anyway, so the second while statement can be:

while ( ( clow >= CODER_LIMIT025 ) & ( chigh < CODER_LIMIT075 ) )

The first while block can be further optimized to collapse the 3 possible branching statements per iteration into one:

start:
switch ( ( clow >= CODER_LIMIT050 ) | (( chigh < CODER_LIMIT050 )<<1) )
{
default: break;

case 1:
    write_zero ( );
    write_nrbits_as_one ( );
    clow <<= 1;
    chigh = ( chigh << 1 ) | 1;
    goto start;

case 3: // think about this case, is this what you want?
case 2:
    write_one ( );
    clow &= CODER_LIMIT050 - 1;
    chigh &= CODER_LIMIT050 - 1;
    write_nrbits_as_zeros ( );
    clow <<= 1;
    chigh = ( chigh << 1 ) | 1;
    goto start;
}

If you want to optimize away the division by s->scale, ask yourself just exactly how variable is it? If there are only a few possible cases, then template it out. Once it is a compile time constant, the compiler can try to either find a bit shift if possible or find its multiplicative inverse in the Galois Field GF(4294967296) if it has one.