According to VTune we are getting cache misses here quite often. Any suggestions on how to fix that?
The way we organize data directly impacts the performance as data locality and hence how cache mechanism would behave depend on this. So to achieve this our program should try to do linear memory access as much as possible and should avoid any indirect memory read/write(pointer based data structure). This would really be loved by cache mechanism, as the probability of memory in having the L1 cache would significantly higher.
While looking your code and VTune report, it looks like the most important data is argument passed to this particular function. The various data members of this objects is getting used(memory read) within this particular function.
void aricoder::encode( symbol* s )
Now, there is following code where program is accessing the data members of this object:
s->scale
s->high_count
s->low_count
From both VTune report, we can verify that all three memory access have different timing. This indicates that these data are at different offset of this particular object. And while accessing the one of them(s->high_count), it is going out from L1 cache and hence it is taking more time as it has to bring the data into cache. Due to this the s->low_count is benefiting as it is now in L1 cache. From these data I can think the following point:
Put your most accessed data members into the hot zone of within your object. This means we should put all these members to the first/top of object. By this way we would be in better chance that our object fits into the first cache line of an object. So we should try to re-organize our object memory layout as per its data members access. I assume that your are not dealing with the virtual table in this object as they are not so good from cache mechanism.
It is possible that your overall program is organized in such a way that around this point(.i.e the execution of this function), the L1 cache is full and hence program is trying to access it from L2 and this transition, there would be more CPU cycles(spike). In this scenario I do not think we can do much as this is kind of limitation of machine and in some sense we are stretching our boundary too much and trying to deal with too low level stuff.
Your object s seems to be of type of POD and hence there would be linear access. This is good and there is no scope of improvement. However the way we allocates may have impact on cache mechanism. If it is getting allocated everytime, it can have impact while executing within the current function.
Apart from that I think we should also refer about the following SO post which talks about these concepts in great detail about(Data Cache/ Instruction Cache). These post also have great link which has in-depth analysis and information about this.
What is "cache-friendly" code?
How to write instruction cache friendly program in c++?
I suggest that, you should try to refer these post. They would be really really helpful to understand internals about these concepts even though it might not help you out to optimize your current piece of code. May be your program is already optimized and there is very little we can do in this :).