Question

This is probably language agnostic, but I'm asking from a C++ background.

I am hacking together a ring buffer for an embedded system (AVR, 8-bit). Let's assume:

const uint8_t size = /* something > 0 */;
uint8_t buffer[size];
uint8_t write_pointer;

There's this neat trick of &ing the write and read pointers with size-1 to do an efficient, branchless rollover if the buffer's size is a power of two, like so:

// value = buffer[write_pointer];
write_pointer = (write_pointer+1) & (size-1);

If, however, the size is not a power of two, the fallback would probably be a compare of the pointer (i.e. index) to the size and do a conditional reset:

// value = buffer[write_pointer];
if (++write_pointer == size) write_pointer ^= write_pointer;

Since the reset occurs rather rarely, this should be easy for any branch prediction.

This assumes though that the pointers need to be advancing foreward in memory. While this is intuitive, it requires a load of size in every iteration. I assume that reversing the order (advancing backwards) would yield better CPU instructions (i.e. jump if not zero) in the regular case, since size is only required during the reset.

// value = buffer[--write_pointer];
if (write_pointer == 0) write_pointer = size;

so

TL;DR: My question is: Does marching backwards through memory have a negative effect on the execution time due to cache misses (since memory cannot simply be read forward) or is this a valid optimization?

Was it helpful?

Solution

You have an 8 bit avr with a cache? And branch prediction?

How does forward or backwards matter as far as caches are concerned? The hit or miss on a cache is anywhere within the cache line, beginning, middle, end, random, sequential, doesnt matter. You can work from the back to the front or the front to back of a cache line, it is the same cost (assuming all other things held constant) the first mist causes a fill, then that line is in cache and you can access any of the items in any pattern at a lower latency until evicted.

On a microcontroller like that you want to make the effort, even at the cost of throwing away some memory, to align a circular buffer such that you can mask. There is no cache the instruction fetches are painful because they are likely from a flash that may be slower than the processor clock rate, so you do want to reduce instructions executed, or make the execution a little more deterministic (same number of instructions every loop until that task is done). There might be a pipeline that would appreciate the masking rather than an if-then-else.

TL;DR: My question is: Does marching backwards through memory have a negative effect on the execution time due to cache misses (since memory cannot simply be read forward) or is this a valid optimization?

The cache doesnt care, a miss from any item in the line causes a fill, once in the cache any pattern of access, random, sequential forward or back, or just pounding on the same address, takes less time being in faster memory. Until evicted. Evictions wont come from neighboring cache lines they will come from cache lines larger powers of two away, so whether the next cache line you pull is at a higher address or lower, the cost is the same.

OTHER TIPS

Does marching backwards through memory have a negative effect on the execution time due to cache misses (since memory cannot simply be read forward)

Why do you think that you will have a cache miss? You will have a cache miss if you try to access outside the cache (forward or backward).

There are a number of points which need clarification:

  1. That size needs to be loaded each and every time (it's const, therefore ought to be immutable)
  2. That your code is correct. For example in a 0-based index (as used in C/C++ for array access) the value 0' is a valid pointer into the buffer, and the valuesize' is not. Similarly there is no need to xor when you could simply assign 0, equally a modulo operator will work (writer_pointer = (write_pointer +1) % size).
  3. What happens in the general case with virtual memory (i.e. the logically adjacent addresses might be all over the place in the real memory map), paging (stuff may well be cached on a page-by-page basis) and other factors (pressure from external processes, interrupts)

In short: this is the kind of optimisation that leads to more feet related injuries than genuine performance improvements. Additionally it is almost certainly the case that you get much, much better gains using vectorised code (SIMD).

EDIT: And in interpreted languages or JIT'ed languages it might be a tad optimistic to assume you can rely on the use of JNZ and others at all. At which point the question is, how much of a difference is there really between loading size and comparing versus comparing with 0.

As usual, when performing any form of manual code optimization, you must have extensive in-depth knowledge of the specific hardware. If you don't have that, then you should not attempt manual optimizations, end of story.

Thus, your question is filled with various strange assumptions:

  • First, you assume that write_pointer = (write_pointer+1) & (size-1) is more efficient than something else, such as the XOR example you posted. You are just guessing here, you will have to disassemble the code and see which yields the less CPU instructions.

    Because, when writing code for a tiny, primitive 8-bit MCU, there is not much going on in the core to speed up your code. I don't know AVR8, but it seems likely that you have a small instruction pipe and that's it. It seems quite unlikely that you have much in the way of branch prediction. It seems very unlikely that you have a data and/or instruction cache. Read the friendly CPU core manual.

  • As for marching backwards through memory, it will unlikely have any impact at all on your program's performance. On old, crappy compilers you would get slightly more efficient code if the loop condition was a comparison vs zero instead of a value. On modern compilers, this shouldn't be an issue. As for cache memory concerns, I doubt you have any cache memory to worry about.

The best way to write efficient code on 8-bit MCUs is to stick to 8-bit arithmetic whenever possible and to avoid 32-bit arithmetic like the plague. And forget you ever heard about something called floating point. This is what will make your program efficient, you are unlikely to find any better way to manually optimize your code.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top