Question

Say that you want to rotate something 360 steps 100 times. You now have a choise to pre-calculate 360 sin and cos values once and then use the stored values 100 times, or you can calculate sin and cos every time.

As far as I understand on modern CPUs you're better off just calculating sin and cos each time because the CPU is many times faster than memory access anyway. However, on a slower CPU like the one used on a RPi (not version 2) would this still hold? or is it slow enough that pre-calculating values is better? A Raspberry Pi has a 700 MHz single core processor.

Was it helpful?

Solution

As far as I understand on modern CPUs you're better off just calculating sin and cos each time because the CPU is many times faster than memory access anyway.

Sure, main memory access costs many CPU cycles, where a CPU would be idle waiting for data.

But that completely ignores that modern CPUs have memory caches (some have 3 levels of cache on the die, commonly referred to as L1, L2 and L3). This memory access is much faster than main memory access (though L1 is faster than L2 which is faster than L3).

If your program is designed well enough and with cache locality in mind, a calculated local variable (the result of your sin and cos calculations, for instance) will be fetched from cache and not main memory (or might even reside in a CPU register). This will be faster than re-calculating.


To see which one is faster, you would need to try both approaches, in a real-world scenario (so, using your actual application) and time it.

OTHER TIPS

I ran some tests on my RPi and came to these conclusions:

  • Looping 100 times over a float LUT with size 360 (ie 100 full rotations with 1 deg resolution or 360*100 sin lookups and 360*100 cos lookups) the LUT is approx. 10-20 times faster than calculating sinf and cosf directly in the loop.

  • Using a double LUT with same size makes the LUT approx. 40 times faster than calculating sin and cos directly in the loop.

Some things to note:

  • The LUT is small enough to fit in the 16k L1 cache of the RPi, if one were to use larger LUTs they might not end up in the L1 cache and the result might be radically different.
  • I only used a small placeholder calculation in the loop to prevent the compiler from optimise it away. With a more complicated inner loop the L1 cache might be used up.
  • Given that this is sin and cos it's actually enough to create for example a sin LUT and use it for the cos values as well (with a 90 deg offset), saving some memory.

Most modern languages will perform table lookups rather than calculate sin and cos values. If the value is between two table values, it'll average the two together (weighted average).

If I'm not mistaken, the same should be true for calling sin/cos in a raspberry pi program, though it is still technically performing a table lookup that you could otherwise avoid by pre-calculating into an array.

My guess is that pre-calculating rotation values only to use once would not be faster and you'd be using up a bit of memory. If you have to perform another rotation, the difference would likely be marginal. However my advice would be to try it to be sure.

Licensed under: CC-BY-SA with attribution
scroll top