Prefetch instructions on ARM

https://stackoverflow.com/questions/82415

01-07-2019
|

Question

Newer ARM processors include the PLD and PLI instructions.

I'm writing tight inner loops (in C++) which have a non-sequential memory access pattern, but a pattern that naturally my code fully understands. I would anticipate a substantial speedup if I could prefetch the next location whilst processing the current memory location, and I would expect this to be quick-enough to try out to be worth the experiment!

I'm using new expensive compilers from ARM, and it doesn't seem to be including PLD instructions anywhere, let alone in this particular loop that I care about.

How can I include explicit prefetch instructions in my C++ code?

Solution

There should be some Compiler-specific Features. There is no standard way to do it for C/C++. Check out you compiler Compiler Reference Guide. For RealView Compiler see this or this.

OTHER TIPS

If you are trying to extract truly maximum performance from these loops, than I would recommend writing the entire looping construct in assembler. You should be able to use inline assembly depending on the data structures involved in your loop. Even better if you can unroll any piece of your loop (like the parts involved in making the access non-sequential).

At the risk of asking the obvious: have you verified the compiler's target architecture? For example (humor me), if by default the compiler is targeted to ARM7, you're never going to see the PLD instruction.

It is not outside the realm of possibility that other optimizations like software pipelining and loop unrolling may achieve the same effect as your prefetching idea (hiding the latency of the loads by overlapping it with useful computation), but without the extra instruction-cache pressure caused by the extra instructions. I would even go so far as to say that this is the case more often than not, for tight inner loops that tend to have few instructions and little control flow. Is your compiler doing these types of traditional optimizations instead. If so, it may be worth looking at the pipeline diagram to develop a more detailed cost model of how your processor works, and evaluate more quantitatively whether prefetching would help.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow