Assuming ALL the data used in A()
fits in the cache, you should see improvement in B()
too. However, you can also end up reading data into the cache that isn't being used, which serves no purpose to anything, and just causes the memory bus to be busy when it could be used to load some ACTUAL data that is needed, if your pattern is as sporadic as you say. By all means give it a try, but don't expect it to magically work effectively - it often takes a bit of "tuning" - particularly with regard to "how far ahead of where you are right now do you read the data".
Depending on the exact behaviour of A()
and B()
, for example if you are switching between reads and writes, and reading from one section and writing to a completely different section, batching up the writes to a "holding area", which is then copied to RAM is often a good plan - make the holding area something like 1/8-1/4 of the L1 cache.
[Caveat: I've got absolutely no experience at all with PowerPC architecture, but I have used cache prefetching and other memory optimisation techniques in my work with x86 processors, with some success at times, not so much success at other times]