Should software cache improve performance on a NUMA machine

Question

Some NUMA machines do have local cache. If you have a multi-socket Opteron or Xeon system, each socket is a NUMA domain with multiple levels of cache, some shared between cores and some not. At least for Intel chips since Nehalem, all of those caches can store remote memory references. This is good for performance in 2-8 sockets but also continues to be a benefit on larger systems built on longer range cache-coherent interconnects like NumaConnect or SGI NUMALink.

With that said, if you're stuck on a non-coherent system, you'll need to narrow down a bunch of other parameters before a yes/no answer is possible. How expensive is each state transition in your software coherency protocol? How often are those transitions happening for a trace of an app you're concerned about? If transitions are cheap enough or lines stay resident long enough, then sure, it could help... but that depends on the implementation, the underlying architecture and the behavior of the app itself.

Here's a group experimenting with some related performance issues: http://www.lfbs.rwth-aachen.de/content/17.html. You might also find some interesting work done relating to the Cell BE architecture used in the Playstation 3, for example: http://researcher.ibm.com/files/us-alexe/paper-gonzalez-pact08.pdf.