Minimizing inter-core Communication in a NUMA architecture

Question

The Nehalem processor uses QuickPath Interconnect (QPI) for inter-processor/node/package communication. In a NUMA system each node has its own local memory, which is shared with other nodes in the system. When the working set of a program fits in the L1 cache and is read-only then it doesn't matter much which NUMA node owns the memory. Communication between NUMA nodes is necessary when a core gets a cache miss and the memory is owned by another node. However, this doesn't mean that it is slower to access memory owned by another node, it depends on whether the other node has it cached in the cache associated with its local memory, what Intel calls the Last Level Cache (LLC). Access by a core to a memory location that is local to that node is faster than access to memory owned by another node, but only if it misses in the LLC on both nodes. It is faster to access memory that hits in the LLC on another node than it is to go to memory on the local node, that is because memory is so much slower than the CPU and QPI is optimized for this sort of communication. Most systems don't bother trying to reduce inter-processor communication because, as you can imagine, it is not an easy problem - it requires setting affinity of threads to cores, setting affinity of the memory working set of that thread to the local memory of that node. You can read more about this in Drepper Ulrich's paper, search for NUMA. In this paper Ulrich refers to QPI as Common System Interface (CSI), which was the Intel name for QPI before announcement.