Question

I have a dual socket 8 core processor, that is, each processor has 4-cores in it. I haven't seen its specification completely, but I think that a separate memory bank is attached to each processor in a ccNUMA fashion and therefore accessing from memory bank of another processor is relatively slow. Also they have different L3 caches I suppose.

Now my question is what is the fastest way to share data between the two processors. Simple shared memory will have the problem due to ccNUMA and cache coherency. Is there any way which is very fast?

Was it helpful?

Solution

That would depend highly on the nature of what you're trying to implement. From what I've seen, it's usually possible to do better with a very tightly managed shared memory approach than to resort to MPI. (because it's possible to do a lot more with shared-memory)

However, it's harder to go completely wrong with MPI since there's a lot less guess work to why X performs well or not.

Here are some common approaches using shared memory:

Read-Only data: If the data is small enough, it might be best to duplicate it across all the nodes.

If your memory access has extremely high spatial locality that doesn't "migrate" around, organize your data such that each "group" of spatial locality access is on the same node.

If your memory access pattern exhibits high temporal locality, but not enough spatial locality to fit into cache, then consider copying the data into a local buffer. Once the work is done, copy it back. This lets you keep the same structure of the program.

EDIT: Consider adding the "NUMA" tag to your question.

OTHER TIPS

Both OpenMP and OpenMPI allow for the sharing of data across multiple CPUs. I would imagine using one of these APIs is likely to be faster than anything you try to implement yourself. Which one you use would depend upon the exact nature of what you are trying to implement.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top