Question

I am tuning the performance of my parallel Java program. I am curious about the Architecture effects.

Given a machine with two CPU sockets, each one with a quad-core Intel Xeon CPU, then:

  • How do the two CPUs communicate, how fast would they communicate?
  • How fast would two cores on the same chip communicate?
  • Are the four cores on the same chip equivalent in terms of communicating or memory accessing?
Was it helpful?

Solution

1) How do the two CPUs communicate, how fast would they communicate?

Most time they communicate via memory or nearest shared memory hierarchy level. (System memory both on SMP and NUMA is considered as shared level; even if in NUMA it is accesses via memory controller of another chip. this is just Non-Uniform=slower access)

2) How fast would two cores on the same chip communicate?

Cores on same chip usually shares L2 or L3 cache. Cores on different chips communicate via memory or with cache-to-cache interactions using cache coherency protocol.

So in case 1 (different chips) speed (bandwidth) of memory passing between CPUs will be near plain memory read/write. And in case 2 (same chip) this speed can be bigger, up to cache read/write speed.

Latency of communication will be several hundreds of CPU ticks in case 1 and several dozens in case 2.

3) Are the four cores on the same chip equivalent in terms of communicating or memory accessing?

All four cores of same chip usually have equivalent distance to RAM. It depends on chip architecture and implementation; for some older Intels e.g. multicore chip was really two chips packed into single package.

OTHER TIPS

How to schedule threads to cores for close to optimum memory performance depends on the access pattern to memory, and is usually not worth the trouble. If your program is in Java, you are probably not going to have the level of control required to get close to optimum performance.

Modern CPUs have integrated memory controllers, and modern multi-socket systems have distributed memory. This is called

Non-Uniform Memory Access (NUMA)

In modern multi-socket Intel processors communication between sockets is done with QPI

QuickPath Interconnect (QPI)

QPI is the Intel architecture that specifies how this works. AMD's equivalent is HyperTransport. You can learn more about the various architectures here:

System Architecture

An access to memory that misses in the Level 1 data cache might be serviced by the Level 2 data cache (in the same socket) or it might be serviced by what Intel calls the "Last Level Cache (LLC)" which would be in the socket that has the memory controller for that memory address. Hitting in the LLC in another socket could be a few tens of processor cycles, but still much faster than accessing DRAM (more than one hundred processor cycles).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top