Multicore thread processing order

Question

The key thing is to appreciate what the machine's architecture actually is.

A "core" is a CPU with cache with a connection to the system memory. Most machine architectures are Symmetric Multi-Processing, meaning that the system memory is equally accessible by all cores in the system.

Most operating systems run a scheduler thread on each core (Linux does). The scheduler has a list of threads it is responsible for, and it will run them to the best of its ability on the core that it controls. The rules it uses to choose which thread to run will be either round robin, or priority based, etc; ie all the normal scheduling rules. So far it is just like a scheduler that you would find in a single core computer. To some extent each scheduler is independent from all the other schedulers.

However, this an SMP environment, meaning that it really doesn't matter which core runs which thread. This is because all the cores can see all the memory, and all the code and data for all threads in the entire system is stored in that single memory.

So the schedulers talk amongst themselves to help each other out. Schedulers with too many threads to run can pass a thread over to a scheduler whose core is under utilised. They are load balancing within the machine. "Pass a thread over" means copying the data structure that describes the thread (thread id, which data, which code).

So that's about it. As the only communication between cores is via memory it all relies on an effective mutual exclusion semaphore system being available, which is something the hardware has to allow for.

The Difficulty

So I've painted a very simple picture, but in practice the memory is not perfectly symmetrical. SMP these days is synthesised on top of HyperTransport and QPI.

Long gone are the days when cores really did have equal access to the system memory at the electronic level. At the very lowest layer of their architecture AMD are purely NUMA, and Intel nearly so.

Nowadays a core has to send a request to other cores over a high speed serial link (HyperTransport or QPI) asking them to send data that they've got in their attached memory. Intel and AMD have done a good job of making it look convincingly like SMP in the general case, but it's not perfect. Data in memory attached to a different core takes longer to get hold of. It's insanely complex - the cores are now nodes on a network - but it's what they've had to do to get improved performance.

So schedulers take that into account when choosing which core should run which thread. They will try to place a thread on a core that is closest to the memory holding the data that the thread has access to.

The Future, Again

If the world's software ecosystem could be weaned off SMP the hardware guys would be able to save a lot of space on the silicon, and we would have faster more efficient systems. This has been done before; Transputers were a good attempt at a strictly NUMA architecture.

NUMA and Communicating Sequential Processes would today make it far easier to write multi threaded software that scales very easily and runs more efficiently than today's SMP shared memory behemoths.

SMP was in effect a cheap and nasty way of bringing together multiple cores, and the cost in terms of software development difficulties and inefficient hardware has been very high.