How to pin threads to cores with predetermined memory pool objects? (80 core Nehalem architecture 2Tb RAM)

Question 1

You might underestimate the issue, there is no super-easy way to accomplish what you want. As a general guideline, you need to work at the operating system level to get things set up the way you want. You want to work with so-called "CPU affinity" and "memory affinity" and you need to think hard about your system architecture as well as your software architecture to get things right. In real HPC, the named "affinities" are normally handled by an MPI library, such as Open MPI. You might want to consider using one and let your different processes be handled by that MPI library. The interface between operating system, MPI library and Python can be provided by the mpi4py package.

You also need to get your concept of threads and processes and the OS setting straight. While for the CPU time scheduler, a thread is a task to be scheduled and therefore theoretically could have an individual affinity, I am only aware of affinity masks for entire processes, i.e. for all threads within one process. For controlling memory access, NUMA (non-uniform memory access) is the right keyword and you might want to look into http://linuxmanpages.com/man8/numactl.8.php

In any case, you need to read articles about the affinity topic and might want to start reading in the Open MPI FAQs about CPU/memory affinity: http://www.open-mpi.de/faq/?category=tuning#paffinity-defs

In case you want to achieve your goal without using an MPI library, look into the packages util-linux or schedutils and numactl of your Linux distribution in order to get useful commandline tools such as taskset, which you could e.g. call from within Python in order to set affinity masks for certain process IDs.

This article seems to vividly describe how an MPI library can be helpful with your issue:

http://blogs.cisco.com/performance/open-mpi-v1-5-processor-affinity-options/

This SO answer describes how you bisect your hardware architecture: https://stackoverflow.com/a/11761943/145400

Generally, I am wondering if the machine you are applying is the right one for the task or if you maybe are optimizing at the wrong end. If you are messaging within one machine and hitting memory bandwidth limits, I am not sure if ZMQ (through TCP/IP, right?) is the right tool at all to perform the messaging. Coming back to MPI, the message passing interface for HPC applications...

Question 2

Just wondering if this might not be amenable to the use of python remote objects - this might be worth investigation but unfortunately I do not have access to such hardware.

As explained in the documentation while pyro is often used to distribute work across multiple machines on a network it can also be used to share processing between cores on a single machine.

On a lower level Pyro is just a form of inter-process communication. So everywhere you would otherwise have used a more primitive form of IPC (such as plain TCP/IP sockets) between Python components, you could consider to use Pyro instead.

While pyro may add some overhead it may well speed things up and should make things more maintainable.