You might underestimate the issue, there is no super-easy way to accomplish what you want. As a general guideline, you need to work at the operating system level to get things set up the way you want. You want to work with so-called "CPU affinity" and "memory affinity" and you need to think hard about your system architecture as well as your software architecture to get things right. In real HPC, the named "affinities" are normally handled by an MPI library, such as Open MPI. You might want to consider using one and let your different processes be handled by that MPI library. The interface between operating system, MPI library and Python can be provided by the mpi4py package.
You also need to get your concept of threads and processes and the OS setting straight. While for the CPU time scheduler, a thread is a task to be scheduled and therefore theoretically could have an individual affinity, I am only aware of affinity masks for entire processes, i.e. for all threads within one process. For controlling memory access, NUMA (non-uniform memory access) is the right keyword and you might want to look into http://linuxmanpages.com/man8/numactl.8.php
In any case, you need to read articles about the affinity topic and might want to start reading in the Open MPI FAQs about CPU/memory affinity: http://www.open-mpi.de/faq/?category=tuning#paffinity-defs
In case you want to achieve your goal without using an MPI library, look into the packages util-linux
or schedutils
and numactl
of your Linux distribution in order to get useful commandline tools such as taskset
, which you could e.g. call from within Python in order to set affinity masks for certain process IDs.
This article seems to vividly describe how an MPI library can be helpful with your issue:
http://blogs.cisco.com/performance/open-mpi-v1-5-processor-affinity-options/
This SO answer describes how you bisect your hardware architecture: https://stackoverflow.com/a/11761943/145400
Generally, I am wondering if the machine you are applying is the right one for the task or if you maybe are optimizing at the wrong end. If you are messaging within one machine and hitting memory bandwidth limits, I am not sure if ZMQ (through TCP/IP, right?) is the right tool at all to perform the messaging. Coming back to MPI, the message passing interface for HPC applications...