Harvesting the power of highly-parallel computers with python scientific code [closed]

Question 1

First of all, I would like to stress that the problem that Uri described in his question is indeed faced by many people doing scientific computing. It may be not easy to see if you work with a developed code base that has a well defined scope - things do not change as fast as in scientific computing or data analysis. This page has an excellent description why one would like to have a simple solution for parallelizing pieces of code.

So, this project is a very interesting attempt to solve the problem. I have not tried using it myself yet, but it looks very promising!

Question 2

I might be wrong, but simply using GNU command line utilities, like parallel, or even xargs, seems appropriate to me for this case. Usage might look like this:

cat inputs | parallel ./job.py --pipe > results 2> exceptions

This will execute job.py for every line of inputs in parallel, output successful results into results, and failed ones to exceptions. A lot of examples of usage (also for scientific Python scripts) can be found in this Biostars thread.

And, for completeness, Parallel documentation.

Question 3

If you with "have access to a ~100 CPUs machines" mean that you have access to 100 machines each having multiple CPUs and in case you want a system that is generic enough for different kinds of applications, then the best possible (and in my opinion only) solution is to have a management layer between your resources and your job input. This is nothing Python-specific at all, it is applicable in a much more general sense. Such a layer manages the computing resources, assigns tasks to single resource units and monitors the entire system for you. You make use of the resources via a well-defined interface as provided by the management system. Such as management system is usually called "batch system" or "job queueing system". Popular examples are SLURM, Sun Grid Engine, Torque, .... Setting each of them up is not trivial at all, but also your request is not trivial.

Python-based "parallel" solutions usually stop at the single-machine level via multiprocessing. Performing parallelism beyond a single machine in an automatic fashion requires a well-configured resource cluster. It usually involves higher level mechanisms such as the message passing interface (MPI), which relies on a properly configured resource system. The corresponding configuration is done on the operating system and even hardware level on every single machine involved in the resource pool. Eventually, a parallel computing environment involving many single machines of homogeneous or heterogeneous nature requires setting up such a "batch system" in order to be used in a general fashion.

You realize that you don't get around the effort in properly implementing such a resource pool. But what you can do is totally isolate this effort form your application layer. You once implement such a managed resource pool in a generic fashion, ready to be used by any application from a common interface. This interface is usually implemented at the command line level by providing job submission, monitoring, deletion, ... commands. It is up to you to define what a job is and which resources it should consume. It is up to the job queueing system to assign your job to specific machines and it is up to the (properly configured) operating system and MPI library to make sure that the communication between machines is working.

In case you need to use hardware distributed among multiple machines for one single application and assuming that the machines can talk to each other via TCP/IP, there are Python-based solutions implementing so to say less general job queueing systems. You might want to have a look at http://python-rq.org/ or http://itybits.com/pyres/intro.html (there are many other comparable systems out there, all based on an independent messaging / queueing instance such as Redis or ZeroMQ).

Question 4

Usually you write the code iteratively, as a script which perform some computation.

This makes me think you'd really like ipython notebooks

A notebook is a file that has a structure which is a mix between a document and an interactive python interpreter. As you edit the python parts of the document they can be executed and the output embedded in the document. It's really good programming where you're exploring the problem space, and want to make notes as you go.

It's also heavily integrated with matplotlib, so you can display graphs inline. You can embed Latex math inline, and many media objects types such as pictures and video.

Here's a basic example, and a flashier one

Finally, it works; now you wish to run it with multiple inputs and parameters and find it takes too much time.

Recalling you work for a fine academic institute and have access to a ~100 CPUs machines, you are puzzled how to harvest this power. You start by preparing small shell scripts which run the original code with different inputs and run them manually.

This makes me think you'd really like ipython clusters

iPython clusters allow you to run parallel programs across multiple machines. Programs can either be SIMD (which sound like your case) or MIMD style. Programs can be edited and debugged interactively.

There were several talks about iPython at the recent SciPy event. Going onto PyVideo.org and searching gives numerous videos, including:

I not watched all of these, but they're probably a good starting point.