Suspend, save to disk, restart long jobs on a supercomputer with PBS

https://stackoverflow.com/questions/22097575

18-10-2022
|

Question

I need to be able to suspend a "running script", have the OS save its state to disk and later resume it by reading that state and continuing exactly from where it has left. The system is a 12core compute node with a shared memory of 48GB, running linux. I have no admin rights and I login remotely using ssh. The scripts and therein executables do not use GUI, it's all command line, and as far as I know don't require expicitly network or sockets.

By "running script" (or "pipeline") I mean a bash script or a perl script or a combination of the two which spawn some C/C++ executables, possibly they are using openmp parallelisation. Or spawning in parallel executables using gnu-parallel. So, we are not talking about a single executable but a sequence of executables either running in parallel or in sequence, using implicit parallelisation over 12 cores with a common memory, glued by several unix commands (e.g. awk).

I need to suspend and restart the pipeline because the scheduler (MOAB) kills (system rules) all jobs running longer that 24h. The idea is to suspend a job and re-queue it. This technique is perfectly legitimate.

Modifying execuables' source code so that they all save state and later resuming it is not practical as it means to modify the several open-source executables to accept a 'save-state-and-suspend' signal, let's say ImageMagick's 'convert' or even a 'grep', a 'sed', an 'awk' and also perl ! Plus, there is also one executable which is closed-source, no source code.

So, I believe I am in a situation where one (the only?) practical option would be to run my 'script/pipeline' in a so-called sandbox environment, e.g. QEMU (an emulator), which can hopefully be sent a signal to 'hibernate', save the state of all currently running programs within it by just saving the whole memory and cpu state to disk (48GB not a problem) and suspending.

I am not an expert to any of the above, so pardon my terminology or if I say something not valid. I am only sketching.

To recap: I am asking any of you with experience for a solution to suspending and restarting complex script jobs under linux without resorting to modifying code to 'save state'. This solution should also take be relatively computationally efficient, i.e. not end up wasting a lot of supercomputer power for running the emulator.

If you believe that the QEMU solution I talked above is OK, then please, if you can, give some example of how to start with that, i.e. create an emulator linux image from public ISO's, load the image, run the 'script', tell the emulator to 'suspend/hibernate' after 20h, and then resume the emulator by reading it's state from the suspend state. All this, ideally from a command line or via a script.

Any other solutions, as long as they are practical (for the given setting) are welcomed.

Please note: I have no admin rights but can install things in my homedir and have lots of harddisk space. Also, the programs do not use GUI, it's all command line, and as far as I know don't require explicitly network or sockets.

As a positive side-effect of the solution with an emulator, will be that any such "pipeline" can be distributed to any OS (e.g. mac or win) where the 'sandbox'/emulator is implemented, without the complex process of recompiling everything and installing gnu-utils, bash, boost, etc.I find myself stack to this situation many times.

thanks for your help, bliako.

Solution

I'm not sure which version of pbs you're using, but TORQUE offers integration with Berkeley Lab Checkpoint/Restart (BLCR). The most important thing for BLCR is that all the nodes have the same exact OS image. Setting it up is rather detailed and documented in the TORQUE docs.

Essentially, the pbs_mom daemons are configured to use BLCR, and whenever you stop a job the daemon uses BLCR to take a snapshot of the OS internal data structures to know the exact state of the process, making it able to restart the same process from exactly the same point.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow