Question

I am running processes in parallel but need to create a database for each cpu process to write to. I only want as many databases as cpu's assigned on each server, so the 100 jobs written to 3 databases that can be merged after.

Is there worker id number or core id that I can identify each worker as?

def workerProcess(job):
  if workerDBexist(r'c:\temp\db\' + workerid):
    ##processjob into this database
  else:
    makeDB(r'c:\temp\db\' + workerid)
    ##first time this 'worker/ core' used, make DB then process

import pp
ppservers = ()
ncpus = 3
job_server = pp.Server(ncpus, ppservers=ppservers)

for work in 100WorkItems:
  job_server.submit(workerProcess, (work,))
Was it helpful?

Solution

As far as I know, pp doesn't have any such feature in its API.

If you used the stdlib modules instead, that would make your life a lot easier—e.g., multiprocessing.Pool takes an initializer argument, which you could use to initialize a database for each process, which would then be available as a variable that each task could use.

However, there is a relatively easy workaround.

Each process has a unique (at least while it's running) process ID.* In Python, you can access the process ID of the current process with os.getpid(). So, in each task, you can do something like this:

dbname = 'database{}'.format(os.getpid())

Then use dbname to open/create the database. I don't know whether by "database" you mean a dbm file, a sqlite3 file, a database on a MySQL server, or what. You may need to, e.g., create a tempfile.TemporaryDirectory in the parent, pass it to all of the children, and have them os.path.join it to the dbname (so after all the children are done, you can grab everything in os.listdir(the_temp_dir)).


The problem with this is that if pp.Server restarts one of the processes, you'll end up with 4 databases instead of 3. Probably not a huge deal, but your code should deal with that possibility. (IIRC, pp.Server usually doesn't restart the processes unless you pass restart=True, but it may do so if, e.g., one of them crashes.)

But what if (as seems to be the case) you're actually running each task in a brand-new process, rather than using a pool of 3 processes? Well, then you're going to end up with as many databases as there are processes, which probably isn't what you want. Your real problem here is that you're not using a pool of 3 processes, which is what you ought to fix. But are there other ways you could get what you want? Maybe.

For example, let's say you created three locks, one for each database, maybe as lockfiles. Then, each task could do this pseudocode:

for i, lockfile in enumerate(lockfiles):
    try:
        with lockfile:
            do stuff with databases[i]
            break
    except AlreadyLockedError:
        pass
else:
    assert False, "oops, couldn't get any of the locks"

If you can actually lock the databases themselves (with an flock, or with some API for the relevant database, etc.) things are even easier: just try to connect to them in turn until one of them succeeds.

As long as your code isn't actually segfaulting or the like,** if you're actually never running more than 3 tasks at a time, there's no way all 3 lockfiles could be locked, so you're guaranteed to get one.


* This isn't quite true, but it's true enough for your purposes. For example, on Windows, each process has a unique HANDLE, and if you ask for its pid one will be generated if it didn't already have one. And on some *nixes, each thread has a unique thread ID, and the process's pid is the thread ID of the first thread. And so on. But as far as your code can tell, each of your processes has a unique pid, which is what matters.

** Even if your code is crashing, you can deal with that, it's just more complicated. For example, use pidfiles instead of empty lockfiles. Get a read lock on the pidfile, then try to upgrade to a write lock. If it fails, read the pid from the file, and check whether any such process exists (e.g., on *nix, if os.kill(pid, 0) raises, there is no such process), and if so forcibly break the lock. Either way, now you've got a write lock, so write your pid to the file.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top