Python multiprocessing + scipy: excessive filesystem 'stat' and 'open' attempts

https://stackoverflow.com/questions/10973869

13-06-2021
|

Question

I am observing some extreme odd behaviour in Python. Consider the following code:

from multiprocessing import Process  
import scipy

def test():
    pass

for i in range(1000):
    p1 = Process(target=test)
    p1.start()
    p1.join()
    print i

When I run strace -f on this I get the following segment from the loop:

clone(Process 19706 attached
child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b23afde1970) = 19706
[pid 19706] set_robust_list(0x2b23afde1980, 0x18) = 0
[pid 18673] wait4(19706, Process 18673 suspended
 <unfinished ...>
[pid 19706] stat("/apps/python/2.7.1/lib/python2.7/multiprocessing/random", 0x7fff041fc150) = -1 ENOENT (No such file or directory)
[pid 19706] open("/apps/python/2.7.1/lib/python2.7/multiprocessing/random.so", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 19706] open("/apps/python/2.7.1/lib/python2.7/multiprocessing/randommodule.so", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 19706] open("/apps/python/2.7.1/lib/python2.7/multiprocessing/random.py", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 19706] open("/apps/python/2.7.1/lib/python2.7/multiprocessing/random.pyc", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 19706] open("/dev/urandom", O_RDONLY) = 3
[pid 19706] read(3, "\3\204g\362\260\324:]\337F0n\n\377\317\343", 16) = 16
[pid 19706] close(3)                    = 0
[pid 19706] open("/dev/null", O_RDONLY) = 3
[pid 19706] fstat(3, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 3), ...}) = 0
[pid 19706] exit_group(0)               = ?
Process 18673 resumed
Process 19706 detached

What's up with all that junk about searching around the filesystem for 'random'? I really want to avoid this because I am running quite a lot of processes with this structure in parallel on a cluster, and looping quite fast, and this kind of filesystem activity is clogging up the filesystem metadata server, or so the cluster admins tell me.

If I remove the "import scipy" command then this problem goes away:

clone(Process 23081 attached
child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b42ec15e970) = 23081
[pid 23081] set_robust_list(0x2b42ec15e980, 0x18) = 0
[pid 22052] wait4(23081, Process 22052 suspended
 <unfinished ...>
[pid 23081] open("/dev/null", O_RDONLY) = 3
[pid 23081] fstat(3, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 3), ...}) = 0
[pid 23081] exit_group(0)               = ?
Process 22052 resumed
Process 23081 detached

but I need scipy in my real code so I can't just get rid of it. Or maybe I can, but that would be a pain.

Does anyone have any idea why I am seeing this behaviour? In case it is a quirk of some version of something I am running the following:

python: 2.7.1, multiprocessing: 0.70a1, scipy: 0.9.0,

Actually since I just realised it may be system dependent I ran the same code on my laptop and had no problem (i.e. output equivalent of the second case). On the laptop I am running

python: 2.6.5, multiprocessing: 0.70a1, scipy: 0.10.0,

Perhaps it is a problem or bug in the earlier version of scipy that has been fixed? My searches for anything like this have turned up nothing. Even if it IS the problem, it is not so easy to change versions of scipy on the cluster, although I can probably get the cluster admins to build the newer version if needed.

Is this likely to be the problem?

Solution

This is not because of Windows or the __main__ module. Nor is this how Python likes doing business. And, if you will re-check, I think you will find that it is a behavior of Python 2.6 and not 2.7 unless you are running a modified 2.7.

You are entirely correct that the issue stems from the random-module initialization step in the multiprocessing.forking module — which is designed to prevent your process, when it forks to produce n workers, from creating workers that all step forward through exactly the same series of pseudo-random numbers (which could compromise security if, for example, they were all negotiating SSL connections using those numbers):

        if 'random' in sys.modules:
            import random
            random.seed()

But the key here is to realize that the above import statement ought to be a no-op from a system-call point of view, because if a module name is already present as a key in the sys.modules dictionary then import simply returns the value that it finds there without trying to go load anything from the filesystem:

>>> import sys
>>> sys.modules['fake'] = 'Not even a module'
>>> import fake
>>> fake
'Not even a module'

The if statement quoted above, therefore, is specifically trying to prevent the expense of an extra import in the case that the random module has not even been loaded. When you do the experiment without scipy loaded up, the if statement body never even fires.

So what is the problem?

The problem is that older versions of Python before 2.7 let you mean two different things by saying import foo in a module that lives inside of a package: you might be attempting a relative import of the_package.foo, or you might be attempting an import of the top-level package foo. See PEP 328 for the details on why this ambiguous and expensive behavior has now been changed in more recent versions of Python:

http://legacy.python.org/dev/peps/pep-0328/

With this background, you can review your strace output and notice something that no one has yet mentioned in the answers here: the stat() and open() system calls listed are not trying to import the module random but the non-existent module named multiprocessing.random!

This is the crucial reason that an additional import is being attempted even though random is already listed in sys.modules — because before Python 2.6 is allowed to fall back to the assumption that the import statement is really aiming to import random, it has to eliminate the possibility that it is instead attempting a relative import of multiprocessing.random since the import statement appears in the code of the multiprocessing.forking sub-module.

The programmer ought really to have said sys.modules['random'].seed() instead of trying a fresh import to spare you those extra system calls. But hopefully you will not be troubled long by this behavior, once you have the chance to upgrade to a more recent version of Python.

OTHER TIPS

This is what python does when importing a module. There is nothing wrong with it. After the first access things will be in the filesystem cache anyway so it's pretty unlikely that this is causing any issues.

Python checks all folders in the PYTHONPATH for all valid names a module with the given name could have. A similar thing happens when you run a compiled program which uses dynamic libraries - the dynamic linker will also search various locations for the library until it finds it.

Ok so it seems like ThiefMaster is perfectly correct and nothing is going wrong, although I still don't like it and am going to avoid it. But first, this is what is happening. In multiprocessing.forking the following occurs:

class Popen(object):

    def __init__(self, process_obj):
        sys.stdout.flush()
        sys.stderr.flush()
        self.returncode = None

        self.pid = os.fork()
        if self.pid == 0:
            if 'random' in sys.modules:
                import random
                random.seed()
            code = process_obj._bootstrap()
            sys.stdout.flush()
            sys.stderr.flush()
            os._exit(code)

So if 'random' is in sys.modules then it indeed imports random and uses it to generate a new random seed. I suppose it might be nice for some applications to have this done automatically but I certainly wouldn't have expected it. Perhaps there is a good reason for doing it but I don't need it done.

Since my multiprocessing needs are very simple I am just doing the fork myself now:

    childpid = os.fork()
    if childpid == 0:
        ...run code...
        os._exit(0)
    else:
        os.waitpid(childpid, 0)

and of course this does no importing so I get no searching for anything. It is also possible to make the searches go away by subclassing the appropriate bits of multiprocessing and just cutting out the 'import'. I don't know why the search wasn't happening on my laptop, given that I was running the same version of multiprocessing.

What is your operating system? I'm guessing it's Windows. As ThiefMaster noted, the behavior is normal, but the reason you're getting it on every loop iteration is probably because multiprocessing imports the __main__ module on Windows. Try protecting your loop inside an if __name__=="__main__" block.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow