Question

I've encountered some strange application behaviour while interacting with database using many processes. I'm using Linux.

I have my own implementation of QueryExecutor which uses the a single connection during its lifetime:

class QueryExecutor(object):
    def __init__(self, db_conf):
        self._db_config = db_conf
        self._conn = self._get_connection()

    def execute_query(self, query):
        # some code
    # some more code

def query_executor():
    global _QUERY_EXECUTOR
    if _QUERY_EXECUTOR is None:
        _QUERY_EXECUTOR = QueryExecutor(some_db_config)
    return _QUERY_EXECUTOR

Query Executor is never modified after instantiation.

Initially there is only one process, which from time to time forks (os.fork()) several times. The new processes are workers which do some tasks and then exit. Each worker calls query_executor() to be able to execute a SQL query.

I have found out that sql queries often return wrong results (it seems that sometimes sql query result is returned to the wrong process). The only sensible explanation is all processes share the same sql connection (according to MySQLdb doc: threadsafety = 1 Threads may share the module, but not connections).

I wonder which OS mechanism leads to this situation. As far as I know, on Linux when process forks, the parent process's pages are not copied for the child process, they are shared by both processes until one of them tries to modify some page (copy-on-write). As I have mentioned before, QueryExecutor object remains unmodified after creation. I guess this is the reason for the fact that all processes uses the same QueryExecutor instance and hence the same sql connection.

Am I right or do I miss something? Do you have any suggestions?

Thanks in advance!

Grzegorz

Was it helpful?

Solution

The root of the problem is that fork() simply creates an exact independent copy of a process, but these two processes share opened files, sockets and pipes. That's why any data written by MySQL server may be [correctly] read only from a single process and if two processes try to make requests and read responses then they quite likely will mess up each other work. This has nothing with "multithreading" because in case of multi-threading there's a single process with few threads of executions, they share data and may coordinate.

The correct way to use fork() is to close (or re-open) right after forking all file-handle-like objects in all but one copies of the process or at least avoid using them from multiple processes.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top