Question

Here is what we have currently:

  1. we're trying to get cached django model instance, cache key includes name of model and instance id. Django's standard memcached backend is used. This procedure is a part of common procedure used very widely, not only in celery.
  2. sometimes(randomly and/or very rarely) cache.get(key) returns wrong object: either int or different model instance, even same-model-different-id case appeared. We catch this by checking correspondence of model name & id and cache key.
  3. bug appears only in context of three of our celery tasks, never reproduces in python shell or other celery tasks. UPD: appears under long-running CPU-RAM intensive tasks only
  4. cache stores correct value (we checked that manually at the moment the bug just appeared)
  5. calling same task again with same arguments might don't reproduce the issue, although probability is much higher, so bug appearances tend to "group" in same period of time
  6. restarting celery solves the issue for the random period of time (minutes - weeks)
  7. *NEW* this isn't connected with memory overflow. We always have at least 2Gb free RAM when this happens.
  8. *NEW* we have cache_instance = cache.get_cache("cache_entry") in static code. During investigation, I found that at the moment the bug happens cache_instance.get(key) returns wrong value, although get_cache("cache_entry").get(key) on the next line returns correct one. This means either bug disappears too quickly or for some reason cache_instance object got corrupted. Isn't cache instance object returned by django's cache thread safe?
  9. *NEW* we logged very strange case: as another wrong object from cache, we got model instance w/o id set. This means, the instance was never saved to DB therefore couldn't be cached. (I hope)
  10. *NEW* At least one MemoryError was logged these days

I know, all of this sounds like some sort of magic.. And really, any ideas how that's possible or how to debug this would be very appreciated.

PS: My current assumption is that this is connected with multiprocessing: as soon as cache instance is created in static code and before Worker process fork this would lead to all workers sharing same socket (Does it sound plausibly?)

Was it helpful?

Solution 2

This has been bugging me for a while until I found this question and answer. I just want to add some things I've learnt.

You can easily reproduce this problem with a local memcached instance:

from django.core.cache import cache
import os

def write_read_test():
    pid = os.getpid()
    cache.set(pid, pid)
    for x in range(5):
        value = cache.get(pid)
        if value != pid:
            print "Unexpected response {} in process {}. Attempt {}/5".format(
                    value, pid, x+1)
    os._exit(0)

cache.set("access cache", "before fork")
for x in range(5):
    if os.fork() == 0:
        write_read_test()

What you can do is close the cache client as Django does in the request_finished signal:

https://github.com/django/django/blob/master/django/core/cache/init.py#L128

If you put a cache.close() after the fork, everything works as expected.

For celery you could connect to a signal that is fired after the worker is forked and execute cache.close().

This also affects gunicorn when preload is active and the cache is initialized before forking the workers.

For gunicorn, you could use post_fork in your gunicorn configuration:

def post_fork(server, worker):
    from django.core.cache import cache
    cache.close()

OTHER TIPS

Solved it finally:

  1. Celery has dynamic scaling feature- it's capable to add/kill workers according to load
  2. It does it via forking existing one
  3. Opened sockets and files are copied to the forked process, so both processes share them, which leads to race condition, when one process reads response of another one. Simply, it's possible that one process reads response intended for second one, and vise-versa.
  4. from django.core.cache import cache this object stores pre-connected memcached socket. Don't use it when your process could be dynamically forked.. and don't use stored connections, pools and other.
  5. OR store them under current PID, and check it each time you're accessing cache
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top