Why does default_storate.exists() with django-storages with S3Boto backend cause a memory error with a large S3 bucket?

StackOverflow https://stackoverflow.com/questions/21120825

Question

I am experiencing what looks like a memory leak with django-storages using the S3Boto backend, when running default_storage.exists()

I'm following the docs here: http://django-storages.readthedocs.org/en/latest/backends/amazon-S3.html

Here is the relevant part of my settings file:

DEFAULT_FILE_STORAGE = 'storages.backends.s3boto.S3BotoStorage'

Here is what I do to repeat the issue:

./manage.py shell

from django.core.files.storage import default_storage

# Check default storage is right
default_storage.connection
>>> S3Connection:s3.amazonaws.com

# Check I can write to a file
file = default_storage.open('storage_test_2014', 'w')
file.write("does this work?")
file.close()
file2 = default_storage.open('storage_test_2014', 'r')
file2.read()
>>> 'does this work?'

# Run the exists command
default_storage.exists("asdfjkl") # This file doesn't exist - but the same thing happens no matter what I put here - even if I put 'storage_test_2014'

# Memory usage of the python process creeps up over the next 45 seconds, until it nears 100%
# iPython shell then crashes
>>> Killed

The only potential issue I've thought of is that my S3 bucket has 93,000 items in it - I'm wondering if .exists is just downloading the whole list of files in order to check? If this is the case, surely there must be another way? Unfortunately sorl-thumbnail uses this .exists() function when generating a new thumbnail, which causes thumbnail generation to be extremely slow.

Was it helpful?

Solution

Update (Jan 23, 2017)

To avoid this, you can simply pass preload_metadata=False when creating a Storage, or set AWS_PRELOAD_METADATA = False in settings.

Thanks @r3mot for this suggestion in the comments.

Original Answer

In fact, it's because S3BotoStorage.exists makes a call to S3BotoStorage.entries, which is as follows:

    @property
    def entries(self):
        """
        Get the locally cached files for the bucket.
        """
        if self.preload_metadata and not self._entries:
            self._entries = dict((self._decode_name(entry.key), entry)
                                for entry in self.bucket.list(prefix=self.location))

The best way to handle this situation would be to subclass S3BotoStorage as follows:

from storages.backends.s3boto import S3BotoStorage, parse_ts_extended


class MyS3BotoStorage(S3BotoStorage):
    def exists(self, name):
        name = self._normalize_name(self._clean_name(name))
        k = self.bucket.new_key(self._encode_name(name))
        return k.exists()

    def size(self, name):
        name = self._normalize_name(self._clean_name(name))
        return self.bucket.get_key(self._encode_name(name)).size

    def modified_time(self, name):
        name = self._normalize_name(self._clean_name(name))
        k = self.bucket.get_key(self._encode_name(name))
        return parse_ts_extended(k.last_modified)

You'll have to just put this subclass in one of your app's modules, and reference it via dotted path in your settings module. The only drawback to this subclass is that each call to any of the 3 overridden methods will result in a web request, which might not be a big deal.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top