Python large variable RAM usage

https://stackoverflow.com/questions/2670005

28-09-2019
|

Question

Say there is a dict variable that grows very large during runtime - up into millions of key:value pairs.

Does this variable get stored in RAM, effectively using up all the available memory and slowing down the rest of the system?

Asking the interpreter to display the entire dict is a bad idea, but would it be okay as long as one key is accessed at a time?

Solution

Yes, the dict will be stored in the process memory. So if it gets large enough that there's not enough room in the system RAM, then you can expect to see massive slowdown as the system starts swapping memory to and from disk.

Others have said that a few million items shouldn't pose a problem; I'm not so sure. The dict overhead itself (before counting the memory taken by the keys and values) is significant. For Python 2.6 or later, sys.getsizeof gives some useful information about how much RAM various Python structures take up. Some quick results, from Python 2.6 on a 64-bit OS X machine:

>>> from sys import getsizeof
>>> getsizeof(dict((n, 0) for n in range(5462)))/5462.
144.03368729403149
>>> getsizeof(dict((n, 0) for n in range(5461)))/5461.
36.053470060428495

So the dict overhead varies between 36 bytes per item and 144 bytes per item on this machine (the exact value depending on how full the dictionary's internal hash table is; here 5461 = 2**14//3 is one of the thresholds where the internal hash table is enlarged). And that's before adding the overhead for the dict items themselves; if they're all short strings (6 characters or less, say) then that still adds another >= 80 bytes per item (possibly less if many different keys share the same value).

So it wouldn't take that many million dict items to exhaust RAM on a typical machine.

OTHER TIPS

The main concern with the millions of items is not the dictionary itself so much as how much space each of these items takes up. Still, unless you're doing something weird, they should probably fit.

If you've got a dict with millions of keys, though, you're probably doing something wrong. You should do one or both of:

Figure out what data structure you should actually be using, because a single dict is probably not the right answer. Exactly what this would be depends on what you're doing.
Use a database. Your Python should come with a sqlite3 module, so that's a start.

Yes, a Python dict is stored in RAM. A few million keys isn't an issue for modern computers, however. If you need more and more data and RAM is running out, consider using a real database. Options include a relational DB like SQLite (built-in in Python, by the way) or a key-value store like Redis.

It makes little sense displaying millions of items in the interpreter, but accessing a single element should be still very efficient.

For all I know Python uses the best hashing algorithms so you are probably going to get the best possible memory efficiency and performance. Now, whether the whole thing is kept in RAM or committed to a swap file is up to your OS and depends on the amount of RAM you have. What I'd say is best if to just try it:

from random import randint
a = {}
for i in xrange(10*10**6):
    a[i] = i

How is this looking when you run it? Takes about 350Mb on my system which should be manageable to say the least.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow