Question

I'm using a dictionary to hold a large number of objects, and have a string name for each of them. To be specific, here is my code:

from itertools import product
for (i,j,k) in product(range(N),range(M),range(K)):
    var_name='x_'+'_'+str(i)+str(j)+'_'+str(k)
    var_dict[var_name] = f(var_name,other_params)
print len(var_dict)

f(...) returns an object. In my code N=363, M=500, and K=2. So I expect 363000 entries in the dictionary. But when I check the length of var_dict, it is 330860!

(Pdb) len(var_dict)
330860

Here are my questions:

  1. Is there any explanation for that? E.g. is there any limit for the number of items that built-in hash table of python can address?

  2. What can I do to solve this problem?

Était-ce utile?

La solution

The problem is here:

str(i)+str(j)

This does not produce unique identifiers. For example, the value set when i=1 and j=11 will be overwritten by the value set when i=11 and j=1 (there are many more instances as well).

You can fix the problem by inserting some delimiter character between the two numbers (such as an underscore like you have between j and k).

Autres conseils

Access times for a string key in a python dictionary is in the order of 1 microsecond (1s / 1000 / 1000).

The time taken increases slightly dependent on the number of entries in the dictionary, possibly with something like a log(N) scaling.

Performance degrades significantly for dictionaries larger than 2^26 = 67,108,864. It takes 30x longer to read from a dictionary of size 2^27 = 134,217,728, and 9000x longer for a dictionary of size 2^28 = 268,435,456. My computer ran out of memory before reaching 2^29.

Therefore, the practical answer to your question of the maximum size of a dictionary in python is:

2^26 = 67,108,864

>>> for i in range(1,sys.maxsize):
...   key = str(i)
...   d[key] = key
...   if math.log2(i) % 1 == 0: 
...     time_start = time.perf_counter()
...     value = d[key]
...     time_taken = time.perf_counter() - time_start
...     print(time_taken*1000*1000, i)
... 
0.682000063534360 1
0.521999936609063 2
0.394000153391971 4
0.365999994755839 8
0.424000063503626 16
0.380000074073905 32
0.365000005331239 64
0.447000047643086 128
0.413999941883957 256
0.481999904877739 512
0.641000042378436 1024
0.906999957805965 2048
0.616000079389778 4096
0.995999926090007 8192
1.115000031859381 16384
1.142999963121838 32768
1.144999941971036 65536
1.156000053015304 131072
1.231999931405880 262144
1.225999994858284 524288
1.196000084746629 1048576
1.308000037170131 2097152
1.232000158779556 4194304
1.314999963142327 8388608
1.178000047730165 16777216
1.179000037154764 33554432
1.669000084802974 67108864
33.22600014143973 134217728
9655.005000013261 268435456
Killed: 9

You don't have a delimiter between i and j in your constructed strings, so tuples like (12, 1, 0) and (1, 21, 0) produce the same name. If possible, don't make names for these things at all; just use the numbers directly:

var_dict[i, j, k] = f(i, j, k, other_params)

If f really needs to take a string, change the name construction to put a delimiter between i and j:

var_name = 'x_{}_{}_{}'.format(i, j, k)

and if possible, use the tuple as a dict key even if f needs a string:

var_dict[i, j, k] = f(var_name, other_params)

No size of limitation on dict

d = {}
for i in xrange(999999):
    d[i] = i
len(d)

It prints

999999
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top