I'm reading each line of a file into both a list and a dict,

with open("../data/title/pruned2_titleonly.txt", 'rb') as f_titles:
    titles_lst = f_titles.read().split('\n')
assert titles_lst[-1] == ''
titles_lst.pop() # remove the last element, an empty string

titles_dict = {}
with open("../data/title/pruned2_titleonly.txt", 'rb') as f_titles:
    for i,line in enumerate(f_titles):
        titles_dict[i] = line

and I'm testing the performance by accessing each item in the list/dict in random order:

n = len(titles_lst)
a = np.random.permutation(n)

%%time
for i in xrange(10):
    t = []
    for b in a:
        t.append(titles_lst[b])
    del t

>>> CPU times: user 18.2 s, sys: 60 ms, total: 18.2 s
>>> Wall time: 18.1 s

%%time
for i in xrange(10):
    t = []
    for b in a:
        t.append(titles_dict[b])
    del t

>>> CPU times: user 41 s, sys: 208 ms, total: 41.2 s
>>> Wall time: 40.9 s

The above result seems to imply that dictionaries are not as efficient as lists for lookup tables, even though list lookups are O(n) while dict lookups are O(1). I've tested the following to see if the O(n)/O(1) performance was true... turns out it isn't...

%timeit titles_lst[n/2]
>>> 10000000 loops, best of 3: 81 ns per loop

%timeit titles_dict[n/2]
>>> 10000000 loops, best of 3: 120 ns per loop

What is the deal? If it's important to note, I am using Python 2.7.6 Anaconda distribution under Ubuntu 12.04, and I built NumPy under Intel MKL.

有帮助吗?

解决方案

The above result seems to imply that dictionaries are not as efficient as lists for lookup tables, even though list lookups are O(n) while dict lookups are O(1). I've tested the following to see if the O(n)/O(1) performance was true... turns out it isn't...

It's not true that dict lookups are O(N), in the sense of "getting an item" which is the sense your code seems to test. Determining where (if at all) an element exists could be O(N), e.g. somelist.index(someval_not_in_the_list) or someval_not_in_the_list in somelist will both have to scan over each element. Try comparing x in somelist with x in somedict to see a major difference.

But simply accessing somelist[index] is O(1) (see the Time Complexity page). And the coefficient is probably going to be smaller than in the case of a dictionary, also O(1), because you don't have to hash the key.

其他提示

What is the deal?

The reason for this behaviour is that for dictionary access, the index (key) is hashed, and the hash is then used to access the value in a hash table. In a list, the index is a direct mapping to the respective slot in the list, which is essentially a calculated offset into memory.

See implementation of dictionaries and lists, respectively.

The above result seems to imply that dictionaries are not as efficient as lists for lookup tables,

No, we cannot conclude that. What your experiment shows is one particular instance, comparing numeric index access in a list to hashed-key access in a hash table. Obviously there is more work to be done to access the dictionary (namely, hash the index).

I've tested the following to see if the O(n)/O(1) performance was true... turns out it isn't...

I also ran some tests, see below. Before we go into details, note that all O(1) states is that the access time is independent of the size of the respective collection object. To state Wikipedia:

An algorithm is said to be constant time (also written as O(1) time) if the value of T(n) is bounded by a value that does not depend on the size of the input.

In other words, given some collection object x, all access to this object will be independent of the size of the object. Note that O(1) also does not mean, however, that all accesses will have the exact same running time. Again to quote Wikipedia:

Despite the name "constant time", the running time does not have to be independent of the problem size, but an upper bound for the running time has to be bounded independently of the problem size.

The key word here is bounded. We can verify that the access time is within reasonable bounds by slightly modifying your program. Note that I am generating random input instead of building the list and dict objects from a file. Also I have simplified the code to test access only (your code builds a new list for each round, which may change the scale of the result).

import sha
from timeit import timeit
import random
import numpy as np

N = 1000
# generate random input of size N
titles_lst = [sha.sha(str(x)).digest() for x in range(1, N)]
titles_dict = {}
for i in range(0, len(titles_lst)):
    titles_dict[i] = titles_lst[i]
# permutate access pattern
n = len(titles_lst)
a = np.random.permutation(n)

def access_list():
    x = titles_lst[b]

def access_dict():
    x = titles_dict[b]

# run measurements for dictionary
X = []
C = 1000
for i in range(C):
    for b in a:
       X.append(timeit(access_dict, number=C))
print "dict std=%f avg=%f" % (np.std(X), np.mean(X))
# run measurements for list       
X = []
for i in range(C):
    for b in a:
       X.append(timeit(access_list, number=C))
print "list std=%f avg=%f" % (np.std(X), np.mean(X))

On my 2.7GHz macmini this yields the following results.

N=100 C=1000 dict std=0.000001 avg=0.000002
N=100 C=1000 list std=0.000001 avg=0.000001

N=1000 C=1000 dict std=0.000001 avg=0.000002
N=1000 C=1000 list std=0.000001 avg=0.000001

N=10000 C=1000 dict std=0.000001 avg=0.000002
N=10000 C=1000 list std=0.000001 avg=0.000001

Clearly, the tests confirm that both dictionary and lists have indeed access time in O(1). They also confirm that dictionary look ups are, on average, slower than list access in this particular case, using a numeric index/key.

list[] is geting something from a specific known location in the list - O(1)

dict[] is searching the dictionary using a key - O(1) on average

if you want to compare searching try to search a list - O(n)

[x for x in list if x==val]

to see the real difference

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top