Question

I ran into a problem where I have to go through proxy logs to see if users have visited a list of sites.

I wrote a small script to read all proxy logs, matching the visited host against the list:

for proxyfile in proxyfiles:
    for line in proxyfile.readlines():
        if line[4] in hosts_list:
            print line

the hosts_file is large, we are talking about ~10000 hosts, and I noticed the searching took longer than expected.

I wrote a small test:

import random, time
test_list = [x for x in range(10000)]
test_dict = dict(zip(test_list, [True for x in range(10000)]))

def test(test_obj):
 s_time = time.time()
 for i in range(10000):
  random.randint(0,10000) in test_obj
 d_time = time.time() - s_time
 return d_time

print "list:", test(test_list)
print "dict:",test(test_dict)

the result are the following:

list: 5.58524107933
dict: 0.195574045181

So, to my question. Is it possible to perform this search in a more convenient way? Creating a dictionary of a list seems like a hack, as I want to search for they key and not the value it contains.

Was it helpful?

Solution

"as I want to search for they key and not the value it contains" => then just use set

OTHER TIPS

I agree that you should use a dictionary for such a thing, set on a newer python, and consider moving to a newer python than 2.2 if that's possible for your application.

But, if your list is in sorted order, you can use the bisect module to quickly search through the list to find elements. Not quite as fast as the dict, but pretty close.

import random, time
import bisect

class BisectContainsList(list):
    def __contains__(self, elem):
        i = bisect.bisect_left(self, elem)
        if i != len(self) and self[i] == elem:
            return True
        return False

test_list = [x for x in range(10000)]
test_dict = dict(zip(test_list, [True for x in range(10000)]))
test_blist = BisectContainsList(test_list)

def test(test_obj):
 s_time = time.time()
 for i in range(10000):
  random.randint(0,10000) in test_obj
 d_time = time.time() - s_time
 return d_time

print "list:", test(test_list)
print "dict:", test(test_dict)
print "blist", test(test_blist)

for (tested on 2.7):

list: 1.19566082954
dict: 0.0248260498047
blist 0.0598628520966

If your list is sorted you can use the bisect module with this helper function:

def sorted_list_contains(alist, item):
    i = bisect.bisect_left(alist, item)
    return i != len(alist) and alist[i] == item

edit: I did not see Matt Anderson's answer using bisect when I posted this. I'll leave this up as an alternative implementation.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top