Question

I am using the bisect module to search for and insert sha256 hashes into a list.

I have about 8,000,000 items to search and add to, and they are stored in an sqlite database, I want to read them into the list so I can search them quicker.

The issue I have is inserting items into the list using bisect to find the right insertion point is quite slow. It takes about 700 seconds to complete all 8,000,000 items.

It only takes about 90 seconds to create an index in the sqlite database in ascending order, and then about 60 seconds to insert these into the list sequentially.

The trouble is when I do this the bisect search for some items fail, but if I sequentially search the item for the hash it is actually there.

So it appears as though the order provided by the database isnt quite the same as the order provided when using bisect to get the index position.

Any ideas why this would be? it would be really useful to be able to pre-sort the list before relying on bisect.

UPDATE.... Based on a comment, I should explain that I have a custom class that behaves like a list, that packs the hashes in a bytearray to save memory. Here is my Class

class Hashlist():

def __init__(self, hashLen):
    self.__hashLen = hashLen
    self.__hashlist = bytearray()
    self.__num_items = 0

def __getitem__(self, index):
    if index >= len(self) or index < 0: 
        print index
        raise IndexError("hash index out of range")
        return 
    return str(self.__hashlist[index*self.__hashLen:(index+1)*self.__hashLen])

def __setitem__(self, index, data):
    if index > len(self) or index < 0: 
        raise IndexError("hash index out of range")
        return 
    if index == len(self):
        self.__hashlist.extend(data)
    else:
        self.__hashlist[index*self.__hashLen:(index+1)*self.__hashLen] = data

def insert(self, index, data):
    oldlen = len(self.__hashlist)/self.__hashLen
    if index > oldlen  or index < 0:
        raise IndexError("trying to insert past next element")
        return
    if index == oldlen:
        self.__hashlist.extend(data)
    else:
        # move the data
        if self.__hashLen == 1:
            self.__hashlist.append(chr(0))
            orig_data = str(self.__hashlist[(index):(len(self.__hashlist)-1)])
            self.__hashlist[(index + 1)*self.__hashLen:(len(self.__hashlist))*self.__hashLen] = orig_data
            #replace existing data
            self.__hashlist[index*self.__hashLen:(index+1)*self.__hashLen] = data
        else:
            orig_data = str(self.__hashlist[(index*self.__hashLen):(len(self.__hashlist) -1)*self.__hashLen])
            self.__hashlist[(index + 1)*self.__hashLen:(len(self.__hashlist))*self.__hashLen] = orig_data
            #replace existing data
            self.__hashlist[index*self.__hashLen:(index+1)*self.__hashLen] = data

Thanks

Dean

Was it helpful?

Solution

If they're stored in a SQL database, an index doesn't guarantee that results are returned in "sorted" order - you have to be explicit by using "order by".

Also, if you're doing that many inserts then I wouldn't use bisect, instead sort/merge.

# Add new to old and sort the whole lot...
old_hash_list.extend(new_hash_list)
old_hash_list.sort()

# Assuming new is already sorted than create new list of merged
import heapq
old_and_new = list(heapq.merge(old_hash_list, sorted(new_hash_list)))
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top