My goal is to create a database of MD5 hashes, and then check another list of hashes against it to see if they exist in the database.
I have a working solution using anydbm
, which if you are unfamiliar works exactly like a dictionary in python, but you don't have to load the whole thing in memory at once. Creation of the db is rather slow (100 million entries in roughly 2 and a half hours) and retrieval speed is moderate (100000 entries in 1 to 2 seconds). Checking if the hash exists is as simple as if hash in dbm
.
For a performance improvement, I attempted to make a working solution with sqlite. Creation speed is blazing fast, creating all 100 million entries in a few minutes. But retrieval took upwards of 15 seconds for one entry. This is unreasonable!
I'm not a SQL pro, so I'm wondering if I'm just using commands which overly complicate things.
They are as follows:
creation of table: c.execute('''create table keys(id integer not null primary key autoincrement, hash text, alert text) ''')
adding entries(in a loop): c.execute('''insert into keys(hash, alert) values (?,?) ''', (hash, "1"))
retrieval(also looped):
c.execute('''select * from keys where hash = ? ''', (hash,))
hits = c.fetchall()
numhits += len(hits)