Is Python DBM really fast?

https://stackoverflow.com/questions/7738272

09-02-2021
|

Question

I was thinking that native DBM of Python should be quite faster than NOSQL databases such as Tokyo Cabinet, MongoDB, etc (as Python DBM has lesser features and options; i.e. a simpler system). I tested with a very simple write/read example as

#!/usr/bin/python
import time
t = time.time()
import anydbm
count = 0
while (count < 1000):
 db = anydbm.open("dbm2", "c")
 db["1"] = "something"
 db.close()
 db = anydbm.open("dbm", "r")
 print "dict['Name']: ", db['1'];
 print "%.3f" % (time.time()-t)
 db.close()
 count = count + 1

Read/Write: 1.3s Read: 0.3s Write: 1.0s

These values for MongoDb is at least 5 times faster. Is it really the Python DBM performance?

Solution

Python doesn't have a built-in DBM implementation. It bases its DBM functions on a wide range of DBM-style third party libraries, like AnyDBM, Berkeley DBM and GNU DBM.

Python's dictionary implementation is really fast for key-value storage, but not persistent. If you need high-performance runtime key-value lookups, you may find a dictionary better - you can manage persistence with something like cpickle or shelve. If startup times are important to you (and if you're modifying the data, termination) - more important than runtime access speed - then something like DBM would be better.

In your evaluation, as part of the main loop you have included both dbm open calls and also array lookup. It's a pretty unrealistic use case to open a DBM to store one value and the close and re-open before looking it up, and you're seeing the typical slow performance that one would when managing a persistent data store in such a manner (it's quite inefficient).

Depending on your requirements, if you need fast lookups and don't care too much about startup times, DBM might be a solution - but to benchmark it, only include writes and reads in the loop! Something like the below might be suitable:

import anydbm
from random import random
import time

# open DBM outside of the timed loops
db = anydbm.open("dbm2", "c")

max_records = 100000

# only time read and write operations
t = time.time()

# create some records
for i in range(max_records):
  db[str(i)] = 'x'

# do a some random reads
for i in range(max_records):
  x = db[str(int(random() * max_records))]

time_taken = time.time() - t
print "Took %0.3f seconds, %0.5f microseconds / record" % (time_taken, (time_taken * 1000000) / max_records)

db.close()

OTHER TIPS

Embedded keystores are faster enough in python3. Take as baseline the native dict, say

for k in random.choices(auList,k=100000000):
    a=auDict[k]
CPU times: user 1min 6s, sys: 1.07 s, total: **1min 7s**

GDBM does not travel bad against this

%%time
with db.open("AuDictJson.gdbm",'r') as d:
    for k in random.choices(auList,k=100000000):   
        a=d[str(k)]   
CPU times: user 2min 44s, sys: 1.31 s, total: **2min 45s**

And even a specialist precompiled table, as keyvy for json serializated lists, can do almost the same.

%%time
d = keyvi.Dictionary("AuDictJson.keyvi")
for k in random.choices(auList,k=100000000):   
    a=d[str(k)].GetValue()
CPU times: user 7min 45s, sys: 1.48 s, total: 7min 47s

Generically an embedded database, specially when it is read-only and single user, should be expected always to win an external one, because of the overhead of sockets and semaphores to access the resource. On the other hand, if your program is a service having already some external I/O bottleneck -say, you are writing a webservice-, the overhead to access the resource could unimportant.

Said that, you can see some advantage on using external databases if they provide extra services. For Redis, consider union of sets.

%%time
for j in range(1000):
    k=r.sunion(('s'+str(k) for k in random.choices(auList,k=10000)))  
CPU times: user 2min 24s, sys: 758 ms, total: 2min 25s

Same task with gbm is in the same order of magnitude. Albeit redis is still five times slower, it is not so slow as to discard it

%%time 
with db.open("AuDictPSV.gdbm",'r') as d:
    for j in range(1000):
        a=set()
        for k in random.choices(auList,k=10000):
            a.update(d[str(k)].split(b'|'))
CPU times: user 33.6 s, sys: 3.5 ms, total: 33.6 s

By using redis in this case, you get the full functionality of a database, beyond a simple datastore. Of course, with a lot of clients saturating it, of with a lot of single gets, it is going to perform poorly compared to the embedded resource.

As for gdbm competition, a 2014 benchmark by Charles Leifer shows that it can overperform KyotoCabinet for reads still tie for writting, and that one could consder LevelDB and RocksDB as advanced alternatives.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow