Question

I have a 400 million lines of unique key-value info that I would like to be available for quick look ups in a script. I am wondering what would be a slick way of doing this. I did consider the following but not sure if there is a way to disk map the dictionary and without using a lot of memory except during dictionary creation.

  1. pickled dictionary object : not sure if this is an optimum solution for my problem
  2. NoSQL type dbases : ideally want something which has minimum dependency on third party stuff plus the key-value are simply numbers. If you feel this is still the best option, I would like to hear that too. May be it will convince me.

Please let me know if anything is not clear.

Thanks! -Abhi

Was it helpful?

Solution

If you want to persist a large dictionary, you are basically looking at a database.

Python comes with built in support for sqlite3, which gives you an easy database solution backed by a file on disk.

OTHER TIPS

No one has mentioned dbm. It is opened like a file, behaves like a dictionary and is in the standard distribution.

From the docs https://docs.python.org/3/library/dbm.html

import dbm

# Open database, creating it if necessary.
with dbm.open('cache', 'c') as db:

    # Record some values
    db[b'hello'] = b'there'
    db['www.python.org'] = 'Python Website'
    db['www.cnn.com'] = 'Cable News Network'

    # Note that the keys are considered bytes now.
    assert db[b'www.python.org'] == b'Python Website'
    # Notice how the value is now in bytes.
    assert db['www.cnn.com'] == b'Cable News Network'

    # Often-used methods of the dict interface work too.
    print(db.get('python.org', b'not present'))

    # Storing a non-string key or value will raise an exception (most
    # likely a TypeError).
    db['www.yahoo.com'] = 4

# db is automatically closed when leaving the with statement.

I would try this before any of the more exotic forms, and using shelve/pickle will pull everything into memory on loading.

Cheers

Tim

In principle the shelve module does exactly what you want. It provides a persistent dictionary backed by a database file. Keys must be strings, but shelve will take care of pickling/unpickling values. The type of db file can vary, but it can be a Berkeley DB hash, which is an excellent light weight key-value database.

Your data size sounds huge so you must do some testing, but shelve/BDB is probably up to it.

Note: The bsddb module has been deprecated. Possibly shelve will not support BDB hashes in future.

Without a doubt (in my opinion), if you want this to persist, then Redis is a great option.

  1. Install redis-server
  2. Start redis server
  3. Install redis python pacakge (pip install redis)
  4. Profit.

import redis

ds = redis.Redis(host="localhost", port=6379)

with open("your_text_file.txt") as fh:
    for line in fh:
        line = line.strip()
        k, _, v = line.partition("=")
        ds.set(k, v)

Above assumes a files of values like:

key1=value1
key2=value2
etc=etc

Modify insertion script to your needs.


import redis
ds = redis.Redis(host="localhost", port=6379)

# Do your code that needs to do look ups of keys:
for mykey in special_key_list:
    val = ds.get(mykey)

Why I like Redis.

  1. Configurable persistance options
  2. Blazingly fast
  3. Offers more than just key / value pairs (other data types)
  4. @antrirez

I don't think you should try the pickled dict. I'm pretty sure that Python will slurp the whole thing in every time, which means your program will wait for I/O longer than perhaps necessary.

This is the sort of problem for which databases were invented. You are thinking "NoSQL" but an SQL database would work also. You should be able to use SQLite for this; I've never made an SQLite database that large, but according to this discussion of SQLite limits, 400 million entries should be okay.

What are the performance characteristics of sqlite with very large database files?

I personally use LMDB and its python binding for a few million records DB. It is extremely fast even for a database larger than the RAM. It's embedded in the process so no server is needed. Dependency are managed using pip.

The only downside is you have to specify the maximum size of the DB. LMDB is going to mmap a file of this size. If too small, inserting new data will raise a error. To large, you create sparse file.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top