Question

I have a very large list of dictionaries (GBs in size) obtained from an API. I would like to use it as a lookup table for other functions. There are several object persistence methods in Python, but what would you recommend for storing lists of dictionaries on disk for easy referencing and lookup?

{
    "library_id": "7", 
    "set_id": "80344779", 
    "description": "Very long description 1 ...", 
    "value": "1"
  }, 
  {
    "library_id": "22", 
    "set_id": "80344779", 
    "description": "Very long description 2 ...", 
    "value": "1"
  }, 
  {
    "library_id": "24", 
    "set_id": "80344779", 
    "description": "Very long description 3 ...", 
    "value": "8"
  }, 
Was it helpful?

Solution 2

Your data seems to be regular, i.e. there is no variation of the dict's keys, right? One might simply use document-based solutions like MongoDB, but I think a simple SQL-based database might be more efficient and is easy to implement.

Alternatives would be the pickle module (not recommended for really large objects, as they are loaded into the memory) or shelve which builds on top of pickle, but is more efficient with large files, afaik (they aren't loaded into your memory at once). The benefit of shelve is it's syntax, which mimics pythons dict-syntax and should be easy to use (see the link). And there is no need to set up a MongoDB or MySQL database (which might get complicated, at least on Windows). Both pickle and shelve are part of the standard-lib.

You also might check datasets and it's easy-to-use interface. It uses a sqlite-db under the hood.

If you're dealing with huge files (let's say > 2 GB), I'd not stick to datasets or shelve, but use more mature soultions like sqlalchemy (+ MySQL-DB) or MongoDB and it's Python interface (PyMongo)

OTHER TIPS

The one way could be to create a model(using Django models https://docs.djangoproject.com/en/dev/topics/db/models/) class to match fields in your dictionary and save each dict in objects like

Something like:

from django.db import models

class MyDict(models.model):
    library_id = models.CharField(max_length=30)
    set_id  = models.CharField(max_length=30)
    description = models.CharField(max_length=30)

You can make your "library_id" as a Primary key if its unique this will help you to lookup with library_id.

You can also you Google app-engine's ndb api for the same purpose. (If you are hosting in on Google App engine). https://developers.google.com/appengine/docs/python/ndb/

As the other answers indicate, it is worth looking into the packaged database models. If you want portability, you could easily create an sqlite3 database using python. Assuming your data comes from an API and is simply a list of dictionary elements like you listed above, a minimal working example would look like:

import sqlite3

# Create a database in memory, in practice you would save to disk
conn = sqlite3.connect(':memory:')

# Read in the data [omitted for brevity]

cmd_create_table='''
CREATE TABLE api_data (
 set_id      INTEGER,
 library_id  INTEGER,
 description STRING,
 value       INTEGER);
CREATE INDEX idx_api ON api_data (library_id, set_id);
'''
conn.executescript(cmd_create_table)

cmd_insert = '''INSERT INTO api_data VALUES (?,?,?,?)'''
keys = ["set_id","library_id","description","value"]

for item in data:
    val = [item[k] for k in keys]
    conn.execute(cmd_insert, val)

def lookup(library_id, set_id):
    cmd_find = 'SELECT * FROM api_data WHERE library_id={} AND set_id={}'
    cmd = cmd_find.format(library_id, set_id)
    return conn.execute(cmd).fetchall()

print lookup(22, 80344779)

>>> [(80344779, 22, u'Very long description 2 ...', 1)]   
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top