Fastest way to read large file(>8GB) and dump data into dictionary and load it again

Question 1

Databases are your friend my son.

import sqlite3
from Bio import SeqIO

db = sqlite3.connect("./db")

c = db.cursor()
c.execute('''CREATE TABLE IF NOT EXISTS map (k text unique, v text)''')
db.commit()


def keys(db):
    cursor = db.cursor()
    return cursor.execute("""SELECT k FROM map""").fetchall()


def get(key, db, default=None):
    cursor = db.cursor()
    result = cursor.execute("""SELECT v FROM map WHERE k = ?""", (key,)).fetchone()
    if result is None:
        return default
    return result[0]


def save(key, value, db):
    cursor = db.cursor()
    cursor.execute("""INSERT INTO map VALUES (?,?)""", (key, value))
    db.commit()


with open('uniref90.fasta') as fasta_file:
    for seq_record in SeqIO.parse(fasta_file, 'fasta'):
       header = seq_record.id
       uniID = header.split('_')[1]
       seqs = str(seq_record.seq)
       save(uniID, seqs, db)

Question 2

cPickle is fastest by far for generic data structures.

Depending on constrains imposed by your data structure, try ujson which is one of the faster JSON implementations, note that you may need to pass some flags to preserve accuracy of floating-point values.

If your data is flat, then numpy.load variants will be faster still. There are format options, from text to binary data.

Finally numpy.mmap allows you to access binary data without "loading" it per se, that is data will only reach RAM when you access it. That should be fastest by far.

If you are stuck with dictionaries, the equivalent is shelve module.

Question 3

well, your problematic is really how to deal with a large set of data, not necessarily with python. Thing is, whatever you try, you just don't want to load >8GB in your memory, because it will make your application swap and each time you'll try to access an element, you'll have page-ins/page-outs from the RAM.

The way I'd do it would be to actually sort your dataset in multiple files that can be easily pickled, each file being sorted using some hash function on the key, built so the split dataset is fast to load. Then you only load the dataset that matches the hash function condition, which will be done fastly, and extract the value of the key.

And finally, always use cPickle which is a C implementation of Pickle, and should be faster for loads and dumps (though I never made benchmarks personally).

N.B.: that solution is somehow the poorman's database. Of course, @freakish's suggestion and @Jakob's solution of using a database is a really good idea. Though the solution I'm suggesting is an usual way for distributing the datasets accross network nodes, to distribute the storage accross several servers.