Question

I am dealing with a large protein sequence (fasta) file (>8GB) and my idea is to create dictionary where key and value will be protein id and sequence respectively.

As for now I am OK to make and dump data into a dictionary using pickle and then trying to open with cpickle (I read pickle is faster to dump data and cpickle is faster to load data). But the main problem here is time: to make and dump it as a dictionary is taking too much time and memory (PC has 8GB memory).

Is there any faster option available to deal with large files in Python?

Here is my Python code to create the dictionary and dump the data:

from Bio import SeqIO
import pickle,sys

fastaSeq = {}
with open('uniref90.fasta') as fasta_file:
    for seq_record in SeqIO.parse(fasta_file, 'fasta'):
       header =seq_record.id
       uniID = header.split('_')[1]
       seqs = str(seq_record.seq)
       fastaSeq[uniID] = seqs

f = open('uniref90.obj', 'wb')
pickle.dump(fastaSeq, f, pickle.HIGHEST_PROTOCOL)
f.close()

Loading dictionary and doing some task in separate Python program:

import cPickle as pickle
seq_dict = pickle.load(open("uniref90.obj", "rb"))
for skey in seq_dict.keys():
   #doing something 
Was it helpful?

Solution

Databases are your friend my son.

import sqlite3
from Bio import SeqIO

db = sqlite3.connect("./db")

c = db.cursor()
c.execute('''CREATE TABLE IF NOT EXISTS map (k text unique, v text)''')
db.commit()


def keys(db):
    cursor = db.cursor()
    return cursor.execute("""SELECT k FROM map""").fetchall()


def get(key, db, default=None):
    cursor = db.cursor()
    result = cursor.execute("""SELECT v FROM map WHERE k = ?""", (key,)).fetchone()
    if result is None:
        return default
    return result[0]


def save(key, value, db):
    cursor = db.cursor()
    cursor.execute("""INSERT INTO map VALUES (?,?)""", (key, value))
    db.commit()


with open('uniref90.fasta') as fasta_file:
    for seq_record in SeqIO.parse(fasta_file, 'fasta'):
       header = seq_record.id
       uniID = header.split('_')[1]
       seqs = str(seq_record.seq)
       save(uniID, seqs, db)

OTHER TIPS

cPickle is fastest by far for generic data structures.

Depending on constrains imposed by your data structure, try ujson which is one of the faster JSON implementations, note that you may need to pass some flags to preserve accuracy of floating-point values.

If your data is flat, then numpy.load variants will be faster still. There are format options, from text to binary data.

Finally numpy.mmap allows you to access binary data without "loading" it per se, that is data will only reach RAM when you access it. That should be fastest by far.

If you are stuck with dictionaries, the equivalent is shelve module.

well, your problematic is really how to deal with a large set of data, not necessarily with python. Thing is, whatever you try, you just don't want to load >8GB in your memory, because it will make your application swap and each time you'll try to access an element, you'll have page-ins/page-outs from the RAM.

The way I'd do it would be to actually sort your dataset in multiple files that can be easily pickled, each file being sorted using some hash function on the key, built so the split dataset is fast to load. Then you only load the dataset that matches the hash function condition, which will be done fastly, and extract the value of the key.

And finally, always use cPickle which is a C implementation of Pickle, and should be faster for loads and dumps (though I never made benchmarks personally).

N.B.: that solution is somehow the poorman's database. Of course, @freakish's suggestion and @Jakob's solution of using a database is a really good idea. Though the solution I'm suggesting is an usual way for distributing the datasets accross network nodes, to distribute the storage accross several servers.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top