How to perform efficient membership search on very large datasets in Python

https://stackoverflow.com/questions/21769526

11-10-2022
|

Question

My question regards list/set membership search efficiency. I want to compare a small set of genomic kmers (string of nucleotides) to a very large list of kmers to test for membership. The large list could be up to the ones of GB range, as this algorithm is designed for large eukaryotic genomes.

The smaller lists will only be in the KB range, but there could be potentially millions of those lists! Clearly, I am in dire need of an efficient way to search this large list.

According to similar questions I have seen, I shouldn't convert the large list to a set as it would cost too much memory. I have been using the smaller lists as sets, but it doesn't save me a noticeable amount of time.

Finally, once the script is done, it will be designed to run on large memory machines usually used for large genomic algorithms.

Thank You

Solution

BioPython has a trie implementation for exactly this purpose.

from Bio import trie
tr = trie.trie()

OTHER TIPS

A trie is likely a good data structure for the genomics, but you could also consider a BloomFilter (a probabilistic data structure that can provide amazingly space efficient set membership testing -- this is how some spell checkers store large dictionaries of valid words for example).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow