Fast algorithm comparing unsorted data

https://stackoverflow.com/questions/12470094

02-07-2021
|

Question

I have data that needs to stay in the exact sequence it is entered in (genome sequencing) and I want to search approximately one billion nodes of around 18 members each to locate patterns.

Obviously speed is an issue with this large of a data set, and I actually don't have any data that I can currently use as a discrete key, since the basis of the search is to locate and isolate (but not remove) duplicates.

I'm looking for an algorithm that can go through the data in a relatively short amount of time to locate these patterns and similarities, and I can work out the regex expressions for comparison, but I'm not sure how to get a faster search than O(n).

Any help would be appreciated.

Thanks

Solution

probably what you want is called "de novo assembly"
an approach would be to calculate N-mers, and use these in an index
nmers will become more important if you need partial matches / mismatches
if billion := 1E9, python might be too weak
also note that 18 bases* 2 bits := 36 bits of information to enumerate them. That is tentavely close to 32 bits and could fit into 64 bits. hashing / bitfiddling might be an option

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow