Fast algorithm comparing unsorted data
-
02-07-2021 - |
Question
I have data that needs to stay in the exact sequence it is entered in (genome sequencing) and I want to search approximately one billion nodes of around 18 members each to locate patterns.
Obviously speed is an issue with this large of a data set, and I actually don't have any data that I can currently use as a discrete key, since the basis of the search is to locate and isolate (but not remove) duplicates.
I'm looking for an algorithm that can go through the data in a relatively short amount of time to locate these patterns and similarities, and I can work out the regex expressions for comparison, but I'm not sure how to get a faster search than O(n).
Any help would be appreciated.
Thanks
Solution
- probably what you want is called "de novo assembly"
- an approach would be to calculate N-mers, and use these in an index
- nmers will become more important if you need partial matches / mismatches
- if billion := 1E9, python might be too weak
- also note that 18 bases* 2 bits := 36 bits of information to enumerate them. That is tentavely close to 32 bits and could fit into 64 bits. hashing / bitfiddling might be an option
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow