Question

I have data that needs to stay in the exact sequence it is entered in (genome sequencing) and I want to search approximately one billion nodes of around 18 members each to locate patterns.

Obviously speed is an issue with this large of a data set, and I actually don't have any data that I can currently use as a discrete key, since the basis of the search is to locate and isolate (but not remove) duplicates.

I'm looking for an algorithm that can go through the data in a relatively short amount of time to locate these patterns and similarities, and I can work out the regex expressions for comparison, but I'm not sure how to get a faster search than O(n).

Any help would be appreciated.

Thanks

Was it helpful?

Solution

  • probably what you want is called "de novo assembly"
  • an approach would be to calculate N-mers, and use these in an index
  • nmers will become more important if you need partial matches / mismatches
  • if billion := 1E9, python might be too weak
  • also note that 18 bases* 2 bits := 36 bits of information to enumerate them. That is tentavely close to 32 bits and could fit into 64 bits. hashing / bitfiddling might be an option
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top