Question

I have a huge list (N = ~1million) of strings 100 characters long that I'm trying to find the overlaps between. For instance, one string might be

XXXXXXXXXXXXXXXXXXAACTGCXAACTGGAAXA (and so on)

I need to build an N by N matrix that contains the longest overlap value for every string with every other string. My current method is (pseudocode)

read in all strings to array

create empty NxN matrix

compare each string to every string with a higher array index (to avoid redoing comparisons)

Write longest overlap to matrix

There's a lot of other stuff going on, but I really need a much more efficient way to build the matrix. Even with the most powerful computing clusters I can get my hands on this method takes days.

In case you didn't guess, these are DNA fragments. X indicates "wild card" (probe gave below a threshold quality score) and all other options are a base (A, C, T, or G). I tried to write a quaternary tree algorithm, but this method was far too memory intensive.

I'd love any suggestions you can give for a more efficient method; I'm working in C++ but pseudocode/ideas or other language code would also be very helpful.

Edit: some code excerpts that illustrate my current method. Anything not particularly relevant to the concept has been removed

//part that compares them all to each other
for (int j=0; j<counter; j++) //counter holds # of DNA
    for (int k=j+1; k<counter; k++)
        int test = determineBestOverlap(DNArray[j],DNArray[k]);

//boring stuff

//part that compares strings. Definitely very inefficient,
//although I think the sheer number of comparisons is the main problem
int determineBestOverlap(string str1, string str2)
{
    int maxCounter = 0, bestOffset = 0;
    //basically just tries overlapping the strings every possible way
    for (int j=0; j<str2.length(); j++)
    {
        int counter = 0, offset = 0;
        while (str1[offset] == str2[j+offset] && str1[offset] != 'X') 
        {
              counter++;
              offset++;
        }
        
        if (counter > maxCounter)
        {
                    maxCounter = counter;
                    bestOffset = j;
        }
    }
    return maxCounter; 
} //this simplified version doesn't account for flipped strings
Was it helpful?

Solution

Do you really need to know the match between ALL string pairs? If yes, then you will have to compare every string with every other string, which means you will need n^2/2 comparisons, and you will need one half terabyte of memory even if you just store one byte per string pair.

However, i assume what you really are interested in is long strings, those that have more than, say, 20 or 30 or even more than 80 characters in common, and you probably don't really want to know if two string pairs have 3 characters in common while 50 others are X and the remaining 47 don't match.

What i'd try if i were you - still without knowing if that fits your application - is:

1) From each string, extract the largest substring(s) that make(s) sense. I guess you want to ignore 'X'es at the start and end entirely, and if some "readable" parts are broken by a large number of 'X'es, it probably makes sense to treat the readable parts individually instead of using the longer string. A lot of this "which substrings are relevant?" depends on your data and application that i don't really know.

2) Make a list of these longest substrings, together with the number of occurences of each substring. Order this list by string length. You may, but don't really have to, store the indexes of every original string together with the substring. You'll get something like (example)

AGCGCTXATCG 1
GAGXTGACCTG 2
.....
CGCXTATC    1
......

3) Now, from the top to the bottom of the list:

a) Set the "current string" to the string topmost on the list.

b) If the occurence count next to the current string is > 1, you found a match. Search your original strings for the substring if you haven't remembered the indexes, and mark the match.

c) Compare the current string with all strings of the same length, to find matches where some characters are X.

d) Remove the 1st character from the current string. If the resulting string is already in your table, increase its occurence counter by one, else enter it into the table.

e) Repeat 3b with the last, instead of the first, character removed from the current string.

f) Remove the current string from the list.

g) Repeat from 3a) until you run out of computing time, or your remaining strings become too short to be interesting.

If this is a better algorithm depends very much on your data and which comparisons you're really interested in. If your data is very random/you have very few matches, it will probably take longer than your original idea. But it might allow you to find the interesting parts first and skip the less interesting parts.

OTHER TIPS

I don't see many ways to improve the fact that you need to compare each string with each other including shifting them, and that is by itself super long, a computation cluster seems the best approach.

The only thing I see how to improve is the string comparison by itself: replace A,C,T,G and X by binary patterns:

  • A = 0x01
  • C = 0x02
  • T = 0x04
  • G = 0x08
  • X = 0x0F
This way you can store one item on 4 bits, i.e. two per byte (this might not be a good idea though, but still a possible option to investigate), and then compare them quickly with a AND operation, so that you 'just' have to count how many consecutive non zero values you have. That's just a way to process the wildcard, sorry I don't have a better idea to reduce the complexity of the overall comparison.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top