Increase string overlap matrix building efficiency

Question 1

Do you really need to know the match between ALL string pairs? If yes, then you will have to compare every string with every other string, which means you will need n^2/2 comparisons, and you will need one half terabyte of memory even if you just store one byte per string pair.

However, i assume what you really are interested in is long strings, those that have more than, say, 20 or 30 or even more than 80 characters in common, and you probably don't really want to know if two string pairs have 3 characters in common while 50 others are X and the remaining 47 don't match.

What i'd try if i were you - still without knowing if that fits your application - is:

1) From each string, extract the largest substring(s) that make(s) sense. I guess you want to ignore 'X'es at the start and end entirely, and if some "readable" parts are broken by a large number of 'X'es, it probably makes sense to treat the readable parts individually instead of using the longer string. A lot of this "which substrings are relevant?" depends on your data and application that i don't really know.

2) Make a list of these longest substrings, together with the number of occurences of each substring. Order this list by string length. You may, but don't really have to, store the indexes of every original string together with the substring. You'll get something like (example)

AGCGCTXATCG 1
GAGXTGACCTG 2
.....
CGCXTATC    1
......

3) Now, from the top to the bottom of the list:

a) Set the "current string" to the string topmost on the list.

b) If the occurence count next to the current string is > 1, you found a match. Search your original strings for the substring if you haven't remembered the indexes, and mark the match.

c) Compare the current string with all strings of the same length, to find matches where some characters are X.

d) Remove the 1st character from the current string. If the resulting string is already in your table, increase its occurence counter by one, else enter it into the table.

e) Repeat 3b with the last, instead of the first, character removed from the current string.

f) Remove the current string from the list.

g) Repeat from 3a) until you run out of computing time, or your remaining strings become too short to be interesting.

If this is a better algorithm depends very much on your data and which comparisons you're really interested in. If your data is very random/you have very few matches, it will probably take longer than your original idea. But it might allow you to find the interesting parts first and skip the less interesting parts.

Question 2

I don't see many ways to improve the fact that you need to compare each string with each other including shifting them, and that is by itself super long, a computation cluster seems the best approach.

The only thing I see how to improve is the string comparison by itself: replace A,C,T,G and X by binary patterns:

A = 0x01
C = 0x02
T = 0x04
G = 0x08
X = 0x0F

This way you can store one item on 4 bits, i.e. two per byte (this might not be a good idea though, but still a possible option to investigate), and then compare them quickly with a AND operation, so that you 'just' have to count how many consecutive non zero values you have. That's just a way to process the wildcard, sorry I don't have a better idea to reduce the complexity of the overall comparison.