Question

I would like to calculate the number of pairwise differences between a long list of sequences, and put it back into a matrix form.

I have a few hundred genetic sequences, and each sequence is already aligned and has the same length (about 300 characters). I'm not looking for one of the edit distance algorithms (hamming's, leveinstein's, etc) but instead would like to get the number of absolute differences between two sequences. The sequences would have to be compared at each character position.

For example,

Sequence 1: "GAT-ACA"
Sequence 2: "AT-GCGA"
Number of differences: 6

(The dash is there to allow the sequences to be aligned, and my sequences may also include dashes).

Would there be any efficient way to do this using python (or other language), with a short computing time? I also asked this question in R, initially intending to do it that way, but it turned out too slow to be feasible to apply to several hundred sequences.

Thank you!

Was it helpful?

Solution

If you want to calculate the matrix that displays the differences between the pairs you can do it like this:

import numpy as np

def get_difference(x,y):
    return sum(ele_x != ele_y for ele_x, ele_y in zip(x,y))

my_list = ['abcde','abcwe','zbfwe']
n = len(my_list)

my_array = np.zeros((n,n))
#
for i, ele_1 in enumerate(my_list):
    for j, ele_2 in enumerate(my_list):
        if j >= i:
            break # Since the matrix is symmetrical we don't need to
                  # calculate everything
        difference = get_difference(ele_1, ele_2)  
        my_array[i, j] = difference
        my_array[j, i] = difference

Result:

>>> my_array
array([[ 0.,  1.,  3.],
       [ 1.,  0.,  2.],
       [ 3.,  2.,  0.]])

The resulting matrix (OK array) shows the differences between the pairs.

OTHER TIPS

Zip the strings into tuples - then use sum to count the boolean Trues (i.e. non-matching points).

s1 = "GAT-ACA"
s2 = "AT-GCGA"

print sum(a != b for (a,b) in zip(s1, s2))

If I understand well, just:

diff = lambda seq1, seq2: sum(seq1[i]!=seq2[i] for i in range(len(seq1)))
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top