Comparing and visualising groups of sequences

https://stackoverflow.com/questions/12436348

string
matlab
sequence
edit-distance
markov-chains

02-07-2021
|

Question

I have two groups A and B of strings of the letters "AGTE" and I'd like to find some way of comparing these to see whether they are statistically similar. The first group A are real world observations, B are predictions. There are 400 or so in each group Eg

**A**
GTAATEGTTTEAAA
TTEAGE
...

**B**
AGTEAAAAGT
TAT
GGATEAATGGGTEAATG
....

I'd also like to be up to visualise these in some way really for presentation purposes. Do you have any ideas how I might be able to do that?

Solution

I'd suggest you compute the Levenshtein distance between the strings, then you can plot these inter string distances. Larger values indicate strings that are more dissimilar.

If you don't want to implement the Levenshtein distance calculation yourself, check out these submissions on file exchange.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow