Question

I have two groups A and B of strings of the letters "AGTE" and I'd like to find some way of comparing these to see whether they are statistically similar. The first group A are real world observations, B are predictions. There are 400 or so in each group Eg

**A**
GTAATEGTTTEAAA
TTEAGE
...

**B**
AGTEAAAAGT
TAT
GGATEAATGGGTEAATG
....

I'd also like to be up to visualise these in some way really for presentation purposes. Do you have any ideas how I might be able to do that?

Was it helpful?

Solution

I'd suggest you compute the Levenshtein distance between the strings, then you can plot these inter string distances. Larger values indicate strings that are more dissimilar.

If you don't want to implement the Levenshtein distance calculation yourself, check out these submissions on file exchange.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top