Algorithm to solve Local Alignment

https://stackoverflow.com//questions/22015325

21-12-2019
|

Question

Local alignment between X and Y, with at least one column aligning a C to a W.

Given two sequences X of length n and Y of length m, we are looking for a highest-scoring local alignment (i.e., an alignment between a substring X' of X and a substring Y' of Y) that has at least one column in which a C from X' is aligned to a W from Y' (if such an alignment exists). As scoring model, we use a substitution matrix s and linear gap penalties with parameter d.

Write a code in order to solve the problem efficiently. If you use dynamic programming, it suffices to give the equations for computing the entries in the dynamic programming matrices, and to specify where traceback starts and ends.

My Solution:

I've taken 2 sequences namely, "HCEA" and "HWEA" and tried to solve the question. Here is my code. Have I fulfilled what is asked in the question? If am wrong kindly tell me where I've gone wrong so that I will modify my code.

Also is there any other way to solve the question? If its available can anyone post a pseudo code or algorithm, so that I'll be able to code for it.

public class Q1 {

    public static void main(String[] args) {
        //  Input Protein Sequences 
        String seq1 = "HCEA";  
        String seq2 = "HWEA";

        //  Array to store the score
        int[][] T = new int[seq1.length() + 1][seq2.length() + 1];

        //  initialize seq1
        for (int i = 0; i <= seq1.length(); i++) {
            T[i][0] = i;
        }

        //  Initialize seq2
        for (int i = 0; i <= seq2.length(); i++) {
            T[0][i] = i;
        }

        //  Compute the matrix score
        for (int i = 1; i <= seq1.length(); i++) {
            for (int j = 1; j <= seq2.length(); j++) {
                if ((seq1.charAt(i - 1) == seq2.charAt(j - 1))
                        || (seq1.charAt(i - 1) == 'C') && (seq2.charAt(j - 1) == 'W')) {
                    T[i][j] = T[i - 1][j - 1];
                } else {
                    T[i][j] = Math.min(T[i - 1][j], T[i][j - 1]) + 1;
                }
            }
        }

        //  Strings to store the aligned sequences
        StringBuilder alignedSeq1 = new StringBuilder();
        StringBuilder alignedSeq2 = new StringBuilder();

        //  Build for sequences 1 & 2 from the matrix score
        for (int i = seq1.length(), j = seq2.length(); i > 0 || j > 0;) {
            if (i > 0 && T[i][j] == T[i - 1][j] + 1) {
                alignedSeq1.append(seq1.charAt(--i));
                alignedSeq2.append("-");
            } else if (j > 0 && T[i][j] == T[i][j - 1] + 1) {
                alignedSeq2.append(seq2.charAt(--j));
                alignedSeq1.append("-");
            } else if (i > 0 && j > 0 && T[i][j] == T[i - 1][j - 1]) {
                alignedSeq1.append(seq1.charAt(--i));
                alignedSeq2.append(seq2.charAt(--j));
            }
        }

        //  Display the aligned sequence
        System.out.println(alignedSeq1.reverse().toString());
        System.out.println(alignedSeq2.reverse().toString());
    }
}

@Shole The following are the two question and answers provided in my solved worksheet.

Aligning a suffix of X to a prefix of Y Given two sequences X and Y, we are looking for a highest-scoring alignment between any suffix of X and any prefix of Y. As a scoring model, we use a substitution matrix s and linear gap penalties with parameter d. Give an efficient algorithm to solve this problem optimally in time O(nm), where n is the length of X and m is the length of Y. If you use a dynamic programming approach, it suffices to give the equations that are needed to compute the dynamic programming matrix, to explain what information is stored for the traceback, and to state where the traceback starts and ends.

Solution: Let X_i be the prefix of X of length i, and let Y_j denote the prefix of Y of length j. We compute a matrix F such that F[i][j] is the best score of an alignment of any suffix of X_i and the string Y_j. We also compute a traceback matrix P. The computation of F and P can be done in O(nm) time using the following equations:

F[0][0]=0
  for i = 1..n: F[i][0]=0
  for j = 1..m: F[0][j]=-j*d, P[0][j]=L
  for i = 1..n, j = 1..m:
      F[i][j] = max{ F[i-1][j-1]+s(X[i-1],Y[j-1]), F[i-1][j]-d, F[i][j-1]-d }
      P[i][j] = D, T or L according to which of the three expressions above is the maximum

Once we have computed F and P, we find the largest value in the bottom row of the matrix F. Let F[n][j0] be that largest value. We start traceback at F[n][j0] and continue traceback until we hit the first column of the matrix. The alignment constructed in this way is the solution.

Aligning Y to a substring of X, without gaps in Y Given a string X of length n and a string Y of length m, we want to compute a highest-scoring alignment of Y to any substring of X, with the extra constraint that we are not allowed to insert any gaps into Y. In other words, the output is an alignment of a substring X' of X with the string Y, such that the score of the alignment is the largest possible (among all choices of X') and such that the alignment does not introduce any gaps into Y (but may introduce gaps into X'). As a scoring model, we use again a substitution matrix s and linear gap penalties with parameter d. Give an efficient dynamic programming algorithm that solves this problem optimally in polynomial time. It suffices to give the equations that are needed to compute the dynamic programming matrix, to explain what information is stored for the traceback, and to state where the traceback starts and ends. What is the running-time of your algorithm?

Solution:

Let X_i be the prefix of X of length i, and let Y_j denote the prefix of Y of length j. We compute a matrix F such that F[i][j] is the best score of an alignment of any suffix of X_i and the string Y_j, such that the alignment does not insert gaps in Y. We also compute a traceback matrix P. The computation of F and P can be done in O(nm) time using the following equations:

F[0][0]=0
  for i = 1..n: F[i][0]=0
  for j = 1..m: F[0][j]=-j*d, P[0][j]=L
  for i = 1..n, j = 1..m:
      F[i][j] = max{ F[i-1][j-1]+s(X[i-1],Y[j-1]), F[i][j-1]-d }
      P[i][j] = D or L according to which of the two expressions above is the maximum

Once we have computed F and P, we find the largest value in the rightmost column of the matrix F. Let F[i0][m] be that largest value. We start traceback at F[i0][m] and continue traceback until we hit the first column of the matrix. The alignment constructed in this way is the solution.

Hope you get some idea about wot i really need.

Solution

I think it's quite easy to find resources or even the answer by google...as the first result of the searching is already a thorough DP solution.

However, I appreciate that you would like to think over the solution by yourself and are requesting some hints.

Before I give out some of the hints, I would like to say something about designing a DP solution (I assume you know this can be solved by a DP solution)

A dp solution basically consisting of four parts:

1. DP state, you have to self define the physical meaning of one state, eg: a[i] := the money the i-th person have; a[i][j] := the number of TV programmes between time i and time j; etc

2. Transition equations

3. Initial state / base case

4. how to query the answer, eg: is the answer a[n]? or is the answer max(a[i])?

Just some 2 cents on a DP solution, let's go back to the question :)

Here's are some hints I am able to think of:

What is the dp state? How many dimensions are enough to define such a state? Thinking of you are solving problems much alike to common substring problem (on 2 strings), 1-dimension seems too little and 3-dimensions seems too many right?
As mentioned in point 1, this problem is very similar to common substring problem, maybe you should have a look on these problems to get yourself some idea? LCS, LIS, Edit Distance, etc.

Supplement part: not directly related to the OP

DP is easy to learn, but hard to master. I know a very little about it, really cannot share much. I think "Introduction to algorithm" is a quite standard book to start with, you can find many resources, especially some ppt/ pdf tutorials of some colleges / universities to learn some basic examples of DP.(Learn these examples is useful and I'll explain below)

A problem can be solved by many different DP solutions, some of them are much better (less time / space complexity) due to a well-defined DP state.

So how to design a better DP state or even get the sense that one problem can be solved by DP? I would say it's a matter of experiences and knowledge. There are a set of "well-known" DP problems which I would say many other DP problems can be solved by modifying a bit of them. Here is a post I just got accepted about another DP problem, as stated in that post, that problem is very similar to a "well-known" problem named "matrix chain multiplication". So, you cannot do much about the "experience" part as it has no express way, yet you can work on the "knowledge" part by studying these standard DP problems first maybe?

Lastly, let's go back to your original question to illustrate my point of view:

As I knew LCS problem before, I have a sense that for similar problem, I may be able to solve it by designing similar DP state and transition equation? The state s(i,j):= The optimal cost for A(1..i) and B(1..j), given two strings A & B
What is "optimal" depends on the question, and how to achieve this "optimal" value in each state is done by the transition equation.
With this state defined, it's easy to see the final answer I would like to query is simply s(len(A), len(B)).
Base case? s(0,0) = 0 ! We can't really do much on two empty string right?

So with the knowledge I got, I have a rough thought on the 4 main components of designing a DP solution. I know it's a bit long but I hope it helps, cheers.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow