The Needleman–Wunsch algorithm performs a global alignment on two sequences (called A and B here). It is commonly used in bioinformatics to align protein or nucleotide sequences. The algorithm was published in 1970 by Saul B. Needleman and Christian D. Wunsch.It is an example of dynamic programming, and was the first application of dynamic programming to biological sequence comparison.
We consider all possible pairs of residue from two sequences (this gives rise to a 2D matrix representation).
We will have two matrices: the score matrix and traceback matrix.
The Needleman-Wunsch algorithm consists of three steps:
Initialisation of the score matrix
Calculation of scores and filling the traceback matrix
Deducing the alignment from the traceback matrix
Consider The Simple Example
The alignment of two sequences SEND and AND with the BLOSUM62 substitution matrix and gap opening penalty of 10 (no gap extension):
-AND score: +1
A-ND score: +3 the best
AN-D score: -3
AND- score: -8
The Score And Traceback Matrices
The cells of the score matrix are labelled C(i; j) where i = 1; 2; :::; N and j = 1; 2; :::; M
The score matrix cells are filled by row starting from the cell C(2; 2)
The score of any cell C(i; j) is the maximum of:
qdiag = C(i 1; j 1) + S(i; j)
qup = C(i 1; j) + g
qleft = C(i; j 1) + g
where S(i; j) is the substitution score for letters i and j, and g is the gap penalty.
Scoring A Pictorial Representation
The value of the cell C(i; j) depends only on the values of the immediately adjacent northwest diagonal, up, and left cells:
The Needleman-Wunsch Progression
1. The first step is to calculate the value of C(2; 2):
2. The calculation for the cell C(2; 2):
qdiag = C(1; 1) + S(S; A) = 0 + 1 = 1
qup= C(1; 2) + g = 10 + (10) = 20
qleft= C(2; 1) + g = 10 + (10) = 20
Where C(1; 1), C(1; 2), and C(2; 1) are read from the score matrix, and S(S; A) is the score for the S <-> A taken from the BLOSUM62 matrix.
3. For the score matrix C(2; 2) =qdiag which is 1. The corresponding cell of the traceback matrix is "diag":
4. The next step is to calculate C(2; 3):
5. The calculation for the cell C(2; 3)
qdiag = C(1; 2) + S(E; A) = 10 + 1 = 11
qup= C(1; 3) + g = 20 + (10) = 30
qleft= C(2; 2) + g = 1 + (10) = 9
6. Thus C(2; 3) = 9 and the corresponding cell of the traceback matrix is "left".
7. After all cells are lled, the score and traceback matrices are: