Main Page | See live article | Alphabetical index

Sequence alignment

Sequence alignment is concerned with the relationships between biological sequences (e.g. protein sequences or DNA sequences). Two major types exist: pairwise and multiple sequence alignments.

Table of contents
1 Pairwise alignment
2 Multiple alignment
3 Algorithms
4 Needleman-Wunsch
5 Smith-Waterman
6 Software
7 SSearch
8 BLAST
9 Fasta
10 Clustal
11 See also

Pairwise alignment

Pairwise sequence alignment methods are concerned with finding the best-matching piecewise (local) or global alignments of protein / amino acid or dna / nucleic acid sequences.

Typically, the purpose of this is to find homologues (relatives) of a gene or gene-product in a database of known examples. This information is useful for answering a variety of biological questions. The most important application of pairwise alignment methods is for identifying sequences of unknown structure/function. Another important use of these techniques is in studies of molecular evolution.

Global Alignment

A global alignment between two sequences is an alignment in which all of the characters in both sequences participate in the alignment. Global alignments are useful mostly for finding closely-related sequences. As these sequences are also easily identified by local alignment methods global alignment is now somewhat deprecated as a technique. Further, there are several complications to molecular evolution (such as domain shuffling - see below) which prevent these methods from being useful.

Local Alignment

Local alignment methods find related regions within sequences - in other words they can consist of a subset of the characters within each sequence (e.g. positions 20-40 of sequence A might align with positions 50-70 of sequence B).

This is obviously a more flexible technique than global alignment and has the advantage that related regions which appear in a different order in the two proteins (which is known as domain shuffling) can be identified as being related. This is not possible with global alignment methods.

Significance of Alignments

Two important issues for sequence alignment are:

  1. How is the best alignment between two sequences (or regions of sequence) chosen?
  2. How are alignments between the query and the (many!) sequences in the database ranked according to their biological significance?

It is important to realise that the actual biological meaning of any alignment can never be absolutely guaranteed. However, statistical methods can be used to assess the liklihood of finding an alignment between two regions (or sequences) by chance, given the size of the database and its composition.

The two questions are related, obviously. The first can be addressed by developing a model of how likely certain changes between characters in the sequences are. There are lots of ways to do this, none of which is obviously superior overall. These models are derived empirically using related sequences, and are expressed as substitution matrices. These matrices are used by the algorithms named below to give each possible alignment between two sequences a score. The highest-scoring alignments possible are generated by the algorithm. The actual biological quality of the alignments then depends upon the evolutionary model used to generate the score.

The second question is purely statistical. A lot of work on the part of a lot of people has determined a few hard theoretical rules and many approximations. It is now generally accepted that the scores of alignments between random sequences follow the extreme value distribution. Pairwise alignment programs such as BLAST use simulation methods to estimate the parameters of this distribution for a particular parameter set (consisting of the query, database, substitution matrix and certain other parameters). Alignments can then be given a statistical significance value, allowing judgements on possible relationships between sequences to be inferred.

Multiple alignment

Multiple alignment is an extension of pairwise alignment to incorporate several sequences. Several methods for this exist, one of the most popular being the progressive alignment strategy as used by the CLUSTAL family of programs. Instead of searching a database, multiple alignment methods take a few sequences and find common regions between them all. This is typically used in cladistics as a method for building phylogenetic trees, as well as for creating sequence profiles which can be used to search sequence database for more distant relatives (the two most popular methods for remote-homologue detection, PSI-BLAST and Hidden Markov model (HMM) based methods both work on this principle).

Many strategies exist, however as yet no algorithm has been described which is guaranteed to find the best possible alignment.

Algorithms

Needleman-Wunsch

Pairwise. Global alignment only.

Smith-Waterman

Pairwise. Local or global alignment.

Software

SSearch

Implements the standard Smith-Waterman algorithm. Considerably slower than the more modern BLAST and FASTA methods.

BLAST

(Stands for Basic Local Alignment Search Tool)

Pairwise local search. Uses a number of methods to increase the speed of the original Smith-Waterman algorithm.

Blast Server at the NCBI

Fasta

Pairwise local search. Superseded by BLAST.

Clustal

Progressive multiple alignment method. Comes in several varieties (ClustalW, ClustalX etc.)

See also