Sequence alignment
Pairwise alignment
Pairwise sequence alignment methods are concerned with finding the best-matching piecewise (local) or global alignments of protein / amino acid or dna / nucleic acid sequences.
Typically, the purpose of this is to find homologues (relatives) of a gene or gene-product in a database of known examples. This information is useful for answering a variety of biological questions. The most important application of pairwise alignment methods is for identifying sequences of unknown structure/function. Another important use of these techniques is in studies of molecular evolution.
This is a more flexible technique than global alignment and has the advantage that related regions which appear in a different order in the two proteins (which is known as domain shuffling) can be identified as being related. This is not possible with global alignment methods.
It is important to realise that the actual biological meaning of any alignment can never be absolutely guaranteed. However, statistical methods can be used to assess the liklihood of finding an alignment between two regions (or sequences) by chance, given the size of the database and its composition.
The two questions are related to each other. The first can be addressed by developing a model of how likely certain changes between characters in the sequences are. There are lots of ways to do this, none of which is generally superior. These models are derived empirically using related sequences, and are expressed as substitution matrices. These matrices are used by the algorithms named below to give each possible alignment between two sequences a score. The highest-scoring alignments possible are generated by the algorithm. The actual biological quality of the alignments then depends upon the evolutionary model used to generate the score.
The second question is purely statistical. A lot of work on the part of a lot of people has determined a few hard theoretical rules and many approximations. It is now generally accepted that the scores of alignments between random sequences follow the extreme value distribution. Pairwise alignment programs such as BLAST use simulation methods to estimate the parameters of this distribution for a particular parameter set (consisting of the query, database, substitution matrix and certain other parameters). Alignments can then be given a statistical significance value, allowing judgements on possible relationships between sequences to be inferred.
The downside of Framesearch is that it is extremely CPU-
intensive. Practical use of Framesearch typically requires
either a large Linux farm or specialized hardware that is
specifically optimized for running dynamic-programming
algorithms on a large scale; several companies sell such
systems.
Pairwise local search. Uses a number of methods to increase the speed of the original Smith-Waterman algorithm.
Blast Server at the NCBI
Global Alignment
A global alignment between two sequences is an alignment in which all of the characters in both sequences participate in the alignment.
Global alignments are useful mostly for finding closely-related sequences. As these sequences are also easily identified by local alignment methods global alignment is now somewhat deprecated as a technique. Further, there are several complications to molecular evolution (such as domain shuffling - see below) which prevent these methods from being useful. Local Alignment
Local alignment methods find related regions within sequences - in other words they can consist of a subset of the characters within each sequence (e.g. positions 20-40 of sequence A might align with positions 50-70 of sequence B). Significance of Alignments
Two important issues for sequence alignment are:
The basic assumption in the application of alignment algorithms is in the mechanism of molecular evolution. DNA by virtue of its semi-conservative duplication mechanism carries over material from generation to generation.
There are errors that are occasionaly introduced in the duplication and also mutations that can be introduced between duplication events. Additionally there are viruses and other mechanisms that move sequences from location to location on the chromosome as well as between individuals. The consequence of this is that these sequence matches indicate the origin and evolution of the carriers of the genetic material. Based on assumptions on the probabilities of these change events, estimates can be made on on the time at which sequences diverged from a common ancestor or the time required for changing one sequence into another. The value and nature of the probabilities associated with these change events are however much debated with one school of thought assuming the simpler (an application of Occams razor to biology) constant rate and the other school assuming short evolutionary periods where changes were extremely high. Many changes are evolutionarily non-significant in that they do not affect (that is, neutral, neither positive or negative) the selective advantage (or fitness) of the carrier. This is the basis of Kimura's Neutral theory. This is partly supported by examples such as the Globin protein sequence which shows normal functionality in-spite of several changes at various bases but loses functionality only when changes occur at critical sequence locations.Structural Alignment
One way to potentially validate sequence alignments is to use structural alignments where possible. Because protein structure is more conserved through evolution than sequence (as shown by Cyrus Chothia and Arthur Lesk in 1986), structural alignments can be more reliable over long evolutioanry distances (when sequence similarity has diverged beyond the point which simple sequence comparison can detect). It is worth noting that structural alignments although useful are totally artifactual.Multiple alignment
Multiple alignment is an extension of pairwise alignment to incorporate several sequences. Several methods for this exist, one of the most popular being the progressive alignment strategy as used by the CLUSTAL family of programs. Instead of searching a database, multiple alignment methods take a few sequences and find common regions between them all. This is typically used in cladistics as a method for building phylogenetic trees, as well as for creating sequence profiles which can be used to search sequence database for more distant relatives (the two most popular methods for remote-homologue detection, PSI-BLAST and Hidden Markov model (HMM) based methods both work on this principle). Algorithms
Needleman-Wunsch
Pairwise.
Global alignment only. Smith-Waterman
Pairwise.
Local or global alignment.Framesearch
An extension of Smith-Waterman for pairwise alignment between
a protein sequence and a nucleotide sequence. Unlike
six-frame-translated Blast or Smith-Waterman, which work
by generating all six possible translations of each
nucleotide sequence, then using those translations
for a protein search, Framesearch dynamically considers
every possible single-nucleotide insertion or deletion (indel)
to generate the translation that best matches the protein
sequence. This allows dramatically better alignments
when the nucleotide sequence contains many indels, such
as is often the case with single-read EST sequences or
with very early versions of genomic sequences.Software
SSearch
Implements the standard Smith-Waterman algorithm. Considerably slower than the more modern BLAST and FASTA methods. However, Smith-Waterman remains the
gold standard for protein-protein or nucleotide-nucleotide pairwise
alignment; its speed can be improved by the use of either specialized
hardware or large Linux farms.BLAST
(Stands for Basic Local Alignment Search Tool)
This method makes use of precomputed hash tables which are like index tables to short length sequences. The index is precomputed. When a search sequence is given, the sub-sequences are looked up using the index to reduce the amount of time and searching involved. Several parameters need to be provided to make this method faster or more accurate. Once patterns that match the search sequence are made, more accurate and intensive algorithms may be applied.FASTA
Pairwise local search. Superseded by BLAST.Clustal
Progressive multiple alignment method.
Comes in several varieties (ClustalW, ClustalX etc.)See also