Overview
The human genome project has spelled out the 3 billion
or so base pairs that make up human DNA. The science of bioinformatics tries
to make sense out of such data. Bioinformatics sifts through data to find out
about protein structure, molecular evolution, and gene function. It includes
topics such as protein molecular dynamics, protein treading, protein homology
modeling, protein structure prediction, phylogenetic trees, pairwise sequence
alignment, multiple sequence alignment, fragment and map assembly, RNA secondary
structure prediction, and integration of molecular biology databases. One area
within bioinformatics is pairwise sequence alignment. When the sequences of
two proteins are very similar, its relatively easy to line them up. However,
the homology of distantly related proteins may be difficult to recognize. In
addition, unrelated proteins and distantly related proteins may have similar
levels of sequence identity. Proteins with sequence identity in this range are
said to be in the twilight zone. Sequence identity in the twilight zone ranges
from 15 to 25 percent. Sophisticated alignment
algorithms are required to distinguish distantly related proteins from those
whose similarity falls in the twilight zone by chance.
Section 2
We can start to understand the alignment process by constructing
a dot matrix for two sequences. The sequence of one polypeptide is placed horizontally
and the sequence of a second polypeptide is placed vertically. Click on the
dots in the resulting matrix wherever the residues are identical and then click
submit when youre finished. Peptides with similar sequences
have many dots along the diagonal and a shift in position wherever one peptide
has a gap relative to the other. Dissimilar peptides have only a few dots along
the diagonal. To quantify peptide similarity, an alignment score may be calculated.
In one scheme, 10 is added for every identity except for cysteine, which counts
20; 25 is subtracted for every gap. Dividing by the number of residues in the
shortest peptide and multiplying by 100 normalizes the score. The normalized
alignment score of a perfect match in the absence of cysteine residues or gaps
is 1000. The normalized alignment score of these
two peptides is 591. While this score is relatively
high, its significance is low because the peptides are so short.
Section 3
Dot matrix alignments are useful when sequences share
high levels of identity. However, most alignments require computerized statistical
analysis. For example, computerized statistical alignments weigh the likelihood
of residue substitutions based on the genetic codeamino acid changes requiring
only a single base change are more likely than those requiring 3 base changes.
In addition, the likelihood of a substitution being accepted through the process
of natural selection must be considered. The physical similarity of the amino
acid substitution is part of such alignment schemes. A conservative change of
leucine to isoleucine is more likely to be accepted than the replacement of
leucine with tryptophan. One family of matrices, called PAM for Percentage of
Accepted point Mutations, was introduced by Margaret Dayhoff in 1978. Dayhoff
constructed phylogenetic trees for closely related proteins and determined the
amino acid replacement frequency over an evolutionary period. The evolutionary
period corresponding to 1 accepted mutation per 100 amino acids is a PAM unit.
There is no association between PAM units and time because different proteins
evolve at different rates. The number associated with each PAM matrix represent
the evolutionary distance for that number of accepted mutations to occur per
100 amino acids. A PAM 250 matrix corresponds to
250 accepted mutations per 100 residues. Since
the same residue can change more than once, two sequences with this level of
mutation will have about 20% identical residues. Recall
that this level of sequence similarity is in the twilight zone-the region where
its difficult to distinguish distantly related sequences from unrelated
sequences.
Section 4
The PAM 250 log odds substitution matrix with a scale
factor is displayed. These are weighted values that can be used rather than
dots in an alignment matrix. The diagonal values indicate the replacement odds
of the amino acids. The amino acid residues least likely to be replaced have
the largest diagonal values. Select the two residues least likely to be replaced
by clicking on their replacement odds and then click Submit. The other matrix
values indicate the exchange odds of the amino acid residues. The residue pairs
least likely to exchange have the smallest values. For example, glycine and
proline, with an exchange value of -1, are less likely to exchange than are
glycine and alanine, with an exchange value of 1. In the PAM matrix, the amino
acids are arranged so that residues likely to exchange are near each other.
Click on the different classifications of the amino acids to see what influence
physical properties have on exchange groupings. A comparison matrix of two short
peptides is shown. Rather than dots, weighted values from the PAM 250 matrix
are used. Fill in the missing values using the PAM 250 log odds substitution
matrix and then click submit. The log odds values entered from the PAM 250 matrix
are used as alignment scores. The best sequence alignment is found by maximizing
the sum of these values. An algorithm, such as that formulated by Needleman
and Wunsch, may be used to determine the best alignment. The Needleman-Wunsch
algorithm transforms the matrix. The lower right corner of the matrix corresponds
to the peptides C termini. The log odds value of the asparagine, N, exchanging
with aspartic acid, D, is 2. The next residue in the peptide, cysteine, may
pair with cysteine. The log odds value of this alignment is 12. The 2 is added
to the 12 and the 12 is replaced with the sum, 14. Less likely is the pairing
of cysteine with lysine, K. The log odds value of this alignment is 5.
The 2 is added to the 5 and the 5 is replaced with the sum, -3.
This process is repeated to yield the transformed matrix. The best sequence
alignment is found by connecting numbers in the transformed matrix. The maximum
value at or near the N-terminus is connected to the next largest value, always
moving down and to the right. Determine the best sequence alignment by clicking
on the matrix values. Start at the maximum value of the transformed matrix,
which is 41, and continue downward and to the right. Click submit
when finished. Two sequence alignments are possible. To determine if one is
better, an alignment score is calculated. The overall alignment score is the
maximum value of the transformed matrix, 41. Gap penalties are assigned. Gap
penalties are calculated using the equation a + bk where a is the penalty for
opening the gap, k is the length of the gap in residues, and b is the penalty
for extending the gap by one residue. Empirical studies suggest the best values
to use for the PAM 250 matrix are a = -8 and b = -2. Each
alignment has two gaps; the gap penalties reduce both scores to 19. Since
both alignments have the same score, they are equally probable.
Section 5
Because of the rapidly expanding databases and modern
computers, sophisticated alignment programs are now used. One program available
over the web is BLAST, for Basic Local Alignment Search Tool. The BLAST algorithm
detects alignments embedded in otherwise unrelated proteins, called local alignments,
as well as global alignments. Global alignments match sequences across their
complete lengths. Click on the BLAST button to enter a mock up of the BLAST
site. Note that when youre visiting the real site, you can click on the
BLAST course and BLAST tutorial for additional information. To search for matches
to a short polypeptide, click on Search for short nearly exact matches under
Protein BLAST. This program compares an amino acid sequence against a protein
sequence database. The peptide sequence is entered into the search window. Since
the entire sequence will be searched, there is no need to enter anything into
the set sequence boxes. The database chosen is nr, for all non-redundant entries.
The nr database is a good choice for a comprehensive search. A CD search, which
searches for conserved domains, is not selected. Click on the BLAST button to
initiate the search. Click Format to check the results. Blast found 108 possible
matches to the query sequence. However, only one alignment score (deletion)
is over 40 bits. A bit is the basic unit of information in a binary numbering
system. Bit scores are normalized so alignment scores from different searches
can be compared. The normalized bit score is used to calculate the statistical
significance of the alignment, the Expected or E value. The E value decreases
exponentially as the bit score increases. The E value describes the number of
expected hits by chance when searching the given database. For example, an E
value of 5 means that one would expect to find, by chance, 5 alignments with
a similar bit score. The E value for the alignment with a bit score of 49 is
3 X 10-6. This alignment is unlikely to have occurred by chance. Other programs
such as FASTA have different search philosophies and may yield different results,
especially for sequence similarities in the twilight region. Regardless of the
method used, sequence alignments can provide valuable clues about evolutionary
relationships and the function of conserved residues among proteins. Bioinformatics
has made possible the sophisticated mining of data that allows sequence identity
in the twilight zone to be recognized.