HELPING STUDENTS
   
   
 

Overview

The human genome project has spelled out the 3 billion or so base pairs that make up human DNA. The science of bioinformatics tries to make sense out of such data. Bioinformatics sifts through data to find out about protein structure, molecular evolution, and gene function. It includes topics such as protein molecular dynamics, protein treading, protein homology modeling, protein structure prediction, phylogenetic trees, pairwise sequence alignment, multiple sequence alignment, fragment and map assembly, RNA secondary structure prediction, and integration of molecular biology databases. One area within bioinformatics is pairwise sequence alignment. When the sequences of two proteins are very similar, it’s relatively easy to line them up. However, the homology of distantly related proteins may be difficult to recognize. In addition, unrelated proteins and distantly related proteins may have similar levels of sequence identity. Proteins with sequence identity in this range are said to be in the twilight zone. Sequence identity in the twilight zone ranges from 15 to 25 percent. Sophisticated alignment algorithms are required to distinguish distantly related proteins from those whose similarity falls in the twilight zone by chance.


Section 2

We can start to understand the alignment process by constructing a dot matrix for two sequences. The sequence of one polypeptide is placed horizontally and the sequence of a second polypeptide is placed vertically. Click on the dots in the resulting matrix wherever the residues are identical and then click ‘submit’ when you’re finished. Peptides with similar sequences have many dots along the diagonal and a shift in position wherever one peptide has a gap relative to the other. Dissimilar peptides have only a few dots along the diagonal. To quantify peptide similarity, an alignment score may be calculated. In one scheme, 10 is added for every identity except for cysteine, which counts 20; 25 is subtracted for every gap. Dividing by the number of residues in the shortest peptide and multiplying by 100 normalizes the score. The normalized alignment score of a perfect match in the absence of cysteine residues or gaps is 1000. The normalized alignment score of these two peptides is 591. While this score is relatively high, its significance is low because the peptides are so short.


Section 3

Dot matrix alignments are useful when sequences share high levels of identity. However, most alignments require computerized statistical analysis. For example, computerized statistical alignments weigh the likelihood of residue substitutions based on the genetic code—amino acid changes requiring only a single base change are more likely than those requiring 3 base changes. In addition, the likelihood of a substitution being accepted through the process of natural selection must be considered. The physical similarity of the amino acid substitution is part of such alignment schemes. A conservative change of leucine to isoleucine is more likely to be accepted than the replacement of leucine with tryptophan. One family of matrices, called PAM for Percentage of Accepted point Mutations, was introduced by Margaret Dayhoff in 1978. Dayhoff constructed phylogenetic trees for closely related proteins and determined the amino acid replacement frequency over an evolutionary period. The evolutionary period corresponding to 1 accepted mutation per 100 amino acids is a PAM unit. There is no association between PAM units and time because different proteins evolve at different rates. The number associated with each PAM matrix represent the evolutionary distance for that number of accepted mutations to occur per 100 amino acids. A PAM 250 matrix corresponds to 250 accepted mutations per 100 residues. Since the same residue can change more than once, two sequences with this level of mutation will have about 20% identical residues. Recall that this level of sequence similarity is in the twilight zone-the region where it’s difficult to distinguish distantly related sequences from unrelated sequences.


Section 4

The PAM 250 log odds substitution matrix with a scale factor is displayed. These are weighted values that can be used rather than dots in an alignment matrix. The diagonal values indicate the replacement odds of the amino acids. The amino acid residues least likely to be replaced have the largest diagonal values. Select the two residues least likely to be replaced by clicking on their replacement odds and then click Submit. The other matrix values indicate the exchange odds of the amino acid residues. The residue pairs least likely to exchange have the smallest values. For example, glycine and proline, with an exchange value of -1, are less likely to exchange than are glycine and alanine, with an exchange value of 1. In the PAM matrix, the amino acids are arranged so that residues likely to exchange are near each other. Click on the different classifications of the amino acids to see what influence physical properties have on exchange groupings. A comparison matrix of two short peptides is shown. Rather than dots, weighted values from the PAM 250 matrix are used. Fill in the missing values using the PAM 250 log odds substitution matrix and then click submit. The log odds values entered from the PAM 250 matrix are used as alignment scores. The best sequence alignment is found by maximizing the sum of these values. An algorithm, such as that formulated by Needleman and Wunsch, may be used to determine the best alignment. The Needleman-Wunsch algorithm transforms the matrix. The lower right corner of the matrix corresponds to the peptides C termini. The log odds value of the asparagine, N, exchanging with aspartic acid, D, is 2. The next residue in the peptide, cysteine, may pair with cysteine. The log odds value of this alignment is 12. The 2 is added to the 12 and the 12 is replaced with the sum, 14. Less likely is the pairing of cysteine with lysine, K. The log odds value of this alignment is –5. The 2 is added to the –5 and the –5 is replaced with the sum, -3. This process is repeated to yield the transformed matrix. The best sequence alignment is found by connecting numbers in the transformed matrix. The maximum value at or near the N-terminus is connected to the next largest value, always moving down and to the right. Determine the best sequence alignment by clicking on the matrix values. Start at the maximum value of the transformed matrix, which is 41, and continue downward and to the right. Click ‘submit’ when finished. Two sequence alignments are possible. To determine if one is better, an alignment score is calculated. The overall alignment score is the maximum value of the transformed matrix, 41. Gap penalties are assigned. Gap penalties are calculated using the equation a + bk where a is the penalty for opening the gap, k is the length of the gap in residues, and b is the penalty for extending the gap by one residue. Empirical studies suggest the best values to use for the PAM 250 matrix are a = -8 and b = -2. Each alignment has two gaps; the gap penalties reduce both scores to 19. Since both alignments have the same score, they are equally probable.


Section 5

Because of the rapidly expanding databases and modern computers, sophisticated alignment programs are now used. One program available over the web is BLAST, for Basic Local Alignment Search Tool. The BLAST algorithm detects alignments embedded in otherwise unrelated proteins, called local alignments, as well as global alignments. Global alignments match sequences across their complete lengths. Click on the BLAST button to enter a mock up of the BLAST site. Note that when you’re visiting the real site, you can click on the BLAST course and BLAST tutorial for additional information. To search for matches to a short polypeptide, click on Search for short nearly exact matches under Protein BLAST. This program compares an amino acid sequence against a protein sequence database. The peptide sequence is entered into the search window. Since the entire sequence will be searched, there is no need to enter anything into the set sequence boxes. The database chosen is nr, for all non-redundant entries. The nr database is a good choice for a comprehensive search. A CD search, which searches for conserved domains, is not selected. Click on the BLAST button to initiate the search. Click Format to check the results. Blast found 108 possible matches to the query sequence. However, only one alignment score (deletion) is over 40 bits. A bit is the basic unit of information in a binary numbering system. Bit scores are normalized so alignment scores from different searches can be compared. The normalized bit score is used to calculate the statistical significance of the alignment, the Expected or E value. The E value decreases exponentially as the bit score increases. The E value describes the number of expected hits by chance when searching the given database. For example, an E value of 5 means that one would expect to find, by chance, 5 alignments with a similar bit score. The E value for the alignment with a bit score of 49 is 3 X 10-6. This alignment is unlikely to have occurred by chance. Other programs such as FASTA have different search philosophies and may yield different results, especially for sequence similarities in the twilight region. Regardless of the method used, sequence alignments can provide valuable clues about evolutionary relationships and the function of conserved residues among proteins. Bioinformatics has made possible the sophisticated mining of data that allows sequence identity in the twilight zone to be recognized.

 
 
  Back to Home   Back to Teaching