Background. The native tertiary structure of a protein is a complex, three-dimensional arrangement of one or more chains of amino acids. Information about tertiary structures is needed to understand how proteins function, how they interact with each other, and how they bind other molecules such as DNA. Structural information may also indicate how alteration of particular residues in the structure could lead to changed stability, specificity, reactivity or interactivity. Knowing the tertiary structure of a receptor protein is a prerequisite for the rational design of molecules which will bind with high specificity and affinity.
Despite the development of increasing powerful algorithms, reliable prediction of protein tertiary structure from knowledge of its primary sequence remains one of the most important unsolved problems in structural biochemistry. All structure prediction methods rely in some way on experimental data. The structural databases now contain files of coordinates for tens of thousands of protein structures. However, there are only a few hundred unique protein folds represented by these structures. It has been suggested that the number of unique protein folds is possibly much larger than this (about 1000 unique folds). Thus, a possible limitation to any prediction method may be that the portion of "protein conformational space" that is sampled by known tertiary structures is still too small. The NIH is attempting to address this problem by funding "protein structure factories" that will use robotics and other technologies to accelerate the production of tertiary structures (see http://www.nigms.nih.gov/ funding/psi.html.) The goal is to produce 10,000 new structures over the next 10 years.
It is a reasonable assumption that if two protein sequences are similar (homologous) then the tertiary folds taken up by the proteins will be similar. Thus, one approach to protein structure prediction is to compare the sequence of a protein whose primary sequence is known, but for which there is no tertiary structure, to the sequences of other proteins for which tertiary structures are available. Adjustment of a known structure to accommodate the amino acids of the unknown structure is made by modeling methods or some prediction algorithm. The success of this modeling by homology approach depends critically on the degree to which two sequences are homologous. If the homology of two sequences is ~90% or better, this approach can produce predicted structures that are as accurate as those determined by crystallography. If the homology is less than ~25% alignment of two sequences itself presents a major dilemma and the reliability of a structure prediction is low.
An alternative approach to structure prediction from primary sequence is to start from the family of known protein folds and determine if the sequence of the new protein is consistent with any of these. This notion is sometimes termed inverse folding or threading. The quality of the fit of a sequence to a particular structure may be evaluated using energy functions, the solvent accessibility of residues or various statistical measures that are based on the environments and conformations of specific amino acids in known tertiary structures. Such approaches will fail if the conformation of the test sequence is not represented in the family of known protein folds.
SYBYL provides facilities for carrying out homology modeling and inverse folding structure prediction methods. The program COMPOSER implements a homology modeling procedure developed by Johnson, et al. (1994) and described in detail in their paper. The program MATCHMAKER provides a version of the inverse folding technique developed by Godzik and Skolnick (1992). The program GENEFOLD uses both homology methods and threading methods to provide predictions that are claimed to be more accurate than those obtained by use of either technology alone. The programs are accessed under the BIOPOLYMER menu. Manuals for the programs have been printed and are available on the shelves at the front of Rm 1153. There are also on-line tutorials for the programs available through TriposBookshelf.
This exercise. Suppose that a small protein (A) with the sequence VREVCSEQAETGPCRAAAGM ISRWYFDVTEGKCAPFFYGGCGGNRNNFDTEEYCMAVCGSA is isolated from the brain of an aging chemistry professor who is showing symptoms of Alzheimer's disease. (Don't laugh--you'll be an aging something in 40 years!)
(1) Find ten proteins with known tertiary structures that have sequences reasonably homologous to the sequence of A.
(2) You will probably find one protein (call it AA) that is nearly 100% homologous to A. Compare the sequence of A to this protein, noting the nature of any insertions or deletions in A compared to AA. Download the PDB file for AA for use later.
(3) The sequence of bovine pancreatic trypsin inhibitor (BPTI) is reasonably homologous to A. Starting with the known crystal structure of BPTI, predict the tertiary structure of protein A. Minimize the potential energy of the structure you have created using the AMBER forcefield. Write the coordinates of the predicted structure in PDB format into your Exer5 directory.
(4) Compare the predicted tertiary structure for A to that of AA by aligning the two molecules. Print a hardcopy of the aligned structures.
Report. (a) List 10 proteins that are homologous to protein A. (b) Briefly discuss the biochemical role or importance of protein AA. (c) Describe the homology of A and AA, indicating %homology and the positions and natures of sequence insertions and deletions. (d) Provide a complete path to the predicted tertiary structure of A (the file written under step 3 above). (e) Provide a brief description of the method(s) and procedure(s) you used in getting the predicted structure of A. (f) Turn in the hardcopy of the aligned A and AA structures, indicating on the printout the location of the sequence differences. (It may take more than one hardcopy to make these differences clear.
Extra credit: Calculate and report the RMSD of the corresponding a-carbon atoms of the structures of A and AA.
Suggestions and comments.
[1] Take a little while to plan a strategy for completing this exercise. There is no one right way to do it and reasonable people will have different strategies.
[2] There are many resources available on the web for homology modeling, inverse folding and sequence alignment. Consider the use of BLASTP for sequence searching.
[3] No particular procedure for predicting the structure of A is required or recommended. It is possible to model protein A from the BPTI structure by one-by-one sidechain replacements, followed by energy minimization. However, COMPOSER or MATCHMAKER may give you results more rapidly.
[4] Recall that hydrogens have to be added to PDB coordinates to get a complete protein structure.
[5] You probably will want to remove water and other non-protein atoms from the PDB files you use.
[6] Please return any program manuals borrowed to Rm 1153 when you are finished with them.
[7] The report for this exercise is due the last day of class, March 15.
References
Johnson, M. S., Srinivasan, N., Sowdhamini, R., & Blundell, T. L. (1994). Knowledge-Based Protein Modeling. Crit. Rev. Biochem. Mol. Biol. 29, 1-68.
Godzik, A., & Skolnick, J. (1992). Sequence-structure matching in globular proteins: Application to superscondary and tertiary structure determination. Proc. Natl. Acad. Sci. U. S. A. 89, 12098-12102.