Feature Story | 19-Jun-2001

Protein prediction tool has good prospects

DOE/Oak Ridge National Laboratory

The international competition to predict the three-dimensional (3D) structures of 43 proteins, using computational tools, was intense. Of the 123 groups competing in the fourth Critical Assessment of Techniques for Protein Structure Prediction (CASP-4) competition, which was held from June through September 2000, an ORNL group placed sixth, putting it in the top 4%. In fact, ORNL placed ahead of all other Department of Energy national laboratories in the contest.

The actual structures of the 43 target proteins were determined experimentally by nuclear magnetic resonance (NMR) spectroscopy and X-ray crystallography, and the data were unpublished at the time of the competition. "The computational groups were provided with the identity and order of amino acids making up each protein and the length of the one-dimensional amino-acid sequence," says Ying Xu, leader of the Computational Protein Structure Group in the Computational Biology Section of ORNL's Life Sciences Division. "From this information our team predicted protein structure." Other team members were Dong Xu, Oakley Crawford, and Phil LoCascio.

Motivating this competition is the search by biologists to discover the function of individual proteins that work together in "protein machines" to form an organism or keep it alive. They also want to understand how these functions are performed at the molecular level. Proteins often do their work by docking with another protein. Because the function of a protein is related to its shape, it is essential to learn the 3D structure of each protein. Using the details of a protein's shape, a chemical compound can be custom designed to fit precisely in the protein, like a hand in a glove, blocking or enhancing the protein's activity. In this way, a highly effective drug with no side effects could be created for each individual.

"The demand for rapid protein structure determination will grow drastically because information that could be used for rational drug design is becoming available rapidly," Xu says. "Traditional experimental methods for determining protein structure may not be able to keep up with the pace at which amino-acid sequences are being generated. Computational techniques, in conjunction with experimental methods, could more rapidly determine protein structures on a genome scale."

For the CASP-4 competition, the ORNL researchers used a computer package that they developed and continue to improve. It is called the Protein Structure Prediction and Evaluation Computer Toolkit (PROSPECT) and is one of only a few dozen protein-threading computer programs in the world.

"In the CASP competition, you get a 0 if you fail to identify the correct structural template," Xu says. "You get a 4 if your alignment between the target protein and the template is perfect. You get scores of 1 to 3 depending on how close you are to being correct. The scores are added up for all 43 proteins. We recognized two-thirds of the correct templates, which is the most among all the competing teams, and one-third of our alignments were off."

Recently, the ORNL team attended a conference in Asilomar, California, and learned how other teams did in CASP-4 compared with PROSPECT.

"Some 10,000 protein structures have been determined experimentally out of the 100,000 or so proteins believed to exist, and the information is stored in the Protein Data Bank," Xu says. "To keep up with the production rate at which protein sequences are being generated by the genome projects, computational methods are clearly needed. Structure predictions have been made by the conventional ab initio technique in which a supercomputer is used to predict how an amino-acid chain can fold itself into a final shape based on first principles of physics and chemistry. Unfortunately, it takes weeks to months to predict the structure of even the smallest protein using this approach and the prediction reliability is poor."

The ORNL group uses template-based methods of protein structure prediction. These methods rely on experimentally determined 3D structures in the Protein Data Bank. The ORNL group uses PROSPECT to do "protein threading," in which a string of amino acids is computationally aligned along different protein templates—like an embroidery thread drawn through a printed design—to determine which template gives the best fit. In a perfect alignment, the amino-acid atoms are at their preferred lowest energy levels and are compatible with neighboring atoms and the protein's environment.

The ORNL group also uses homology modeling to fine-tune the predicted structure. In this technique, if two amino-acid sequences are similar and one sequence has a known structure, researchers can use this information to help determine the structure of the unknown protein sequence. By calculating the detailed forces between atoms and adjusting the final predicted structure to minimize the atoms' energies, the researchers computationally tweak the predicted structure of the target protein to make it energetically more favorable.

"It is believed that about 1000 unique protein structural folds exist in nature and that many proteins share each of these unique structural folds," Xu says. "Some 600 unique protein structures have been determined experimentally. Once the 1000 unique structural folds are determined by NMR and X-ray crystallography, the rest of the 100,000 protein structures can be accurately modeled computationally."

Xu's group, which has four staff researchers and two postdoctoral scientists, is involved in the National Institutes of Health's Structural Genome Initiative, which is dedicated to finding the structures of 100,000 human proteins. As part of this effort, NIH has funded seven pilot centers for experimentally determining protein structures. They include centers at DOE's Argonne, Brookhaven, Lawrence Berkeley, and Los Alamos national laboratories. At Lawrence Berkeley, David Eisenberg, a pioneer in protein threading, is trying to determine the structures of proteins in the genome of the rod-shaped bacterium that causes tuberculosis.

"A new trend in structure prediction is the incorporation of partial experimental data as constraints in the computation process, to make structure prediction closer in accuracy to the experimental structures," Xu says. "PROSPECT is ideally suited for incorporating data from local researchers and from these pilot centers. The data include measurements of distances between amino acids and information on which amino acids tend to be found on the surface of a protein and which don't."

Recently, the ORNL group modeled a protein complex using PROSPECT and experimental data provided by Cynthia Peterson, a University of Tennessee researcher who has identified a number of disulfide bonds between amino acids in certain parts of the protein. PROSPECT is being used to incorporate experimental data provided by Greg Hurst and Jim Stephenson of ORNL's Organic Mass Spectrometry Group. They are using electrospray ionization ion trap mass spectrometry and a cross-linking chemical to determine the distances between two amino acids—both lysines—in a protein. Their initial studies found that the lysines linked by this chemical of a known length are 4 angstroms apart. (See Protein Identification by Mass Spectrometry.)

Because NMR data provide distances between amino acids in a protein, the ORNL group will gladly accept partial NMR data from the pilot centers, which otherwise will not be used because the information is insufficient to determine a whole protein structure. "This amount of data is good enough to help PROSPECT reliably predict a protein structure," Xu says. He notes that the level of confidence, or uncertainty, in knowing a protein structure with complete accuracy is within 1 to 1.5 angstroms for X-ray crystallography, within 2 to 2.5 angstroms for NMR, and within 4 angstroms for computer modeling.

The ORNL group is focused not only on predicting protein structures more accurately but also on doing it much faster than current computational techniques allow. "We now use PROSPECT and 20 other computational tools to determine protein structures in a semi-automatic fashion," Xu says. "Using the IBM RS/6000 SP supercomputer at DOE's Center for Computational Sciences at ORNL, we can now thread 100 or more proteins a day against 2000 possible template structures. We are seeking funding to develop software to build an expert system and an automated protein structure pipeline to run on the IBM supercomputer. The expert system will mimic the human decision-making process to automate the computational tools. Our goal is to predict about 100 protein structures a day.

"If the proposal is funded, our first project will be to predict the structure of proteins of the Prochlorococcus marinus genome, a bacterium with about 1600 genes. We hope to show that we can predict these protein structures."

In the next three years, the ORNL group expects to do some computer simulation that will be of interest to the pharmaceutical industry. Their work could enable more rapid design of drugs that are safe and effective.

"To do this, you have to know whether a ligand, which is a group of molecules typical of a new drug, will dock with a particular protein to inhibit or stimulate its activity," Xu says. "We will be doing computer modeling to determine whether and how a ligand binds with various proteins to cause a healing or harmful effect."

PROSPECT is a copyrighted computer program. It is being used by over 20 academic organizations, including MIT, Columbia University, the University of Michigan, and the University of Texas. Millennium Pharmaceuticals is interested in licensing the program. ORNL's Technology Transfer and Economic Development Directorate seeks to license this computer toolkit for commercial use because its recent successes suggest it has very good prospects.

###

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.