News Release 20-Dec-2000

Computer generates comparative gene maps

Peer-Reviewed Publication

Cornell University

ITHACA, N.Y. -- Comparing the genomes of two related species of a plant or animal often helps to locate important genes that have been identified in one species but not in another, and can provide clues about how both species evolved from a common ancestor. But making these "comparative gene maps" has been a slow, painstaking process, something biologists do by hand over weeks, months or years, using data painstakingly collected in "wet labs" and analyzed with software designed to interpret only one map at a time.

Now, Cornell University researchers have come up with a way to do the comparison step in a few hours on a computer. In early tests, a computer-generated comparison of the genomes of rice and maize (corn) closely matched a similar map made by hand, and even suggested some relationships that had not shown up in the handmade map.

Debra Goldberg, Cornell graduate student in applied mathematics, developed the new method in collaboration with Susan McCouch, Cornell professor of plant breeding, and Jon Kleinberg, Cornell assistant professor of computer science. Goldberg described their work at the Gene Order Dynamics, Comparative Maps and Multigene Families (DCAF) workshop held September in Sainte-Adèle, Quebec, and will present a later version at the Plant and Animal Genome IX conference in San Diego in January. Their paper, "Algorithms for Constructing Comparative Maps," appears in Comparative Genomics (David Sankoff and Joseph H. Nadeau, Eds., Kluwer Academic Publishers, 2000). A software implementation of the new method soon will be available to geneticists.

"The point of this isn't just to compare rice and corn, but to be able to do it with any two species," Goldberg says. "Ideally we'd like to be able to find new evolutionary pathways."

Every so often, as reproductive cells divide, genes and segments of chromosomes get shuffled around. One chromosome meets another and pieces of DNA are moved or swapped. If those particular cells then happen to be involved in reproduction, the new arrangement is passed on to the next generation and may spread through the population. It doesn't happen very often, but over evolutionary time scales many such events show up. Related species descended from a common ancestor have many genes in common, but they occur in different arrangements. A strand of DNA that used to be on chromosome 2 in some common ancestor ends up on chromosome 10, in between two pieces that used to belong to ancestral chromosomes 3 and 5. The relocated genes often continue to do the same jobs, and often several genes move together, retaining their ancestral order along a segment of DNA.

By comparing genomes, scientists can trace the evolutionary paths, and there are immediate practical applications. If it's known that genes A and B are near each other in the rice genome, and the location of gene A in maize also is known, then a comparative map could help locate gene B in maize. In plant breeding, such a discovery could help to breed corn with better disease resistance or improved nutritional value. In medicine, clues from the genome of the mouse are being used to help find genes associated with human diseases.

The idea of comparative mapping is to align genes in the order they are found along the chromosomes of the first or "base" species with those found in the same order on a single chromosome of the second or "target" species. The raw data consists of ordered lists of the genes and gene markers of both species that have been identified in "wet lab" experiments.

At the simplest level, a computer could look at each gene or marker of the base species, find where it is (on which arm of which chromosome) on the target genome, and label it accordingly. But geneticists want to step back to get a larger view, identifying segments of the base genome that contain arrays of genes that also are found together on the target genome. The catch is what McCouch calls "noise" in the data: the target genome can contain long arrays of genes that look like those on the base genome except that there are a few extra genes here and there that come from somewhere else in the genome. How does the computer decide whether or not to ignore the out-of-place genes? When are two similar linear arrays of genes close enough to be called a match?

In early stages of the work, Goldberg applied constraints, called "penalties," both for out-of-place genes and for breaks between segments. The computer was directed to minimize both the number of segments it created and the number of out-of-place genes in each segment. While promising, when applied to a comparison between rice and maize this approach still didn't produce a map close enough to one made by hand, Goldberg says. Among other things, the computer often introduced too few breaks where a small part of one sequence appeared in the middle of another.

So, Goldberg added a procedure that remembers the labels of genes as it goes along, making decisions about what sequences go together on the basis of an overall trend rather than considering just one gene at a time. Based on the sequence it remembered, the computer was allowed to reduce the penalties for breaks between segments. In other words, if a small but meaningful sequence of out-of place genes appeared in the middle of another matching sequence, it would be marked as a separate segment. But if just a few out-of-place genes turned up and didn't have a meaningful relationship, the overall sequence still would be listed as a single segment.

In computer-science terms, the label for each gene is pushed onto a stack in memory, and popped back off when it gets to be too unlikely. This procedure, the researchers say in their paper, draws on computer methods for parsing sentences in natural-language processing, in which a program remembers words until the end of a sentence and only then decides what the sentence means.

Each chromosome in a living organism consists of two adjacent arms, and the algorithm also was modified to give special consideration to related orders of genes that appear on different arms of the same chromosome. In some cases biologists know which chromosome a gene is on, but not on which arm, so special consideration also was given to those "ambiguous" genes.

The researchers tested their computer method by comparing a computer-generated comparative map of rice and maize with a handmade map prepared in 1999 by William A. Wilson (a postdoctoral fellow in the Department of Plant Breeding at Cornell and now in private industry), and several colleagues at Cornell and Iowa State University. The computer mapping done by Goldberg was based on Wilson's original data. The results, the researchers say, were remarkably similar, although in their paper they note some minor differences. They also point out that handmade maps usually are made with reference to additional information that biologists hold in their memories, such as the order of genes along the chromosomes of other related species.

The computer also found a small "footprint" of an ancestral chromosome in maize that did not turn up in the handmade map, McCouch says. This will be investigated further in the lab, she says.

Besides rice and maize, the algorithm has been tested on a comparison between the mouse and human genomes. "It appears to work well in both cases," McCouch says "It is certainly our intention to present this algorithm as a replacement for the construction of hand-crafted comparative maps."

###

Related World Wide Web sites: The following sites provide additional information on this news release. Some might not be part of the Cornell University community, and Cornell has no control over their content or availability.

o Cornell Center for Applied Mathematics: http://www.cam.cornell.edu/

o Susan McCouch home page: http://www.plbr.cornell.edu/PBBweb/McCouch.html

o Jon Kleinberg home page: http://www.cs.cornell.edu/annual_report/99-00/Kleinberg.htm

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.