BETHESDA, MD – MARCH 20, 2014 – The massive genome of the loblolly pine—around seven times bigger than the human genome—is the largest genome sequenced to date and the most complete conifer genome sequence ever published. This achievement marks the first big test of a new analysis method that can speed up genome assembly by compressing the raw sequence data 100-fold.
The draft genome is described in the March 2014 issue of GENETICS and the journal Genome Biology.
Loblolly pine is the most commercially important tree species in the United States and the source of most American paper products. The tree is also being developed as a feedstock for biofuel. The genome sequence will help scientists breed improved varieties and understand the evolution and diversity of plants.
But the enormous size of the pine's genome had been an obstacle to sequencing efforts until recently. "It's a huge genome. But the challenge isn't just collecting all the sequence data. The problem is assembling that sequence into order," said David Neale, a professor of plant sciences at the University of California, Davis, who led the loblolly pine genome project and is an author on the GENETICS and Genome Biology articles.
Modern genome sequencing methods make it relatively easy to read the individual "letters" in DNA, but only in short fragments. In the case of the loblolly, 16 billion separate fragments had to be fit back together—a computational puzzle called genome assembly.
"We were able to assemble the human genome, but it was close to the limit of our ability; seven times bigger was just too much," said Steven Salzberg, professor of medicine and biostatistics at Johns Hopkins University, one of the directors of the loblolly genome assembly team, who was also an author on the papers.
The scale of the problem can be compared to shredding thousands of copies of the same book and then trying to read the story. "You have this big pile of tiny pieces and now you have to reassemble the book," Salzberg said.
The key to the solution was using a new method to pre-process the gargantuan pile of sequence data so that it could all fit within the working memory of a single super-computer. The method, developed by researchers at the University of Maryland, compiles many overlapping fragments of sequence into much larger chunks, then throws away all the redundant information. Eliminating the redundancies leaves the computer with 100 times less sequence data to deal with.
This approach allowed the team to assemble a much more complete genome sequence than the draft assemblies of two other conifer species reported last year. "The size of the pieces of consecutive sequence that we assembled are orders of magnitude larger than what's been previously published," said Neale. This will enable the loblolly to serve as a high-quality "reference" genome that considerably speeds along future conifer genome projects.
The loblolly genome data have also been freely available throughout the project, with public releases starting back in June 2012. "Our project has had great benefits to the community long before publication," said Neale.
The new sequence confirmed that the loblolly genome is so large because it is crammed full of invasive DNA elements that copied themselves around the genome. Approximately 82% of the genome is made up of these and other repetitive fragments of sequence.
The genome also revealed the location of genes that may be involved in fighting off pathogens, which will help scientists understand more about disease resistance in pines.
"The megagenomes of conifers are a challenge to sequence. Thanks to the important innovations described in these articles, the draft genome of the loblolly pine is not only the largest ever assembled, its quality is impressive. It paves the way for assembly of even larger genomes," said Mark Johnston, Editor-in-Chief of the journal GENETICS.
"Loblolly pine plays an important role in American forestry. Now that we've unlocked its genetic secrets, loblolly pine will take on even greater importance as we look for new sources of biomass to drive our nation's bioeconomy and ways to increase carbon sequestration and mitigate climate change," said Sonny Ramaswamy, director of USDA's National Institute of Food and Agriculture (NIFA), which funded the research. "I applaud the research team for their efforts as their work truly represents the science needed to bring about solutions to some of our greatest challenges."
The loblolly genome project was led by a team at the University of California, Davis, and the assembly stages were led by Johns Hopkins University and the University of Maryland. Other collaborating institutions include Indiana University, Bloomington; Texas A&M University; Children's Hospital Oakland Research Institute and Washington State University.
FUNDING: This work was supported in part by the US Department of Agriculture's National Institute of Food and Agriculture through its flagship competitive grants program, the Agriculture and Food Research Initiative.
CITATIONS:
A. Zimin, K.A. Stevens, M. Crepeau, A. Holtz-Morris, M. Koriabine, G. Marçais, D. Puiu, M. Roberts, J.L. Wegrzyn, P.J. de Jong, D.B. Neale, S.L. Salzberg, J.A. Yorke, and C.H. Langley. Sequencing and assembly of the 22-Gb loblolly pine genome. Genetics March 2014 196: 875-890; doi: 10.1534/genetics.113.159715 http://www.genetics.org/content/196/3/875
J.L. Wegrzyn, J.D. Liechty, K.A. Stevens, L. Wu, C.A. Loopstra, H. Vasquez-Gross, W.M. Dougherty, B.Y. Lin, J.J. Zieve, P.J. Martínez-García, C. Holt, M. Yandell, A. Zimin, J.A. Yorke, M. Crepeau, D. Puiu, S.L. Salzberg, P.J. de Jong, K. Mockaitis, D. Main, C.H. Langley, and D.B. Neale. Unique Features of the Loblolly Pine (Pinus taeda L.) Megagenome Revealed Through Sequence Annotation. Genetics March 2014 196: 891-909; doi: 10.1534/genetics.113.159996 http://www.genetics.org/content/196/3/891
D.B. Neale, J.L. Wegrzyn, K.A. Stevens, A.V. Zimin, D. Puiu, M.W. Crepeau, C. Cardeno, M. Koriabine, A.E. Holtz-Morris, J.D. Liechty, P.J. Martínez-García, H.A. Vasquez-Gross, B.Y. Lin, J.J. Zieve, W.M. Dougherty, S. Fuentes-Soriano, L. Wu, D. Gilbert, G. Marçais, M. Roberts, C. Holt, M. Yandell, J.M. Davis, K. Smith, J.F.D. Dean, W.W. Lorenz, R.W. Whetten, R. Sederoff, N. Wheeler, P.E. McGuire, D. Main, C.A. Loopstra, K. Mockaitis, P.J. deJong, J.A. Yorke, S.L. Salzberg, and C.H. Langley Decoding the massive genome of loblolly pine using haploid DNA and novel assembly strategies Genome Biology 2014, 15:R59 http://genomebiology.com/2014/15/3/R59
ABOUT GENETICS: Since 1916, GENETICS has published high quality, original research on a range of topics bearing on inheritance, including population and evolutionary genetics, complex traits, developmental and behavioral genetics, cellular genetics, gene expression, genome integrity and transmission, and genome and systems biology. A peer-reviewed and peer-edited publication of the Genetics Society of America, GENETICS is one of the world's most cited journals in genetics and heredity.
ABOUT GSA: Founded in 1931, the Genetics Society of America (GSA) is the professional scientific society for genetics researchers and educators. The Society's more than 5,000 members worldwide work to deepen our understanding of the living world by advancing the field of genetics, from the molecular to the population level. GSA promotes research and fosters communication through a number of GSA-sponsored conferences including regular meetings that focus on particular model organisms. GSA publishes two peer-reviewed, peer-edited scholarly journals: GENETICS, which has published high quality original research across the breadth of the field since 1916, and G3: Genes|Genomes|Genetics, an open-access journal launched in 2011 to disseminate high quality foundational research in genetics and genomics. The Society also has a deep commitment to education and fostering the next generation of scholars in the field. For more information about GSA, please visit http://www.genetics-gsa.org. Also follow GSA on Facebook at facebook.com/GeneticsGSA and on Twitter @GeneticsGSA
Journal
Genetics