Meet the genome's puzzle masters

By Haley Bridger, Communications

June 26, 2008

Image by Maria Nemchuk and Bang Wong, ӳ��ý Communications

With faster, cheaper DNA sequencing tools springing up, more data than ever are flowing into the world of genomics. But before results can be interpreted and meaningful conclusions reached, the raw data — fresh from the sequencing machines — must first go through a special kind of "pasteurization." The key component isn't heat, though — it's computational expertise. At the ӳ��ý, analysts with backgrounds in computation and mathematics make up the expanding Computational Research and Development (CRD) group. They use their analytical skills and experience in computational problem solving to get massive volumes of sequence data into a form that helps researchers answer important biological questions.

Remarkably, most CRD members do not have a degree in computational biology or genomics. Instead, they have a strong background in computation that they use to define problems and enable ӳ��ý scientists. "We have to understand some of the biology, but we learn that on the job," said CRD director David Jaffe, a former math professor turned computational researcher. "It's all about analyzing data, inventing algorithms, and turning them into code that will work on huge data sets. Finding people with really amazing computational skills is what's important."

The ability to apply computation to new situations is also handy, as the CRD group has to keep pace with the proliferation of new sequencing systems. On top of the three the ӳ��ý already runs, this year two fundamentally different systems will become available and others are waiting in the wings. And CRD will enable ӳ��ý researchers to get the most from these new systems. "We've got boat loads of people banging on our doors saying, 'We want to do this, we want to do that,'" said Jaffe. "But each technology has its own idiosyncrasies, as do the biology projects that utilize it. In each case, understanding how to think about the technology, and the biology, and how to make them work together can be a project unto itself."

The Computational Research and Development group works tirelessly to tackle data. Photo by Maria Nemchuk, ӳ��ý Communications

Understanding the genome, one piece at a time

When genomes are sequenced, they are first chopped into shorter, more manageable bits. The data produced from these short stretches of DNA, known as sequence reads, must be carefully pieced back together, like a jigsaw puzzle, in order to see an organism's genome as a coherent whole. This strategy underpins the CRD group's work in assembling large, mammalian genomes, small viral genomes, and everything in between.

One project that the CRD group is working on involves comparing DNA sequences from many strains of Mycobacterium tuberculosis, the bacterium that causes tuberculosis. The goal is to determine how these sample genomes differ from an already decoded M. tuberculosis genome — a so-called "reference" genome. If piecing together a genome is like assembling a puzzle, the reference genome is like the picture on the puzzle's box. It provides a rough image of what the completed puzzle should look like, helping researchers to figure out how all the little reads fit into the big picture. Once this genome assembly step is complete, the team works to pick out the spots where the genomes differ from one another.

New technologies can make this genome puzzle even harder. Since the new sequencing techniques generate shorter snippets of the genetic code than the earlier method, genomes are represented in much smaller, more numerous pieces. "The short reads come in much larger buckets, which can help compensate for their reduced length," said Iain MacCallum, a physicist who formerly studied optics.

MacCallum leads a group of computational biologists who use an algorithm called ALLPATHS to assemble genomes without using a reference genome to help. Imagine a jigsaw puzzle of many tiny and similar pieces that can fit together in different ways, but only one way is correct — and there isn't even a rough picture of the completed puzzle to use as a guide. A strategy MacCallum's group uses to make sense of the reads is called localization. Just as a puzzler might dump out a box of pieces and then divide them into piles based on color or pattern, the localization technique allows computational biologists to group together the reads that have come from a similar part of the genome. They begin by fitting different regions of the puzzle together and then try to connect them.

One of the natural enemies of assembly is repetition in the genome — repeating segments of DNA that contain nearly identical sequence. If each read is a puzzle piece, then these repetitive stretches are like the sky pieces: all have the same appearance and it can be difficult to tell where the uniform stretches begin and end.

The CRD group wants to eventually use new sequencing technologies to assemble huge, mammalian genomes that have never before been sequenced — so-called de novo assembly. But for now, the researchers are still perfecting their strategies. "We are trying to use different technologies at the same time and we are trying to get all these new techniques to work together so that we can assemble big genomes," said Sante Gnerre, a mathematician who leads this effort. "But this is a very, very hard problem."

Finding the differences

While some CRD researchers concentrate on genome assembly, Jared Maguire looks for the genetic variations that may be vital to understanding human disease. But comparing genomes can be just as tricky as assembling them. Routine sequencing errors can often be mistaken for bona fide genetic differences, unless the appropriate computational fixes are developed and applied. "The new technology is much cheaper and you get more data, but the data often has more errors," said Maguire, a computer scientist who previously worked in natural language processing. "But there's so much more data that it outweighs the shortcomings."

The ability to accurately identify genetic differences forms the backbone of several projects underway at the ӳ��ý. Chief among them are efforts to systematically map the genomes of cancer cells, including the federally funded TCGA (The Cancer Genome Atlas) project. Such projects aim to systematically compare DNA from tumor cells to that of normal cells to highlight and catalogue the genetic mutations associated with cancer.

Another major project on the horizon is the 1,000 Genomes Project, which was launched by the NHGRI earlier this year. Researchers will collaborate on sequencing the genomes of individuals from around the world to develop a detailed picture of human genetic variation, including variants present in as little as 1% of the population. Such rare variants are believed to be key to understanding common diseases such as diabetes that have a large genetic component but for which no single locus is the cause.

And like so many other projects now underway in the CRD group, the work will give Jaffe and his colleagues mountains of data to analyze and understand.