The Genomic Palimpsest: Genomics in Evolution and Ecology
Genomics is the discipline that has grown up around the sequencing and analysis of complete genomes. It has typically emphasized questions that involve the biological function of individual organisms, and has been somewhat isolated from the fields of evolutionary biology and ecology. However, genomic approaches also provide powerful tools for studying populations, interactions among organisms, and evolutionary history. Because of the large number of microbial genomes available, the first widespread use of genomic methods in evolution and ecology was in the study of bacteria and archaea, but similar approaches are being applied to eukaryotes. Genomic approaches have revolutionized the study of in situ microbial populations and facilitated the reconstruction of early events in the evolution of photosynthetic eukaryotes. Fields that have been largely unaffected by genomics will feel its influence in the near future, and greater interaction will benefit all of these historically distinct fields of study.
Atidal wave in the biological sciences, caused by the advent of genomic techniques and data, has begun to sweep across the fields of evolutionary biology and ecology. Although its first ripples were felt much earlier (Mount 2001), the modern discipline of genome biology, or genomics, surged in the late 1980s under the impetus of the Human Genome Project. While its purview was soon expanded to include a variety of microbes and model organisms, the primary motivation for the project was biomedical research, with a concomitant emphasis on physiological processes and functional questions. The National Plant Genome Initiative, which brought genomics to the plant sciences, has focused exclusively on economically important plants and model systems. Thus, genomics is largely associated with clinical and agricultural questions, and it has had its first great impact in fields such as biochemistry, cell biology, physiology, and genetics.
But there is more to a genome than its current functional biology. The genome is like a palimpsest (figure 1), an ancient, recycled manuscript in which the traces of an earlier text can be discerned. The current functions of an organism's genome represent the adaptations of the organism to its present environment, but these functions are superimposed on remnants of evolutionary history. Because the genome is the result of billions of years of evolution, natural selection has shaped its current functions through the modification of older functions. It is rare for genes to appear de novo; rather, they are typically formed by the modification of a previously existing gene, often one that has been made superfluous by gene or genome duplication (Dujon et al. 2004), perhaps with the added twist of horizontal gene transfer. This means that every feature of a genome has characteristics that reflect the adaptation of the genome to its environment, layered on top of those that reflect its evolutionary history. For researchers who are primarily concerned with understanding the function of the genome, the features that are primarily the result of evolutionary history can be distractions. However, just as it is possible to read the subtext in a palimpsest, it is possible to study genomes to gain insight into the evolutionary past. This is not to say that the field of genomics has been uninformed by evolutionary history; quite the opposite is true. For example, one of the most important techniques used to infer the function of protein-coding genes relies on the assumption that homologous genes (which are, by definition, descended from a single gene in a common ancestor) will often have similar functions. Yet it is only relatively recently that genomics has begun to have a substantial impact on the study of evolution and ecology.
What is genomics?
The field of genomics is usually understood to be the study of genomes as a whole, particularly the study of complete genome sequences at the nucleotide level. It is distinct from the field of genetics, which originated before the nature of the genetic material was known, and which relies on patterns of inheritance to make inferences about inherited traits and how they interact. Classical genetics is in some senses a distinct field from modern molecular genetics, which makes use of a variety of techniques to manipulate a cell's genetic contents and expression, although like classical genetics it takes advantage of patterns of inheritance to gain insight. Genomics complements molecular genetics by providing an encyclopedic list of a genome's contents. Once the genomic sequence is known, a variety of methods can be used to make inferences about the information held in that sequence. Consequently, the first major task faced in genomics research is the computational analysis of the genome sequence. Even a relatively small genome would be overwhelming to interpret manually, so genomics necessarily includes not only the techniques needed to perform efficient, high-throughput DNA sequencing but also a variety of information technologies used to interpret the data. Biological informatics (bioinformatics) has exploded in importance as a flood of new genomic data has become available with the advent of highly efficient DNA sequencing technologies. Thus, among the major new technologies at the core of genomics are computational methods such as contig assembly (the assembly of individual DNA sequence reads into contiguous stretches of genomic sequence), scaffolding (the use of long-range map information to facilitate large-scale contig assembly), pattern recognition and gene finding (for the detection of subtle patterns, including the beginning and end of protein-coding regions), and database searching (for the rapid identification of regions of interest in large databases).
As the efficiency of high-throughput methods and the advantages of working with large data sets have become clear, and as it has become increasingly obvious that the DNA sequence alone would provide only a partial view of the genome's biological function, a number of ancillary fields have developed. The first of these is the study of expressed RNAs, often in the form of reverse-transcribed complementary DNA (cDNA) libraries. Although the technique of cDNA library construction is not novel, the genomic approach to such libraries emphasizes high-throughput sequencing and near-saturation data collection. Rather than screen a library for a small number of clones of interest, a typical project will analyze a sample of 10,000 to 100,000 or more single-read sequences of clones randomly selected from a cDNA library. This expressed sequence tag (EST) approach takes advantage of the relative ease of determining DNA sequences from a uniform set of clones.
By analogy to the word genome, the set of transcribed sequences that can be produced by a cell is referred to as the transcriptome. Although it could be dismissed as jargon, this terminology is easily understood and has some legitimate linguistic basis. The term genome (literally “begetting stuff”) was coined to refer to the unknown substance that carried the information of inheritance; in current biological usage, it refers to the set of all genes. In this context, the -ome suffix conveys a sense of completeness. Following this lead, the set of all proteins that can be produced by a cell (which in eukaryotes most emphatically does not have a one-to-one correspondence with the set of protein-coding genes) is referred to as the proteome. One will also hear reference to the glycome (for carbohydrates), interactome (for interactions among genes), and so forth, all of which convey a sense of completeness and high-throughput screening. Also associated with genomics are techniques for high-throughput analysis, including micro-arrays for rapid screening of expression levels in different cell types and developmental stages, and mass-spectrometric methods that can be used for sequencing complex mixtures of polypeptides.
All of these approaches share an emphasis on large-scale data sets that approach exhaustive sampling, with an underlying assumption that such data will be cheaper on a per-unit basis than more narrowly targeted research. This is mass-production science, and the increased efficiency it can yield is very real. Genomics research has proved to be self-reinforcing, with large-scale sequencing efforts leading to improvements in techniques, so that a kilobase of DNA sequence can now be obtained for a fraction of the effort and cost that would have been required just 10 years ago. The cost of DNA sequencing at the inception of the Human Genome Project was roughly $10 per finished base, which would add up to roughly $27 billion for the cost of sequencing the complete genome. In fact, the project cost less than $3 billion, with most of the sequencing performed for less than $300 million, this at a per-base cost of less than $1 per 10 bases (Shendure et al. 2004). Furthermore, these economies of scale are not limited to megaprojects. In my own laboratory, we found that a relatively small project that called for making a cDNA library and determining 5000 individual sequence reads required far less time and effort to locate a sequence of interest than did more traditional selective methods. To my astonishment, this sequencing was brought from organism to publication by three dedicated students at a per-base cost comparable to that achieved by the Human Genome Project (Bachvaroff et al. 2004).
Genomics and microbial ecology
One of the first bits of hidden information to be read from the genomic palimpsest concerned microbial diversity. When high-throughput, random screening approaches were applied to the study of natural microbial populations, they provided a radically different kind of information than did traditional methods. This resulted in a remarkable reorganization of thought on microbial diversity. I first began to think about the intersection between genomics, evolution, and ecology when working on a collaborative project on microbial diversity (Barns et al. 1996). Traditional microbiology relied on the ability to grow cultures of microorganisms. Culture methods permitted the development of modern microbiology, but they suffered from the critical weakness that an organism could be studied only if it could be cultured. Because most bacteria and archaea are extremely small and show relatively little morphological diversity, organisms that could not be cultured remained almost completely unstudied (DeLong and Pace 2001). This presented a particularly grave impediment to microbial ecology, because a basic knowledge of or-ganismal diversity within a community is essential to understanding how that community functions. To place this limitation in perspective, imagine trying to study a tropical rain forest by using a bulldozer to collect all of the plants and animals in a hectare of forest and then dumping these materials in a greenhouse with a temperature and moisture regimen similar to that of the natural forest. Some plants and animals would survive the process, and in due course a scientist could isolate pure cultures of these organisms. But these samples would not be a particularly good representation of diversity within the forest. Having information about a community in situ is key to understanding how it functions. Molecular methods closely related to genomics have opened a window on in situ microbial populations.
Early efforts to characterize natural microbial populations, before the ascendance of genomics per se, took advantage of the conservative nature and high expression of ribosomal RNA (rRNA) to permit the determination of rRNA sequences directly from field samples. The ribosomal genes had been used for the first steps toward a natural classification of bacteria. Analysis of rRNA had identified a group of organisms with ribosomes that were dramatically different from typical bacteria. These organisms, the Archaea, were subsequently shown to have a complex of properties such as membrane composition, mechanisms of transcription and translation, and biochemistry that showed they were a natural group, and as distinct from Bacteria as are eukaryotes (Woese 1994). Thus, analysis of rRNA was shown to have predictive value that was entirely independent of (and complementary to) classical, culture-based methods. It was a logical—if not entirely obvious—next step to apply ribosomal gene sequencing to the study of microorganisms as they occur in nature (Pace 1996).
Analyses of natural microbial populations using environmental molecular methods are naturally linked to genomics, because of the need to determine large numbers of sequences and because of the early availability of numerous microbial genomes. In the 1990s, environmental molecular studies quickly revealed that many microorganisms had not yet been characterized, and that in some communities only a small fraction of the microbial diversity was known (Pace 1996). For example, in the cyanobacterial mat community of Octopus Spring, Yellowstone National Park (a moderately hyperthermal environment), analysis of environmentally isolated ribosomal DNA (rDNA) sequences identified tens of sequences representing organisms that had not been previously characterized, despite years of study at the site by expert microbiologists using traditional techniques (Ward et al. 1998). A number of these organisms seem to have been entirely unknown until they were identified using molecular methods. I became involved when work on another site in Yellowstone yielded a large number of sequences from putative unknown Archaea. Among these sequences were two that have been treated as representatives of a novel kingdom, the Korar-chaeota (Barns et al. 1996), although this conclusion remains controversial (even among the coauthors of the Barns et al. study) because it depends on the specific phylogenetic placement of these organisms based only on analysis of a single gene sequence. However, in such studies investigation should not end with the identification of an interesting sequence. Once a sequence representing a novel organism has been detected, it is possible to use in situ hybridization and guided culture techniques to obtain information about the biology of the organism (Amann et al. 1995, Rappe et al. 2002), linking the molecular data to traditional microbiological methods.
Environmental molecular studies have also found unexpected microbial diversity in the open ocean, in fresh waters, in mud, and elsewhere. The number of microorganisms yet to be discovered is unknown, but the best estimates suggest there is a huge diversity of organisms that have not yet been cultured in the laboratory (Dawson and Pace 2002). To what extent these organisms might prove to be important in fields outside microbial ecology is a matter of speculation, but their potential significance should not be underestimated. Thermus aquaticus, another microorganism that lives in Yellowstone's hot springs, is the source of Taq polymerase, the engine behind the polymerase chain reaction.
More recently, microbial ecology and genomics have become very closely intertwined. One promising approach is to sequence several members of a microbial community simultaneously. The idea is that because some microorganisms cannot be cultured individually, and because bacterial and archaeal genomes tend to be much smaller than eukaryotic genomes, it should be possible to determine several bacterial or archaeal genomes simultaneously with no more effort than determining a single eukaryotic genome. The set of genomes that comprise a microbial community could be sequenced in a single effort, and the result would be a data set with rich information about diversity and population-level variation. This would be difficult to do with map-based sequencing, but the random-clone strategy seems ideally suited to the purpose (Fraser et al. 2000). In practice, however, complicating factors such as variations in population sizes, which can lead to discrepancies in copy numbers among the genomes, mean that studies of this type are not trivial. Only a few such studies have yet been completed (Tyson et al. 2004, Venter et al. 2004). Nonetheless, high-throughput sequence analysis is permitting the evaluation of microbial communities with a level of sensitivity and detail that would previously have been unthinkable (Beja et al. 2002).
An ideal candidate for community sequence analysis would be the hot-spring community that includes pJP27, a representative of the “Korarchaeota” that were identified by Barns and colleagues (1996). If these organisms really do represent a fundamentally distinct archaean lineage and an outgroup to the rest of the Archaea, then the analysis of their genome could be very informative. Communities that include pJP27 have been grown in the laboratory, but attempts at obtaining pure cultures have been unsuccessful (Burggraf et al. 1997). Thus, the most practical way of obtaining genome-level data about these organisms may be to determine the set of genomes that corresponds to the smallest culturable community that includes them. A somewhat similar problem was solved for Nanoarchaeum equitans, which, like pJP27, is a thermophilic archaeon that was first identified as an outlying archaeal rRNA sequence (Huber et al. 2002). Also like pJP27, Nanoarchaeum cannot be grown in pure culture. However, unlike pJP27 cells, Nanoarchaeum cells could be mechanically isolated, and consequently it was possible to determine the genome sequence for this species from isolated DNA. The Nanoarchaeum genome is extremely small (491 kilobase pairs), and its analysis supported placement of the organism as an outgroup to the rest of the Archaea, including even the Korar-chaeota (Waters et al. 2003).
The community sequencing approach has attracted the participation of major players in the field of genomics. Most recently, a large-scale environmental sequencing project determined the sequence of more than 1 × 109 base pairs of DNA from filtered water out of the Sargasso Sea (Venter et al. 2004). These data yielded information from roughly 1800 distinct species, including 148 rRNA sequences that are at least 3% different from any sequence in the ribosomal database project database. Such a level of divergence in ribosomal genes (3%) is a widely used rule of thumb to distinguish bacterial “species,” although the definition of species in bacteria remains controversial (Oren 2004), and lesser degrees of divergence can characterize species in other organisms; for example, humans and chimpanzees differ by about 1.2% (Clark et al. 2003). Also among these data were well over 700 distinct rhodopsin-like sequences. Rhodopsins are light-reactive proton pumps that are related to the light-sensing proteins in animal eyes, but that in archaea can use light to generate ATP (adenosine triphosphate) in a process that is entirely distinct from photosynthesis as it occurs in plants and bacteria. Environmental sampling has revealed that these proteins are found in bacteria as well, and that they are far more diverse and widely distributed than had been recognized (de la Torre et al. 2003), with the implication that a major form of primary productivity has been missed entirely. Future work will be needed to determine the absolute abundance of these organisms and to estimate their productivity and ecological significance, although it should be noted that absorption spectra indicate that the relative abundance of rhodopsin is low.
Another major influence of genomics on microbial ecology has been the development of techniques that can rapidly probe the composition of microbial communities. Microarray analyses, for example, have been used to assay microbial populations. Several specific strategies have been attempted, but in general the microarray is designed to contain sequences that are either specific to the organisms of interest or else variable among them (Greene and Voordouw 2003). The array can then be probed with DNA or RNA extracted directly from the environment, making it possible to examine spatial and temporal variation in microbial diversity, much as the more familiar applications of microarray techniques permit study of patterns of gene expression. This approach seems to have found its first widespread use in environmental monitoring, but one could imagine numerous other potential applications to microbial ecology. With the reemergence of biological weapons, techniques for the rapid assessment of microbial communities have renewed importance.
Although the majority of environmental molecular studies have concentrated on bacteria and archaea, eukaryotic microorganisms have also received some attention. The Río Tinto in Spain, a river that is naturally extremely acidic and laden with heavy metals, has a high biomass of eukaryotic microorganisms. Environmental molecular analysis has indicated great eukaryotic diversity within the Río Tinto, and the environmental molecular data have complemented microscopic analysis of the same environment (Amaral Zettler et al. 2002). Similarly, studies of marine plankton revealed a substantial number of sequences that were quite distinct from any that had previously been determined (López-García et al. 2001, Moon–van der Staay et al. 2001). It is not clear, however, whether these sequences represent the discovery of organisms that are genuinely new to science, or simply the determination of molecular data for organisms that were previously known only by classical methods.
Because eukaryotes are typically much larger than bacteria or archaea but occur at lower population densities, many unicellular eukaryotes have been described microscopically, but have not been studied with molecular methods. Thus, some of the sequences determined from marine waters will certainly prove to be from previously described organisms. One such organism is Myrionecta rubra, a ciliate that acquires plastids from algae and retains them for long periods of time; its rDNA sequence has recently been shown to be a very close match to certain environmental sequences (Johnson et al. 2004). Nonetheless, at least some of the environmental sequences probably do represent entirely unknown eukaryotes. For example, Ostreococcus is a marine planktonic green alga so minute that it entirely escaped detection until 1995, despite its global distribution and relatively high abundance (Chrétiennot-Dinet et al. 1995). Ostreococcus was first detected by flow cytometry, but much of the information about its distribution and environmental role comes from molecular studies (Venter et al. 2004). The genomic sequences of two strains of Ostreococcus have now been determined and should be published shortly. It is impressive that less than a decade has passed from the discovery of the organism to the determination of its genome sequence, and noteworthy that a primary reason for choosing Ostreococcus for sequencing was its environmental significance. It will probably not be long before the first knowledge of an organism comes from its genome sequence (Tyson et al. 2004).
Inferring evolutionary history from genomic analysis
The genomic palimpsest carries information not only about what kinds of organisms exist but also about how they are related. Genomics has interacted with systematics—the study of how organisms are related—to modify existing areas of study and to create new ones. It has become apparent that the study of genomes is inherently a comparative exercise. When the sequence of the Haemophilus influenzae genome was determined, it demonstrated the practicality of the random-clone approach to genome sequencing and provided an exhaustive catalog of the gene content for this species (Fleischmann et al. 1995). But when the genome of Mycoplasma genitalium was completed a few months later, an entirely different kind of analysis became possible; the genomes of these two unrelated pathogens could be compared, and their independent adaptations to life on an animal host provided insights into the evolution of pathogenesis that would have been impossible to gain by studying a single genome (Fraser et al. 1995).
Comparative analysis is such a powerful tool that it is accepted as de rigueur in genome analysis (Martin et al. 2002, Skaletsky et al. 2003). The database search–pairwise alignment algorithm BLAST (basic local alignment search tool) has been widely used to infer homology, which in turn is used to infer function. In the absence of biochemical data, the best estimate of a gene's function is usually based on the function of related sequences for which biochemical data are available. In practice, such inferences are often based on sequence similarity that is two or more steps removed from biochemical data, because such data are available for only a tiny fraction of the putative gene products whose DNA sequence is now known. The assumption that underlies these inferences is that homologous genes, because they are derived from a common ancestor, are likely to share that ancestral function. Of course, this is not necessarily the case; evolutionary forces constantly alter gene function, so knowledge of homology simply provides a first guess as to function. Methods of inferring gene identity and function that do not rely on homology are in development, but at present, methods based on homology predominate (Eisen and Wu 2002).
A broader comparative effort that draws heavily on lessons from genomics is the Tree of Life Project, which aims to determine a reliable phylogenetic tree for at least a million organisms (Pennisi 2003). Phylogenetic analysis of molecular sequences is a powerful way of inferring evolutionary history. Whereas BLAST examines patterns of similarity that can be used to infer homology, phylogenetic analysis starts with the assumption that select sequences are homologous and examines patterns among these sequences to infer the evolutionary history of those sequences. To develop a robust phylogeny of a million organisms, the Tree of Life Project will require, at a minimum, the phylogenetic analysis of sequences amounting to some tens of thousands of nucleotides from each of those organisms, as well as the incorporation of other data types and the development of novel methods of data analysis and presentation. In terms of the scale of data collection and analysis, such a project is comparable to sequencing the human genome, although it comes with its own unique difficulties, not the least of which is identifying and isolating those million representative organisms. One can think of the Tree of Life Project as a vertically integrated genome project; rather than aiming to determine all of the nucleotide sequences from a single genome, it aims to determine a single subset of sequences from representatives of all organisms. It will also aid in (and be aided by) the analysis of complete genomes. The comparative analysis of genomes is a key tool for understanding their function, and knowledge of relationships among genomes will make such analyses more robust. Once again, this interaction between genomics and systematics expressed itself first in microbiology.
The genomic palimpsest can also be read to infer processes, even those from the ancient past. Among the most striking observations to emerge from comparative analysis of microbial genomes was the surprising degree of gene transfer that seems to have shaped these genomes (Lawrence and Ochman 1997). Before the advent of genomics, there were just a handful of well-documented cases of gene transfer among distantly related organisms, and even the strongest cases were controversial (Cummings 1994, Delwiche and Palmer 1996). After a substantial number of microbial genomes became available, comparative analysis revealed several cases that were almost certainly the result of horizontal gene transfer. It is, however, inherently difficult to distinguish gene transfer from some other phenomena, particularly lineage sorting of paralogous genes (Delwiche and Palmer 1996, Lawrence and Ochman 2002), and a number of claims of gene transfer have not held up to scrutiny. It is now widely accepted that gene transfer has had a substantial impact on bacterial and archaeal evolution, perhaps even to the extent of challenging the concept of a phylogenetic “tree” (Doolittle 1999, Gogarten et al. 2002). It is much less clear what the evolutionary impact of gene transfer has been in eukaryotes.
Intraorganismal gene transfer and the plastids of dinoflagellates
One case in which gene transfer has clearly had a major impact on eukaryotic evolution is the interaction between host and symbiont genomes in the evolution of the endosymbiotic organelles (mitochondria and plastids). Interactions among organisms occur throughout nature, with particularly intimate and mutually obligate relationships referred to as symbioses. In some cases, such associations can be so close that substantial changes occur in the genomes of both partners (Wernegreen 2004). The most spectacularly modified symbiont genomes are those of plastids and mitochondria, where the majority of the endosymbiont genome has been lost, much of it to the nuclear genome. Most eukaryotes have mitochondria, and even those that lack them may once have possessed them (Tovar et al. 2003). In most cases the mitochondrial genome is very small, and in animals and fungi it undergoes extremely rapid sequence evolution. Plastids are less widely distributed among eukaryotes, and have relatively large and (in most cases) slowly evolving genomes, so they present an excellent opportunity for the study of organellar evolution (Delwiche et al. 2004). Plastids are most familiar as chloroplasts, the photosynthetic organelle in plants, but they are also found in all other photosynthetic eukaryotes (i.e., algae), where they display various characteristic pigmentations. The ancestors of plastids were free-living cyanobacteria that took up permanent and obligate residence in a eukaryotic host cell (figure 2). The extent to which the genome of each type of plastid has been modified or lost depends on the algal group, but none has been entirely unchanged.
In the case of dinoflagellates, the matter is a bit more complex, because they acquired their plastids indirectly by preying on another eukaryote. Dinoflagellates are common flagellates that account for an important part of the plankton in both freshwater and marine environments worldwide. About half of all dinoflagellates are photosynthetic. The majority of these are pigmented with peridinin, a modified carotenoid, and are thought to have acquired their plastids from red algae (perhaps indirectly), but some species are known to retain plastids that are clearly derived from several other algal groups (Delwiche et al. 2004). To understand the evolution of dinoflagellate genomes, one must consider at least five once-independent genomes: the nuclear and mitochondrial genomes of the dinoflagellate, and the chloroplast, nuclear, and mitochondrial genomes of the donor alga. In typical (peridinin-containing) dinoflagellates, the plastid genome has undergone drastic reduction, with the only known remnant within the plastid itself being a set of about a dozen genes that are encoded solely on single-gene minicircles (figure 3; Zhang et al. 1999, Laatsch et al. 2004). The vast majority of plastid-associated genes appear to now be encoded in the nuclear genome (Bachvaroff et al. 2004), but the fate of the nuclear and mitochondrial genomes of the eukaryotic endosymbiont remains essentially unknown.
My laboratory has been making use of moderately high-throughput EST sequencing to investigate the evolutionary history of the dinoflagellate plastid and to study how the nuclear and plastid genomes interact. ESTs are single-read sequences from arbitrarily selected cDNA clones; they are often used to determine what genes are expressed in a particular type of tissue or under a given set of environmental conditions. Although expressed sequences do not account for all potentially important components of a genome, they do provide a rapid way to get information about protein-coding sequences, particularly those that are highly expressed. Dino-flagellate nuclear genomes can be very large (as much as 100-fold larger than the human genome; Rizzo 1987), so complete sequencing remains a task for the future. Instead, my colleagues and I (Bachvaroff et al. 2004) chose to sample the nuclear genome by constructing a cDNA from cells that were selected to be in various different expression states. From two such libraries, we sequenced about 5000 ESTs (Bachvaroff et al. 2004). Because of the highly reduced nature of the dinoflagellate plastid genome (Zhang et al. 1999), it had been suggested that some of the genes that are encoded in the plastid genome in most organisms had undergone transfer to the nuclear genome, and we wanted to examine the nuclear genome for such sequences. In the absence of data from a large fraction of the nuclear genome, examining EST sequences was our best source of information.
Our work on the nuclear genome supported the hypothesis that minicircles constitute the entire plastid genome and indicated that the vast majority of plastid-expressed proteins are encoded by genes located in the nucleus. Out of a total of 4899 individual ESTs from two different dinoflagellates, we identified 118 unique sequences that were evolutionarily derived from the plastid (Bachvaroff et al. 2004). These are probably genuine nuclear-encoded genes, because they all have poly-A tails (a characteristic eukaryotic post-transcriptional modification that is not found in the plastid). In addition, many of them are encoded in multigene families (which are rare in plastid genomes), or have 5′ extensions resembling the targeting peptides that direct proteins into sub-cellular compartments such as the plastid, or both. Thirty of these putative nuclear-encoded, plastid-expressed genes are in the plastid genome of the red alga Porphyra and consequently are likely to have been transferred directly from the red algal plastid genome to the nucleus during dinoflagellate evolution. Eight are in the plastid genome of all photosynthetic eukaryotes other than dinoflagellates. Because only about a dozen protein-coding genes have been identified from single-gene minicircles, and because many of the same minicircle genes have been found by several different labs working independently, it seems plausible that these represent the majority of minicircle genes (although it is certainly possible that they represent only the most abundant class). Consequently, it has been widely assumed that there must be a second, unsampled source of plastid genes. Our data indicate that this second source is the nuclear genome.
One category of interest would be genes that were transferred from the nuclear genome of the red algal plastid donor to the nuclear genome of the dinoflagellate host, rather than from the plastid to the nuclear genome. This is a relatively difficult inference to make, because one cannot be certain of the genome content of the red algal ancestor. Although dinoflagellate diversity greatly expanded during the Jurassic period, fossils of unequivocal dinoflagellate affinity extend back at least to the early Triassic (Tappan 1980), and biogeochemical evidence suggests that the group may be substantially older. It is difficult to be certain when dinoflagellates acquired plastids, but it is very likely that the Triassic radiation involved photosynthetic forms. However, the red algae are a much older group, and their diversification greatly predates the Triassic (Tappan 1980). On this basis, one can infer that the features of genome architecture that are shared among all red algae were probably present in the ancestor that donated plastids to dinoflagellates. Therefore, genes that are encoded in the nuclear genome of red algae probably participated in nucleus-to-nucleus transfer, and several such genes have been identified in our data. Such observations make it possible to infer events that took place hundreds of millions of years ago and can provide information about the interaction between the genome, its adaptation to the environment, and its evolutionary history.
While the impact of genomics on evolution and ecology has been felt first and most strongly in microbiology, it is also moving rapidly into eukaryotic systems. With the availability of genome-scale data from several individual humans, from chimpanzees, and soon from other great apes, studies of human population biology and evolutionary history now have a richness of data that was previously unimaginable (Clark et al. 2003). The human genome has been (mostly) sequenced at least twice: once by the publicly funded Human Genome Project, which sequenced clones that were carefully selected on the basis of map information, and once by the private Cel-era Corporation, which sequenced a much larger number of randomly selected clones, relying on oversampling and powerful contig assembly algorithms to order the overlapping clones (Mount 2001). One interesting advantage of the latter approach is that—as in environmental sampling of microbial populations—DNA from more than one individual can be incorporated into the source library. This makes it possible to identify locations where those individuals' genomes differ as a part of the overall sequencing process. Most differences are single-nucleotide polymorphisms, or SNPs, which are a powerful source of data for population biology (Tishkoff et al. 2001).
The human genome, which was selected for sequencing because of human beings' intrinsic interest in ourselves and because of its potential biomedical applications, is by far the largest genome determined to date. It is also probably more typical of eukaryotic genomes than are the model system genomes such as Drosophila, Caenorhabditis, yeast, or fugu, each of which was selected as a model system because of its short generation time and small genome. In addition, there is a very large database of human genetic disease and Mendelian inheritance. Unlike other organisms, humans self-report on illness. This greatly facilitates detection and characterization of phenotypically important genetic variation. Thus, humans present a rich source of data with fascinating prospects for basic biology (attended, of course, by difficult problems in bioethics). Remarkable comparative databases are also becoming available for yeasts (Dujon et al. 2004) and api-complexan parasites (Abrahamsen et al. 2004), with Drosophila and others not far behind.
A common thread through all of genomics is the need to use the tools of bioinformatics to process and interpret data. The lens through which researchers read the genomic palimpsest is a computer screen. This means that as genomic approaches spread through other fields, there will be a growing need for appropriate computational tools and mathematical theory. Mathematical biology seems poised for rapid ascendancy, and population biology ranks high among the areas that seem ripe for expansion. Not only will new forms of data become available for population biology, as described above, but there is also reason to think that functional genomics would benefit from the application of ecological theory. Because individual loci within a genome resemble individuals within a population, methods from population biology can be used to model the change in genomes over time. In addition, the analysis of expression data from microarray hybridization involves complex networks of interactions with a high degree of stochastic variation. The resemblance to ecosystem modeling is striking, and collaboration between cellular physiologists and community ecologists would probably prove fruitful.
Genomic methods are not a panacea. Important questions will remain intractable, and a few fields of inquiry may be relatively little affected. (One might wonder how genomics will affect paleontology, for example, given that the most one can hope for is a tiny fragment of DNA isolated from a relatively recent and well-preserved fossil. However, because systematics and developmental biology are key to the interpretation of fossils, and because these fields are undergoing profound change in response to the advent of genomics, even paleontology will be rocked by the genomic wave.) Another legitimate concern is how the spread of large-scale studies that take advantage of the efficiencies of scale will affect the tradition of single-investigator science. There is probably no direct answer to this problem, other than to note that the change toward large-scale biology seems to be inevitable, and that to resist change is to risk stagnation.
We live, then, in interesting times. Genomic data sources and methods of analysis make it possible to find robust answers to questions that seemed intractable just a few years ago. This new knowledge has profound implications for the training of future biologists. Until recently, it seemed that students in conservation biology could safely ignore genomics, and that students of biochemistry could safely ignore population biology. This impression may have always been illusory, but the interdependence of such traditionally uncoupled disciplines is now unmistakable. Many of the biological sciences that have neglected mathematics and computer science will need to draw on their power, and some aspects of the standard life sciences curriculum will no doubt need to be abandoned to make room for training in the tools of modern biology. What will be achieved by scientists with interdisciplinary training and remarkably powerful tools can only be imagined. I look forward to being astonished.
The dinoflagellate plastid work described here is largely the dissertation work of Tsvetan R. Bachvaroff and M. Virginia Sanchez Puerta; Eliot Herman and Elizabeth Gantt are collaborators on that project. Other members of my laboratory, particularly John David Hall, kindly allowed me to distract them with questions and musings. Matthew D. Johnson made available prepublication data on Myrionecta. I apologize to those colleagues whose excellent work could not be cited because of limitations on space. Several scholars at the Library of Congress, the Walters Art Museum, and the Beinecke Rare Book and Manuscript Library patiently discussed palimpsests with an evolutionary biologist. John-Aaron Blanchette courteously feigned interest. This work was supported in part by NSF grants MCB-9984284 and DEB-9978117.
- 1MS Abrahamsen. 2004. Complete genome sequence of the apicomplexan, Cryptosporidium parvum. Science. 304: 441-445.
- 2RI Amann, W Ludwig, KH Schleifer. 1995. Phylogenetic identification and in situ detection of individual microbial cells without cultivation. Microbiological Reviews. 59: 143-169.
- 3LA Amaral Zettler, F Gómez, E Zettler, BG Keenan, R Amils, ML Sogin. 2002. Microbiology: Eukaryotic diversity in Spain's River of Fire. Nature. 417: 137
- 4TR Bachvaroff, GT Concepcion, CR Rogers, EM Herman, CF Delwiche. 2004. Dinoflagellate expressed sequence tag data indicate massive transfer of genes to the nuclear genome. Protist. 155: 65-78.
- 5SM Barns, CF Delwiche, JD Palmer, NR Pace. 1996. Perspectives on archaeal diversity, thermophily and monophyly from environmental rRNA sequences. Proceedings of the National Academy of Sciences. 93: 9188-9193.
- 6O Beja, MT Suzuki, JF Heidelberg, WC Nelson, CM Preston, T Hamada, JA Eisen, CM Fraser, EF DeLong. 2002. Unsuspected diversity among marine aerobic anoxygenic phototrophs. Nature. 415: 630-633.
- 7S Burggraf, P Heyder, N Eis. 1997. A pivotal Archaea group. Nature. 385: 780
- 8MJ Chrétiennot-Dinet, C Courties, A Vaquer, J Neveux, H Claustre, J Lautier, MC Machado. 1995. A new marine picoeucaryote: Ostreococcus tauri gen.et sp. nov. (Chlorophyta, Prasinophyceae). Phycologia. 34: 285-292.
- 9AG Clark. 2003. Inferring nonneutral evolution from human–chimp–mouse orthologous gene trios. Science. 302: 1960-1963.
- 10MP Cummings. 1994. Transmission patterns of eukaryotic transposable elements: Arguments for and against horizontal transfer. Trends in Ecology and Evolution. 9: 141-145.
- 11SC Dawson, NR Pace. 2002. Novel kingdom-level eukaryotic diversity in anoxic environments. Proceedings of the National Academy of Sciences. 99: 8324-8329.
- 12JR de la Torre, LM Christianson, O Beja, MT Suzuki, DM Karl, J Heidelberg, EF DeLong. 2003. Proteorhodopsin genes are distributed among divergent marine bacterial taxa. Proceedings of the National Academy of Sciences. 100: 12830-12835.
- 13EE DeLong, NR Pace. 2001. Environmental diversity of bacteria and archaea. Systematic Biology. 50: 470-478.
- 14CF Delwiche, JD Palmer. 1996. Rampant horizontal transfer and duplication of rubisco genes in eubacteria and plastids. Molecular Biology and Evolution. 13: 873-882.
- 15CF Delwiche, RA Andersen, D Bhattacharya, BD Mishler, RM McCourt. 2004. Algal evolution and the early radiation of green plants. Pages. 121-137. in Cracraft J, Donoghue MJ, eds. The Tree of Life. London: Oxford University Press.
- 16WF Doolittle. 1999. Phylogenetic classification and the universal tree. Science. 284: 2124-2128.
- 17B Dujon. 2004. Genome evolution in yeasts. Nature. 430: 35-44.
- 18JA Eisen, M Wu. 2002. Phylogenetic analysis and gene functional predictions: Phylogenomics in action. Theoretical Population Biology. 61: 481-487.
- 19RD Fleischmann. 1995. Whole genome random sequencing and assembly of Haemophilus influenzae Rd. Science. 269: 496-512.
- 20CM Fraser. 1995. The minimal gene complement of Mycoplasma genitalium. Science. 270: 397-403.
- 21CM Fraser, JA Eisen, SL Salzberg. 2000. Microbial genome sequencing. Nature. 406: 799-803.
- 22JP Gogarten, WF Doolittle, JG Lawrence. 2002. Prokaryotic evolution in light of gene transfer. Molecular Biology and Evolution. 19: 2226-2238.
- 23EA Greene, G Voordouw. 2003. Analysis of environmental microbial communities by reverse sample genome probing. Journal of Micro-biological Methods. 53: 211-219.
- 24H Huber, MJ Hohn, R Rachel, T Fuchs, VC Wimmer, KO Stetter. 2002. A new phylum of Archaea represented by a nanosized hyperthermophilic symbiont. Nature. 417: 63-67.
- 25MD Johnson, T Tengs, CF Delwiche, D Oldach, DK Stoecker. 2004. Highly divergent SSU rRNA genes found in marine ciliates Myrionecta rubra and Mesodinium pulex. Molecular Biology and Evolution. 155: 347-359.
- 26T Laatsch, S Zauner, B Stoebe-Maier, KV Kowallik, U-G Maier. 2004. Plastid-derived single gene minicircles of the dinoflagellate Ceratium horridum are localized in the nucleus. Molecular Biology and Evolution. 21: 1318-1322.
- 27JG Lawrence, H Ochman. 1997. Amelioration of bacterial genomes: Rates of change and exchange. Journal of Molecular Evolution. 44: 383-397.
- 28JG Lawrence. 2002. Reconciling the many faces of lateral gene transfer. Trends in Microbiology. 10: 1-4.
- 29P López-García, F Rodríguez-Valera, C Pedrós-Alió, D Moreira. 2001. Unexpected diversity of small eukaryotes in deep-sea Antarctic plankton. Nature. 409: 603-607.
- 30W Martin. 2002. Evolutionary analysis of Arabidopsis, cyanobacterial, and chloroplast genomes reveals plastid phylogeny and thousands of cyanobacterial genes in the nucleus. Proceedings of the National Academy of Sciences. 99: 12246-12251.
- 31SY Moon–van der Staay, R De Wachter, D Vaulot. 2001. Oceanic 18S rDNA sequences from picoplankton reveal unsuspected eukaryotic diversity. Nature. 409: 607-610.
- 32DW Mount. 2001. Bioinformatics: Sequence and Genome Analysis. Cold Spring Harbor (NY): Cold Spring Harbor Laboratory Press.
- 33A Oren. 2004. Prokaryote diversity and taxonomy: Current status and future challenges. Philosophical Transactions: Biological Sciences. 359: 623-638.
- 34NR Pace. 1996. New perspective on the natural microbial world: Molecular microbial ecology. ASM News. 62: 463-470.
- 35E Pennisi. 2003. Modernizing the tree of life. Science. 300: 1692-1697.
- 36MS Rappe, SA Connon, KL Vergin, SJ Giovannoni. 2002. Cultivation of the ubiquitous SAR11 marine bacterioplankton clade. Nature. 418: 630-633.
- 37PJ Rizzo. 1987. Biochemistry of the dinoflagellate nucleus. Pages. 143-173. in Taylor FJR, ed. The Biology of Dinoflagellates. Oxford (United Kingdom): Blackwell Scientific.
- 38BA Shailor. 1998. The Medieval Book. Toronto: University of Toronto Press.
- 39J Shendure, RD Mitra, C Varma, GM Church. 2004. Advanced sequencing technologies: Methods and goals. Nature Reviews Genetics. 5: 335-344.
- 40H Skaletsky. 2003. The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes. Nature. 423: 825-837.
- 41H Tappan. 1980. The Paleobiology of Plant Protists. San Francisco: W. H. Freeman.
- 42SA Tishkoff. 2001. Haplotype diversity and linkage disequilibrium at human G6PD: Recent origin of alleles that confer malarial resistance. Science. 293: 455-462.
- 43J Tovar, G Léon-Avila, LB Sánchez, R Sutak, J Tachezy, M van der Giezen, M Hernández, M Müller, JM Lucocq. 2003. Mitochondrial remnant organelles of Giardia function in iron–sulphur protein maturation. Nature. 426: 172-176.
- 44GW Tyson. 2004. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 428: 37-43.
- 45JC Venter. 2004. Environmental genome shotgun sequencing of the Sargasso Sea. Science. 304: 66-74.
- 46DM Ward, MJ Ferris, SC Nold, MM Bateson. 1998. A natural view of microbial biodiversity within hot spring cyanobacterial mat communities. Microbiology and Molecular Biology Reviews. 62: 1353-1370.
- 47E Waters. 2003. The genome of Nanoarchaeum equitans: Insights into early archaeal evolution and derived parasitism. Proceedings of the National Academy of Sciences. 100: 12984-12988.
- 48JJ Wernegreen. 2004. Endosymbiosis: Lessons in conflict resolution. PLoS Biology. 2: 307-311.
- 49CR Woese. 1994. There must be a prokaryote somewhere: Microbiology's search for itself. Microbiological Reviews. 58: 1-9.
- 50Z Zhang, BR Green, T Cavalier-Smith. 1999. Single gene circles in dinoflagellate chloroplast genomes. Nature. 400: 155-159.
Genomics and microbial evolution: An annotated glossary.
Alga: A photosynthetic eukaryote. Algae may be unicellular or multicellular.
Analogy: A resemblance between two structures (such as gene sequences) that reflects evolutionary convergence rather than descent from a common ancestor.
Bioinformatics (biological informatics): The application of information technologies to biological data, particularly to databases of genomic and other biological sequences.
BLAST (basic local alignment search tool): A family of algorithms designed to use fast pairwise sequence alignment to detect sequences within a database that are more similar to a query sequence than would be expected at random. For each pair of aligned sequences, BLAST reports a measure of the information content (the bit score, a measure of how long the matching region is and how strongly the two sequences resemble each other) and an estimate of the expected number of alignments of comparable quality that one would expect to observe at random.
cDNA (complementary DNA): DNA that has been reverse-transcribed from an RNA transcript. Often inserted into an artificial vector, yielding a cDNA library.
Chloroplast: The plastid of a green alga or land plant.
Contig assembly: Computational linking of overlapping DNA sequence reads to infer a single, contiguous sequence.
EST (expressed sequence tag): A single DNA sequence read from an arbitrarily selected clone in a cDNA library. Used as a rapid way to screen for genes expressed as RNA.
Gene: A region of the genome that carries heritable information. In the context of genomics, often used informally to refer specifically to protein-coding regions.
Gene finding: The application of pattern recognition to the detection of specific elements within a genome, most commonly protein-coding regions.
Homology: In modern usage, the relationship between two structures (such as gene sequences) derived from the same structure in a common ancestor. Homology is an evolutionary inference made on the basis of study of the properties of a structure; it cannot be measured directly. Homologous structures may or may not have the same function. Several special cases of homology have been described to accommodate the evolution of genomes by gene duplication and transfer (e.g., orthology, paralogy, xenology). See similarity, analogy.
Microarrays: Hybridization arrays constructed to have extremely high density within a small area. This technology permits hybridization to be performed very rapidly and at a low cost per individual hybridization reaction. Microarray techniques have been developed for DNA (southern hybridization), RNA (northern hybridization), and protein (western hybridization).
Natural classification: A classification that reflects evolutionary history.
Orthology: In duplicated genes, the relationship between the corresponding copies of a gene from two different genomes. Although it is often assumed that orthologs must be functionally equivalent, this is not necessarily the case. See homology.
Paralogy: In duplicated genes, the relationship between copies of a gene present within a single genome. In some cases, paralogous genes may have diverged in function. See homology.
Pattern recognition: The use of algorithms to detect information-containing regions within complex data (in this context, genomic data).
Phylogenetic analysis: Computer modeling of the changes that have occurred in homologous characters during the evolution of a group of organisms. In molecular phylogenetic analysis, the characters examined are nucleotides or amino acids, so the analysis attempts to reconstruct the history of mutation.
rDNA (ribosomal DNA): Used informally to refer to the gene that encodes the ribosomal RNA.
rRNA (ribosomal RNA): Used informally to refer to the RNA component of the ribosome. See rDNA.
Scaffolding: The use of long-range map information to facilitate large-scale contig assembly.
Sequence read: The output from a single DNA sequencing reaction. The number of bases determined in a single read depends on the technology, but it is typically in the range of 500 to 1000 nucleotides.
Similarity: A quantifiable resemblance between two structures. In genomics, measurements of similarity between sequences are often used to make inferences of homology. See homology.
SNP (single-nucleotide polymorphism): A location within a genome that can be shown to have a different nucleotide in different individuals in that species.
Xenology: The relationship between two genes, one of which has undergone horizontal gene transfer. See homology.
Charles F. Delwiche (email@example.com) is an associate professor in the Department of Cell Biology and Molecular Genetics at the University of Maryland, College Park, MD 20742. He uses molecular and computational methods to study the diversification of algae, with an emphasis on the origin of plastids and the colonization of the land by plants (the “drier algae”)