Original NSF Proposal

Original NSF Proposal

"From the genome to the tree of life"

1.	Results from Prior Support	5.	Examples: Research Integrating Genomics / Phylogenetics
2.	Background: Phylogenetics / Evolution	6.	Proposed Coordination Activities
3.	Background: Genomics	7.	Management / Coordination Mechanisms
4.	Theme: Research Coordination Group	8.	Significance

Section 3: Background on Green Plant Genomics

The long-term goal of plant genomics is to identify, isolate, and determine the function of plant genes that are associated with both vegetative and reproductive phenotypes. Most phenotypes require the coordinated activity and regulatory control of suites of genes over time and in precise positions within the plant. Until recently the idea of establishing a comprehensive approach to isolate and characterize all the genes involved in any complex phenotype was a daunting one, however, advances in genomics, informatics, and phylogenetics has brought such a prospect to a manageable level. The nucleotide sequence of the Arabidopsis genome is nearing completion, the sequencing of rice has begun, and large amounts of expressed sequence tag (EST) information is being obtained for many other plants. There are many new opportunities to use this wealth of information to accelerate progress toward an understanding of the genetic mechanisms that control plant growth and development and responses to the biotic and abiotic environment.

Progress in green plant genomics.

One of the first eukaryotic genomes to be completely sequenced will be that of the small mustard species Arabidopsis thaliana. During the past decade, Arabidopsis has emerged as one of the most widely used model organisms for studying the biology of higher plants. Its genome was chosen for sequencing because it is highly compact, about 130 Mb, with little interspersed repetitive DNA. However, since Arabidopsis is rather distantly related to the cereal crops that provide the bulk of the world food supply, the genome of rice will also be sequenced during the next decade. Rice was chosen because, in addition to its importance as a food source for about one quarter of the human population, it has one of the most compact genomes among the cereals. It contains about 3.5 times as much DNA as Arabidopsis but only about 20% as much DNA as maize and about 3% as much DNA as wheat (Bennett and Smith, 1991). However, the genome organization of the cereals appears to be very highly conserved; rice, wheat, maize, sorghum, millet and other cereals exhibit a high degree of synteny (Gale and Devos, 1998). The differences in genome size primarily reflect the amplification of interspersed repetitive sequences (Bennetzen et al., 1998); there is no evidence that angiosperms with large amounts of DNA per cell have substantially greater numbers of functional genes than angiosperms with relatively small DNA contents. Because of extensive synteny among the cereal genomes, knowledge of gene order and organization in rice may be used to isolate and characterize the corresponding genes in the other cereals (McCouch, 1998). Thus, for instance, if a genetic locus encoding a useful trait is mapped between a pair of closely linked molecular markers in wheat, it may be possible to identify candidate genes for the rice ortholog by analyzing the rice genome sequence located between the rice orthologs of the molecular markers.

The sequences of Arabidopsis and rice will provide two foci from which the genome contents of other angiosperms will be extrapolated. It seems likely that as the costs of DNA sequencing continue to drop, additional genomes of economically important plants may eventually be sequenced. However, during the next decade additional complete plant genome sequences will probably not be available publicly because of the high costs for whole genome sequencing of any of the major crops. However, extensive partial cDNA sequence information will be publicly available for a majority of the genes from many important plant species, both crop and non-crop (Pennisi, 1998). There are currently more than 127,000 EST sequences from 19 plant species in public databases, and this number is expected to grow rapidly during the next several years. These sequences will provide isomorphisms between the model genomes and other species, forming a kind of transect through genome diversity in all plants that is anchored in comprehensive knowledge of the two representative species. Thus, as genes associated with functions or traits in one plant are cloned, it will usually be possible to identify the orthologs responsible for the trait in other plant species, including those in the more basal lineages, by a database search or by using the sequence information to clone the corresponding gene from the species of interest.

Although flowering plants have evolved during the past 130 million years or so, and might therefore be expected to be very similar at the genetic level, substantial morphological, developmental, and metabolic diversity exists. A major challenge to understanding the genetic basis of interspecific diversity is that, in at least some cases, minor changes in the structure or expression of a gene can lead to major changes in phenotype. Understanding the basis of this diversity is a key to understanding how to effect rational improvements in the utility of crop species. Knowledge of the genetic basis for intraspecies variation in specific traits should be useful in selecting or creating useful variation within a species. Because of the relatively recent radiation of the angiosperms, we consider it likely that there will be very few protein-encoding angiosperm genes that do not have orthologs or paralogs in Arabidopsis or rice. Therefore, understanding the genetic basis for diversity may devolve to identifying the relevant differences in the control of expression or the function of essentially the same set of genes. Indeed, it has been hypothesized that the developmental diversity of angiosperms may largely result from changes in the cis-regulatory sequences of transcriptional regulators (Doebley and Lukens, 1998).

Assigning function to genes.

One of the major efficiencies that has emerged from the plant genome research to date is that about 54% of Arabidopsis genes can be assigned some degree of function by comparison to the sequences of genes of known function (EU Arabidopsis Genome Project, 1998). In effect, a universal biology has coalesced from the common language of gene and protein sequences. Unfortunately, knowing the general function frequently does not provide an insight into the specific role in the organism. For instance, on the basis of sequence analysis, about 13% of Arabidopsis genes are inferred to be involved in transcription or signal transduction. However, knowing that a gene encodes a kinase or transcription factor does not provide any useful information about what processes are controlled by these genes. Thus, the completion of the genome sequences of Arabidopsis and rice will be followed by a second phase of large-scale functional genomics in which all of the approximately 20 - 25,000 genes that comprise the basic angiosperm genome will be assigned function on the basis of experimental evidence. Considering that the combined efforts of the plant biology community have resulted in the direct functional analysis of only about 1000 genes to date (Rounsley, 1996), this may seem like a tall order. However, it seems likely that the efficiency gained by reverse genetics will fundamentally change this equation. Large collections of insertion mutants are available for Arabidopsis, maize, petunia, and snapdragon, and collections of insertion mutants will probably be created in several other species, including rice. These collections can be screened for an insertional inactivation of any gene by using the polymerase chain reaction (PCR) primed with oligonucleotides based on the sequences of the target gene and the insertional mutagen (Martienssen, 1998a). The presence of an insertion in the target gene is indicated by the presence of a PCR product. By multiplexing DNA samples, hundreds of thousands of lines can be screened and the corresponding mutant plants identified with relatively small effort. In addition, several groups are embarking on the sequencing of the genomic DNA flanking a large numbers of insertions so that an insertion in virtually any gene can be identified by a computer search (Bouchez and Hafte, 1998). Analysis of the phenotype and other properties of the corresponding mutant will frequently provide an insight into the function of the gene.

A major limitation to the analysis of gene function by mutation is that a high degree of gene duplication is apparent in Arabidopsis (EU Arabidopsis Genome Project, 1998) and is, therefore, probably a common feature of plant genomes. Because many of the gene duplications in Arabidopsis are very tightly linked, it will usually not be feasible to produce double mutants by genetic recombination. Several alternative methods have been proposed, including using homologous recombination to eliminate tandem genes simultaneously by gene-replacement (Kempin et al., 1997), or a method for producing point mutations using RNA:DNA hybrids (Cole?Strauss et al., 1996). It is expected that the application of these and related methods will lead to the assignment of some degree of gene function to all genes in the Arabidopsis genome within the next decade.

Impact of gene chips and microarrays.

One of the most important experimental approaches for discovering the function of genes promises to be gene chips and microarrays. In principle, DNA sequences representing all of the genes in an organism can be placed on miniature solid supports and used as hybridization substrates to quantitate the expression of all of the genes represented in a complex mRNA sample (Schena et al., 1996). Thus, we may expect to have extensive databases of quantitative information about the degree to which each gene responds to pathogens, pests, drought, cold, salt, photoperiod, and other environmental variation. Similarly, we will have extensive information about which genes respond to changes in developmental processes such as germination and flowering, or to the phytohormones, growth regulators, safeners, herbicides, and related agrichemicals. Knowledge of which genes exhibit changes in expression in any mutant of interest will be useful for formulating hypotheses about the roles of the gene affected by the mutation (Holstete, et al., 1998).

These databases of gene expression information will provide completely novel insights into the pathways of genes that control complex responses and will be a first step toward an ecology of the genome in which the genome is viewed as a whole and the relationships of gene products to each other will be considered from at least one perspective (i.e., relative level of expression). Perhaps the kinds of models that ecologists currently use for understanding the interactions in ecosystems will prove useful (Service, 1999). Indeed, since microarrays can be made for any organism for which cDNAs can be isolated, it seems likely that ecological applications will be found. It is not necessary to know the sequence of the genes on a DNA microarray beforehand - this can be determined after the arrays have been used to identify genes that may be of interest by some criterion.

The accumulation of DNA microarray or gene chip data from many different experiments will create a novel and potentially very powerful opportunity to assign functional information to genes of otherwise unknown function. The conceptual basis of the approach is that genes that contribute to the same biological process will exhibit similar patterns of expression. Thus, by clustering genes based on the similarity of their relative levels of expression in response to diverse environmental stimuli or developmental conditions, it should be possible to assign hypothetical functions to genes based on the known function of other genes in the cluster (Chu et al., 1998).

In addition to their use in the study of gene function, microarrays can be used to assess single nucleotide variation, both within and between species. This has the potential to increase dramatically the numbers of gene loci screened for studies involving, for example, the analyses of breeding systems, natural levels of genetic variation, or population variation in rare plants. The impact of this on the plant systematics community will be exciting because the increased genetic resolution will open the way for addressing a wider variety of questions about natural populations.

Work with plant microarrays is just beginning but there seems every reason to believe that this approach will soon be a standard component of the repertoire of plant biologists (Baldwin et al., 1999). The principal challenge, at present, is to develop methods for databasing and interrogating the massive amounts of data that result from this type of experiment, and phylogenetic methods have a lot to offer as a framework for comparison.