NSF Proposal - 5. New Data Acquisition: Genomics, Morphological and Sequence Data

Our goal is to build a robust phylogenetic reconstruction across multiple scales. To do this, we will
generate a comprehensive dataset for at least 51 exemplar taxa. Taxa were chosen for their postulated
phylogenetic position relative to the unresolved nodes in green plant evolution (Fig. 1, Table 1). This
dataset will incorporate characters from morphology, ultrastructure, and organellar genome and nuclear
gene sequences. We will generate, annotate and archive these data (note taxa already done: Table 1).
These data will be used to reconstruct the "deep" phylogeny of green plants, and will serve as the
backbone for concatenating "deep" analyses with many ongoing shallower analyses in green plants.

Criteria for selection of taxa. Our primary criterion for selection of our 51 taxa was their hypothesized phylogenetic position in relation to the nodes we want to resolve (Fig. 1, Table 1). We selected among the many possible exemplars on the basis of four subsidiary criteria. 1) To complement sampling of other studies that are developing genomic resources for comparative study in green plants, including the NSF-funded Collaborative Grant on Plant and Algal BACs (D. Mandoli, Project Director), the Organelle Genome Megasequencing Program (M. W. Gray and B. F. Lang, Program Directors), and Jansen's seed plant chloroplast seqeuncing project (see letters B. F. Lang & R. Jansen, collaborators). 2) We added taxa that will facilitate concatenation of published and ongoing studies to the backbone phylogeny that we will develop here. 3) The taxon must be easy to obtain through collection or cultivation. 4) Taxa that are important models for research in various fields. When alternatives exist within the constraints of these criteria, the organism with the smaller nuclear genome size was chosen to maintain cost-efficiency of BAC production. If in the course of our work, an species proves to be intractable or we find another one that seems even more suitable or has a smaller genome size than one selected initially, we will make the appropriate replacements.

Morphological, ultrastructural, and other non-molecular data. A major component of this project is accumulation and interpretation of morphological data. Accurate detailing of anatomical, develop- mental and ultrastructural features is critical to all future morphological inquiry. Though they made crucial contributions to our understanding of green plant phylogeny, until recently studies on morphological components tended not to be conducted systematically. Differences in methodological approach, available technologies and investigator biases made them subject to discordances. Our studies are designed to provide reliable contemporary morphological data that will correct errors, clarify ambiguity and augment information available in the literature. To use "discrete" rather than "composite" OTUs, we will detail the morphological and ultrastructural features of all exemplars that we examine at the genomic level (Fig. 1, Table 1). In this way we will build a comprehensive dataset based on temporally and methodologically consistent approaches and maximally discrete OTUs. These data will allow us to critically evaluate morphological datasets compiled from the literature, for fossil as well as living specimens, and will contribute to analyses across "deep" and "shallow" scales by maximizing our ability to interpret homologies, paralogies and convergences in the evolution of morphological characters.

We will concentrate on 1) anatomical features that can be derived from light microscope observation of living, preserved and dried material and 2) ultrastructural, developmental and physiological data that require tissue preparation and observation in the TEM, SEM, fluorescence or light microscope. We will begin with recently composed coherent character matrices, including the 132-character bryophyte set [19, 24], the 75-character matrix for spermatogenesis in land plants [25], and the 77-character matrix for pteridophytes ([105, 76]; see http://www.science.siu.edu/landplants/Morphological/MorphData.html). These characters will serve as a baseline for data collection and will be substantially modified as characters are evaluated and character states defined. A major focus will be to construct like datasets for the chlorophyte algae, which have seldom been compiled in forms comparable to those cited above [66]. Acquisition of crucial ultrastructural and morphological characters will identify potential homologies and will significantly enhance resolution of morphological data.

In addition to accumulating general information on plant morphology, we will conduct intensive studies of key structural features and processes that are common to all or most taxa. This will provide data at all fractal scales and enable global comparisons. The available data are restricted to cellular features and so we will conduct thorough studies of cell division, especially mitosis, analyze cell wall constituents and examine motile cell structure and differentiation using standard TEM, fluorescent labels and immunolabeling protocols for TEM, fluorescent and light microscopy (e.g. [106, 25]).

Genomic data. Whole organellar genomes provide two distinct sorts of data for phylogenetic inference. Gene and intron losses, inversions, and other structural changes in the genome occur infrequently and can provide powerful phylogenetic markers (e.g., [107, 108, 72]; but see [109] for example of homoplasy). Complete chloroplast and mitochondrial genome sequences will also provide two important sequence data sets. In addition to structural genomic data, we will assemble chloroplast and mitochondrial datasets from all coding regions of sufficient size and conservation to permit confident sequence alignment. The tremendous amount of organellar sequence data should permit unambiguous reconstruction of organellar phylogenies for all taxa sampled. We will also sequence a few nuclear genes that are either single-copy or from small multi-gene families which are appropriate for analysis at this scale. BAC libraries will facilitate probing for (on filter arrays) and amplification of the desired sequences (from individual BAC clones).

Four approaches will be used to obtain organellar genomes (Table 2). The order in which we will execute these options reflects the relative costs per genome and the probability of working most easily.

Table 2: Comparison of four methods to obtain the organellar genome data.
Traditional isolation
of organelles
FACs to purify
organellar genomes
BAC library for 100Mb
nuclear genome, 5X
oBAC library biased for
organellar genomes,
~19x coverage
$100-5,000 $850 $1,923 $192

We will determine the size of those nuclear genomes that have not been directly measured using flow cytometry (see Arumuganathan cv). For genomes ~100MB, we will make a standard BAC library (17 taxa in Table 1) because this is relatively inexpensive and will provide us with all three genomes. Average insert size per clone in the Wing lab is 130-150 Mb and we will aim for =5X coverage which is considered a minimum BAC library standard by NSF. Quality control of all libraries will be done by the Wing lab (CUGI standards). For genomes >100Mb we have three options to get the organellar genomes. Our first option will be to create an ?organellar bacterial artificial chromosome? or oBAC library. During normal BAC library construction, tissue from which the cell wall has been digested is embedded in agar. Proteins, carbohydrates and organellar genomes are removed in situ to preserve intact chromosomes. Normally, a Triton-X step is included to reduce the organellar genome representation from 10-15% to 2- 3% in the final BAC library. We will omit the Triton-X step, essentially capitalizing on old technology for a new purpose, and make a very small library, 384 clones, that will nevertheless represent each organellar genome ~19-times. The libraries will be arrayed and probed with standard genes to identify those clones containing organellar genomes (mito: atpA, cob, atp9, cox1; chloro: ndhA, rbcL, psbA). Many clones will contain the entire organellar genome. Our oBAC and BAC procedure may reveal nuclear regions that contain organellar DNA such as has been found in rice (Wing, unpub.). Not only is this method cost effective (Table2), but it is automated (http://www.genome.clemson.edu/), produces arrayed filters and ? 80C glycerol stocks of all clones, is the best chance of preserving fragile organelles from some of the more ancient taxa (Delwiche, pers. comm.), and will foster data and bioinformatics exchange into the genomic community via the CUGI/AZ website. Should this protocol fail for a particular organism, we will use a Fluorescence Activated Cell Sorter (FACS) to separate mitochondria and chloroplasts from fractionated cells. Standard DNA extractions will be made from the sorted organelles. If both oBAC and FACS fail for the larger genomes, we will fall back on centrifugation protocols for organellar separation and extraction, the classical approach [110, 111]. With four alternatives, all taxa should be feasible, but we will draw from our pool of "alternate" organisms if any taxa prove intractable.

The purified genomes or surrogate templates will be sheared randomly into fragments of ~3 kb using a Hydroshear device, end-repaired, and gel purified. Routine quality control measures ensure that shearing produces fragments of narrow size distribution (important in the later sequence assembly phase), with 1 s.d. =8% of the intended fragment length. These fragments will be blunt-ligated into pUC18, transformed into E. coli DH5a, and plated onto large format bacterial plates under conditions that allow for blue-white color selection. Colonies will be grown overnight, then processed robotically through creation of glycerol stocks, extracted and amplified using rolling circle amplification, separation for forward and reverse primer sequencing, and setting up of the sequencing reactions. Sequence determination will be on 96-capillary automated sequencers. For each genome, 96 clones will be sequenced to determine purity (based on BLAST searches of sequences). Sequencing will continue until approximately 8-fold redundancy, when gaps in the gene-rich genomes should be minimal. Gap filling and sequence completion will be done by returning to archived plasmid preps, or if necessary through amplification of genomic DNA. Gap filling will be done in collaboration between JGI and Utah State University. The goal will be to achieve a total of approximately 9 Mb of final sequence data.

At both CUGI/AZ and JGI all cloning and analysis steps are tracked using bar-code readers. The data are automatically entered into a workflow database for statistical analysis of each phase of the operation. Sequencing machines automatically output their data into a UNIX-based folder system, where they are assembled into contigs. The JGI software is unique in that it uses paired-plasmid ends to guide contig assembly. Gene annotation uses both standard and custom software which has been successful for many whole genomes sequenced at JGI. All sequence data will be deposited in GenBank.

Primary sequence characters. Sequences of chloroplast genomes are complete for 24 organisms, including four green algae, Marchantia, Psilotum, and numerous seed plants. From analyses of these genomes, we infer that the best source of characters will be protein-encoding genes and genes for the 16S and 23S ribosomal RNAs. Gene content ranges from 69 protein-coding genes in Pinus to 84 in Marchantia (78 in Chlorella; 76 in Nicotiana) so gene losses are likely to be important phylogenetic markers [108]. A strategy using nucleotide sequences of 17 protein-coding chloroplast genes exhibiting low synonymous substitution rates and site-to-site rate variation has been applied successfully to studies of basal angiosperms and land plants [112]. Additional results (R. Olmstead, unpubl.) suggest that this strategy can be used successfully at much deeper phylogenetic levels in green plants. Stoebe et al. [113] analyzed 46 protein-coding genes totaling >11,500 aligned amino acids positions in a study of 9 taxa representing all chloroplast genomes then available and including non-green plant taxa. Restricting our study to green plants will enable us to use 60 genes and ~50,000 nucleotides of DNA sequence. Characters will be defined at both the nucleotide and amino acid levels and analyses will be carried out where most appropriate given alignments and levels of nucleotide sequence divergence.

Complete mitochondrial genomes have been sequenced for fewer green plants than chloroplast genomes. However, two green algae, Marchantia, and at least one seed plant have been sequenced. Green plant mitochondrial genomes also contain small and large subunit rRNA genes, but contain far fewer protein coding sequences than do chloroplast genomes. We will conduct combined multi-gene analyses for green plants as we described for chloroplast genomes. Mitochondrial DNA substitution rates are slower than those of either chloroplast or nuclear genomes [114]. Various mtDNA genes have been used recently for deep phylogenetic studies in land plants [115, 116, 28, 117, 37]. We will sequence single-copy nuclear genes as well as some from small multi-gene families. Again, nuclear BAC libraries and filter arrays made from them will greatly ease the acquisition of sequence for phylogenetic analysis and provide genomic tools for other researchers. Working from BACs instead of whole genomic DNA enables PCR-based approaches to recover all copies of the genes, without interference from more readily amplified copies, a problem when using PCR on nuclear multi-gene families. We will focus on protein- coding genes that have been identified as useful for deep phylogeny in plants. The RNA polymerase II consists of several subunits each encoded by separate nuclear genes. With rare exceptions, the two largest subunits (RPB1 and RPB2) are single copy genes in all groups in which they have been studied. RNA pol II genes have been used for deep phylogenetic studies of crown eukaryotes [118], red algae [119, 120], fungi [121], and land plants (B. Hall, pers. comm.) and should help resolve our fuzzy nodes. Phytochrome genes have a good signal for seed plant and basal angiosperm phylogeny [122] where a series of duplications have yielded a clearly defined set of phytochrome genes. However, in non-seed plants [123] evidence suggests that there is a single gene with some lineages having duplications (e.g., Selaginella, Psilotum). Some of these duplications are likely to mark clades once sampling is expanded.

[previous] [next]