NSF Proposal - 5. New Data Acquisition: Genomics,
Morphological and Sequence Data
Our goal is to build a robust phylogenetic reconstruction across
multiple scales. To do this, we will
generate a comprehensive dataset for at least 51 exemplar taxa.
Taxa were chosen for their postulated
phylogenetic position relative to the unresolved nodes in green
plant evolution (Fig. 1, Table 1). This
dataset will incorporate characters from morphology, ultrastructure,
and organellar genome and nuclear
gene sequences. We will generate, annotate and archive these data
(note taxa already done: Table 1).
These data will be used to reconstruct the "deep" phylogeny
of green plants, and will serve as the
backbone for concatenating "deep" analyses with many ongoing
shallower analyses in green plants.
Criteria for selection of taxa. Our primary criterion for selection
of our 51 taxa was their hypothesized phylogenetic position in relation
to the nodes we want to resolve (Fig. 1, Table 1). We selected among
the many possible exemplars on the basis of four subsidiary criteria.
1) To complement sampling of other studies that are developing genomic
resources for comparative study in green plants, including the NSF-funded
Collaborative Grant on Plant and Algal BACs (D. Mandoli, Project
Director), the Organelle Genome Megasequencing Program (M. W. Gray
and B. F. Lang, Program Directors), and Jansen's seed plant chloroplast
seqeuncing project (see letters B. F. Lang & R. Jansen, collaborators).
2) We added taxa that will facilitate concatenation of published
and ongoing studies to the backbone phylogeny that we will develop
here. 3) The taxon must be easy to obtain through collection or
cultivation. 4) Taxa that are important models for research in various
fields. When alternatives exist within the constraints of these
criteria, the organism with the smaller nuclear genome size was
chosen to maintain cost-efficiency of BAC production. If in the
course of our work, an species proves to be intractable or we find
another one that seems even more suitable or has a smaller genome
size than one selected initially, we will make the appropriate replacements.
Morphological, ultrastructural, and other non-molecular data.
A major component of this project is accumulation and interpretation
of morphological data. Accurate detailing of anatomical, develop-
mental and ultrastructural features is critical to all future morphological
inquiry. Though they made crucial contributions to our understanding
of green plant phylogeny, until recently studies on morphological
components tended not to be conducted systematically. Differences
in methodological approach, available technologies and investigator
biases made them subject to discordances. Our studies are designed
to provide reliable contemporary morphological data that will correct
errors, clarify ambiguity and augment information available in the
literature. To use "discrete" rather than "composite"
OTUs, we will detail the morphological and ultrastructural features
of all exemplars that we examine at the genomic level (Fig. 1, Table
1). In this way we will build a comprehensive dataset based on temporally
and methodologically consistent approaches and maximally discrete
OTUs. These data will allow us to critically evaluate morphological
datasets compiled from the literature, for fossil as well as living
specimens, and will contribute to analyses across "deep"
and "shallow" scales by maximizing our ability to interpret
homologies, paralogies and convergences in the evolution of morphological
We will concentrate on 1) anatomical features that can be derived
from light microscope observation of living, preserved and dried
material and 2) ultrastructural, developmental and physiological
data that require tissue preparation and observation in the TEM,
SEM, fluorescence or light microscope. We will begin with recently
composed coherent character matrices, including the 132-character
bryophyte set [19, 24], the 75-character matrix for spermatogenesis
in land plants , and the 77-character matrix for pteridophytes
([105, 76]; see http://www.science.siu.edu/landplants/Morphological/MorphData.html).
These characters will serve as a baseline for data collection and
will be substantially modified as characters are evaluated and character
states defined. A major focus will be to construct like datasets
for the chlorophyte algae, which have seldom been compiled in forms
comparable to those cited above . Acquisition of crucial ultrastructural
and morphological characters will identify potential homologies
and will significantly enhance resolution of morphological data.
In addition to accumulating general information on plant morphology,
we will conduct intensive studies of key structural features and
processes that are common to all or most taxa. This will provide
data at all fractal scales and enable global comparisons. The available
data are restricted to cellular features and so we will conduct
thorough studies of cell division, especially mitosis, analyze cell
wall constituents and examine motile cell structure and differentiation
using standard TEM, fluorescent labels and immunolabeling protocols
for TEM, fluorescent and light microscopy (e.g. [106, 25]).
Genomic data. Whole organellar genomes provide two distinct sorts
of data for phylogenetic inference. Gene and intron losses, inversions,
and other structural changes in the genome occur infrequently and
can provide powerful phylogenetic markers (e.g., [107, 108, 72];
but see  for example of homoplasy). Complete chloroplast and
mitochondrial genome sequences will also provide two important sequence
data sets. In addition to structural genomic data, we will assemble
chloroplast and mitochondrial datasets from all coding regions of
sufficient size and conservation to permit confident sequence alignment.
The tremendous amount of organellar sequence data should permit
unambiguous reconstruction of organellar phylogenies for all taxa
sampled. We will also sequence a few nuclear genes that are either
single-copy or from small multi-gene families which are appropriate
for analysis at this scale. BAC libraries will facilitate probing
for (on filter arrays) and amplification of the desired sequences
(from individual BAC clones).
Four approaches will be used to obtain organellar genomes (Table
2). The order in which we will execute these options reflects the
relative costs per genome and the probability of working most easily.
Table 2: Comparison of four methods to obtain the organellar genome
|FACs to purify
|BAC library for 100Mb
nuclear genome, 5X
|oBAC library biased for
We will determine the size of those nuclear genomes that have not
been directly measured using flow cytometry (see Arumuganathan cv).
For genomes ~100MB, we will make a standard BAC library (17 taxa
in Table 1) because this is relatively inexpensive and will provide
us with all three genomes. Average insert size per clone in the
Wing lab is 130-150 Mb and we will aim for =5X coverage which is
considered a minimum BAC library standard by NSF. Quality control
of all libraries will be done by the Wing lab (CUGI standards).
For genomes >100Mb we have three options to get the organellar
genomes. Our first option will be to create an ?organellar bacterial
artificial chromosome? or oBAC library. During normal BAC library
construction, tissue from which the cell wall has been digested
is embedded in agar. Proteins, carbohydrates and organellar genomes
are removed in situ to preserve intact chromosomes. Normally, a
Triton-X step is included to reduce the organellar genome representation
from 10-15% to 2- 3% in the final BAC library. We will omit the
Triton-X step, essentially capitalizing on old technology for a
new purpose, and make a very small library, 384 clones, that will
nevertheless represent each organellar genome ~19-times. The libraries
will be arrayed and probed with standard genes to identify those
clones containing organellar genomes (mito: atpA, cob, atp9, cox1;
chloro: ndhA, rbcL, psbA). Many clones will contain the entire organellar
genome. Our oBAC and BAC procedure may reveal nuclear regions that
contain organellar DNA such as has been found in rice (Wing, unpub.).
Not only is this method cost effective (Table2), but it is automated
(http://www.genome.clemson.edu/), produces arrayed filters and ?
80C glycerol stocks of all clones, is the best chance of preserving
fragile organelles from some of the more ancient taxa (Delwiche,
pers. comm.), and will foster data and bioinformatics exchange into
the genomic community via the CUGI/AZ website. Should this protocol
fail for a particular organism, we will use a Fluorescence Activated
Cell Sorter (FACS) to separate mitochondria and chloroplasts from
fractionated cells. Standard DNA extractions will be made from the
sorted organelles. If both oBAC and FACS fail for the larger genomes,
we will fall back on centrifugation protocols for organellar separation
and extraction, the classical approach [110, 111]. With four alternatives,
all taxa should be feasible, but we will draw from our pool of "alternate"
organisms if any taxa prove intractable.
The purified genomes or surrogate templates will be sheared randomly
into fragments of ~3 kb using a Hydroshear device, end-repaired,
and gel purified. Routine quality control measures ensure that shearing
produces fragments of narrow size distribution (important in the
later sequence assembly phase), with 1 s.d. =8% of the intended
fragment length. These fragments will be blunt-ligated into pUC18,
transformed into E. coli DH5a, and plated onto large format bacterial
plates under conditions that allow for blue-white color selection.
Colonies will be grown overnight, then processed robotically through
creation of glycerol stocks, extracted and amplified using rolling
circle amplification, separation for forward and reverse primer
sequencing, and setting up of the sequencing reactions. Sequence
determination will be on 96-capillary automated sequencers. For
each genome, 96 clones will be sequenced to determine purity (based
on BLAST searches of sequences). Sequencing will continue until
approximately 8-fold redundancy, when gaps in the gene-rich genomes
should be minimal. Gap filling and sequence completion will be done
by returning to archived plasmid preps, or if necessary through
amplification of genomic DNA. Gap filling will be done in collaboration
between JGI and Utah State University. The goal will be to achieve
a total of approximately 9 Mb of final sequence data.
At both CUGI/AZ and JGI all cloning and analysis steps are tracked
using bar-code readers. The data are automatically entered into
a workflow database for statistical analysis of each phase of the
operation. Sequencing machines automatically output their data into
a UNIX-based folder system, where they are assembled into contigs.
The JGI software is unique in that it uses paired-plasmid ends to
guide contig assembly. Gene annotation uses both standard and custom
software which has been successful for many whole genomes sequenced
at JGI. All sequence data will be deposited in GenBank.
Primary sequence characters.
Sequences of chloroplast genomes are complete for 24 organisms,
including four green algae, Marchantia, Psilotum, and numerous seed
plants. From analyses of these genomes, we infer that the best source
of characters will be protein-encoding genes and genes for the 16S
and 23S ribosomal RNAs. Gene content ranges from 69 protein-coding
genes in Pinus to 84 in Marchantia (78 in Chlorella; 76 in Nicotiana)
so gene losses are likely to be important phylogenetic markers .
A strategy using nucleotide sequences of 17 protein-coding chloroplast
genes exhibiting low synonymous substitution rates and site-to-site
rate variation has been applied successfully to studies of basal
angiosperms and land plants . Additional results (R. Olmstead,
unpubl.) suggest that this strategy can be used successfully at
much deeper phylogenetic levels in green plants. Stoebe et al. 
analyzed 46 protein-coding genes totaling >11,500 aligned amino
acids positions in a study of 9 taxa representing all chloroplast
genomes then available and including non-green plant taxa. Restricting
our study to green plants will enable us to use 60 genes and ~50,000
nucleotides of DNA sequence. Characters will be defined at both
the nucleotide and amino acid levels and analyses will be carried
out where most appropriate given alignments and levels of nucleotide
Complete mitochondrial genomes have been sequenced for fewer green
plants than chloroplast genomes. However, two green algae, Marchantia,
and at least one seed plant have been sequenced. Green plant mitochondrial
genomes also contain small and large subunit rRNA genes, but contain
far fewer protein coding sequences than do chloroplast genomes.
We will conduct combined multi-gene analyses for green plants as
we described for chloroplast genomes. Mitochondrial DNA substitution
rates are slower than those of either chloroplast or nuclear genomes
. Various mtDNA genes have been used recently for deep phylogenetic
studies in land plants [115, 116, 28, 117, 37]. We will sequence
single-copy nuclear genes as well as some from small multi-gene
families. Again, nuclear BAC libraries and filter arrays made from
them will greatly ease the acquisition of sequence for phylogenetic
analysis and provide genomic tools for other researchers. Working
from BACs instead of whole genomic DNA enables PCR-based approaches
to recover all copies of the genes, without interference from more
readily amplified copies, a problem when using PCR on nuclear multi-gene
families. We will focus on protein- coding genes that have been
identified as useful for deep phylogeny in plants. The RNA polymerase
II consists of several subunits each encoded by separate nuclear
genes. With rare exceptions, the two largest subunits (RPB1 and
RPB2) are single copy genes in all groups in which they have been
studied. RNA pol II genes have been used for deep phylogenetic studies
of crown eukaryotes , red algae [119, 120], fungi , and
land plants (B. Hall, pers. comm.) and should help resolve our fuzzy
nodes. Phytochrome genes have a good signal for seed plant and basal
angiosperm phylogeny  where a series of duplications have yielded
a clearly defined set of phytochrome genes. However, in non-seed
plants  evidence suggests that there is a single gene with
some lineages having duplications (e.g., Selaginella, Psilotum).
Some of these duplications are likely to mark clades once sampling