Deep Green - College Park

The Computational Challenges of Green Plant Phylogeny: Minutes

[Note: these are rough notes taken down during the workshop itself by Russ Chapman. They are intended for archival purposes; please excuse the rough edges -- we would appreciate getting feedback from anyone about errors or omissions in this document.]

Saturday June 3, 2000
2460 A.V. Williams Building, University of Maryland
College Park, MD

Workshop: Coordinating Work on Computational Challenges in Phylogenetic Reconstruction

Session 1: 9:00-10:30 a.m.

Focus: New and Emerging Computational Issues

Bernard Moret, U. of New Mexico

"Parallel computing on uniform-memory-access shared-memory architecture: linear speed-ups for complex combinatorial problems"

[This speaker's slides are available in Postscript form at: http://www.cs.unm.edu/~moret/deepgreen.ps
and in PDF form at:
http://www.cs.unm.edu/~moret/deepgreen.pdf]

Introduction: parallel computing has been around for 30 years (ILIAC) massively parallel machines are becoming affordable but use shared memory except in small (4-8 processor) clusters

Discussion of problems with parallel computing to date: 1000 factor slow down when accessing other memories
Shared Memory Trough Hardware: this is the key; SGI (ASCI BLUE MOUNTAIN) expensive
There is now an architecture that provides Uniform Memory Access (UMA) in Shared Memory: UMA - access from all processors to all memory locations; now Sun E10K = first such machine (a large array is very expensive); San Diego Supercomputer Lab: there is cache-ing and slight delay of 600 nanoseconds (10 times slower than work on a single machine, but fantastic rate for so many shared processors)
25 Years of PRAM ("pee-ram") Algorithms: Parallel Random Access Machine:: theoretical model of perfect shared memory system
Message-passing vs. shared-memory: most programming is based on message passing (ear decomposition) - analysis of results shows message-passing mode slows down the process but using shared memory mode doesn't (linear speed up: 2x processors = 2x as fast)
Linear Speed-ups with PRAM Algorithms: should be available at a practical level in several years and we can attack problems that we cannot handle now

Conclusions: True UMA shared-memory helps: it will be available, and we should prepare to make use of it; a revolution in high-performance computing may be at hand and computational biology will benefit greatly from it.

Q&A Isn't this a vectorization process? No, it is not vectorization, it is different. Actually, computation is pretty minor; it is the transfer of information - the communication that is - that is why this new development is so important. This could lead to a revolution in the industry, PRAM is best for combinatorial optimization. Audience comments: This should be especially good for ML analyses. This all makes massive memory available. It was designed by Sun to handle giant data bases (not for scientific computing) and this might also be important for the biological community. (Ed. note: Assuming the computer discussed is the "SUN Enterprise 10000" (also known as the "Starfire"), you can read a little about the technology here: http://www.sun.com/servers/highend/10000/tech.html

To summarize: it seems that they have a very, very fast interconnection scheme called the "Gigaplane-XB Interconnect" that outperforms earlier bus architectures so that the processors can communicate with each other and with the computer's memory much faster than in the earlier architectures.)

Lisa Vawter, Smith Kline-Beecham

"Inclusion of distant taxa"

Introductory comments on approaches: morphological and molecular etc.

Situation: one or a few clades or paraphyletic groups rooted or unrooted trees

Unrooted networks of interest

Morphological Approach: work with Ken Rice: looking at intron splice sites as characters - there are problems in assuming the uninterrupted gene is a single exon (because genes often include duplicated segments etc.)

Structural-morphological work with Ken Rice: e.g. cell surface receptor can have several structural variations and all of these can be used as characters

IAS Method: Inferred Ancestral States (easiest with a rooted subtree)

IAS Method with unrooted networks: artificially root one of the trees in all possible ways and compute the tree lengths with each of these attached to the other tree at all possible nodes; this can help move toward the optimal tree

Other sources of data: regulation, pathways upstream, etc. - use all sources of data to reduce the number of possible correct trees

Comment: It is very helpful to have a central curator of all data in a relational format. (Important for Deep Green etc.)

Jessie Kissinger, University of Pennsylvania

"Visualizing the Plasmodium and Toxoplasma Genomes - Toxoplasma gondii"

Apicomplexans (have plastids)

review of serial endosymbiosis; apicomplexans (plasmodium, Malaria) have an "apicoplastid" with 4 membranes and these plastids are essential to the organism and are therefore a great target for drugs. A lot of the endosymbiont's genes have been transferred to the host nucleus.

They have sequenced the entire 35kb apicoplastid DNA from Toxoplasma gondii. They don't know yet exactly what this genome codes for. The organism does not use the normal genetic code for one of the codons and the ribosomes have compensatory changes for this modified codon usage.

Comment: There is a need for a way to graphically show indels etc. for non homologous chromosomes.

Apicoplastid genes: 3 genes known (the products are imported back into the plastid) and to find out about the others they are comparing these sequences with the nuclear genome and databases (data mining) - "Functional Genomics."

They began to look for plastid genes that were in the nuclear genome. They also "Blast'd" their genes against Genbank and indexed the Blast hits by names.

They have transit peptides but these are different from others in plastids.

They have found about 35 candidate genes and about 12 have been shown to go into the plastid.

Terpenoid biosynthesis, acyl chain biosynthesis, heme biosynthesis are there, but many things that should be there haven't been found yet (RNA transferases etc.).

Q&A What is the phylogenetic position of this plastid? Based on nuclear genes, it is a red algal plastid (via a secondary endosymbiosis) Does Leishmania have a plastid? Not found yet but they are looking at the genes etc. Is there more than one nucleus? There is no evidence for a remnant nucleus. Does the current nucleus have genes from nuclei of previous endosymbionts? Definitely. Time frame for all of the genome changes? Not known - over millions of years. Comment: Herbicides as anti-malarial compounds are being studied and many have been patented!

Tsetso Bachvaroff, University of Maryland

Dinoflagellate Chloroplast Genes

Introductory comments on: Apicomplexans (related to dinoflagellates based on nuclear genes)

He has sequenced ct genes and done ML on psbB

The preliminary results would suggest dinoflagellate chloroplasts come from the haptophytes (this disagrees with other information).

There are no dinoflagellate chloroplast genomes as such; all genes so far have been found on small circular DNA pieces.

Even within the peridinin-containing dinoflagellates, some have plastids from other algal types.

Dinoflagellates use proteobacteria RUBISCO not the plant type of RUBISCO. 8 Dinoflagellate chloroplast proteins are known: pspB, LSU, SSU, etc.

Dinoflagellates don't necessarily need photosynthesis.

Plastids can be necessary for biosynthetic functions and not photosynthesis per se.

Dinoflagellate plastids are bound by 3 membranes (cf. 4 in apicoplastids)

Discussion: other examples of kleptoplastids (e.g., slugs with plastids) and potential bottlenecks created by endosymbiotic events

Red and Green algal plastids have been shown to be monophyletic (cf. article in Nature ). Audience comment: plastids may be monophyletic but the acquisition of the plastids is not necessarily monophyletic.

Session II: 11:00-12:30 am

Focus: Moving Beyond the Tree

Jim Rodman, NSF

"Broccoli, capers, papayas: Molecular phylogenetics of nouvelle cuisine"

Introductory comments: on the flavors that are caused by the sulfur-containing compounds that produce "mustard oil" (of which there are about 100) - mustard oil glucosides

The breakdown enzyme (myrosinase) is present in all Cruciferae studied (also in the sister family, the Caperaceae family - the flavor of capers is due to a mustard oil).

Mustard oils occur in 15-16 families of plants: Is this a case of rampant convergence, multiple convergences among these unrelated families? (review of many examples, including papayas)

Is it parallel evolution? 1975 one scientist (Rolf Dahlgren) suggested these families were related and, thus shared the trait. He was considered very wrong! Therefore, rbcL analysis was done to test his hypothesis and he was basically right, the families are all in one clade (with one exception - Drypetes).

What about nuclear 18S rRNA gene sequence data? The answer is the same. And combination of rbcL and 18S data yielded better resolution. Bootstrapping and decay analysis, etc. were done to test the robustness of the tree.

This robustly supported tree was used to look at other characters, e.g. three dicot families produce erucia acid and now it is realized that these 3 families are in the mustard oil group.

Also some of seemingly morphologically very different flowers are now thought be more similar than originally thought. Now we can study the evolution of the biosynthesis of mustard oils. The process begins like the cyanogenic glycoside biosynthetic pathway. Maybe even some of the enzymes are related.

The big sister group to the mustard oil clade is the Malvaceae which have sulfated enzymes. The sapidalean plants are cyanogenic glycoside-producing group

Drypetes ca. 100 spp. have some similar enzymes and even some similar specialized cells. They are not closely related at all. Thus, this must be an example of parallel evolution.

Q&A How frequently has the enzyme been lost? Only one genus has been shown to lack the necessary enzyme but it has other features of mustard oil production. Otherwise every plant examined has the synthesis feature. In other cases wherein many but not all plants share a feature, one can go back and look to see if other parts of the feature are present in the genome. Discussion: the sister groups for the Cruciferaceae.

Jungho Lee, University of Massachusetts

"Phylogeny of the charophytes - evidence from the chloroplast genome"

There are two lineages of green plants: chlorophytes (green algae) and streptophytes (green algae and land plants).

A phragmoplast-mediated cytokinesis occurs in some of the advanced charophytes.(Chara and Coleochaete). Some Zygnematales have a phragmoplast-like structure.

The lineages within the charophycean green algae include Chlorokybus (Chlorokybales), Klebsormidiales, Zygnematales, Charales, and Coleochaetales.

Which is the basal lineage? Some studies have shown that Chlorokybus is basal, other research shows Charales as the basal charophyte.

He looked at chloroplast DNA of Spirogyra maxima to collect new sources of information on chloroplast genome structure. It has some features that are quite similar to those of land plants but some of the features seem to be unique to Spirogyra.

Introns in land plants and charophytes: Many that are in land plants are in the charophytes. He examined all 5 lineages of charophytes. cis-splicing intron and trans-splicing introns

The distribution of four introns; Chara and Coleochaete have all four, Klebsormidium and Zygnematales have different sets of three, and Chlorokybus has none. In addition, Chara has certain key introns but Chlorokybus has none of the introns so it is considered to be basal.

Q&A ? Is the sister to land plants changed? The four intron distribution does not affect the view of Charales or/and Coleochaete as the sisters of land embryophytes. To solve the question, investigation of more organellar genomic characters in charophytes and basal land plants is necessary.

Q&A ? What is the view of lateral transfer? The lateral transfer of the intron can not be ruled out but unlikely in those four introns among charophytes.

Q&A ? Will you sequence some of chloroplast genomes? Both Spirogyra and Coleochaete genomes have been half characterized based on sequence data.

Dennis Wall, University of California, Berkeley

"Codon Usage in Green Plants"

Codon bias can be caused by genome base composition bias or can reflect selection for translational accuracy.

Codon bias will increase homoplasy in coding-gene data. Some genes are used in studies of deep branches within land plant phylogenies. Does codon bias exist in such genes, and if so does it change along lineages? Discussion of assessment of codon usage and bias

Does codon bias exit in rbcL? Yes, high bias in spore-producing plants and low bias in seed-producing plants. The codon bias is not based on genome composition bias.

This information can be useful in phylogenetic reconstruction.

Reduced bias and no preference (no bias) are derived conditions in land plants.

Codon bias usage clearly evolves under modes of positive selection; such selection will increase homoplasy and must be considered; and recent maximum likelihood models of codon evolution may be overly simplified for wide phylogenetic sampling.

Q&A How to avoid problem? Audience comments: Compartmentalization would help. Codon bias per se will not cause a problem; differential condon bias would create a problem. In the former case the analysis would not be hurt. In most cases we do not see significant differential codon bias. If one of the seed plants developed a high codon bias, that situation might create a problem. The increased homoplasy within a group because of codon bias might be a problem, but there no clear evidence of that. There is more variation among algae and lower land plants, than among the seed plants.

Session III: 1:30-3:00 pm

Focus: Analytical Advances

Matt Cimino, University of Maryland

"Charales"

Comments on the available data set for streptophyta position of Chaetospheridium (which doesn't retain the egg cell as does Coleochaete) and monophyly of Coleochaetales

The rbcL and atpB analyses shows the Coleochaetales are monophyletic, but this result is not too strongly supported.

Brett Largett, Duquesne University

"A Bayesian approach to phylogenetic inference from genome arrangement data"

What are the limits to what we can determine from sequence information? Bayesian approach: don't look for the single best tree, rather begin with a prior distribution of a set of evolutionary trees.

For example observe genetic data at leaves. Determine the posterior distribution on the basis of a likelihood model for change in genetic data and Bayes' Theorem.

Why develop methods for genome arrangement data? Genome arrangements may provide more information about deep evolutionary relationships.

Ingredients to a Bayesian approach to phylogenetic inference: prior distribution on trees and parameters in the likelihood model

a likelihood model that specifies the probability of observed data given the tree; and unknown data and parameters Markov chain on the space of trees, unknown data, and parameters

Simple problem: inversions are the only mechanism of change, all inversion are equally likely; inversions occur according to a Poisson Process.

To compute the likelihood of observed data for any tree, we need to be able to calculate the probability of going between any two arrangements given an expected number of inversions.

Alfalfa and pea chloroplast DNA: 12 genes 8.2. x 1010 possibilities, so even this two taxon situation is complex.

Discussion: on the shortest path to a given set of rearrangements. With very small trees you can run this even now without a special program. What is the computational complexity of the counting algorithm? Is there a better algorithm? Can a good approximation that is more computational tractable be used? Can the model be generalized and remain computationally tractable? Can sequence data and genome arrangement data be combined?

Example: black chitin, fruit fly, acorn worm, and human

Brent Mishler, University of California, Berkeley

"Compartmentalization revisited"

problems of large sale, deep analyses too many taxa difficulty in determining homology
alternatives: exemplars; consensus coding for a heterogeneous terminal group (ad hoc); compartmentalization
by analogy with a water-tight compartment on a ship
homoplasy isn't allowed in or out
goal of compartmentalization: cut data sets down to manageable size; suppress the effect of spurious homoplasy; allow use of more information in an analysis
some definitions: compartment: a group of terminals accepted as monophyletic apriori; sore thumb: a terminal that belongs to a compartment but jumps out; in unconstrained global analyses
some approaches and some advantages and disadvantages

Some examples: green plants unconstrained analysis of 18S data replaced by constrained monophyletic groups with same data or more radical approach is to use a consensus sequence for each group "HTU" This approach will greatly reduce the number of taxa.

Q&A Audience comment: This approach does not eliminate problems associated with determining HTU. Response: This is somewhat similar to the super tree approach. It is perhaps midway between global analysis of all taxa on the one hand and the supertree approach. Audience comment: This is very much a Bayesian approach in the sense that the accepted monophyletic groups are the "priors." Response: Yep, similar in basic philosphy.

Mark Chase, Royal Botanic Gardens, Kew

"Rigorous analyses on single gene matrices do not lead to more correct groups: a perspective based on results from matrices composed of several genes"

Introductory comments on the angiosperms and the developing data sets. Looked at starting lengths vs. maximum parsimony trees for different genes rbcL, atpB, and 18S and different paired analyses. Combined data sets worked better than single gene analyses (with some starting trees close to the shortest trees).

You get shorter trees when you use more branching swapping (TBR swapping) e.g. atpB trees and other trees. This is less true for the 18S data which is flat (there is not decrease in tree length as you increase branch swapping). A single gene doesn't allow you to get to best trees.

In a simulation with one of the real trees as the "true" tree, the combined data gives you a better tree but not the true tree.

Slow vs. fast genes (n.b.,both lower numbers of variable sites and lower frequency of change at variable sites have been used when referring to slowly evolving genes)

rate of molecular evolution: nuclear > plastid> mitochondrial

Is use of slow genes better to infer ancient radiation events?

% of variable sites 18s 22%, atpB, 40%, rbcL, 42%,, matK 70%, atp1 ___% matR _____%

The faster gene is best because it contains more variable characters. The angiosperm data perform almost like clean simulated data sets.

Session IV: 3:30-4:40 pm

Ray Cranfill, University of California, Berkeley

presented brief informal comments on the need to facilitate information sharing on the web.

Dick Olmstead, University of Washington

offered informal comments on the phylogeny of seed plants, and the nuclear and mitochondrial gene trees and relationships of the Gnetales to other seed plants;

there are long branches in the situation; there was another study with similar results; he and Sean Graham are looking at 17 chloroplast genes and he commented on the selection of slowly and more rapidly evolving genes for the study.

comments on the ILD test and their results with chloroplast genes some genes had different base compositions among seed plants

The results (18s maximum parsimony) show Gnetales sister to the seed plants (conifers and angiosperms) with high bootstrap support. (Ray Cranfill noted he had very similar [identical] results in various analyses of various genes.)

He saw no rate differences among the genes in any of lineages for any of the genes.

With one analysis (of first and second positions) he did get the Gnetales within the conifers. Thus, there are conflicting results even when you try to take into consideration some of the problems.

ILD test seems to be giving some strange results. (The test tells you if you could improve the nonhomology situation by partitioning the data.) Audience comment: The test may be doing what it is supposed to do.

The inverted repeat genes were good but some (ndhB) are not in all land plants, others are and can be used. Also introns are useful in these genes. Obtained a tree with Gnetales in the conifers. There is a need to sequence some more conifers.

Focus: Information Presentation

Brent Mishler, University of California

"The ultimate rank-free database"

There is a need to communicate well with the public and show them our results in a way they can understand. There is no need for traditional taxonomic ranks ranks. Green plant hyperbolic map demonstration at Deep Green web site. There are 300-400 taxa in the tree. Thumbnails will be added. This is a good way to show the results to the public. Tree was constructed by Ray Cranfill and Dick Moe. "Site Lens Studio" was donated by Ramana Rao and Inxight Software, Inc. (Software is normally $4,000.)

Junhyong Kim, Yale University: Discussion Leader for "Data Visualization"

Opening comments on visualization: problems with visualizing large trees or a large set of trees. How to show the tree? How to show the tree well?

Large trees: make interactive (like Tree of Life) or embedded in hyperbolic space (fish eye effect).

Visualizing large objects well - humans are very much visually oriented.

What to do?

schemes: interactive visualization; data sensitive visualization; representative schemes
technicals: algorithm; hardware
biology: optimization

Comments by Ben Salisbury, Yale University:

Visualizing different trees

Session V: 5:00-5:30 pm Closing Discussion

current and future research plans
logistics for maintenance of resources
future funding for Deep Green

It would be good to broaden the focus to evolution in general not just phylogenetics. Some would like to learn more about genomics. Others would be interested in meeting the people involved in the plant genome projects plus paleobotanists and phylogeneticists. The Taos meeting included a good mix. Some would like two or three meetings in a proposal to fund some of these meetings. Doebly might be amenable to a meeting involving his plant genome project participants and Deep Green people. Would people like to continue these data analysis meetings? Bring in people working in related areas including paleobotanists and morphologists, and developmental scientists. Genomics is a good central theme that could bring in the types of participants. The original idea of coordinating the research really has not happened, but it was always intended to be voluntary. The individual workshops (bryophytes, ferns, angiosperms) did a little more of this kind of thing (that is, result in explicit coordination of the research being done in various laboratories).

The deadline for the NSF Research Coordination Network proposals (in June 2000) was mentioned. There are at least two Deep Green RCN proposals being submitted that we know of. There have more than 100 papers that were facilitated one way or another by Deep Green. If any one else plans to submit a proposal within two weeks to the NSF Research Coordination Network program, they should let us know. Current Executive committee members are Brent Mishler, Liz Zimmer, Russ Chapman, Chuck Delwiche, and Ken Karol. If anyone has ideas for funding please let us know.

Adjournment at 5:23 pm.