[Note: these are rough notes taken down during the symposium itself by Russ Chapman. They are intended for archival purposes; please excuse the rough edges -- we would appreciate getting feedback from anyone about errors or omissions in this document.]

Green Plant Phylogeny Research Coordination Group (Deep Green)

The Computational Challenges of Green Plant Phylogeny

University of Maryland, College Park, MD

Friday 2 June 2000

Symposium - NOTES

2460 AV Williams Building

Dr. Sean Graham, University of Alberta

"Challenges in deep green phylogeny with examples from the seed plants"

Non-coding Regions of the Chloroplast DNA: evolution & alignment
- impact on alignment
- using indels in these regions as characters
  - single base indels are the most common but the size distribution of larger indels is not a simple distribution
  - evolution of the regions is very slow thus homoplasy is not great for these indel characters (and the homoplasy is lower there for nucleotide characters)
  - a couple of kilobases of non-coding data from spacer regions in ct DNA were analyzed
- comments on basal position of Amborella in recent analyses and rooting of the angiosperms
  - different taxon selection leads to different rooting
- examined slowly changing indels and this question of rooting -there is weak indel character support for current trees, and some support for the basal angiosperm being the water lilies (Nymphaeales)
- There is a correlation between the indel characters and the branch length based on nucleotide characters. Estimated size of the region needed to get decent support of basal angiosperm relationships from the indel characters is perhaps 4-6 kilobases not the current 2 kilobases
- 2 types of indels: tandem repeat sequences (insertions dominate) and unique sequences (deletions dominate)
- homology: looking at the mechanism for indel formation can help in determining homology or non-homology
- practical aspects of alignment: manual alignment versus computer- facilitated
- number of required changes varies with how you interpret the homology of the indels and the molecular mechanisms of their origin
Chloroplast Genes sampled in current study: 10 in photosystem II, 2 in ndh, 2 rps, 1 each in rpl, apB, and rbcL regions
- seed plant phylogeny: there are many important questions because the seed plants are so important
- synthesis of current knowledge: a bush-like tree (polytomy)
- the situation is complicated because extant taxa are only a small part of the evolution of the seed plants and morphology & molecular analyses can yield conflicting results
- for various reasons chloroplast DNA should be very appropriate for the analysis of seed plant phylogeny
- tried to root angiosperms with Gnetum (an appropriate outgroup based on morphology) but result was clearly wrong

(Mike Sanderson: simulation studies and long branch attractions

Type I errors - some trees are easy to reject if correct and Type II errors - some wrong trees are easy to accept)

Their analyses of 17 chloroplast genes gave conflicting results depending on whether non-seed-plant outgroups were included or not, and whether third or first plus second codon positions were considered.Gentales placed near Pinaceae in some analyses, and basally in the seedplants in others.

Likelihood analysis also placed Gnetales within the conifers, but not near Pinaceae.

Despite some problems, some important answers are coming out clearly,e.g. relationships within the conifers and within the flowering plants.

Will we find same problems deeper in the tree?

There is a need to integrate fossil evidence and developmental evidence.

Q&A There was discussion about long branch attraction and the question of what is the basal taxon.

Dr. John Huelsenbeck, University of Rochester

"Bayesian inference of large phylogenies"

opening comments on : small phylogeny problems: a few dozen spp.; and big phylogeny problems: >100 spp; there are unique problems associated with the latter consider the optimality criterion: you try to get closer and closer to the true tree but you may have millions to get through to reach the best tree and thus can get stuck on a local near optimum plateau of trees

Bayesian inference: example coin toss: results get closer to expected results as the number of tosses goes up; Bayesian inference brings in some prior beliefs to a posterior probability; as you add more and more data the result is less dependent on the a priori expectation; e.g. 3 trees assume all 3 are equally likely or one is more likely than the other two; calculating the posterior probability

Metropolis-Hastings algorithm - Markov Chain Monte Carlo approximation and inferring large phylogenies

new move-types: improve the analysis (single worm and double worm)
Metropolis-Coupled MCMC (i.e., MC3) and "incremental heating" (facilitates swaps from one local optimum to the other)
Conclusions:
1. random sampling of phylogenies may prove a better approach than optimization for inferring large phylogenies
2. Bayesian inference using ______ may be an effective approach
3. MC3
4. MC3 can take advantage of parallel processors

Dr. Tandy Warnow, University of Texas, Austin

"New approaches for large scale phylogeny reconstruction, and what we still need to do"

"Data decomposition techniques to increase speed and accuracy for phylogenetic analysis"

opening comments on Markov models of evolution

In NJ, both theoretical and experimental studies suggest that the sequence lengths needed to recover the true tree grow exponentially with the largest leaf-to-leaf evolutionary distance and this situation is bad since there is a finite limit to the data available

NJ analysis is very much affected by high rates of evolution but likelihood and parsimony seem to be less affected.

They defined the concept of "fast-converging" methods, which means that the method is guaranteed to recover the true tree with probability going to 1, from sequences that grow polynomially in the numberof taxa, for each bound on the branch lengths.

By this definition, no method currently in use can be proven to be fast converging, and empirical studies suggest that NJ is not fast converging.

They sought to develop fast converging methods through a divide-and-conquer strategy.

Technique: divide into small, low divergence subproblems; compute trees on subproblems; merge subtrees into final tree

Objective: divide and merge in polytime; obtain accurate subtrees -DCM (maxclique)

They noted that this DCM strategy does improve the accuracy of distance methods, but it also seems potentially useful for solving MLE and MP, and maybe helpful in real data problems.

Discussion of "disk covers of a tree" and of merging trees (first transform them through edge contractions so that they induce the same subrees on their shared leaves)

Discussion of algorithmic structure of the disk-covering method and a small test of the method.

For NJ, it is better to use small subproblems, but for parsimony it is better to use the whole data set.

Discussion of DCM2 (dac) - don't divide everything into very small subproblems

They are getting an improvement in accuracy with DCM1 and faster parsimony with DCM2.

Open Problems

Are any of the standard distance methods fast converging for CF trees?
Lower bounds?
Can we bound the expected error rate of different methods?
Characterization of real evolutionary tree topologies
optimization problems based upon fitting dissimilarity matrices to additive matrices, but considering only the entries below fixed constant q
Why does MP performs so well for biologically reasonable CF trees?
Which methods perform well in practice?
How can we speed up analyses?

Break

Mike Sanderson, University of California, Davis

"Supertree strategies for assembling the tree of life"

Comments on the "sparse matrix of comparative genomics/biology" and all of the missing data

But how do you handle data for a large number of taxa?
There are some straightforward analyses of large data sets (e.g., rbcL).
There is some good news and some bad news about the approach of sequencing a few genes for a broad spectrum of the tree of life.
They examined simulations to see the number of characters need to get 80% of the clades correct. It scales nicely, even large phylogenies don't require an impossible number of characters. But, the problem is that the deeper nodes are harder to get accurately with a reasonable number of characters. The tips of the tree are relatively easy to resolve.
Therefore, consider piecing together smaller trees to make a supertree. Under optimal conditions you can unambiguously combine trees,but often it is not easy to combine the trees (e.g., when some shared taxa show different relationships in the two trees).
Matrix representation: turn trees into data matrices and then combine the matrices and run a parsimony analysis to construct the super tree.
Example: two studies of two different groups of grasses; 177 original trees on carnivores were synthesized for 300 taxa
If you had a massive data set, would it be better to subdivide the data set and then synthesize a supertree or stick with the original data and analyze all at once? Use simulations to check. Both methods work, but both begin to fail if you begin to cut taxa out of the analysis.
How does one grapple with the growing data base "out there"? The first challenge is to decide which trees could/should be selected to be combined in the supertree. The approach can provide an alternative to massive sequencing of all of the taxa. The approach could also let you know where additional data are needed.

Q&A Q: If you analyze only published trees you could have a good data set that was poorly analyzed (and thus your analysis would be based on errors). A: Yes. Thus, ideally you would reanalyze the data or include bootstrap values from the original analysis to show weakly supported regions.

David Hillis, University of Texas, Austin

"Genetic algorithms for large phylogenies: problems and possibilities"

Given: many taxa, many characters, and complex evolutionary models

What can be said about computational tractability?

genetic algorithms
parallel processing
hitch-hiking algorithms
simulated annealing
Markov Chain Monte Carlo (MCMC) methods

Genetic Algorithms
set up the problem so that the 'solutions" are 'individuals" in a population and apply operations analogous to those of population biology

genome: string specifying topology and parameters

mutation: tree bisection and reconnection; change branch length by random values; change kappa by random value, etc.

recombination: recombination between chromosomes strings: good trees are likely to reproduce, recombine to form new trees

Discussion of Paul Lewis' "GAML" (Genetic Algorithm for Maximum Likelihood) approach.

Rank order selection of individuals for next generation

The analysis of the individual (solutions) can easily be "parallelized" (the algorithms are naturally parallel, with one processor assigned to each individual in the population) and there are various approaches to handle the parallel analyses

Discussion of evaluating Performance - two types of situations

1. Simulated data: Comparison to a known model tree

Advantages:
a. true tree known
b. model of evolution known
But, disadvantages:
c. simplicity of model may lead to data that are too"well-behaved"

2. Real data: Comparison of trees inferred by different methods

Advantages:
a. data realistic
But, disadvantages:
b. true tree unknown
c. model of evolution unknown

Used both real and simulated data for 228-taxon Angiosperm phylogeny (Soltis et al.) to test performance of parallel GAML approach.

Performance of method: curve of likelihood scores appears to plateau, but scores are actually still improving after 70,000 generations. A better way to see improvement is to plot agreement score of current best tree with model tree (or best known tree).

Discussion of distribution of branch length changes after optimization. Best trees at any one time can usually be improved by optimizing branch lengths. Small changes in branch lengths can have huge effects on likelihood scores.

Strategy: periodic optimization of branch lengths during selection process could lead to rapid improvement in GA approach.

Limits on Performance:: Adaptation is most efficient when: selection is strong; population size is large; variance is maintained (i.e., the mutation rate is high). In this approach, selection is always strong (because of rank selection scheme), but population sizes are necessarily small and selection may not counteract effects of branch length mutation as population approaches the optimum fitness.
Effect of population size:: At first small populations seem to give comparable good results but then performance gets better with larger population size. In terms of computational power used, the optimum population size for the Angiosperm phylogeny problem is somewhere near 15 individuals. However, in terms of real time, the larger the population size, the better, and the scale-up is almost linear. This gives great hope that algorithms may prove highly effective.
Effect of mutation near the optimum:: If you have an optimal tree in tree-space: effective asymptote - near optimal trees
Effect of combining populations of solutions:: works to speed up the approach to optimal solutions
Advantages to genetic algorithm approach:: Avoids potential bias from starting point and calculation order; Takes advantage of parallel computing, and is easily scalable to larger computer architectures; Easily distributed across large computer clusters
Future improvements:: changes in mutation rate as problem progresses; periodic branch length optimization; other topological mutations; larger populations; more population structure, with migration between subpopulations; distributions of populations across many machines

Acknowledgements: Paul Lewis, Matt Brauer, Mark Holder, Laurie Dries, Derrick Zwickl, National Partnership for Advanced Computational Infrastructure (NPACI), and Texas Institute for Computational and Applied Mathematics (TICAM).