[Note: these are rough notes taken down during the symposium itself by Russ Chapman. They are intended for archival purposes; please excuse the rough edges -- we would appreciate getting feedback from anyone about errors or omissions in this document.]

Green Plant Phylogeny Research Coordination Group (Deep Green)

The Computational Challenges of Green Plant Phylogeny

University of Maryland, College Park, MD

Friday 2 June 2000

Symposium - NOTES

2460 AV Williams Building

Dr. Sean Graham, University of Alberta

"Challenges in deep green phylogeny with examples from the seed plants"

  1. Non-coding Regions of the Chloroplast DNA: evolution & alignment
  2. Chloroplast Genes sampled in current study: 10 in photosystem II, 2 in ndh, 2 rps, 1 each in rpl, apB, and rbcL regions

(Mike Sanderson: simulation studies and long branch attractions

Type I errors - some trees are easy to reject if correct and Type II errors - some wrong trees are easy to accept)

Their analyses of 17 chloroplast genes gave conflicting results depending on whether non-seed-plant outgroups were included or not, and whether third or first plus second codon positions were considered.Gentales placed near Pinaceae in some analyses, and basally in the seedplants in others.

Likelihood analysis also placed Gnetales within the conifers, but not near Pinaceae.

Despite some problems, some important answers are coming out clearly,e.g. relationships within the conifers and within the flowering plants.

Will we find same problems deeper in the tree?

There is a need to integrate fossil evidence and developmental evidence.

Q&A There was discussion about long branch attraction and the question of what is the basal taxon.

Dr. John Huelsenbeck, University of Rochester

"Bayesian inference of large phylogenies"

opening comments on : small phylogeny problems: a few dozen spp.; and big phylogeny problems: >100 spp; there are unique problems associated with the latter consider the optimality criterion: you try to get closer and closer to the true tree but you may have millions to get through to reach the best tree and thus can get stuck on a local near optimum plateau of trees

Bayesian inference: example coin toss: results get closer to expected results as the number of tosses goes up; Bayesian inference brings in some prior beliefs to a posterior probability; as you add more and more data the result is less dependent on the a priori expectation; e.g. 3 trees assume all 3 are equally likely or one is more likely than the other two; calculating the posterior probability

Metropolis-Hastings algorithm - Markov Chain Monte Carlo approximation and inferring large phylogenies

  1. new move-types: improve the analysis (single worm and double worm)
  2. Metropolis-Coupled MCMC (i.e., MC3) and "incremental heating" (facilitates swaps from one local optimum to the other)
  3. Conclusions:
    1. random sampling of phylogenies may prove a better approach than optimization for inferring large phylogenies
    2. Bayesian inference using ______ may be an effective approach
    3. MC3
    4. MC3 can take advantage of parallel processors

Dr. Tandy Warnow, University of Texas, Austin

"New approaches for large scale phylogeny reconstruction, and what we still need to do"

"Data decomposition techniques to increase speed and accuracy for phylogenetic analysis"

opening comments on Markov models of evolution

In NJ, both theoretical and experimental studies suggest that the sequence lengths needed to recover the true tree grow exponentially with the largest leaf-to-leaf evolutionary distance and this situation is bad since there is a finite limit to the data available

NJ analysis is very much affected by high rates of evolution but likelihood and parsimony seem to be less affected.

They defined the concept of "fast-converging" methods, which means that the method is guaranteed to recover the true tree with probability going to 1, from sequences that grow polynomially in the numberof taxa, for each bound on the branch lengths.

By this definition, no method currently in use can be proven to be fast converging, and empirical studies suggest that NJ is not fast converging.

They sought to develop fast converging methods through a divide-and-conquer strategy.

Technique: divide into small, low divergence subproblems; compute trees on subproblems; merge subtrees into final tree

Objective: divide and merge in polytime; obtain accurate subtrees -DCM (maxclique)

They noted that this DCM strategy does improve the accuracy of distance methods, but it also seems potentially useful for solving MLE and MP, and maybe helpful in real data problems.

Discussion of "disk covers of a tree" and of merging trees (first transform them through edge contractions so that they induce the same subrees on their shared leaves)

Discussion of algorithmic structure of the disk-covering method and a small test of the method.

For NJ, it is better to use small subproblems, but for parsimony it is better to use the whole data set.

Discussion of DCM2 (dac) - don't divide everything into very small subproblems

They are getting an improvement in accuracy with DCM1 and faster parsimony with DCM2.

Open Problems

  1. Are any of the standard distance methods fast converging for CF trees?
  2. Lower bounds?
  3. Can we bound the expected error rate of different methods?
  4. Characterization of real evolutionary tree topologies
  5. optimization problems based upon fitting dissimilarity matrices to additive matrices, but considering only the entries below fixed constant q
  6. Why does MP performs so well for biologically reasonable CF trees?
  7. Which methods perform well in practice?
  8. How can we speed up analyses?

Mike Sanderson, University of California, Davis

"Supertree strategies for assembling the tree of life"

Comments on the "sparse matrix of comparative genomics/biology" and all of the missing data

Q&A Q: If you analyze only published trees you could have a good data set that was poorly analyzed (and thus your analysis would be based on errors). A: Yes. Thus, ideally you would reanalyze the data or include bootstrap values from the original analysis to show weakly supported regions.

David Hillis, University of Texas, Austin

"Genetic algorithms for large phylogenies: problems and possibilities"

Given: many taxa, many characters, and complex evolutionary models

What can be said about computational tractability?

Genetic Algorithms
set up the problem so that the 'solutions" are 'individuals" in a population and apply operations analogous to those of population biology

genome: string specifying topology and parameters

mutation: tree bisection and reconnection; change branch length by random values; change kappa by random value, etc.

recombination: recombination between chromosomes strings: good trees are likely to reproduce, recombine to form new trees

Discussion of Paul Lewis' "GAML" (Genetic Algorithm for Maximum Likelihood) approach.

Rank order selection of individuals for next generation

The analysis of the individual (solutions) can easily be "parallelized" (the algorithms are naturally parallel, with one processor assigned to each individual in the population) and there are various approaches to handle the parallel analyses

Discussion of evaluating Performance - two types of situations

1. Simulated data: Comparison to a known model tree

2. Real data: Comparison of trees inferred by different methods

Used both real and simulated data for 228-taxon Angiosperm phylogeny (Soltis et al.) to test performance of parallel GAML approach.

Performance of method: curve of likelihood scores appears to plateau, but scores are actually still improving after 70,000 generations. A better way to see improvement is to plot agreement score of current best tree with model tree (or best known tree).

Discussion of distribution of branch length changes after optimization. Best trees at any one time can usually be improved by optimizing branch lengths. Small changes in branch lengths can have huge effects on likelihood scores.

Strategy: periodic optimization of branch lengths during selection process could lead to rapid improvement in GA approach.

Limits on Performance:
Adaptation is most efficient when: selection is strong; population size is large; variance is maintained (i.e., the mutation rate is high). In this approach, selection is always strong (because of rank selection scheme), but population sizes are necessarily small and selection may not counteract effects of branch length mutation as population approaches the optimum fitness.
Effect of population size:
At first small populations seem to give comparable good results but then performance gets better with larger population size. In terms of computational power used, the optimum population size for the Angiosperm phylogeny problem is somewhere near 15 individuals. However, in terms of real time, the larger the population size, the better, and the scale-up is almost linear. This gives great hope that algorithms may prove highly effective.
Effect of mutation near the optimum:
If you have an optimal tree in tree-space: effective asymptote - near optimal trees
Effect of combining populations of solutions:
works to speed up the approach to optimal solutions
Advantages to genetic algorithm approach:
Avoids potential bias from starting point and calculation order
Takes advantage of parallel computing, and is easily scalable to larger computer architectures
Easily distributed across large computer clusters
Future improvements:
changes in mutation rate as problem progresses
periodic branch length optimization
other topological mutations
larger populations
more population structure, with migration between subpopulations
distributions of populations across many machines

Acknowledgements: Paul Lewis, Matt Brauer, Mark Holder, Laurie Dries, Derrick Zwickl, National Partnership for Advanced Computational Infrastructure (NPACI), and Texas Institute for Computational and Applied Mathematics (TICAM).