[Note: these are rough notes taken down during the symposium itself by Russ Chapman. They
are intended for archival purposes; please excuse the rough edges -- we would appreciate
getting feedback from anyone about errors or omissions in this document.]
Green Plant Phylogeny Research Coordination Group (Deep Green)
The Computational Challenges of Green Plant Phylogeny
University of Maryland, College Park, MD
Friday 2 June 2000
Symposium - NOTES
2460 AV Williams Building
Dr. Sean Graham, University of Alberta
"Challenges in deep green phylogeny with examples from the seed plants"
- Non-coding Regions of the Chloroplast DNA: evolution & alignment
- impact on alignment
- using indels in these regions as characters
- single base indels are the most common but the size
distribution of larger indels is not a simple distribution
- evolution of the regions is very slow thus homoplasy is not great for
these indel characters (and the homoplasy is lower there for
nucleotide characters)
- a couple of kilobases of non-coding data from spacer regions
in ct DNA were analyzed
- comments on basal position of Amborella in recent analyses and rooting of the angiosperms
- different taxon selection leads to different rooting
- examined slowly changing indels and this question of rooting -there is weak indel character support for current trees, and some support for the basal angiosperm being the water lilies (Nymphaeales)
- There is a correlation between the indel characters and the branch length based on nucleotide characters. Estimated size of the region needed to get decent support of basal angiosperm relationships from the indel characters is perhaps 4-6 kilobases not the current 2 kilobases
- 2 types of indels: tandem repeat sequences (insertions dominate) and unique sequences (deletions dominate)
- homology: looking at the mechanism for indel formation can help in determining homology or non-homology
- practical aspects of alignment: manual alignment versus computer- facilitated
- number of required changes varies with how you interpret the homology of the indels and the molecular mechanisms of their origin
- Chloroplast Genes sampled in current study: 10 in photosystem II, 2 in ndh, 2 rps, 1 each in rpl, apB, and rbcL regions
- seed plant phylogeny: there are many important questions because the seed plants
are so important
- synthesis of current knowledge: a bush-like tree (polytomy)
- the situation is complicated because extant taxa are only a small part of the
evolution of the seed plants and morphology & molecular analyses can
yield conflicting results
- for various reasons chloroplast DNA should be very appropriate for the analysis of seed plant phylogeny
- tried to root angiosperms with Gnetum (an appropriate outgroup based on morphology) but result was clearly wrong
(Mike Sanderson: simulation studies and long branch attractions
Type I errors - some trees are easy to reject if correct and Type II errors - some
wrong trees are easy to accept)
Their analyses of 17 chloroplast genes gave conflicting results depending on
whether non-seed-plant outgroups were included or not, and whether third or first
plus second codon positions were considered.Gentales placed near Pinaceae in
some analyses, and basally in the seedplants in others.
Likelihood analysis also placed Gnetales within the conifers, but not near
Pinaceae.
Despite some problems, some important answers are coming out clearly,e.g. relationships within the conifers and within the flowering plants.
Will we find same problems deeper in the tree?
There is a need to integrate fossil evidence and developmental evidence.
Q&A There was discussion about long branch attraction and the question of what
is the basal taxon.
Dr. John Huelsenbeck, University of Rochester
"Bayesian inference of large phylogenies"
opening comments on : small phylogeny problems: a few dozen spp.; and
big phylogeny problems: >100 spp; there are unique problems associated with
the latter
consider the optimality criterion: you try to get closer and closer to the true tree
but you may have millions to get through to reach the best tree and thus can get
stuck on a local near optimum plateau of trees
Bayesian inference: example coin toss: results get closer to expected results as
the number of tosses goes up; Bayesian inference brings in some prior beliefs to
a posterior probability; as you add more and more data the result is less
dependent on the a priori expectation; e.g. 3 trees assume all 3 are equally likely
or one is more likely than the other two; calculating the posterior probability
Metropolis-Hastings algorithm - Markov Chain Monte Carlo approximation
and inferring large phylogenies
- new move-types: improve the analysis (single worm and double worm)
- Metropolis-Coupled MCMC (i.e., MC3) and "incremental heating"
(facilitates swaps from one local optimum to the other)
- Conclusions:
- random sampling of phylogenies may prove a better approach than optimization for inferring large phylogenies
- Bayesian inference using ______ may be an effective approach
- MC3
- MC3 can take advantage of parallel processors
Dr. Tandy Warnow, University of Texas, Austin
"New approaches for large scale phylogeny reconstruction, and what we still
need to do"
"Data decomposition techniques to increase speed and accuracy for phylogenetic
analysis"
opening comments on Markov models of evolution
In NJ, both theoretical and experimental studies suggest that the sequence lengths
needed to recover the true tree grow exponentially with the largest leaf-to-leaf
evolutionary distance and this situation is bad since there is a finite limit to the
data available
NJ analysis is very much affected by high rates of evolution but likelihood and
parsimony seem to be less affected.
They defined the concept of "fast-converging" methods, which means that the
method is guaranteed to recover the true tree with probability going to 1, from
sequences that grow polynomially in the numberof taxa, for each bound on the
branch lengths.
By this definition, no method currently in use can be proven to be fast
converging, and empirical studies suggest that NJ is not fast converging.
They sought to develop fast converging methods through a divide-and-conquer
strategy.
Technique: divide into small, low divergence subproblems; compute trees on subproblems; merge subtrees into final tree
Objective: divide and merge in polytime; obtain accurate subtrees -DCM (maxclique)
They noted that this DCM strategy does improve the accuracy of distance
methods, but it also seems potentially useful for solving MLE and MP, and
maybe helpful in real data problems.
Discussion of "disk covers of a tree" and of merging trees (first transform them
through edge contractions so that they induce the same subrees on their shared
leaves)
Discussion of algorithmic structure of the disk-covering method and a small test
of the method.
For NJ, it is better to use small subproblems, but for parsimony it is better to use
the whole data set.
Discussion of DCM2 (dac) - don't divide everything into very small
subproblems
They are getting an improvement in accuracy with DCM1 and faster parsimony
with DCM2.
Open Problems
- Are any of the standard distance methods fast converging for CF trees?
- Lower bounds?
- Can we bound the expected error rate of different methods?
- Characterization of real evolutionary tree topologies
- optimization problems based upon fitting dissimilarity matrices to additive matrices, but considering only the entries below fixed constant q
- Why does MP performs so well for biologically reasonable CF trees?
- Which methods perform well in practice?
- How can we speed up analyses?
Break
Mike Sanderson, University of California, Davis
"Supertree strategies for assembling the tree of life"
Comments on the "sparse matrix of comparative genomics/biology" and
all of the missing data
- But how do you handle data for a large number of taxa?
- There are some straightforward analyses of large data sets (e.g., rbcL).
- There is some good news and some bad news about the approach of sequencing a
few genes for a broad spectrum of the tree of life.
- They examined simulations to see the number of characters need to get 80% of the clades correct. It scales nicely, even large phylogenies don't require an
impossible number of characters. But, the problem is that the deeper nodes are
harder to get accurately with a reasonable number of characters. The tips of the
tree are relatively easy to resolve.
- Therefore, consider piecing together smaller trees to make a supertree. Under optimal conditions you can unambiguously combine trees,but often it is not
easy to combine the trees (e.g., when some shared taxa show different
relationships in the two trees).
- Matrix representation: turn trees into data matrices and then combine the matrices
and run a parsimony analysis to construct the super tree.
- Example: two studies of two different groups of grasses; 177 original trees on carnivores were synthesized for 300 taxa
- If you had a massive data set, would it be better to subdivide the data set and then
synthesize a supertree or stick with the original data and analyze all at
once? Use simulations to check. Both methods work, but both begin to fail if
you begin to cut taxa out of the analysis.
- How does one grapple with the growing data base "out there"? The first
challenge is to decide which trees could/should be selected to be combined in the
supertree. The approach can provide an alternative to massive sequencing of all
of the taxa. The approach could also let you know where additional data are
needed.
Q&A Q: If you analyze only published trees you could have a good data set that
was poorly analyzed (and thus your analysis would be based on errors). A: Yes.
Thus, ideally you would reanalyze the data or include bootstrap values from the
original analysis to show weakly supported regions.
David Hillis, University of Texas, Austin
"Genetic algorithms for large phylogenies: problems and possibilities"
Given: many taxa, many characters, and complex evolutionary models
What can be said about computational tractability?
- genetic algorithms
- parallel processing
- hitch-hiking algorithms
- simulated annealing
- Markov Chain Monte Carlo (MCMC) methods
Genetic Algorithms
set up the problem so that the 'solutions" are 'individuals" in a population
and apply operations analogous to those of population biology
genome: string specifying topology and parameters
mutation: tree bisection and reconnection; change branch length by random values; change kappa by random value, etc.
recombination: recombination between chromosomes strings: good trees are likely to reproduce, recombine to form new trees
Discussion of Paul Lewis' "GAML" (Genetic Algorithm for Maximum
Likelihood) approach.
Rank order selection of individuals for next generation
The analysis of the individual (solutions) can easily be "parallelized" (the algorithms are naturally parallel, with one processor assigned to each individual
in the population) and there are various approaches to handle the parallel analyses
Discussion of evaluating Performance - two types of situations
1. Simulated data: Comparison to a known model tree
- Advantages:
a. true tree known
b. model of evolution known
- But, disadvantages:
c. simplicity of model may lead to data that are too"well-behaved"
2. Real data: Comparison of trees inferred by different methods
- Advantages:
a. data realistic
- But, disadvantages:
b. true tree unknown
c. model of evolution unknown
Used both real and simulated data for 228-taxon Angiosperm phylogeny (Soltis et
al.) to test performance of parallel GAML approach.
Performance of method: curve of likelihood scores appears to plateau, but scores
are actually still improving after 70,000 generations. A better way to see
improvement is to plot agreement score of current best tree with model tree (or
best known tree).
Discussion of distribution of branch length changes after optimization. Best trees
at any one time can usually be improved by optimizing branch lengths. Small
changes in branch lengths can have huge effects on likelihood scores.
Strategy: periodic optimization of branch lengths during selection process could
lead to rapid improvement in GA approach.
- Limits on Performance:
- Adaptation is most efficient when: selection is strong; population size is
large; variance is maintained (i.e., the mutation rate is high). In this
approach, selection is always strong (because of rank selection scheme),
but population sizes are necessarily small and selection may not counteract
effects of branch length mutation as population approaches the optimum
fitness.
- Effect of population size:
- At first small populations seem to give comparable good results but then
performance gets better with larger population size. In terms of
computational power used, the optimum population size for the
Angiosperm phylogeny problem is somewhere near 15 individuals.
However, in terms of real time, the larger the population size, the better,
and the scale-up is almost linear. This gives great hope that algorithms
may prove highly effective.
- Effect of mutation near the optimum:
- If you have an optimal tree in tree-space:
effective asymptote - near optimal trees
- Effect of combining populations of solutions:
- works to speed up the approach to optimal solutions
- Advantages to genetic algorithm approach:
- Avoids potential bias from starting point and calculation order
- Takes advantage of parallel computing, and is easily scalable to larger computer architectures
- Easily distributed across large computer clusters
- Future improvements:
- changes in mutation rate as problem progresses
- periodic branch length optimization
- other topological mutations
- larger populations
- more population structure, with migration between subpopulations
- distributions of populations across many machines
Acknowledgements: Paul Lewis, Matt Brauer, Mark Holder, Laurie Dries,
Derrick Zwickl, National Partnership for Advanced Computational Infrastructure
(NPACI), and Texas Institute for Computational and Applied Mathematics
(TICAM).