NSF Proposal - 3. Problems in Deep Phylogenetic
Reconstruction
Shallow versus deep phylogenetics.
The challenges associated with reconstruction of "shallow"
relationships are fundamentally different from those of "deep"
ones [1]. In "shallow" reconstruction problems, branching
events happened a relatively short time ago and the set of lineages
resulting from these branching events is relatively complete (extinction
has not had a major effect). In these situations, the relative lengths
of internal and external branches are similar, giving less opportunity
for long branch attraction. However, at this level an investigator
often has to deal with the confounding effects of reticulation and
lineage sorting. Characters at the morphological level may be quite
subtle, and at the nucleotide level require very careful analysis
to find rapidly evolving genes. (However, note that such genes are
likely to be relatively neutral, thus less subject to adaptive constraints
which can lead to non- independence).
In contrast, in "deep" reconstruction problems, the
branching events happened a relatively long time ago and the set
of lineages resulting from these branching events is relatively
incomplete (extinction has had a major effect). In these situations,
the relative lengths of internal and external branches are often
quite different, thus there is a greater likelihood of long branch
attraction. Conversely there are few problems with reticulation
and lineage sorting, since most of the remaining branches are old
and widely separated in time. Due to all the time available on many
branches, a myriad of morphological characters should be available,
yet they may have changed that homology assessments are difficult;
the same is true at the nucleotide level, where multiple mutations
in the same region may make alignment difficult. Thus very slowly
evolving genes must be found, but such conservatism is caused by
strong selective constraints that increase the danger of convergence
leading to character dependence.
Structural vs. DNA sequence characters.
How intrinsically useful are different categories of characters
at these different scales? Clearly, structural and DNA sequence
data have different and complementary strengths and weaknesses.
Especially in "deeper" comparisons, structural characters
such as morphological or genomic markers are more information-rich,
allowing a temporal axis of comparison not possible with DNA sequence
data. Structural characters often change in an episodic pattern,
which is necessary for evidence of deep, short branches to remain
detectable (clock-like markers are the worst kind of data for those
sorts of branches). The number of possible character states is usually
much higher in morphological character systems (and in genomic rearrangements)
than in DNA sequence data and this makes long-branch attraction
less problematic [2]. On the other hand, objectively defining character
states in morphological comparisons can be difficult, particularly
in "shallow" reconstructions, whereas the states are usually
clear-cut in DNA sequence data. DNA sequence markers are also much
more numerous, thus increasing the chance that sufficient markers
can be found for all branches of a tree.
Dealing with heterogeneous data types.
Deep phylogenetic reconstructions are inherently difficult, so all
characters should be developed and used if they meet the criteria
of good potential markers [1]. However, it remains controversial
how data from different sources are to be evaluated and integrated
with each other [3]. Some have argued that data sets derived from
fundamentally different sources should be analyzed separately, and
only common results taken as well-supported (i.e., consensus tree
approaches), or at least that only data sets that appear to be similar
in the trees they favor should be combined [4]. Others have argued
that all putative homologies should be combined into one matrix.
Theoretical arguments now favor the latter approach (i.e., "total
evidence;" [5-8, 2, 9]). If characters have been independently
judged to be good candidates for phylogenetic markers, then they
are equivalent and should be analyzed together.
There is one major exception to our preference for a "total
evidence" position: data should not be combined if there is
evidence that some of it had a different branching history than
the rest. However, there are several sources of homoplasy other
than different branching history, including evolutionary convergence.
If several data partitions show different highly discordant trees
due to convergence, the only way to see the true tree topology is
to combine them. The only weapon a systematist has against convergence
is the likelihood that truly independent characters will be subject
to different confusing factors and thus the true history may emerge
when these independent characters are combined. Probably all character
systems are influenced by constraints that tend to bias phylogeny
reconstruction one way or another, yet a combination of very different
character sets can allow the "noise" to cancel out revealing
the historical signal.
Therefore, observing a particular data partition exhibiting serious
conflict with another is not sufficient reason to reject combining
them. There must also be additional evidence, outside of the phylogentic
analysis, of reticulation or lineage sorting. The best examples
of such discordance are in "shallow" analyses, where organellar
genomes may have different phylogenies than those of associated
nuclear genomes and morphologies [10-12]. Barring that sort of clearly
explainable discordance, all appropriate data should be used, especially
in "deep" analyses because as argued above, reticulation
and lineage sorting are much less likely to be problems in "deep"
analyses, while convergence is likely to be a greater problem.
Global versus local approaches.
How will we ultimately connect "deep" and "shallow"
analyses, each with their own distinctively useful data and problems?
Some hold out hope for eventual global analyses, once enough universally
comparable data are amassed and computer programs are efficient
enough to deal with all extant species simultaneously. Others would
go to the opposite extreme, and use a "supertree" approach,
where the "shallow" analyses are simply grafted onto the
tips of the "deep" analyses. An intermediate approach,
"compartmentalization" [13, 2], uses the "shallow"
topologies (that are based on analyses of the characters useful
locally) to constrain "deep" analyses (that are based
on analyses of characters useful globally).
The task at hand. We need to
address how characters can be selected, interpreted, and most effectively
analyzed at various scales. The primary advantages of using the
green plant lineage for this work are that a wealth of "shallow"
analyses are published and ongoing, and many of the methods for
collecting the data for "deep" analyses (particularly,
genome-level molecular data) are being developed. This enables us
to evaluate unresolved "deep" nodes by developing a large
dataset that encompasses characters derived from both genomic and
morphological analyses. Given the requested funds, we will link
our framework to existing "shallow" analyses and use these
linkages to test which of our scaling approaches works best.
[previous] [next]
|