NSF Proposal - 3. Problems in Deep Phylogenetic Reconstruction

Shallow versus deep phylogenetics. The challenges associated with reconstruction of "shallow" relationships are fundamentally different from those of "deep" ones [1]. In "shallow" reconstruction problems, branching events happened a relatively short time ago and the set of lineages resulting from these branching events is relatively complete (extinction has not had a major effect). In these situations, the relative lengths of internal and external branches are similar, giving less opportunity for long branch attraction. However, at this level an investigator often has to deal with the confounding effects of reticulation and lineage sorting. Characters at the morphological level may be quite subtle, and at the nucleotide level require very careful analysis to find rapidly evolving genes. (However, note that such genes are likely to be relatively neutral, thus less subject to adaptive constraints which can lead to non- independence).

In contrast, in "deep" reconstruction problems, the branching events happened a relatively long time ago and the set of lineages resulting from these branching events is relatively incomplete (extinction has had a major effect). In these situations, the relative lengths of internal and external branches are often quite different, thus there is a greater likelihood of long branch attraction. Conversely there are few problems with reticulation and lineage sorting, since most of the remaining branches are old and widely separated in time. Due to all the time available on many branches, a myriad of morphological characters should be available, yet they may have changed that homology assessments are difficult; the same is true at the nucleotide level, where multiple mutations in the same region may make alignment difficult. Thus very slowly evolving genes must be found, but such conservatism is caused by strong selective constraints that increase the danger of convergence leading to character dependence.

Structural vs. DNA sequence characters. How intrinsically useful are different categories of characters at these different scales? Clearly, structural and DNA sequence data have different and complementary strengths and weaknesses. Especially in "deeper" comparisons, structural characters such as morphological or genomic markers are more information-rich, allowing a temporal axis of comparison not possible with DNA sequence data. Structural characters often change in an episodic pattern, which is necessary for evidence of deep, short branches to remain detectable (clock-like markers are the worst kind of data for those sorts of branches). The number of possible character states is usually much higher in morphological character systems (and in genomic rearrangements) than in DNA sequence data and this makes long-branch attraction less problematic [2]. On the other hand, objectively defining character states in morphological comparisons can be difficult, particularly in "shallow" reconstructions, whereas the states are usually clear-cut in DNA sequence data. DNA sequence markers are also much more numerous, thus increasing the chance that sufficient markers can be found for all branches of a tree.

Dealing with heterogeneous data types. Deep phylogenetic reconstructions are inherently difficult, so all characters should be developed and used if they meet the criteria of good potential markers [1]. However, it remains controversial how data from different sources are to be evaluated and integrated with each other [3]. Some have argued that data sets derived from fundamentally different sources should be analyzed separately, and only common results taken as well-supported (i.e., consensus tree approaches), or at least that only data sets that appear to be similar in the trees they favor should be combined [4]. Others have argued that all putative homologies should be combined into one matrix. Theoretical arguments now favor the latter approach (i.e., "total evidence;" [5-8, 2, 9]). If characters have been independently judged to be good candidates for phylogenetic markers, then they are equivalent and should be analyzed together.

There is one major exception to our preference for a "total evidence" position: data should not be combined if there is evidence that some of it had a different branching history than the rest. However, there are several sources of homoplasy other than different branching history, including evolutionary convergence. If several data partitions show different highly discordant trees due to convergence, the only way to see the true tree topology is to combine them. The only weapon a systematist has against convergence is the likelihood that truly independent characters will be subject to different confusing factors and thus the true history may emerge when these independent characters are combined. Probably all character systems are influenced by constraints that tend to bias phylogeny reconstruction one way or another, yet a combination of very different character sets can allow the "noise" to cancel out revealing the historical signal.

Therefore, observing a particular data partition exhibiting serious conflict with another is not sufficient reason to reject combining them. There must also be additional evidence, outside of the phylogentic analysis, of reticulation or lineage sorting. The best examples of such discordance are in "shallow" analyses, where organellar genomes may have different phylogenies than those of associated nuclear genomes and morphologies [10-12]. Barring that sort of clearly explainable discordance, all appropriate data should be used, especially in "deep" analyses because as argued above, reticulation and lineage sorting are much less likely to be problems in "deep" analyses, while convergence is likely to be a greater problem.

Global versus local approaches. How will we ultimately connect "deep" and "shallow" analyses, each with their own distinctively useful data and problems? Some hold out hope for eventual global analyses, once enough universally comparable data are amassed and computer programs are efficient enough to deal with all extant species simultaneously. Others would go to the opposite extreme, and use a "supertree" approach, where the "shallow" analyses are simply grafted onto the tips of the "deep" analyses. An intermediate approach, "compartmentalization" [13, 2], uses the "shallow" topologies (that are based on analyses of the characters useful locally) to constrain "deep" analyses (that are based on analyses of characters useful globally).

The task at hand. We need to address how characters can be selected, interpreted, and most effectively analyzed at various scales. The primary advantages of using the green plant lineage for this work are that a wealth of "shallow" analyses are published and ongoing, and many of the methods for collecting the data for "deep" analyses (particularly, genome-level molecular data) are being developed. This enables us to evaluate unresolved "deep" nodes by developing a large dataset that encompasses characters derived from both genomic and morphological analyses. Given the requested funds, we will link our framework to existing "shallow" analyses and use these linkages to test which of our scaling approaches works best.

[previous] [next]