A tradition in the computer science, information technology,
and mathematical communities is to issue "challenges."
Such challenges pose problems to fellow scientists, and provide
an entertaining way to advance the discipline. The form of
the challenge itself can be quite variable, but the challenge
should pose a problem that is sufficiently difficult to be
interesting (and indeed, challenging!), but should not normally
be so absurdly difficult that it is unlikely anyone would
ever be able to solve it. It is also important that the challenge
be sufficiently well defined that it is possible to determine
whether or not a person has successfully answered the challenge,
or at least to provide an objective measure of performance.
Because the concept of challenges has not been widely applied
in the life sciences, we hope that introducing them will promote
additional interactions among the biological and analytical
communities. Note that none of the challenges listed here
currently has a significant prize associated with it. The
winner (if any) will be selected arbitrarily by the person
who proposed the challenge, who in turn will offer the winner
a beer or other beverage of comparable value.
Use imagination and propose your own challenge!
To submit a challenge, send the proposed text for a challenge
and relevant web links to Charles F. Delwiche (email@example.com).
Tree of Life Challenge #1 (proposed by Charles F. Delwiche:
A dataset comprising 64 carefully aligned Small Subunit ribosomal
RNA(SSU rRNA) sequences with 1620 characters was used by Barns
et al. (1996) to calculate a phylogenetic tree spanning all
known major groups of living organisms. This analysis presented
a maximum likelihood tree (as well as bootstrap values) based
on the F84 model of sequence evolution, and assuming site-to-site
a) What is the maximum likelihood tree you can find using
a GTR + gamma + invariant sites model of sequence evolution?
b) For your best tree, can you demonstrate that it is the
globally optimal tree? If not, can you provide a quantitative
estimate of the probability that there is another tree of
higher likelihood? And finally, how long did it take to determine
this tree, measured in both clock time and CPU-minutes?
c) Bonus: can you perform a bootstrap analysis using the
same model of sequence evolution?
The 1620 character alignment is available on the web at:
Barns, S.M., C.F. Delwiche, J.D. Palmer, and N.R. Pace. 1996.
Perspectives on archaeal diversity, thermophily and monophyly
from environmental rRNA sequences. Proc. Natl. Acad. Sci.
Sean Graham adds:
"With reference to the rDNA data set posted, my challenge
would be to identify potentially problematical long branches,
those that may have resulted in spurious placement of one
or more taxa."
Tree of Life Challenge #2 (proposed by Charles F. Delwiche:
The analyses described in "Tree of Life Challenge #1"
all assume stationarity of the mode of sequence evolution.
Can you provide a quantitative measure of lineage-specific
modes of sequence evolution that can be displayed graphically
and apply it to those data? Because the tree presented in
the paper was calculated under an assumption of stationarity,
it can be expected to minimize inferred differences in mode
of sequence evolution. Thus to address the issue of non-stationarity,
one would ideally first identify a tree that is optimal under
some some justifiable, and biologically meaningful, probabilistic
model of sequence evolution that allows for non-stationarity.
Once such a tree has been determined, the changes in the mode
of sequence evolution that were inferred by the analysis should
be displayed on the tree in a graphical manner (i.e., without
the use of text, and in a manner that can quickly and easily
be interpreted with minimal explanation).
Supertree Challenge (proposed by Mike Sanderson firstname.lastname@example.org):
The TreeBASE database (http://www.herbaria.harvard.edu/treebase)
currently contains over 1000 phylogenies with over 11,000
taxa among them. Many of these trees share taxa with each
other and are therefore candidates for the construction of
composite phylogenies, or "supertrees", by various algorithms.
A challenging problem is the construction of the largest and
"best" supertree possible from this database. "Largest" and
"best" may represent conflicting goals, however, because resolution
of a supertree can be easily diminished by addition of "inappropriate"
trees or taxa.
Originally posed following the discussions at Deep Green
- Princeton, the 232-Taxon challenge
is still open.
Russ Chapman (email@example.com)
As for the challenge(s): My only suggestion is very simplistic,
but I am not sure it has been done.
I would like to see an intensive analysis of a complete data
matrix for green algae and land plants in which 3 different
genes are used and for which every square in the matrix is
filled (i.e., no missing data, which will cut the total number
of taxa down a bit. Given one or more analyses of the massive
data set, what is the permutation of the results with random
"extrinctions" of taxa ? With random eliminations of data
points? Is there a % random reduction from the perfect data
matrix (in terms of taxa and/or in terms of data points) that
SIGNFICANTLY alters the result of the analysis (analyses)?
I realize lots of this sort of thing has been done, but I
don't know if much has been done with a real green plant data
matrix. Of course, I am not sure of how big the current "prefect"
matrix is yet for all green plants, but I thought it should
now be large enough to set the stage for simulated taxon extinctions
and simulated missing data for real organisms and real data.
[CFD adds: Another issue to consider is whether or not extinction
affects taxa at random. I suspect that under at least some
conditions certain clades would be disproportionately prone
to extinction, so a logical extension of this challenge would
be to examine the effects of "patchiness" in extinction
on phylogenetic reconstruction.]