[Note: these are rough minutes taken down during the meeting itself. They are intended for archival purposes; please excuse the rough edges -- we would appreciate getting feedback from anyone about errors or omissions in this document.] Green Plant Phylogeny Research Coordination Group Third meeting--Mardi Gras Symposium Louisiana State University Baton Rouge, Louisiana February 15-16, 1996 Participants and Guests (with initials by which they are referred to below): Brent Mishler--Co-PI (BDM) Russell Chapman--Co-PI (RLC) Mark Buchheim--PI (MAB) Chuck Delwiche (CD) Chris Henze (CH) Detlef Leipe (DL) Richard Olmstead (DO) David Swofford (DS) Debra Waters (DW) Elizabeth Sweedyk (ES) Gary Olsen (GO) Junhyong Kim (JK) John Huelsenbeck (JPH) Juan M. López-Bautista Ken Karol (KK) Kenneth Rice (KR) Lena Struve (LS) Michael Donoghue (MD) Michael Sanderson (MS) Paul Kores (PK) Paul Lewis (POL) Pamela Soltis (PS) Rick McCourt (RM) Sean Turner (ST) Tandy Warnow (TW) Victor Albert (VA) Xuhua Xia (XX) Thursday, February 15, 1996 9:00 am Session I: Opening Remarks RLC--introduced the Green Plant Phylogeny Working Group: Coordination grant from the USDA to help in the collaboration and coordination of international green plant phylogeny (see prospectus). Over 5 years, set up workshops addressing different problems, and finally disseminate the data to the scientific community at large. Workshops are meant to be small from both a budgetary and logistic perspective. Participants encouraged to recommend future participants and topics. Want to involve as many people as possible; therefore hold some workshops in conjunction with major meetings. The WWW site will be a primary site of dissemination of information. Concerns in terms of cooperation and sharing data--not trying to co-op others¹ work. The only data included will be published data--this is both for quality control and to ensure that no one¹s data are stolen. Want to achieve parallel data sets in order to complete a data matrix for data analysis for a robust solution. Idea of grant is not for PIs to get money for own work, but to bring people together. Trying to keep process as open as possible. BDM--discussed 1st workshop, held in Berkeley: Problem--huge amount of green plant data that is not comparable to a large extent, because of failure to coordinate activities of independent researchers. Need to have a synthesis of the data in the near future, but difficult to achieve without a comparable data matrix. Not controlling science but supplying framework. Idea is not to have the group do research, but coordinate research for the ultimate purpose of coordinating data analysis. First meeting focused on one clade of green plants, the streptophytes--at least 400,000 species ranging from unicells on up, trying to put together comparable data set. Discussed structure of grant, mechanisms, and most importantly sat down as subgroups and debated exemplar taxa and characters of streptophyte groups, with 300-400 OTUS of both clades. At end of three day meeting had series of ideals; timewise these ideals have not been met. Want to set up website, prototype of which exists, with exemplar taxa and characters available for anyone to use. Envision data availabity matrix with list of taxa and kinds of data available with footnote of reference where available. This data availability matrix is the ultimate goal of the first two years of the grant. Other types of databases were identified as sources of information that would fulfill the objectives of the overall project; these include culture collections and DNA banks. Basic goals over five year period--workshops focussed with professional meetings to make money last longer. Lots of data gathering. In 1998 and 1999--goal is to have multiauthored paper with talk at 1999 Botanical Congress, a major public symposium. This would be in conjunction with a booklength set of analyses. Nature of structure developed depending on outcome. Data individually published and independently justified. Analysts would get credit for their input (as editors, etc.). Book would also be electronic with MacClade-style matrices. Any royalties would be used to set up fund to continue workshops and maintain database. No one has turned in their lists from Berkeley meeting. ST has agreed to release Chapman lab data alignment. Next symposium is at Seattle. Ultrastructural specialists at AIBS. International Congress of Fossil Charophytes--smaller workshop there with Linda Graham. Final session will come back to this topic. This group may be able to make suggestions. Grant is flexible. MAB--Discussed Breckenridge meeting--One and one-half day meeting on green algae held in conjunction with PSA--maximum number of chlorophytes--about 200 on Mark¹s list. Our group different because of demographics--most of experts in green plant phylogeny (molecular) 75% at least. Decided not going to be able to be as ambitious in sequencing as wanted by end of project. Prioritze nuclear SSU rRNA and rbcL. Co-sponsoring symposium at International Congress in Leiden in 1997. Topic--invite other algal groups to explore idea of global consortia in research. RLC will give keynote lecture. Will include reds, browns, diatoms, etc. RLC--Housekeeping--at least founders of group appreciate use of nonmolecular characters. When thinking about global characters however, problem with choosing characters, when combining with green algae, even more difficult. Land plant people don¹t have to worry about algal lineages. Brings up problems of global characters. Would like participants to feel free to cite workshops and even encourage everybody to cite grant if they cite any comment or idea that came up at meeting. That would be very helpful in showing tangible results for annual reports and willl also alert other people to workshop. USDA grant 94-37105-0713. Break--resume at 10:45 Session II: Discussions begun by a few short presentations. BDM -- How to Represent Large Terminal Taxa (e.g. exemplar taxa vs. compartments and consensus sequences) and How to Choose and Define Characters. Large data analysis issues have broader implications beyond just plants (e.g., all of life). Questions: Are exemplars necessary? How are exemplars chosen? Where do data matrices come from? Most systematic theory focuses on phylogenetic reconstruction (making trees). Since the data matrix is primary, the issue of data matrix construction should also be considered. Local vs Global are the extremes. Where does an analysis fall on this spectrum? Global extreme--want to minimize a priori assumptions of group and include all taxa for which data are available. Local extreme--first reconstruct phylogenies within local groups and then link together. Must be dealt with pragmatically. What about characters? Should we use ³everything² or does one employ a ³selection² process in constructing a data matrix? If use all data must include whole genomes of all organisms vs only slowly evolving ones? Homology issue--more local analyses don¹t usually suffer from questions of homology. Global analyses have tremendous homology problems (ex. Orchid vs. Other angiosperm flower characteristics). With morphology it is easier to code with local group vs global. Amount of data we can use will vary depending on scale. Exemplars vs. composite coding--Should we have exemplars or erect an archetype from a broad analysis? If we do want to do composite coding, how can it be done? If do local, then how do we divide local set of OTUs into compartments--how do we decide what the compartments are? Practical examples that arise--global analyses almost always get different local topology than when just do local. What happens when there is a conflict between the two? Want to avoid preconceptions, want to be maximally global, but the more global the more difficult to deal with. Global vs. Local represents a compromise and finding a happy medium is the goal. What is the problem: pruning taxa or failure to establish homology? If the latter, some objective measure of what comprises compartments needs to be established: do global analyses first, then take well-supported clades and then do local phylogenies with perhaps additionally homologized characters included. Then go back to global with constraint statement based on local. Open Forum on above MS-- How to sample taxa? Simple random sampling seems to be missing from methods. Has not been identified as a method of selecting taxa. Suppose want to compartmentatilize and then assign ancestral states to some chunk of life. How many taxa do we have to sample from a group to identify the root node? Underlying phylogeny with true root node, but in any real case we will only have a sample of taxa in group. The estimated root node may not correspond to the true root node. Whatever we do may not correspond to true root, compartment, or clade. Some key issues-would like to have representative of both subclades descended from root node. If we don¹t know anything about phylogeny, faced with situation of how to assess if we have sampled from both side of root nodes. Some assumptions about diversification itself--if make assumptions about evolution can generate formulas for probability of sampling. Depends on process that is generating cladogram plus process of sampling. If assume simple diversification model (Poisson process) results are largely independent of clade size. Sampling strategies can be designed from this. If sample 40 to 100 taxa have 90% probability to have samples from both sides. Can also do robustness analysis. GRAPH OF minimal taxon sample to get 95% confidence level. Ratio of expected species diversity of larger to smaller clade. So the more diverse the more taxa must be sampled In order to develop reasonable sampling strategies, need to be specific about how sampling is done and must make assumptions about the evolution of some clades. Open Forum on above Lunch Break Session III. MAB--Presentation on Dealing with Missing and Inapplicable Data An example of Inapplicable character: Pyrenoid present or absent Pyrenoid position basal or lateral Pyrenoid matrix thylakoids or cytoplasm How to deal with coding characters? Binary vs Multistate Open Forum on above 3:00 PM Break Session IV: J. Huelsenbeck--Combination of diverse data sets As relevant to goals of GPPWG: Major approaches--could debate partitioning of data by different criteria. How do we go about it? 1) Total evidence approach--seems reasonable because as add more data get better estimate of phylogeny, 2) biological or data partitions, should always treat separately, and 3) mix of both of above. Analysis of different genes can give different assessment of phylogeny. Depending on different genes get different phylogeny. Example using four taxon statement and different genes. Why might different data sets provide different phylogenetic estimates? 1) stochastic variation 2) different histories (gene vs. species trees) 3) method failure--when assumptions are violated, can become convergent on wrong tree. This can be demonstrated in simulation, but not as easily using real data. JH advocates that you use tree stochastic variation as null hypothesis. What tests can be applied? Likelihood ratio test of heterogeneity: Ho: assumes one history (stochastic variation) H1: allows possibility of multiple histories Calculate likelihood in 2 different ways. Standard likelihood ratio test that can be used. Assimilate data under null hypothesis and then test. Open Forum Break for dinner--resume at 7 pm Session V: POL--Equally and Unequally Weighted Parsimony and Maximum Likelihood Are there benefits to using maximum likelihood that overcome the slowness of it? Distinctions between two: likelihood allows for changes in models by relaxing some assumptions--it is more flexible. It takes more time to analyze a data set, but you may not need as many taxa because of the flexibility. Assumptions--likelihood and parsimony compared. Take home messages-- adding parameters makes the model less restrictive (NOT MORE!) and not knowing the model means not knowing what assumptions are being made. If have branch length heterogeneity can have long branch attraction which is a violation of assumptions of model. Parsimony is a fairly restrictive model. Likelihood models relax some assumptions: each branch can have its own rate/time, thus avoiding the infamous ³long branch attraction² problem. Open Forum. Adjourn Friday am Session VI: DL--Practical Problems in Implementing and Maintaining Large Scale Phylogenetic Classifications Systems at the Genetic Data Bases. DL working at GenBank and organizing data in a meaningful taxonomic way. Organisms at GenBank as of Dec 15, 1995. Currently 18000 organisms including all types of sequences. How many eukaryotes? About a third of all come from green plants, but about 50/50 between botanical things and animals; only about 300 sequences comprise green algae. How fast is data base growing? Currently fungi are most active. For every organism at GenBank there is generally only one sequence. Open Forum VA-New Approaches and New Developments in the Use of ³Old² Approaches--Jackknifing Expedient solutions for nasty problems (big matrix) e.g. green plant phylogeny. ML is too slow, MP is too slow. What about Neighbor Joining (NJ)? To swap or to resample? NJ may actually have more than one equally optimal tree even though only one tree is produced on any given run. Entry order affects NJ especially in data with substantial ambiguities. Data resampling uncovers the ambiguities. You may only need to ask about groups among a large sampling (and not necessarily the relationships among the groups.) Or you may want a ³quickie² answer for a 500 taxon problem--problem won¹t get with MP or ML or N-J? What is tree distribution? One way to look at problem-- what are the groups we can find? Another way to look at it--branch swapping to look at different trees. Resampling look at whether groups are supportable. Open Forum break Session VII: TW--How to Handle Massive Data Sets: Solving the optimization problems. problems with existing approaches: exact algorithms too costly computationally so limited to small data sets heuristics may not produce globally optimal solutions in all cases most of classical optimization programs are NP-hard, making exact solutions difficult to reliably obtain algoritims not based upon optimization criteria less appealing how to handle more than one tree new approaches: new algorithms for classical problems, compatiiblity, ml, parsimony, some distance-based optimization criteria new optimization critera proposed to bridge previous optimization criteria new models for inferring consensus of different trees and fast algorithms for classical consensus problems--wants to find more resolution in consensus trees new results for compatibility criterion Linguistic data--a compatibility problem (not parismony or distance) Open Forum GO--Parallel processing and phylogenetic inference types of processor: single instruction, multiple data multiple instruction, single data multiple instruction, multiple data (current wave of hardware- connected but independent computers a very general model) 3rd of these is model he uses and thinks in terms of; also most of machines built today are this. Want to take advantage of relative independence of this type of machine. phylogenetic inference: in a criterion-based method, we want to evaluate how good is a phylogenetic tree (optimize parameters). In parsimony this is not a big deal, but in ml this is much more intense in parallel computing with ml, generating trees to test and dispatching them to individual processors and then sending back works easier in ml Open Forum: There was much interest expressed in getting a group of people together to setup a network of machines working on phylogeny. This would be a good thing for GPPRCG to sponsor. break Session VIII. Reconvened for general discussion--how can the coordination group help participants with data analysis? And, planned future activities BDM--Wrap-up. Not much can be done as a group re. these theoretical issues except to encourage further consideration and discussion of these issues. If specific groups work on part of the group analysis, mini-workshops could be justified. Agenda summary: Where do data matrices come from? How do select characters? Compartmentalization? Combining or not combining data? The differences in methods are probably minor compared to different approaches to character identification and analysis. Search strategies. Use and number of parameters. Equally or nearly equally optimal solution. What are the uses that can be made with constraint statements in reducing the computation load on searches. Pursuing the parallel processing network may be a good goal for this group. Supporting Treebase is also a good goal for this group-- should immediately encourage filling up treebase wih green plant stuff--MD should take the lead on that and we should all contribute; Tree of Life--collaborating with Maddisons on this. Rick McCourt has done a lot on this but we need to work on pages no one else is working on. GPPRCG Web Page: still a draft. We would like feedback on this web page prior to taking it to the public. The demo of the group¹s page shows the participants in a ³taxonomic² fashion (RLC). If any names are missing, these should be identified. Minutes from previous meetings. BDM will set a two week comment period on it before it makes its worldwide debut. Open Forum. Adjourn