b1440revised 21 July1998
Green Plant Phylogeny Research Coordination Group
28 June 1998
held in conjunction with the
DIMACS Symposium on Estimating Large Scale Phylogenies: Biological, Statistical, & Computational Problems
Princeton, New Jersey
Russell L. Chapman, Department of Biological Sciences, Louisiana State University, Baton Rouge, LA, USA - Co-PI
Brent D. Mishler, Department of Integrative Biology, University of California- Berkeley, Berkeley, CA, USA - Co-PI
Victor A. Albert, The Lewis B. and Dorothy Cullman Program for Molecular Systematics, the New York Botanical Garden, New York, NY, USA
Gary Churchill, Biometrics Unit Faculty, Cornell University, Ithaca, NY, USA
Matthew T. Cimino, University of Maryland, College Park, MD, USA
Charles Delwiche, Department of Cell Biology and Molecular Genetics/Plant Biology, University of Maryland, College Park, MD, USA
Steve Farris, Molekyl Šrsystematiska laboriet, Naturhistoriskriksmuseet, Stockholm, Sweden
Sean Graham, Department of Botany, University of Washington, Seattle, WA, USA
Mari KŠllersjš, Molekyl Šrsystematiska laboriet, Naturhistorisk riksmuseet, Stockholm, Sweden
Louise Lewis, Department of Biology, University of New Mexico, Albuquerque, NM, USA
Paul Lewis, Department of Biology, University of New Mexico, Albuquerque, NM, USA
Francois Lutzoni, Field Museum of Natural History, Chicago, IL, USA
Richard M. McCourt, Academy of Natural Science, Philadelphia, PA, USA
Scott Nettles, University of Pennsylvania, Philadelphia, PA, USA
Kathleen Pryer, Field Museum of Natural History, Chicago, IL, USA
Pam Soltis, Department of Botany, Washington State University, Pullman, WA, USA
Elizabeth Zimmer, Smithsonian Institution, Laboratory for Molecular Systematics, Suitland, MD USA
Session III Additional Attendees
Neil Caithness, Department of Ornithology, American Museum of Natural History, New York, NY, USA
Georg Fuellen, University of Bielefeld, Bielefeld, Germany
Mike Steel, Biomathematics Research Centre, Christchurch, NZ
Frank Olken, Lawrence Berkeley Lab, Berkeley, CA, USA
Paul Purdom, Indiana University, Bloomington IN, USA
Ken Rice, Bioinformatics, UW2230, Smith-Kline Beecham, King of Prussia, PA, USA
David Swofford, Smithsonian Institution, Laboratory for Molecular Systematics, Suitland, MD, USA
Lisa Vawter, Bioinformatics, UW2230, Smith-Kline Beecham, King of Prussia, PA, USA
Tandy Warnow, Department of Computer Science, University of Arizona, Tucson, AZ, USA
Introductory Comments: Brent Mishler introduced the GPPRCG (history and activities) and the plans for the International Botanical Congress in St. Louis in August, 1999. Russ Chapman explained that minutes from this workshop will be posted on the GPPRCG web site and that a draft will be sent to anyone who would like to review the draft. Chuck Delwiche mentioned a possible extension of the GPPRCG via a proposal though the Experiment Station at the University of Maryland. Brent noted that we plan to submit a proposal for a renewal.
Brent Mishler reviewed the overall phylogeny of the green plants and the two major lineages (chlorophycean and charophycean lineages). The ultimate goal is a book and some publishers are interested (Oxford Press e. g.). The goal is to assemble the book as soon as possible after the International Botanical Congress. It would include various data sets. Current plans call for eight symposia at the International Botanical Congress sponsored by the GPPRCG. The eight symposia include three seedplant symposia, one on mosses, one on liverworts and hornworts, one on ferns, one on lycophytes, plus a general overall GPPRCG symposium. We will support most of the speakers (56 plus a few other speakers) - they will at least get travel and housing from our grant. The three seed plant symposia should be coordinated. Liz Zimmer mentioned trying to publish the symposia. Brent noted that we do not plan to publish the symposia in the book, but various groups will publish their symposia (e.g. the fern symposium and the bryophyte symposium) in various journals.
Data Sets: The large green plant 18S rRNA data set (232taxa) was sent out by Brent Mishler, and this data set is close to the one used in the Missouri Botanical Garden Symposium paper. It is a very "messy" data set with lots of missing data. Chuck Delwiche noted that the alignment may require more work. Pam Soltis noted that the alignment she has does include green algae and the alignment is workable. There are 550angiosperm 18S rRNA sequences available. There was a general discussion of the Belgian (Van De Peer's) available alignments. It was noted that these alignments are different from other major alignments. It was noted that members of this group should provide their alignments. We already have some data sets on the web site. We gave "Tree Base" some financial assistance to get all the green plant analyses into Tree Base. (There has been some delay because of hacker attacks on Michael Donoghue's computer.)It was noted that having data sets available on a CD would be useful, but that the web site might ultimately be more useful and long-lived (i.e., CDs may go out of date or become unusable).
Analyses: Some observations were made on the preliminary analyses for the 232 OTUs (angiosperms and outgroups) and the use of constraints within the compartments (clades). Discussion followed on the use of constraints and the local alignments within the compartments. Mindel has a paper dealing with this topic (based on an example with data for bats and other vertebrates).
Parsimony, Jackknifing, Constraints, and other concerns: Steve Farris presented his analyses of the "big messy data set" (BMDS) based on his program, Parsimony Jackknifing. There were a lot of unresolved relationships and only a little structure. This analysis was a short run on a laptop computer. He did analysis with and without branch swapping and obtained pretty similar results. There are some cases wherein it does make a difference if you do or don't use branch swapping. Steve Farris asked Kevin Nixon to use his new method (the Parsimony Ratchet) on the data set. Nixon obtained a tree that was at least4 steps shorter than the published tree in a fairly short time (ca. 1.5hours). He presented an example from the literature on insects (Bower et al.) that had some strange results. Their approach involved constraining various groups during the analyses, but this approach can lead to erroneous results. Thus, one must be very careful in the use of constraints. You can generate artificial support for some groups by constraining seemingly unrelated other groups. He and coworkers published a paper in Cladistics on this topic. You can generate support for almost anything you want to show. The important point is that one must be very careful to be able to strongly justify the groups that you are constraining. He also showed a parsimony jackknife analysis of the insect data set that refuted the false conclusions of the Bower et al. paper. There was a general discussion of the fact that the use of constraints is "dangerous" and requires caution. For the green plants, the groupings are well supported by morphological characters. One could simply include the morphological characters in the analysis and they would serve the same function of constraining the compartments. For the green algae and land plants, the problem is global characters (e.g. the morphological characters). Chuck Delwiche noted there is a need to use multiple approaches, but there should be some global unconstrained analyses. There was discussion of a hypothetical situation in which a local analysis of ferns gives a detailed topology, but when you do a global analysis including the ferns you have to delete some characters and you get a different topology for the ferns. Which analysis is better for the ferns? Steve Farris noted that one must consider the value of the characters that were deleted for the global analysis. If you get a big difference between the fern topology in the local and global analysis, then you know there is problem because the datasets are giving you a different result. You could use other types of analyses. Long branch attractions were discussed as one of the problems that might be involved in the differences between local and global analyses. Breaking up long branches. Francois Lutzoni discussed the topic of independence of characters and key assumptions in various approaches to data analysis.
Recommended Genes, Question Marks: The GPPRCG has recommended certain genes (e.g. large and small subunit rRNA, rbcL, etc.). Not all of the groups recommended the same list of preferred molecules. Brent noted that there will be big blocks of question marks for a lot of the morphological characters. There was a discussion of bringing more cell biologists into the green plant phylogeny research effort.
rbcL Example: Mari KŠllersjš presented information on a large scale (500 sequence) data set for rbcL sequences on which she has been working with Steve Farris and Victor Albert with help from many other PIs. They used the Parsimony Jackknifing program. They gathered 2538 sequences with 1428 positions with alignment by eye. The taxa: blue-green outgroup, 41 green algae, 4 mosses, 1-2 each of liverworts, Equisetum, and Lycopods, 170 ferns, 26 cycads, 46 conifers 2230angiosperms (401 families, 185 one single sequence). Two sets of analyses were performed (with and without branch swapping). They performed 1000replicates and 5 random addition starts. There were 1235 informative positions. Without branching swapping there were 1359 supported groups. With branch swapping there were 1400 groups (41 new groups including monocots and eudicots). She assumed that with addition of more taxa (e.g. the ferns) the analysis might crash, but it didn't. She discussed the results and indicated which groups were stable and which were not. It was noted that Sanderson et al. presented an analysis based on photosynthesis genes at the Vancouver systematics meetings that supported some similar results. Mari indicated that these preliminary results were not fully resolved, but were consistent with other findings. She examined the informativeness of first, second, and third positions in the data set and found a lot of informative change. She examined saturation of the sites and did not get a saturation curve, but concluded that all sites were probably saturated. She then looked at transversions vs. transitions. First and second positions showed only 953 groups. Third position recognized 1300groups. Transversion analysis recognized 706 groups. She reviewed which major groups were recognized in the various analyses. Louise Lewis noted that in their liverwort paper analysis of the third position gave the same result as analysis of all three positions. Mari also got the best results with analysis of all three positions.
Analysis of the consistency index: 1st position: CI 0.155; 2nd position: 0.178; 3rd position: 0.046 for 2500 taxa. Results for the Retention Index were just the opposite. Mari indicated that the RI for the third position increased with added taxa. Parsimony jackknifing is an efficient method for analyzing large data sets. The rbcL data are highly structured and provide a surprisingly well resolved tree. rbcL data did not answer all of our questions on green plant phylogeny. Adding more taxa did not lead to loss of resolution of well supported groups. In fact, resolution is often improved with a more extensive sampling. Third positions contribute most of the structure. Saturated sites still provide structure. Frequency of change should not be use for weighting. There was some discussion of the support for the monocot group and the results from other studies. There was some problem with long branch attraction in these analyses. There was a general discussion of data analysis.
18S Example: Sean Graham presented some results on the analysis of 18S data together with a mass of chloroplast gene data (17chloroplast genes). The 18S data do not help resolve some of deeper branches in the tree.
General Discussion: There was a brief discussion of the agenda for the afternoon joint session with the analysis group. And, there were some comments on the International Botanical Congress plans and whether there was a symposium on data analysis (analytical methods). There is no planned symposium, but individual symposia should include coverage within their presentations. There was some discussion of the importance of the coordination that has been developing through the GPPRCG and that the zoologist and others have noted this activity among the green plant biologists. It was suggested that after the International Botanical Congress we should plan to have a symposium on the analytical methods etc.
Challenge Data Sets: Paul Lewis and Ken Rice discussed some of the comments that arose during the morning session of the analytical group. The group had discussed using data sets from the GPPRCG as "challenge data sets" for them to work on. There was some discussion about what type of data sets would be of most use and what information the analytical people would want. There was some discussion of the available data sets and the holes in those data sets (including missing data, no parallel taxa [i.e., data from different species used to represent a given genus]). Some challenges should be specific - eight to ten major questions (e.g., what is the sister group to the angiosperms?). We should identify the "sore thumb" or eccentric taxa that do not seem to "fit" in any clades well (e.g., Ceratophyllum). The need for a good mailing list of the analytical/- experimentalist researchers who would work on our data sets was noted. Paul and Ken will have a web site, and they would welcome our "challenge data sets" and our lists of special problem taxa etc.
Alignments: There was question concerning whether these data sets were available as multiple alignments, and it was stated that some of the alignments are in pretty good shape (e.g., the rbcL or atpB datasets), but the rRNA alignment is itself a challenge. There was a general discussion of the alignment topic and the fact that optimizing the alignment is not what is of interest to some of the analytical researchers. Tandy Warnow indicated that with the huge data sets there may be absolutely no need to worry about the hard-to-align regions so one can avoid the multiple alignment problems. But it is possible the analytical people could solve the multiple alignment problem completely. There was a question about the computational web site (when it would be up and who would be doing it).
Bot Blots and Culture Collections: Louise Lewis noted that some molecular biologists could use a "bot blot" list of ten plants that should be examined when they are trying to do a survey of representative green plants. This would be comparable to a typical "zooblot" used to survey an array of animal organisms representative of the diversity of the animal world. It was noted that these ten plants would have to be available to investigators. The "zoo blots" are commercially available. Perhaps some green plant people could provide DNA for the ten taxa. Discussion turned to the need for culture collections for plants other than the algae (for which there are already culture collections).Michael Christianson at the University of Kansas wants to establish a bryophyte culture collection and the GPPRCG is supportive of this proposal. Liz Zimmer could help provide some of the "bot blots" angiosperm DNAs.
Concluding Remarks: Future activities must include getting more of the participants to contribute to the web site. Ken Rice would like to collaborate with someone on a large data set. The web site could list requests like Ken's. Model-intensive approaches to data analysis were discussed at lunch and considered an appropriate topic for future discussion. Tandy Warnow would, in one year, would like to work on the multiple alignment problem and collaborate (but not right now since she does not want to over-extend at this time). Brent Mishler noted that he would have the graduate student working on the GPPRCG web page add the results from those who worked on the 232 taxon data set. Brent Mishler will send out the data set to other people on the DIMACS workshop email list (originally he sent the set only to the green plant people) on his return to Berkeley. He will include the email address of the graduate student assistant. Tandy Warnow will ask Sandy Barbu to send the updated list of participants. Brent Mishler will try to develop the list of ten challenge questions himself. Mari noted there are approximately 3,000 rbcL sequences in GenBank for green plants and algae. These sequences are about 1,400 bases long. The alignment is no problem if limited to greenplants. We could post a nexus file. Brent Mishler indicated he will put the rbcL data base on the web as soon as possible. There was a general discussion of the availability of large data sets such as the rRNA sequences in the Belgian site. It was noted that we would have to let people know about our ten challenge questions. Tandy Warnow noted that part of DIMACS includes a computational challenge.