NSF Proposal - 6. Databasing and Analysis

Many powerful resources for genomic sequence data (e.g., Genbank and EMBL, SWISS-PROT, PIR) are archival because they do not provide integrated access to the tools needed for phylogenetic and comparative analysis. The proposed bioinformatics will provide support for an expertly curated set of data (organellar genome sequences, genomic structure, ultra- structural and morphological data), and integrate these with analytical tools needed for phylogenetic and comparative analysis. Thus we propose to create not only a database, but also a data laboratory.

Character database enhancements. Several databases, notably GOBASE [124] and MitBASE, focus on organelle genomic data. These resources use a relatively simple set of tables to display published sequence, gene location, protein sequences, and genetic maps. A simple query interface allows data retrieval based on gene and protein names, exon and intron definitions, and taxonomy. GOBASE defines a standard nomenclature for mitochondrial genes, but none exists for chloroplast genes and gene products. We will extend the above database structure to include phylogenetically important structural changes such as insertion/deletion regions, inversions, and duplications. The most straightforward way to implement this is to compare each chloroplast genome to a virtual standard genome; pairwise comparisons then can be simply generated by comparing the two genomes in question to the standard genome. We will make several enhancements to GOBASE to improve its search and referencing abilities.

Annotation and alignment. The sequencing group will annotate single genomes for database deposition using the beta test versions of "Mitotater" and "Plastotater". PiPmaker and MultiPipMaker will be used to identify a wide variety of structural changes that have occurred during plastid genome evolution and to generate multiple sequence alignments for downstream phylogenetic and molecular evolutionary analysis. PipMaker is a flexible program for visualization and evolutionary analysis of whole genome sequences, and is ideally suited to our bioinformatics needs. The PipMaker approach is based on percent identity plots (PIPs) which are linear representations of high scoring regions found with genomic-scale dot matrix analyses. This approach efficiently aligns very large sequences (=Mbp). Alignments are fast and use as series of BLAST programs [125, 126]). PIP output is compacted to allow rapid identification of genes and other homologous sequences [127-129], repeat elements and their classification [130] and structural features, and to reveal evolutionarily conserved promoter and regulatory elements [127] regardless of their linear order in the genome [131, 132]. The website provides various tools that aid genome annotation and visual presentation of the results. MultiPiPmaker, a recent expansion of PipMaker, allows simultaneous alignment of =100 genomes using a new multiple alignment algorithm (Miller, unpub.). In addition to generating compact summary maps, aligned sequences can be exported for downstream phylogenetic and molecular evolutionary analysis.

[previous] [next]