NSF Proposal - 6. Databasing and Analysis
Many powerful resources for genomic sequence data (e.g., Genbank
and EMBL, SWISS-PROT, PIR) are archival because they do not provide
integrated access to the tools needed for phylogenetic and comparative
analysis. The proposed bioinformatics will provide support for an
expertly curated set of data (organellar genome sequences, genomic
structure, ultra- structural and morphological data), and integrate
these with analytical tools needed for phylogenetic and comparative
analysis. Thus we propose to create not only a database, but also
a data laboratory.
Character database enhancements.
Several databases, notably GOBASE
[124] and MitBASE,
focus on organelle genomic data. These resources use a relatively
simple set of tables to display published sequence, gene location,
protein sequences, and genetic maps. A simple query interface allows
data retrieval based on gene and protein names, exon and intron
definitions, and taxonomy. GOBASE defines a standard nomenclature
for mitochondrial genes, but none exists for chloroplast genes and
gene products. We will extend the above database structure to include
phylogenetically important structural changes such as insertion/deletion
regions, inversions, and duplications. The most straightforward
way to implement this is to compare each chloroplast genome to a
virtual standard genome; pairwise comparisons then can be simply
generated by comparing the two genomes in question to the standard
genome. We will make several enhancements to GOBASE to improve its
search and referencing abilities.
Annotation and alignment. The
sequencing group will annotate single genomes for database deposition
using the beta test versions of "Mitotater" and "Plastotater".
PiPmaker and MultiPipMaker will be used to identify a wide variety
of structural changes that have occurred during plastid genome evolution
and to generate multiple sequence alignments for downstream phylogenetic
and molecular evolutionary analysis. PipMaker
is a flexible program for visualization and evolutionary analysis
of whole genome sequences, and is ideally suited to our bioinformatics
needs. The PipMaker approach is based on percent identity plots
(PIPs) which are linear representations of high scoring regions
found with genomic-scale dot matrix analyses. This approach efficiently
aligns very large sequences (=Mbp). Alignments are fast and use
as series of BLAST programs [125, 126]). PIP output is compacted
to allow rapid identification of genes and other homologous sequences
[127-129], repeat elements and their classification [130] and structural
features, and to reveal evolutionarily conserved promoter and regulatory
elements [127] regardless of their linear order in the genome [131,
132]. The website provides various tools that aid genome annotation
and visual presentation of the results. MultiPiPmaker, a recent
expansion of PipMaker, allows simultaneous alignment of =100 genomes
using a new multiple alignment algorithm (Miller, unpub.). In addition
to generating compact summary maps, aligned sequences can be exported
for downstream phylogenetic and molecular evolutionary analysis.
[previous] [next]
|