A graph theoretical approach for comparative
analysis of genomic structures and biochemical
pathways
Hiroyuki Ogata and Minoru Kanehisa
We introduce and discuss a new computational
approach towards prediction and inference of
biological functions from genomic sequences by
making use of the pathway data in KEGG
(http://www.genome.ad.jp/kegg/). It is based on
a concept of graph, where a node is a gene or a
gene product and an edge is a link or a
relationship between genes or gene products.
Different kinds of links make different
networks or graphs. For example, a genome is
seen as a set of genes that are one-
dimensionally linked, so it is represented by a
linear graph. A set of interacting gene
products in a biochemical pathway is another
type of graph. Relationships of proteins by
sequence similarity can also be represented as
a complicated graph.
In order to obtain functional clues of
ORFs in the complete genome, it is customary to
perform homology search to identify relatedness
of sequence similarity. However, this search
alone leaves, at least, one third to one half
of the genes as hypothetical. By incorporating
additional information, such as relatedness on
the genome or in the biochemical pathway, it is
expected to uncover functions of more genes as
well as to verify functions that are already
assigned. To this end we have recently
developed a new algorithm for extraction of
correlated clusters among these graphs. Based
on graph theory and a hierarchical clustering
algorithm, it detects local clusters of
corresponding nodes that represent links of
genes and/or gene products. A wide variety of
utility of the method is demonstrated in the
following examples.
When the method is used for a comparison
of a genome versus another genome with
correspondence information of sequence
similarity, it extracts pairs of orthologous
gene clusters. While it is well known that
global arrangement of orthologous gene clusters
on the genome can be highly shuffled between
two bacterial lineage, we also observe gene
shuffling events by translocations, inversions,
insertions and deletions within some of the
orthologous gene clusters. Practically, the
extraction of orthologous gene clusters gives
important clues for identification of orthologs
that have been missed by simple homology
searches.
When the method is used for a comparison
of a genome with a set of known biochemical
pathways, it extracts gene clusters that play
their roles at close positions on the
biochemical pathways. For example, our analysis
on metabolic pathways showed that most of the
gene clusters in E. coli corresponded to enzyme
operons. By comapring each known genome against
metabolic pathways we observed many common gene
clusters that were conserved among multiple
organisms, as well as many organism-specific
gene clusters. This type of analysis would be
used for reconstructing and characterizing
functional units of each organism with
different physiology.
We believe comparative analysis of
networks of biological entities at these levels
of abstraction would be fruitful for basic
understanding of biological system and its
evolution, as well as for development of
practical tools such as for automatic
annotation of gene functions.
Glaxo Wellcome Symposium on Bioinformatics'98
(7th May, 1998, Nippon Glaxo Ltd. Tsukuba Research Laboratories)