A graph theoretical approach for comparative analysis of genomic structures and biochemical pathways

Hiroyuki Ogata and Minoru Kanehisa

We introduce and discuss a new computational approach towards prediction and inference of biological functions from genomic sequences by making use of the pathway data in KEGG (http://www.genome.ad.jp/kegg/). It is based on a concept of graph, where a node is a gene or a gene product and an edge is a link or a relationship between genes or gene products. Different kinds of links make different networks or graphs. For example, a genome is seen as a set of genes that are one- dimensionally linked, so it is represented by a linear graph. A set of interacting gene products in a biochemical pathway is another type of graph. Relationships of proteins by sequence similarity can also be represented as a complicated graph.

In order to obtain functional clues of ORFs in the complete genome, it is customary to perform homology search to identify relatedness of sequence similarity. However, this search alone leaves, at least, one third to one half of the genes as hypothetical. By incorporating additional information, such as relatedness on the genome or in the biochemical pathway, it is expected to uncover functions of more genes as well as to verify functions that are already assigned. To this end we have recently developed a new algorithm for extraction of correlated clusters among these graphs. Based on graph theory and a hierarchical clustering algorithm, it detects local clusters of corresponding nodes that represent links of genes and/or gene products. A wide variety of utility of the method is demonstrated in the following examples.

When the method is used for a comparison of a genome versus another genome with correspondence information of sequence similarity, it extracts pairs of orthologous gene clusters. While it is well known that global arrangement of orthologous gene clusters on the genome can be highly shuffled between two bacterial lineage, we also observe gene shuffling events by translocations, inversions, insertions and deletions within some of the orthologous gene clusters. Practically, the extraction of orthologous gene clusters gives important clues for identification of orthologs that have been missed by simple homology searches.

When the method is used for a comparison of a genome with a set of known biochemical pathways, it extracts gene clusters that play their roles at close positions on the biochemical pathways. For example, our analysis on metabolic pathways showed that most of the gene clusters in E. coli corresponded to enzyme operons. By comapring each known genome against metabolic pathways we observed many common gene clusters that were conserved among multiple organisms, as well as many organism-specific gene clusters. This type of analysis would be used for reconstructing and characterizing functional units of each organism with different physiology.

We believe comparative analysis of networks of biological entities at these levels of abstraction would be fruitful for basic understanding of biological system and its evolution, as well as for development of practical tools such as for automatic annotation of gene functions.
Glaxo Wellcome Symposium on Bioinformatics'98
(7th May, 1998, Nippon Glaxo Ltd. Tsukuba Research Laboratories)