D. DEPARTMENT OF POPULATION GENETICS
D-b. Division of Evolutionary Genetics (Until May) - Toshimichi Ikemura Group

RESEARCH ACTIVITIES

(1) Phylogenetic classification of environmental DNA sequences including those from the Sargasso Sea and of rDNA sequences with Self-Organization Map (SOM)

Takashi Abe, Jian-Ping Song and Toshimichi Ikemura

--The self-organizing map (SOM) is a powerful tool for clustering and visualizing high-dimensional complex data on a two-dimensional plane. Oligonucleotide frequency is an example of high-dimensional data, and we developed an SOM as a novel bioinformatics strategy to capture and visualize phylogenetic diversity of a wide variety of microbial genomic sequences obtained from an environmental sample. First we constructed SOMs for tetranucleotide frequencies in approximately 200,000 5-kb sequence fragments obtained from 1500 prokaryotes for which at least 10 kb of genomic sequence has been deposited in DDBJ/EMBL/GenBank (a total of 1.05 Gb). The sequences could be classified primarily according to 25 phylogenetic groups without information regarding the species. The classification was possible without orthologous sequence sets, and therefore, is especially useful for phylogenetic classification of novel sequences from poorly characterized species in environmental and clinical samples. We used the SOM method to classify 810,000 sequences recently reported by Venter et al. for pooled DNA samples from the Sargasso Sea near Bermuda. Phylogenetic diversity and novelty of the Sargasso Sea sequences was visualized on a single map, and sequences that were derived from a single genome but cloned independently could be reconstructed in silico.
--Phylogenetic classification of genomic sequence fragments from a groundwater metagenome library and a human gut library was conducted using the tetranucleotide SOM. (For details, see Nature Biotechnology, 2005: Canadian Journal of Microbiology, 2005)
--Because 16S rDNA sequences were highly conserved during evolution, their sequences have been used for detailed phylogenetic classification of prokaryotic species, including uncultured prokaryotes. Approximately 20,000 16S rDNA sequences longer than 1 kb from 6,100 known prokaryotic species have been compiled in DDBJ/EMBL/GenBank. We constructed a tetranucleotide SOM with these 16S rDNA sequences after normalization for the sequence lengths. Clear clustering according to phylogenetic group was observed; 97% of sequences were classified into the correct group territory on the tetranucleotide SOM. The finding that the hit level of 16S rDNA classification into the correct group territory was higher than that of genomic sequences may indicate that the occurrence of horizontal transfer of rDNAs, if present, is lower than that of other genome portions. Combination of SOMs for genomic and 16S rDNA sequences will provide a tool for detailed phylogenetic studies of genomic sequence fragments from environmental uncultured prokaryotes.

(2) SOM classification of mammalian genomic and cDNA sequences according to function

Takashi Abe, Yoko Kosaka and Toshimichi Ikemura

--In addition to protein-coding sequences (CDSs), 5' and 3' untranslated regions (UTRs) and transcription regulatory regions of eukaryotic genes have attracted wide attention because of their crucial roles in transcriptional and post-transcriptional regulation of gene activity. We constructed SOMs of tri- and tetranucleotide frequencies in all 1-kb sequences derived from human or mouse genome. When sequences of 5' and 3' UTRs, CDSs, and introns, as well as 1-kb upstream regions from transcriptional start sites, were mapped on these SOMs, a major portion of the sequences were clustered primarily according to the functional categories. This showed that SOM could detect sequence characteristics specific to the distinct functional categories. Importantly, the territory of each functional category was divided into multiple zones. Furthermore, when we constructed tetra- and pentanucleotide SOMs for human and mouse cDNA sequences, protein-coding and -noncoding cDNA sequences tended to be separated from each other. Because no information other than oligonucleotide frequencies is required for the map generation, SOM is a novel in silico method useful for identifying characteristic and diagnostic sequences for individual functional categories. Function-unknown sequences colocalized in a zone where sequences of known functions are clustered can be assumed presumably to have similar functions.

(3) Compilation of gGenomeWordDictionary"

Takashi Abe, Yoko Kosaka, Kiyomi Kita and Toshimichi Ikemura

--In order to know the biological meaning of characteristic oligonucleotide sequences specified by SOM, it is important to systematically refer to literatures focusing on experimental studies of the oligonucleotide sequences. With regard to genomes on which experimental studies have advanced, many signal and motif sequences with functional activity are known. Referring to the SOM data, the signal and motif sequences that are not experimentally identified may be presumed newly. Furthermore, based on the SOM data, exploration of signal and motif candidates on the genomes of which experimental studies have not advanced may be possible. It should also be noted that a large amount of exploration information itself which is accumulated in the process of the above literature exploration should become significant and valuable data sets. We have compiled a new database called gGenomeWordDictionary" in the form of collection of the exploration results of papers which describe experimental facts regarding each oligonucleotide sequence. In the dictionary, oligonucleotides being composed of 4 letters (A, T, G, and C) are arranged in the alphabetical order, and the abstract of the paper reporting the experimental results are compiled. Since we have created the database using Oracle and Postgres relational-database systems, we can extract a dictionary for a particular phylogenetic group; e.g. Rice GenomeWordDictionary and Fly GenomeWordDictionary.

PUBLICATIONS

Papers
1. Uchiyama, T., Abe, T., Ikemura, T. and Watanabe, K. (2005). Substrate-induced gene-expression screening of environmental metagenome libraries for isolation of catabolic genes, Nature Biotechnology, 23, 88-93.
2. Abe, T., Kanaya, S., Kinouchi, M. and Ikemura, T. (2004). Genome informatics for unveiling hidden genome signatures. Proceedings of the Institute of Statistical Mathematics, 52, 207-215.
3. Abe, T., Ikemura, T., Kanaya, S., Kinouchi, M. and Sugawara, H. (2004). Novel genome informatics for unveiling hidden signatures in genome sequences: self-organizing map (SOM) of oligonucleotide frequencies, Proceedings of Information-Based Induction Sciences, 94-99.
4. Hayashi, H., Abe, T., Sakamoto, M., Ohara, H., Ikemura, T., Sakka, K. and Benno, Y. (2005). Direct cloning of genes encoding novel xylanases from human gut, Canadian Journal of Microbiology (in press).

Journal Editor
5. GENE (Elsevier)
6. DNA Sequence (Harwood Academic Publisher)

ORAL PRESENTATIONS

1) Abe, T., Kanaya, S., Kinouchi, M., Kosaka, Y. and Ikemura, T., "Novel bioinformatics for unveiling hidden characteristics in genome sequences and searching in silico for genetic signal sequences", The 8th World Multi-Conference on Systemics, Cybernetics and Informatics (Orland, USA), July, 2004.
2) Nishio, H., Abe, T., Ogasawara, N., Ikemura, T. and Kanaya, S., "Gene classification based on expression profile using BL-SOM: Suitability assessment of multivariate gene expression data to spherical and plain SOM by N-measure", The 8th World Multi-Conference on Systemics, Cybernetics and Informatics (Orland, USA), July, 2004.
3) Abe, T., Ikemura, T., Kozuki, T., Nakagawa, S., Kinouchi, M., Kanaya, S. and Sugawara, H. "A novel bioinformatics approach for genome analyses of environmental samples on the basis of self-organizing map (SOM)", 16th International Genome Sequencing & Analysis (Washington, DC) September, 2004.