|
D. DEPARTMENT OF
POPULATION GENETICS
D-b. Division of Evolutionary Genetics (Until May)
- Toshimichi Ikemura Group
RESEARCH
ACTIVITIES
(1)
Phylogenetic classification of environmental DNA
sequences including those from the Sargasso Sea and
of rDNA sequences with Self-Organization Map
(SOM)
Takashi Abe, Jian-Ping Song and Toshimichi
Ikemura
--The
self-organizing map (SOM) is a powerful tool for
clustering and visualizing high-dimensional complex
data on a two-dimensional plane. Oligonucleotide
frequency is an example of high-dimensional data,
and we developed an SOM as a novel bioinformatics
strategy to capture and visualize phylogenetic
diversity of a wide variety of microbial genomic
sequences obtained from an environmental sample.
First we constructed SOMs for tetranucleotide
frequencies in approximately 200,000 5-kb sequence
fragments obtained from 1500 prokaryotes for which
at least 10 kb of genomic sequence has been
deposited in DDBJ/EMBL/GenBank (a total of 1.05
Gb). The sequences could be classified primarily
according to 25 phylogenetic groups without
information regarding the species. The
classification was possible without orthologous
sequence sets, and therefore, is especially useful
for phylogenetic classification of novel sequences
from poorly characterized species in environmental
and clinical samples. We used the SOM method to
classify 810,000 sequences recently reported by
Venter et al. for pooled DNA samples from
the Sargasso Sea near Bermuda. Phylogenetic
diversity and novelty of the Sargasso Sea sequences
was visualized on a single map, and sequences that
were derived from a single genome but cloned
independently could be reconstructed in
silico.
--Phylogenetic
classification of genomic sequence fragments from a
groundwater metagenome library and a human gut
library was conducted using the tetranucleotide
SOM. (For details, see Nature Biotechnology,
2005: Canadian Journal of Microbiology,
2005)
--Because 16S rDNA
sequences were highly conserved during evolution,
their sequences have been used for detailed
phylogenetic classification of prokaryotic species,
including uncultured prokaryotes. Approximately
20,000 16S rDNA sequences longer than 1 kb from
6,100 known prokaryotic species have been compiled
in DDBJ/EMBL/GenBank. We constructed a
tetranucleotide SOM with these 16S rDNA sequences
after normalization for the sequence lengths. Clear
clustering according to phylogenetic group was
observed; 97% of sequences were classified into the
correct group territory on the tetranucleotide SOM.
The finding that the hit level of 16S rDNA
classification into the correct group territory was
higher than that of genomic sequences may indicate
that the occurrence of horizontal transfer of
rDNAs, if present, is lower than that of other
genome portions. Combination of SOMs for genomic
and 16S rDNA sequences will provide a tool for
detailed phylogenetic studies of genomic sequence
fragments from environmental uncultured
prokaryotes.
(2)
SOM classification of mammalian genomic and cDNA
sequences according to function
Takashi Abe, Yoko Kosaka and Toshimichi
Ikemura
--In addition to
protein-coding sequences (CDSs), 5' and 3'
untranslated regions (UTRs) and transcription
regulatory regions of eukaryotic genes have
attracted wide attention because of their crucial
roles in transcriptional and post-transcriptional
regulation of gene activity. We constructed SOMs of
tri- and tetranucleotide frequencies in all 1-kb
sequences derived from human or mouse genome. When
sequences of 5' and 3' UTRs, CDSs, and introns, as
well as 1-kb upstream regions from transcriptional
start sites, were mapped on these SOMs, a major
portion of the sequences were clustered primarily
according to the functional categories. This showed
that SOM could detect sequence characteristics
specific to the distinct functional categories.
Importantly, the territory of each functional
category was divided into multiple zones.
Furthermore, when we constructed tetra- and
pentanucleotide SOMs for human and mouse cDNA
sequences, protein-coding and -noncoding cDNA
sequences tended to be separated from each other.
Because no information other than oligonucleotide
frequencies is required for the map generation, SOM
is a novel in silico method useful for
identifying characteristic and diagnostic sequences
for individual functional categories.
Function-unknown sequences colocalized in a zone
where sequences of known functions are clustered
can be assumed presumably to have similar
functions.
(3)
Compilation of
gGenomeWordDictionary"
Takashi Abe, Yoko Kosaka, Kiyomi Kita and
Toshimichi Ikemura
--In order to know
the biological meaning of characteristic
oligonucleotide sequences specified by SOM, it is
important to systematically refer to literatures
focusing on experimental studies of the
oligonucleotide sequences. With regard to genomes
on which experimental studies have advanced, many
signal and motif sequences with functional activity
are known. Referring to the SOM data, the signal
and motif sequences that are not experimentally
identified may be presumed newly. Furthermore,
based on the SOM data, exploration of signal and
motif candidates on the genomes of which
experimental studies have not advanced may be
possible. It should also be noted that a large
amount of exploration information itself which is
accumulated in the process of the above literature
exploration should become significant and valuable
data sets. We have compiled a new database called
gGenomeWordDictionary" in the form of collection
of the exploration results of papers which describe
experimental facts regarding each oligonucleotide
sequence. In the dictionary, oligonucleotides being
composed of 4 letters (A, T, G, and C) are arranged
in the alphabetical order, and the abstract of the
paper reporting the experimental results are
compiled. Since we have created the database using
Oracle and Postgres relational-database systems, we
can extract a dictionary for a particular
phylogenetic group; e.g. Rice GenomeWordDictionary
and Fly GenomeWordDictionary.
PUBLICATIONS
Papers
1. Uchiyama, T., Abe, T., Ikemura, T. and
Watanabe, K. (2005). Substrate-induced
gene-expression screening of environmental
metagenome libraries for isolation of catabolic
genes, Nature Biotechnology, 23,
88-93.
2. Abe, T., Kanaya, S., Kinouchi, M. and Ikemura,
T. (2004). Genome informatics for unveiling hidden
genome signatures. Proceedings of the Institute
of Statistical Mathematics, 52,
207-215.
3. Abe, T., Ikemura, T., Kanaya, S., Kinouchi, M.
and Sugawara, H. (2004). Novel genome informatics
for unveiling hidden signatures in genome
sequences: self-organizing map (SOM) of
oligonucleotide frequencies, Proceedings of
Information-Based Induction Sciences,
94-99.
4. Hayashi, H., Abe, T., Sakamoto, M., Ohara, H.,
Ikemura, T., Sakka, K. and Benno, Y. (2005). Direct
cloning of genes encoding novel xylanases from
human gut, Canadian Journal of Microbiology
(in press).
Journal Editor
5. GENE (Elsevier)
6. DNA Sequence (Harwood Academic Publisher)
ORAL
PRESENTATIONS
1) Abe, T., Kanaya, S., Kinouchi, M., Kosaka, Y.
and Ikemura, T., "Novel bioinformatics for
unveiling hidden characteristics in genome
sequences and searching in silico for
genetic signal sequences", The 8th World
Multi-Conference on Systemics, Cybernetics and
Informatics (Orland, USA), July, 2004.
2) Nishio, H., Abe, T., Ogasawara, N., Ikemura, T.
and Kanaya, S., "Gene classification based on
expression profile using BL-SOM: Suitability
assessment of multivariate gene expression data to
spherical and plain SOM by N-measure", The 8th
World Multi-Conference on Systemics, Cybernetics
and Informatics (Orland, USA), July, 2004.
3) Abe, T., Ikemura, T., Kozuki, T., Nakagawa, S.,
Kinouchi, M., Kanaya, S. and Sugawara, H. "A novel
bioinformatics approach for genome analyses of
environmental samples on the basis of
self-organizing map (SOM)", 16th International
Genome Sequencing & Analysis (Washington,
DC) September, 2004.
|