I. CENTER FOR INFORMATION BIOLOGY AND DNA DATA BANK OF JAPAN
I-d. Laboratory for Research and Development of Biological Databases - Hideaki Sugawara Group

RESEARCH ACTIVITIES

(1) Information systems for molecular biology and its related disciplines

1) From Web services to a Bioportal

Yasumasa Shigemoto, Haruka Sakai, Takashi Abe, Satoru Miyazaki and Hideaki Sugawara (Hitachi soft, Tokyo Univ. of Sci.)

--The publicly available bioinformatics resources, comprising databases and analytical tools, have expanded in recent years. While the information environment for life sciences has gradually become more abounding, it is still difficult to combine multiple, heterogeneous bioinformatics resources for a specific research purpose. To set up and run an integrated system, it is often necessary to write and update custom programs. In addition, different research groups continually write programs that have overlapping functions. We need an information environment that is conducive to efficient and appropriate bioinformatics resource utilization for a wide range of users. Therefore, the Center for Information Biology and DNA Data Bank of Japan, in alliance with the National Institute of Informatics (NII) and the Mitsubisi Research Institute, Inc. (MRI) have started a three years long project since 2003, “Research and Development of the New Generation of Bio-portal", to enhance the information environment for the relevant user communities. In this project, the Laboratory for Research and Development of Biological Databases is responsible for the development of biological Web services. The project site is open at http://www.bioportal.jp/in 2004. From there, a Web page for links to sites complete genome sequence and annotation are also prepared and accessible, in addition to the biological Web services. The former is named “Genome Menu".

2) Expansion of Genome Information Broker (GIB)

Masaki Hirahata, Naoto Tanaka, Takashi Abe, Satoru Miyazaki and Hideaki Sugawa (Tokyo Univ. of Sci.)

--GIB was originally created for the retrieval and analysis of E. coli genomic information in a set. We implemented microbial genome data into GIB whenever genome sequencing was completed and the data is made open to the public. At the GIB Web page (http://gib.genes.nig.ac.jp/), key word search, homology search, links to DBGET, KEGG and GTOP and visualization of the data are available for more than 200 strains as of December 2004. We have utilized XML, CORBA and a distributed database in order to cope with the explosion of microbial genome information.

 

(2) Information systems on microbes1)

1) WFCC-MIRCEN World Data Centre for Microorganisms (WDCM)

Yasumasa Shigemoto, Junko Nagaya and Hideaki Sugawara (Fujitsu)

--WFCC and MIRCEN stand for World Federation for Culture Collections and Microbial Resource Centers network respectively. The laboratory is the host of WDCM and maintains the World Directory of microbial resource centers. The on-line World Directory contains the detailed information of 469 centers in 65 countries and also the list of their holdings. Any culture collection is able to register, update and delete the information at http://www.wdcm.org/. WDCM could promote the update of the data by culture collections funded by the American Society for Microbiology and UNESCO.

2) Development of an e-Workbench for Biological Classification and Identification (InforBIO)

Naoto Tanaka, Kouji Koorikawa, Takashi Abe, Satoru Miyazaki and Hideaki Sugawara (Hitachi soft, Tokyo Univ. of Sci.)

--We continued the development of an e-Workbench named InforBIO by use of JAVA, XML and a relational database management system in the public domain. We have distributed InforBIO to several laboratories that study microbes and improved the utility and robustness of InforBIO based on the feedback (http://lilium.genes.nig.ac.jp/index_e.html).

3) An information system for pathogenic microorganisms

Masaki Hirahata, Naoto Tanaka, Yasumasa Shigemoto and Hideaki Sugawa (Fujitsu)

--We participated in a national project for the resource center of pathogenic microorganisms. Our role is to develop an information system for pathogenic fungi and actinomycetes, and also a portal site for pathogenic microorganisms in general (http://www.wdcm.org/byogen/).
(*) The information system on pathogenic microorganisms has been supported by Special Coordination Funds for Promoting Science and Technology.

(3) Applications of IT to the International Nucleotide Sequence Database2)

1) Development of Open Annotation System

Satoru Miyazaki, Takashi Abe and Hideaki Sugawara (Tokyo Univ. of Sci.)

--A number of the complete genome sequences have been submitted to INSD since 1995. The annotation information, however, is not consistent among genome sequencing teams. In addition, researchers outside of the team might have more information and knowledge on some genes and biological molecules. Therefore, it is quite important to develop the system which allows any expert to evaluate the annotation given by the team to attach more valuable information. As a new feature of INSD, we develop so-called “Open Annotation System (OASYS)" as an annotation editor in the distributed environment on the Internet.
(*) OASYS project has been supported by BIRD of Japan Science and Technology Corporation (JST)

2) Exhaustive evaluation of microbial genome information by use of GRID

Takehide Kosuge, Toshihisa Okido, Yasumasa Shigemoto, Masaki Hirahata, Naoto Tanaka, Yuzuru Maruyama, Takashi Abe, Satoru Miyazaki and Hideaki Sugawara (Fujitsu, Tokyo Univ. of Sci.)

--Tsunami of biological data and multiple views of the data analysis require an expandable and flexible information environment. GRID computing is expected to be the solution. We prepared a computational environment composed of 5 sites in OBIGrid and succeeded in analyzing horizontal gene transfer and clusters of ORFs of more than 100 microbial genomes that were stored in the Genome Information Broker as of May, 2003. This scheme is being applied to more than 300 thousands ORFs of genomic sequences of 124 microbial species. In 2004, we evaluated the results of the analysis and have developed site to diffuse the result to the public. We also applied the workflow to all the microbial genome sequences that were publicly available by September 2004.

(4) Genomics

1) Development of the H-Invitational Database

Yasumasa Shigemoto, Satoru Miyazaki and Hideaki Sugawara (Fujitsu, Tokyo Univ. of Sci.)

--We performed an exhaustive integrative characterization of 41,118 full-length cDNAs that capture the gene transcripts as complete functional cassettes, providing an unequivocal report of structural and functional diversity at the gene level. Our international collaboration has validated 21,037 human gene candidates by analysis of high-quality full-length cDNA clones through curation using unified criteria. We have developed a human gene database, called the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/). The H-InvDB platform represents a substantial contribution to resources needed for the exploration of human biology and pathology.

2) Splicing Profile Based Protein Categorization between Human and Mouse Genomes

Åke Väastermark, Yasumasa Shigemoto, Takashi Abe and Hideaki Sugawara (Univ. of Oxford, Fujitsu)

--We compared gene structures of human and mouse to explore the relationships of functions of genes and exon-intron structures. The central question is whether protein function is more correlated with splicing profiles than sequence similarity, or not. To approach this question, a splicing profile similarity (SPS) index, which measures relative exon length discrepancy, was devised. Arbitrary human proteins were compared, in terms of SPS and amino acid sequence similarity, to their 1) mouse orthologues and 2) human paralogues, which epitomise functional equivalence and non-equivalence, respectively, to methodically elucidate the global relationship between a) biological function, b) splicing profile similarity, and c) sequence similarity. Protein function is more correlated with splicing profile similarity than sequence similarity as demonstrated by the fact that human-mouse orthologues (HMOs) display significantly higher splicing profile similarity than do human-human paralogues (HHPs), despite the mutual sequence similarity between these two categories. This finding indicates that splicing profile-based protein categorisation is biologically meaningful4).

3) Phylogenetics Analyses of Environmental Samples on the Basis of Self-Organizing Map (SOM)

Takashi Abe, Toshimichi Ikemura and Hideaki Sugawara (SOKEN-DAI)

--Metagenomic approach, which is the genome analysis on a mixture of uncultured microorganisms, has been recently developed to search for novel and industrially useful genes and to study microbial diversity in a wide variety of environments. We previously modified the conventional SOM for genome informatics to make the learning process and resulting map independent of the order of data input5), 6). In the present study, we developed the SOM as a novel bioinformatics strategy to capture and visualize microbial diversity and relative abundance of microorganisms within an environmental sample. First we constructed SOMs of tri- and tetranucleotide frequencies in 1- and 5-kb sequence fragments from prokaryotic genomes for which complete sequence is available. The sequences could be classified primarily according to species and to 11 major phylogenetic groups without information regarding the species. For example, 88% of 5-kb sequences were classified into the correct phylogenetic group. Importantly, the classification could be done without orthologous sequence sets, and therefore, SOM was especially useful to analyze novel sequences from poorly characterized species for industrial applications and scientific studies. With the SOM method, all non-rRNA sequences in the Database that were from unidentified or uncultured bacteria and longer than 1 kb were classified into major phylogenetic groups7). The present method can also be developed as a tool for surveys of pathogenic microorganisms in environmental and clinical samples that can not be cultured easily and in sterilized samples.

PUBLICATIONS

Papers
1. Sugawara, H., Abe, T., Tanaka, N. and Miyazaki, S. (2004). Encounter of microbiology with the data science in the phase called post-genome sequencing. Soil microorganisms. 58 (2), 57-67.
2. Miyazaki, S., Sugawara, H., Ikeo, K., Gojobori, T. and Tateno, Y. (2004). DDBJ in the stream of various biological data. Nucleic Acids Research. 32, D31-D34.
3. Imanishi, T., Itoh, T., Suzuki, Y., O'Donovan, C., Fukuchi, S., Koyanagi, K., Barrero, R., Tamura, T., Yamaguchi-Kabata, Y., Tanino, M., Yura, K., Miyazaki, S., Ikeo, K., Homma, K., Kasprzyk, A., Nishikawa, T., Hirakawa, M., Thierry-Mieg, J., Thierry-Mieg, D., Ashurst, J., Jia, L., Nakao, M., Thomas, M., Mulder, N., Karavidopoulou, Y., Jin, L., Kim, S., Yasuda, T., Lenhard, B., Eveno, E., Suzuki, Y., Yamasaki, C., Takeda, J., Gough, C., Hilton, P., Fujii, Y., Sakai, H., Tanaka, S., Amid, C., Bellgard, M., Bonaldo, Mde, F., Bono, H., Bromberg, S., Brookes, A., Bruford, E., Carninci, P., Chelala, C., Couillault, C., De Souza, SJ., Debily, M., Devignes, M., Dubchak, I., Endo, T., Estreicher, A., Eyras, E., Fukami-Kobayashi, K., Gopinath, G.., Graudens, E., Hahn, Y., Han, M., Han, Z., Hanada, K., Hanaoka, H., Harada, E., Hashimoto, K., Hinz, U., Hirai, M., Hishiki, T., Hopkinson, I., Imbeaud, S., Inoko, H., Kanapin, A., Kaneko, Y., Kasukawa, T., Kelso, J., Kersey, P., Kikuno, R., Kimura, K., Korn, B., Kuryshev, V., Makalowska, I., Makino, T., Mano, S., Mariage-Samson, R., Mashima, J., Matsuda, H., Mewes, H., Minoshima, S., Nagai, K., Nagasaki, H., Nagata, N., Nigam, R., Ogasawara, O., Ohara, O., Ohtsubo, M., Okada, N., Okido, T., Oota, S., Ota, M., Ota, T., Otsuki, T., Piatier-Tonneau, D., Poustka, A., Ren, S., Saitou, N., Sakai, K., Sakamoto, S., Sakate, R., Schupp, I., Servant, F., Sherry, S., Shiba, R., Shimizu, N., Shimoyama, M., Simpson, AJ., Soares, B., Steward, C., Suwa, M., Suzuki, M., Takahashi, A., Tamiya, G., Tanaka, H., Taylor, T., Terwilliger, J., Unneberg, P., Veeramachaneni, V., Watanabe, S., Wilming, L., Yasuda, N., Yoo, H., Stodolsky, M., Makalowski, W., Go, M., Nakai, K., Takagi, T., Kanehisa, M., Sakaki, Y., Quackenbush, J., Okazaki, Y., Hayashizaki, Y., Hide, W., Chakraborty, R., Nishikawa, K., Sugawara, H., Tateno, Y., Chen, Z., Oishi, M., Tonellato, P., Apweiler, R., Okubo, K., Wagner, L., Wiemann, S., Strausberg, R., Isogai, T., Auffray, C., Nomura, N., Gojobori, T. and Sugano, S. (2004). Integrative annotation of 21,037 human genes validated by full-length cDNA clones. PLoS Biol., 2 (6), e162.
4. Vastermark, A., Shigemoto, Y., Abe, T. and Sugawara, H. (2004). Splicing Profile-based Protein Categorization between Human and Mouse Genome by use of DDBJ Web Services. Genome Informatics 15, 13-20.
5. Abe, T., Kanaya, S., Kinouchi, M. and Ikemura, T. (2004). Genome Informatics for Unveiling Hidden Genome Signatures. Proceedings of the Institute of Statistical Mathematics 52, 207-215.
6. Abe, T., Kanaya, S., Kinouchi, M., Kosaka, Y. and Ikemura, T. (2004). Novel bioinformatics for unveiling hidden characteristics in genome sequences and searching in silico for genetic signal sequences. Proceeding of The 8th World Multi-Conference on Systemics, Cybernetics and Informatics.
7. Abe, T., Ikemura, T., Kanaya, S., Kinouchi, M. and Sugawara, H. (2004). Novel genome informatics for unveiling hidden signatures in genome sequences: self-organizing map (SOM) of oligonucleotide frequencies. Proceedings of Information-Based Induction Sciences, 94-99.

Books
8. Sugawara, H. (2004). Tsunami of data: Data resources and utilization. Microbial Genetic Resources and Biodiscovery. Kurtboke, I. and Swings, J. ed., (National Library of Australia), 40-56.

Databases
9. Japanese Bio-portal site (Jabion), http://www.bioportal.jp/
10. Genome Information Broker, http://gib.genes.nig.ac.jp/
11. WFCC-MIRCEN World Data Centre for Microorganisms (WDCM), http://www.wdcm.org/
12. The portal site for pathogenic microorganisms, http://www.wdcm.org/byogen/
13. e-Workbench for Biological Classification and Identification, http://lilium.genes.nig.ac.jp/index_e.html
14. H-Invitational Database, http://www.h-invitational.jp/

ORAL PRESENTATIONS

1. Sugawara, H., Culture collections face challenges and opportunities, International Symposium Towards a New Era's Microbial Resource Center, Beijing, February, 2004.
2. Miyazaki, S., Sugawara, H., Exhaustive analysis of microbial genomes by Web services and GRID JST-BIR International Workshop"Integrated Databases and DataGrid for Structural Biology and Molecular Biology, Osaka, March, 2004.
3. Sugawara, H., Evolution of WFCC-MIRCEN World Data Centre for Microorganisms (WDCM). ISBER US Meeting 2004, New York City, May, 2004.
4. Sugawara, H., Gene Trek in Procaryote Space powered by a GRID environment Proceedings of the First International Workshop on Life Science Grid. LSGRID2004, Kanazawa city, May, 2004.
5. Sugawara, H., The Butterfly Effect. JSCC Award Lecture, Tsukuba, October, 2004.
6. Sugawara, H., WFCC-MIRCEN World Data Centre for Microorganisms (WDCM) meets Global Biodiversity Information Facility (GBIF). 19th International CODATA Conference The Information Society: New Horizons for Science, Berlin, November, 2004.
7. Kosuge, K., Okido, T., Hirahata, M., Shigemoto, S., Miyazaki, S., Abe, T., Gojobori, T., Sugawara, S., Development of a common protocol for the prediction of microbial genes. Genome Informatics Workshop, Yokohama, Decmber, 2004.
8. 菅原秀明、「国際連携と情報ネットワーク」第1回NITEバイオテクノロジーショップ「微生物資源センターを取り巻く最近の話題と今後の展開」、東京、2004年1月.
9. 阿部貴志、菅原秀明、池村淑道、「環境中に潜んでいる未開拓ゲノム資源を活用するためのバイオインフォマティクス」、国際バイオEXPO東京、2004年5月.
10. 菅原秀明、「微生物とデータ科学」、2004年度日本土壌微生物学会、筑波、2004年6月.
11. 阿部貴志、池村淑道、中川智、上月登喜男、木ノ内誠、金谷重彦、菅原秀明、「環境由来DNA配列に基づく培養困難な微生物群の系統推定のための新規な情報学的手法:自己組織化地図法(Self-Organizing Map)」、第6回進化学会、東京、2004年8月.
12. 小林悟志、川本祥子、水田洋子、ムリアディ・ヘンドリィ、出宮スウェン・ミノル、岩間久和、竹崎直子、伊藤武彦、荒木次郎、吉成泰彦、北本朝展、五條堀孝、菅原秀明、宮崎智、武田英明、藤山秋佐夫「新世代バイオポータルの開発:Webサービスによる遺伝学の普及をめざして」日本遺伝学会第76回大会、吹田、2004年9月.
13. 阿部貴志、菅原秀明、金谷重彦、木ノ内誠、中川智、上月登喜男、池村淑道、「自己組織化地図法(SOM)を用いた環境中の難培養性微生物群由来のゲノムDNA断片配列の系統分類」、日本遺伝学会第76回大会、吹田、2004年9月.
14. 田中尚人、小菅武英、大城戸利久、平畠壮規、重元康昌、宮崎智、阿部貴志、菅原秀明「国際塩基配列データベース登録微生物ORFの統一的再評価」日本微生物系統分類研究会、伊東、2004年11月.
15. 水島洋、菅原秀明、嘉納時男、苙口隆重、「バイオデータベース相互運用性に向けてのSNPsデータ交換の標準化」、IPABシンポジウム2004、東京、2004年12月.
16. 阿部貴志、菅原秀明、木ノ内、金谷、池村淑道、「ゲノムに潜む未知のシグナル配列類を探索するための新規なゲノム情報学」、第27回日本分子生物学会、神戸、2004年12月.
17. 菅原秀明、「配列から見る生命科学―配列以外のフィーチャーは役に立つのか」第27回日本分子生物学会、神戸、2004年12月.
18. 菅原秀明、「Webサービスが加速するバイオ情報環境」、日本生物物理学会年会、京都、2004年12月.

EDUCATION

1. Dr. H. Sugawara was invited to give a lecture on “Databases are the key to bioinformatics" at the 2nd Open Symposium of Joint Research with Wakayama Pref., Wakayama, 2004 (in Japanese).
2. Dr. H. Sugawara was invited to give a lecture on “Invitation to information biology" at Campus system Research Group of Private Universities, Hamanako, August 2004 (in Japanese).
3. Dr. T. Abe was invited to give a lecture on “Genome analysis by PC-cluster." at Working group of scientific system, Tokyo, August, 2004 (in Japanese).

学会活動
1. Dr. H. Sugawara organized the 2nd International Conference on Biodata Interoperability, Tokyo, June, 2004.
2. Dr. H. Sugawara organized International program committee of the International Congress for Culture Collections, Tsukuba, October, 2004.
3. Dr, H. Sugawara organized Program committee of Genome Informatics 2004, Yokohama, December, 2004.
4. Task Force of Biological Resource Centers, OECD Working Party for Biotechnology (Vice-chair).
5. World Federation for Culture Collections, Executive board member and journal editor.
6. 極限微生物学会(評議員)
7. 日本情報知識学会(理事)