I. CENTER FOR INFORMATION BIOLOGY AND DNA DATA BANK OF JAPAN
I-e. Laboratory for Gene-Expression Analysis - Kousaku Okubo Group

RESEARCH ACTIVITIES

(1) Expression profiling of human genes

(1a) Data integration (BodyMap8499): “Do you know how much of our genes have reliable expression patterns on the net?"

Osamu Ogasawara and Kousaku Okubo

--Unexpectedly small difference in gene numbers among multicellular organisms, precisely determined though whole genome sequence supports the idea that the complexity in our body is evolutionary achieved by sophistication in expression controls of genes. The anatomically comprehensive genome-wide gene expression profile is key data to appreciate such sophistication coded in our genome. Moreover, availability of such data opened up opportunities to explore the dependence of constitutive expression patterns on other features of genes and genomes, which may eventually leads to the understanding of coding principles in our genome.
--Despite the frequent use of the term ‘genome-wide profiling' and wealth of expression data in the public domain, it is still not explicit what fraction of our genes are provided with anatomical expression patterns (COVERAGE) and to what extent different data set agrees in terms of tissue distribution (ACCURACY). In order for rational design of studies of human transcriptome as a whole, we started to integrate data from multiple different platforms on the framework of latest human genome. The preliminary data is opened to the public in collaboration with integrated database team at JBIRC. (https://www.jbirc.aist.go.jp/hinv/h-angel/)

(1b) Data generation and analysis; “Are you satisfied by the present resolution of anatomical expression data?"

Makiko Otsuji and Koichi Itoh

--In order to functionally relate genes in co-expression cluster or to deduce promoter sequences through alignment of upstream regions of them, the resolution of expression pattern should be at the level of cells or homogeneous cell populations. The vast majority of the public data do not meet this criterion, except for those from induction experiments with cell lines. Moreover, the majority of the target genes for drug development such as receptors for signaling molecules and channels localize in the minority cells in the complex organs rather than in homogeneous cell population. We are taking several different approaches to generate such data in streamline in the organs with complex cell population such as brain and kidney.

(2) Knowledge encoding and computation with gene functions (BOB): “Are you confident in your massive data interpretation?"

Kousaku Okubo, Koichi Itoh and Osamu Ogasawara

--With the advent in high throughput genome-wide measurement, hypothesis generation on gene functions by systematic and integrative interpretation of the accumulating data is anticipated. For the last few years, various statistical analysis techniques have been employed in extraction of global patterns from the massive data, in the forms of gene clusters and networks. However, at present, even with intensive use of web-based knowledge bases, human interpretation will not match machine-aided data production in neither speed nor scope. Automation of interpretation process, at least in part, appears essential for systematic and efficient hypothesis generation.
--There are three steps in automation of interpretation process; (1) encode biomedical knowledge into computable form, (2) interpret data using encoded knowledge, and (3) represent the interpretation results. The pioneering approaches in the step (1) were all declarations (KEGG, GO). In general, some inherent limitations are known in declarative approach. First of all, continuous revising and updating efforts by experts are inevitable. Secondly, the manually declared structure has usually low dimension and representation power is poor. For example GO provides only three ‘aspects' in the representation of functional relations in genes. And thirdly, expert's declaration is apt to be inconsistent when target domain becomes wider. Although such limitation may causes little problem in biochemistry and cell biology, they become serious in the field of medical biology where gene functions have been described by relating their roles with so many types of concepts from behavior to chemical compounds. These limitations make the effectiveness of declaration approach in the medical problems still elusive.
Rationale for our approach: A major process in biomedical interpretation of these data is to determine whether if there is a unique and common functional feature, within any part of data-driven structures. For this purpose, experts recall features of genes from every aspect. Such a process appears so dependent on expert's flexible recalling and thinking that machine cannot possibly do. In medical biology, however, we assumed that each aspect mostly represents a certain topic in biomedicine, such as ‘pathogenesis of a disease' and ‘molecular mechanism of an organs function.' If there is a book covering essentially all fundamental topics in medical biology, and if each page contains a list of all relevant genes, interpretation would be achieved by computing fitness of a given gene cluster to each page (meaningfulness) and by returning the title of the page that fits best as a meaning of the cluster.
--We aimed to create such a book, ‘BOB (Biomedical OminiBook)', by concept-based structuring of pages in biomedical textbooks and description of gene functions. We applied a concept based indexing technique, latent semantic indexing (LSI), for this purpose. LSI was developed to overcome the problems in term-matching-based-document-searching caused by the fact that similar concept may represented by texts with different term combinations and vice versa. In LSI, terms and documents are structured in a ‘semantic' vector space, based on the global patterns in term-document association data, beforehand. In searching, users' queries are mapped onto this space as pseudo-documents. For each queries, the document vectors having supra-threshold cosine value to the query vector are returned.
--In our application, we took textbooks in place of documents to be searched. Started from term-page association data, provided in the index sections of textbooks, we created a high-dimensional vector space wherein objects (term or page) are placed according to their relevance. Then we prepared a gene vector, that corresponds to a ‘query vector' in original LSI, by counting occurrence of textbook terms in corresponding molecular database entry including abstracts for cited papers. Thus prepared gene vectors were mapped onto the textbook space similarly to the mapping of users queries. Resulted space contains vector representation of three different classes of objects; term meanings, page contents, and gene functions, arranged according to their conceptual relevance. High dimensionality of this space is expected to allow discriminative representation of many aspects of relations without inconsistency. Using this space, we may scale the functional relevance of any pairs of genes, with which ‘meaningfulness' of any given gene cluster can be calculated. In addition, the title for the page-vector nearest to the center of a gene cluster will explain the biomedical ‘meaning' of the meaningful cluster.

Publications

Papers
1. Michibata, H., Chiba, H., Wakimoto, K., Seishima, M., Kawasaki, S., Okubo, K., Mitsui, H., Torii, H. and Imai, Y. (2004). Identification and characterization of a novel component of the cornified envelope, cornifelin. Biochem Biophys Res Commun., 318, 803-13.
2. Hishiki, T., Ogasawara, O., Tsuruoka, Y. and Okubo, K. (2004). Indexing anatomical concepts to OMIM Clinical Synopsis using the UMLS Metathesaurus. In Silico Biol., 4, 31-54.
3. Chiba, H., Michibata, H., Wakimoto, K., Seishima, M., Kawasaki, S., Okubo, K., Mitsui, H., Torii, H. and Imai, Y. (2004). Cloning of a gene for a novel epithelium-specific cytosolic phospholipase A2, cPLA2delta, induced in psoriatic skin. J Biol Chem., 279, 12890-7.

Books
4. Kaimori, J., Takenaka, M. and Okubo, K. (2004). Quantification of Gene Expression in Mouse and Human Renal Proximal Tubules., 「Laser Capture Microdissection Methods and Protocols」 (Grame I. Murray and Stephanie Curran ed.) Methods in Molecular Biology 293, Humana Press, 209-220.
5. 大久保公策,川本祥子(2004)「ゲノム語とオントロジーという名の形而上学―進化と医学」,科学 Vol.74 No.10,1254-1257.

ORAL PRESENTATIONS

1. 大久保公策「ポストゲノムデータ解釈の自動化」第27回日本分子生物学会(神戸)12月,2004
2. Okubo K., MACHINE USE OF MEDICAL TEXTBOOKS FOR ESTABLISHING KNOWLEDGE HANDLING ENVIRONMENT IN FUNCTIONAL GENOMICS. The 5th HUGO Pacific Meeting and 6th Asia-Pacific Conference on Human Genetics (Singapore), Nov.2004.
3. 大久保公策「多様性研究はゲノムに何を期待できるか」ガイアリスト21計画委員会シンポジウム。「分類・ゲノム・言語」日本動物学会第75回大会(神戸)9月,2004
4. 小笠原理 久保田功 大久保公策 テーマ ゲノムと遺伝子発現からみた生物進化の様相「遺伝子間の発現の類似と構造の類似の関係」第6回日本進化学会東京大会(東京)8月,2004
5. Okubo K. Machine use of medical textbooks for establishing knowledge handling environment in functional genomics.The Third Workshop on Ontology and Genome-Development and Applications of Ontologies on OMICS Research, (Gottingen,Germany), July. 2004.
6. Okubo K. Machine Interpretation of Genome-wide Measurements for Integrative Biology. The Third Waterfront Symposium of Human Genome Science, Tokyo, Feb. 2004.

EDUCATION

1. Dr. Okubo gave a seminar on “The computational interpretation of biological data and its significance" at National Institute of Genetics, Oct., 2004 (in Japanese).
2. Dr. Okubo gave a seminar on “How can we (biologists) make good use of biological knowledge available?" for DDBJing and TERAKOYA, Tokyo, Mar., 2004 (in Japanese).
3. Dr. Okubo was invited to give a seminar on “Medical Science in the post-genome period- knowledge engineering in medicine" for Department of Biomolecular Sciences Saga Medical School Faculty of Medicine at Saga University, Mar., 2004 (in Japanese).

SOCIAL CONTRIBUTIONS AND OTHERS

1. BodyMap: http://bodymap.genes.nig.ac.jp/
2. Databese: http://www.jbirc.aist.go.jp/hinv/h-angel/
3. Databese: http://bodymap.ims.u-tokyo.ac.jp
4. JST領域探索プログラム「ゲノムと言語」コオーガナイザー
5. 日本学術会議 ゲノム科学研究連絡委員会 委員
6. (社)バイオ産業情報化コンソーシアム タンパク質機能解析・活用pj研究推進委員会 委員