|
I. CENTER FOR
INFORMATION BIOLOGY AND DNA DATA BANK OF JAPAN
I-e. Laboratory for Gene-Expression Analysis -
Kousaku Okubo Group
RESEARCH
ACTIVITIES
(1)
Expression profiling of human genes
(1a) Data integration
(BodyMap8499): “Do you know how much of our genes
have reliable expression patterns on the
net?"
Osamu Ogasawara and Kousaku Okubo
--Unexpectedly
small difference in gene numbers among
multicellular organisms, precisely determined
though whole genome sequence supports the idea that
the complexity in our body is evolutionary achieved
by sophistication in expression controls of genes.
The anatomically comprehensive genome-wide gene
expression profile is key data to appreciate such
sophistication coded in our genome. Moreover,
availability of such data opened up opportunities
to explore the dependence of constitutive
expression patterns on other features of genes and
genomes, which may eventually leads to the
understanding of coding principles in our
genome.
--Despite the frequent
use of the term ‘genome-wide profiling' and wealth
of expression data in the public domain, it is
still not explicit what fraction of our genes are
provided with anatomical expression patterns
(COVERAGE) and to what extent different data set
agrees in terms of tissue distribution (ACCURACY).
In order for rational design of studies of human
transcriptome as a whole, we started to integrate
data from multiple different platforms on the
framework of latest human genome. The preliminary
data is opened to the public in collaboration with
integrated database team at JBIRC. (https://www.jbirc.aist.go.jp/hinv/h-angel/)
(1b) Data generation
and analysis; “Are you satisfied by the present
resolution of anatomical expression
data?"
Makiko Otsuji and Koichi Itoh
--In order to
functionally relate genes in co-expression cluster
or to deduce promoter sequences through alignment
of upstream regions of them, the resolution of
expression pattern should be at the level of cells
or homogeneous cell populations. The vast majority
of the public data do not meet this criterion,
except for those from induction experiments with
cell lines. Moreover, the majority of the target
genes for drug development such as receptors for
signaling molecules and channels localize in the
minority cells in the complex organs rather than in
homogeneous cell population. We are taking several
different approaches to generate such data in
streamline in the organs with complex cell
population such as brain and kidney.
(2)
Knowledge encoding and computation with gene
functions (BOB): “Are you confident in your
massive data interpretation?"
Kousaku Okubo, Koichi Itoh and Osamu
Ogasawara
--With the advent
in high throughput genome-wide measurement,
hypothesis generation on gene functions by
systematic and integrative interpretation of the
accumulating data is anticipated. For the last few
years, various statistical analysis techniques have
been employed in extraction of global patterns from
the massive data, in the forms of gene clusters and
networks. However, at present, even with intensive
use of web-based knowledge bases, human
interpretation will not match machine-aided data
production in neither speed nor scope. Automation
of interpretation process, at least in part,
appears essential for systematic and efficient
hypothesis generation.
--There are three
steps in automation of interpretation process; (1)
encode biomedical knowledge into computable form,
(2) interpret data using encoded knowledge, and (3)
represent the interpretation results. The
pioneering approaches in the step (1) were all
declarations (KEGG, GO). In general, some inherent
limitations are known in declarative approach.
First of all, continuous revising and updating
efforts by experts are inevitable. Secondly, the
manually declared structure has usually low
dimension and representation power is poor. For
example GO provides only three ‘aspects' in the
representation of functional relations in genes.
And thirdly, expert's declaration is apt to be
inconsistent when target domain becomes wider.
Although such limitation may causes little problem
in biochemistry and cell biology, they become
serious in the field of medical biology where gene
functions have been described by relating their
roles with so many types of concepts from behavior
to chemical compounds. These limitations make the
effectiveness of declaration approach in the
medical problems still elusive.
Rationale for our approach: A major process in
biomedical interpretation of these data is to
determine whether if there is a unique and common
functional feature, within any part of data-driven
structures. For this purpose, experts recall
features of genes from every aspect. Such a process
appears so dependent on expert's flexible recalling
and thinking that machine cannot possibly do. In
medical biology, however, we assumed that each
aspect mostly represents a certain topic in
biomedicine, such as ‘pathogenesis of a disease'
and ‘molecular mechanism of an organs function.'
If there is a book covering essentially all
fundamental topics in medical biology, and if each
page contains a list of all relevant genes,
interpretation would be achieved by computing
fitness of a given gene cluster to each page
(meaningfulness) and by returning the title of the
page that fits best as a meaning of the
cluster.
--We aimed to create
such a book, ‘BOB (Biomedical OminiBook)', by
concept-based structuring of pages in biomedical
textbooks and description of gene functions. We
applied a concept based indexing technique, latent
semantic indexing (LSI), for this purpose. LSI was
developed to overcome the problems in
term-matching-based-document-searching caused by
the fact that similar concept may represented by
texts with different term combinations and vice
versa. In LSI, terms and documents are structured
in a ‘semantic' vector space, based on the global
patterns in term-document association data,
beforehand. In searching, users' queries are mapped
onto this space as pseudo-documents. For each
queries, the document vectors having
supra-threshold cosine value to the query vector
are returned.
--In our application,
we took textbooks in place of documents to be
searched. Started from term-page association data,
provided in the index sections of textbooks, we
created a high-dimensional vector space wherein
objects (term or page) are placed according to
their relevance. Then we prepared a gene vector,
that corresponds to a ‘query vector' in original
LSI, by counting occurrence of textbook terms in
corresponding molecular database entry including
abstracts for cited papers. Thus prepared gene
vectors were mapped onto the textbook space
similarly to the mapping of users queries. Resulted
space contains vector representation of three
different classes of objects; term meanings, page
contents, and gene functions, arranged according to
their conceptual relevance. High dimensionality of
this space is expected to allow discriminative
representation of many aspects of relations without
inconsistency. Using this space, we may scale the
functional relevance of any pairs of genes, with
which ‘meaningfulness' of any given gene cluster
can be calculated. In addition, the title for the
page-vector nearest to the center of a gene cluster
will explain the biomedical ‘meaning' of the
meaningful cluster.
Publications
Papers
1. Michibata, H., Chiba, H., Wakimoto, K.,
Seishima, M., Kawasaki, S., Okubo, K., Mitsui, H.,
Torii, H. and Imai, Y. (2004). Identification and
characterization of a novel component of the
cornified envelope, cornifelin. Biochem Biophys Res
Commun., 318, 803-13.
2. Hishiki, T., Ogasawara, O., Tsuruoka, Y. and
Okubo, K. (2004). Indexing anatomical concepts to
OMIM Clinical Synopsis using the UMLS
Metathesaurus. In Silico Biol., 4, 31-54.
3. Chiba, H., Michibata, H., Wakimoto, K.,
Seishima, M., Kawasaki, S., Okubo, K., Mitsui, H.,
Torii, H. and Imai, Y. (2004). Cloning of a gene
for a novel epithelium-specific cytosolic
phospholipase A2, cPLA2delta, induced in psoriatic
skin. J Biol Chem., 279, 12890-7.
Books
4. Kaimori, J., Takenaka, M. and Okubo, K.
(2004). Quantification of Gene Expression in Mouse
and Human Renal Proximal Tubules., 「Laser Capture
Microdissection Methods and Protocols」 (Grame I.
Murray and Stephanie Curran ed.) Methods in
Molecular Biology 293, Humana Press, 209-220.
5.
大久保公策,川本祥子(2004)「ゲノム語とオントロジーという名の形而上学―進化と医学」,科学
Vol.74 No.10,1254-1257.
ORAL
PRESENTATIONS
1.
大久保公策「ポストゲノムデータ解釈の自動化」第27回日本分子生物学会(神戸)12月,2004
2. Okubo K., MACHINE USE OF MEDICAL TEXTBOOKS FOR
ESTABLISHING KNOWLEDGE HANDLING ENVIRONMENT IN
FUNCTIONAL GENOMICS. The 5th HUGO Pacific Meeting
and 6th Asia-Pacific Conference on Human Genetics
(Singapore), Nov.2004.
3.
大久保公策「多様性研究はゲノムに何を期待できるか」ガイアリスト21計画委員会シンポジウム。「分類・ゲノム・言語」日本動物学会第75回大会(神戸)9月,2004
4. 小笠原理 久保田功 大久保公策 テーマ
ゲノムと遺伝子発現からみた生物進化の様相「遺伝子間の発現の類似と構造の類似の関係」第6回日本進化学会東京大会(東京)8月,2004
5. Okubo K. Machine use of medical textbooks for
establishing knowledge handling environment in
functional genomics.The Third Workshop on Ontology
and Genome-Development and Applications of
Ontologies on OMICS Research, (Gottingen,Germany),
July. 2004.
6. Okubo K. Machine Interpretation of Genome-wide
Measurements for Integrative Biology. The Third
Waterfront Symposium of Human Genome Science,
Tokyo, Feb. 2004.
EDUCATION
1. Dr. Okubo gave a seminar on “The
computational interpretation of biological data and
its significance" at National Institute of
Genetics, Oct., 2004 (in Japanese).
2. Dr. Okubo gave a seminar on “How can we
(biologists) make good use of biological knowledge
available?" for DDBJing and TERAKOYA, Tokyo, Mar.,
2004 (in Japanese).
3. Dr. Okubo was invited to give a seminar on
“Medical Science in the post-genome period-
knowledge engineering in medicine" for Department
of Biomolecular Sciences Saga Medical School
Faculty of Medicine at Saga University, Mar., 2004
(in Japanese).
SOCIAL CONTRIBUTIONS AND
OTHERS
1. BodyMap: http://bodymap.genes.nig.ac.jp/
2. Databese: http://www.jbirc.aist.go.jp/hinv/h-angel/
3. Databese: http://bodymap.ims.u-tokyo.ac.jp
4.
JST領域探索プログラム「ゲノムと言語」コオーガナイザー
5. 日本学術会議 ゲノム科学研究連絡委員会 委員
6. (社)バイオ産業情報化コンソーシアム
タンパク質機能解析・活用pj研究推進委員会 委員
|