BIOLOGY Outline Introduction Background Literature Methodology - - PowerPoint PPT Presentation

biology
SMART_READER_LITE
LIVE PREVIEW

BIOLOGY Outline Introduction Background Literature Methodology - - PowerPoint PPT Presentation

Andrew Hoblitzell, Snehasis Mukhopadhyay, Qian You, Shiaofen Fang, Yuni Xia, and Joseph Bidwell Indiana University Purdue University Indianapolis TEXT MINING FOR BONE BIOLOGY Outline Introduction Background Literature Methodology


slide-1
SLIDE 1

TEXT MINING FOR BONE BIOLOGY

Andrew Hoblitzell, Snehasis Mukhopadhyay, Qian You, Shiaofen Fang, Yuni Xia, and Joseph Bidwell Indiana University Purdue University Indianapolis

slide-2
SLIDE 2

Outline

  • Introduction
  • Background Literature
  • Methodology
  • Results and Discussion
  • Conclusion
slide-3
SLIDE 3

INTRODUCTION

slide-4
SLIDE 4

Introduction

  • Bone diseases affect tens of millions of

people and include bone cysts, osteoarthritis, fibrous dysplasia, and osteoporosis among

  • thers.
  • Osteoporosis affects an estimated 75 million

people in Europe, USA and Japan, with 10 million people suffering from osteoporosis in the United States alone.

slide-5
SLIDE 5

Introduction

  • Goal: The extraction and visualization of

relationships between biological entities related to bone biology appearing in biological databases

  • Benefit: Keep biologists up to date on the

research and also possibly uncover new relationships among biological entities.

slide-6
SLIDE 6

Key Terms

  • Bioinformatics: the application of information

technology and computer science to the field

  • f molecular biology
  • Text mining: allows for the extraction of

knowledge contained in the text-based literature

slide-7
SLIDE 7

BACKGROUND LITERATURE

slide-8
SLIDE 8

Background Literature

  • Computer Science is still a relatively young

science, and text mining is an even younger subset of the science

  • Nonetheless, the field of text mining has

developed very well and quite rapidly

  • In particular, its application to the biomedical

domain has attracted considerable attention

  • The PubMed resource maintained by NIH has

more than 20 million research articles, necessitating the development of automated analysis methods

slide-9
SLIDE 9

Some Relevant Background

  • “Complementary Literatures: A Stimulus to

Scientific Discovery”

  • 1997 paper by Swanson et al.
  • Begin with a list of viruses that have weapons

potential development and present findings meant to act as a guide to the virus literature to support further studies of defensive measures.

  • Initially promising results
slide-10
SLIDE 10

Background Literature

  • “Automatic Term Identification and

Classification in Biology Texts”

  • 1999 paper by Collier et al.
  • Made use of a decision tree for classification

and term candidate identification

  • Results indicated that while identifying term

boundaries was non-trivial, a high success rate could eventually be obtained in term classification.

slide-11
SLIDE 11

Background Literature

  • “Accomplishments and challenges in

literature data mining for biology”

  • 2002 paper by Hirschman et al.
  • Trace literature data mining from its

recognition of protein interactions to its solutions to a improving homology search, identifying cellular location, and more

  • Notes the field has progressed from simple

term recognition to much more complex interactions between degrees of entities

slide-12
SLIDE 12

Background Literature

  • “Support tools for literature-based

information access in molecular biology”

  • 2009 paper by Fabio Rinaldi and Dietrich

Rebholz-Schuhmann

  • Paper shows different tools developed by the

authors to support professional biologists in accessing information

  • High performance on gold standard data

does not necessarily translate into high performance for database annotation

slide-13
SLIDE 13

Background Literature

  • “An application of bioinformatics and text
  • mining to the discovery of novel genes

related to bone biology”

  • 2007 paper by Gajendran, Lin, and Fyhrie
  • Reports the results of text mining for a bone

biology pathway including SMAD genes

  • Proposed a ranking systems for relevant

genes based on text mining

slide-14
SLIDE 14

METHODOLOGY

slide-15
SLIDE 15

Extraction

  • To extract entity relationships from the

biological literature, we examined flat relationships, which simply state there exists a relationship between two biological entities

  • A Thesaurus-based text analysis approach is

used to discover the existence of relationships

slide-16
SLIDE 16

Extraction

  • The document representation step next

converts the downloaded text documents into data structures which are able to be processed without the loss of any meaningful information

  • The process uses a thesaurus, an array T of

atomic tokens (or terms) identified by a unique numeric identifier.

slide-17
SLIDE 17

Tf*idf method

  • The tf*idf (the term frequency multiplied

with inverse document frequency) algorithm is applied to achieve a refined discrimination at the term representation level.

  • The inverse document frequency (idf)

component acts as a weighting factor by taking into account inter-document term distribution.

slide-18
SLIDE 18

Normalized weighting

  • where Tik represents the number of
  • ccurrences of term Tk in document i,

Ik=log(N/nk) provides the inverse document frequency of term Tik in the base of documents, N is the number of documents in the base of documents, and nk is the number

  • f documents in the base that contains the

given term Tk.

slide-19
SLIDE 19

Weight vector

  • Each document di is converted to an M

dimensional vector where W where Wik denotes the weights of the kth gene or protein term in the document and M indicates the number of total terms in the thesaurus.

  • Wik will increase with the term frequency (Tik)

and decrease with the total number of documents containing the given term in the collection (nk).

slide-20
SLIDE 20

Association matrix

  • The associations between entities k and l are

computed using the following equation:

  • The association[k][l] will always be greater

than or equal to zero. The relative values of association[k][l] will indicate the product of the importance of the kth and lth term in each document

slide-21
SLIDE 21

Transitive text mining

  • The basic premise of transitive text mining is

that if there are direct associations between

  • bjects A and B, as well as direct associations

between objects B and C, then an association between A and C may be hypothesized even if the latter has not been explicitly seen in the literature.

  • Such transitive associations may be

efficiently determined by computing the transitive closure of the association matrix

slide-22
SLIDE 22

Floyd-Warshall algorithm

  • The transitive closure of a binary relation R on

a set X is the smallest transitive relation on X that contains R

  • The Floyd-Warshall algorithm may be used to

find the transitive closure

slide-23
SLIDE 23

Separation of evidence principle

  • Evidence (i.e., a part of the capacities) once

used along a transitive path may not be used again along another transitive path in defining the confidence measure of a transitive association.

  • This will allow us to find association strength

using a flow model

slide-24
SLIDE 24

Maximum flow

  • Maximum flow problem, seen as a special

case of the circulation problem

  • The Edmonds-Karp algorithm is applied for

each transitive association (a,b), to find the maximum flow through the graph

slide-25
SLIDE 25

RESULTS AND DISCUSSION

slide-26
SLIDE 26

Results and Discussion

  • To test our search strategy we chose to

explore potential novel relationships between NMP4/CIZ (nuclear matrix protein 4/cas interacting zinc finger protein; hereafter referred to as Nmp4 for clarity) and proteins that may interact with this signalling pathway.

  • Nmp4 is a nuclear matrix architectural

transcription factor that represses genes that support the osteoblast phenotype

slide-27
SLIDE 27

Terms used

  • A summary of the terms used is presented in

the following legend:

slide-28
SLIDE 28

Direct Association Matrix

  • The following direct association matrix was

generated:

slide-29
SLIDE 29

Transitive matrix

  • Transitive closure and the Edmonds-Karp

algorithm provided the following results:

slide-30
SLIDE 30

Normalization

  • The Direct Association Matrix then
  • normalizes. A thresh holding value of 152.1

was then obtained and used for examining and analyzing the data.

  • The MNF matrix was then normalized. A

thresh holding value of 7000.2 was obtained from inspection of the scores.

  • The normalize data was used to generate

heat maps.

slide-31
SLIDE 31

Direct Association Heat Map

slide-32
SLIDE 32

MNF Heat Map

slide-33
SLIDE 33

Expert Heat Map

slide-34
SLIDE 34

Error computation

  • The results from were then compared against

expert provided scores. The average error was then computed as follows:

  • ∑|Expert(l,k)-Predicted(l,k)|/Nr
  • where Expert(l,k) is the expert provided score
  • f a relationship between entities l and k,

Predicted(l,k) is the predicted score of a given relationship between entities l and k, l is one entity, k is another entity, and Nr is the total number of relations.

slide-35
SLIDE 35

Error results

  • Using random guessing, a random average

error rate of 0.58 was obtained

  • Using the corresponding direct association

matrix, an error rate 0.35 was obtained.

  • Using the maximum network flow method,

an error rate of 0.24 was obtained.

  • Application of the maximum flow algorithm

to this problem offers significant improvement over other methods

slide-36
SLIDE 36

CONCLUSION

slide-37
SLIDE 37

Conclusion

  • The biological literature is a huge and

constantly increasing source of information which the biologist may consult for information about their field, but the vast amount of data can sometimes become

  • verwhelming
  • Text Mining, a solution to this problem, has

seen a great amount of development

slide-38
SLIDE 38

Conclusion

  • The aim was to present a method which uses

MNF to determine a confidence score for the derived transitive associations

  • A specific pathway in bone biology consisting
  • f a number of important proteins was

subjected to the text mining approach

  • A significantly higher agreement with an

expert’s knowledge can be obtained with transitive mining than that with only direct associations.

slide-39
SLIDE 39

Extension: Hypergraphs

  • A hypergraph is a generalization of a GRAPH,

where EDGES can connect any number of VERTICES

  • Numerous problems have been studied on

hypergraphs including transitive closure, transitive reduction, flow and cut problems, and minimum weight traversal problems

  • This could offer improved accuracy
slide-40
SLIDE 40

Other Future Work

  • Causal Model Development: A systematic

procedure for constructing causality models from text mining knowledge could also be developed using Bayesian networks.

  • Biomedical Knowledge Visualization: A

visualization environment would assist biologists in understanding the data. It would also aid in the knowledge discovery and the hypothesis generation process.