BIOLOGY Outline Introduction Background Literature Methodology - PowerPoint PPT Presentation

Andrew Hoblitzell, Snehasis Mukhopadhyay, Qian You, Shiaofen Fang, Yuni Xia, and Joseph Bidwell Indiana University Purdue University Indianapolis TEXT MINING FOR BONE BIOLOGY

Outline  Introduction  Background Literature  Methodology  Results and Discussion  Conclusion

INTRODUCTION

Introduction  Bone diseases affect tens of millions of people and include bone cysts, osteoarthritis, fibrous dysplasia, and osteoporosis among others.  Osteoporosis affects an estimated 75 million people in Europe, USA and Japan, with 10 million people suffering from osteoporosis in the United States alone.

Introduction  Goal: The extraction and visualization of relationships between biological entities related to bone biology appearing in biological databases  Benefit: Keep biologists up to date on the research and also possibly uncover new relationships among biological entities.

Key Terms  Bioinformatics: the application of information technology and computer science to the field of molecular biology  Text mining: allows for the extraction of knowledge contained in the text-based literature

BACKGROUND LITERATURE

Background Literature  Computer Science is still a relatively young science, and text mining is an even younger subset of the science  Nonetheless, the field of text mining has developed very well and quite rapidly  In particular, its application to the biomedical domain has attracted considerable attention  The PubMed resource maintained by NIH has more than 20 million research articles, necessitating the development of automated analysis methods

Some Relevant Background  “Complementary Literatures: A Stimulus to Scientific Discovery”  1997 paper by Swanson et al.  Begin with a list of viruses that have weapons potential development and present findings meant to act as a guide to the virus literature to support further studies of defensive measures.  Initially promising results

Background Literature  “Automatic Term Identification and Classification in Biology Texts”  1999 paper by Collier et al.  Made use of a decision tree for classification and term candidate identification  Results indicated that while identifying term boundaries was non-trivial, a high success rate could eventually be obtained in term classification.

Background Literature  “Accomplishments and challenges in literature data mining for biology”  2002 paper by Hirschman et al.  Trace literature data mining from its recognition of protein interactions to its solutions to a improving homology search, identifying cellular location, and more  Notes the field has progressed from simple term recognition to much more complex interactions between degrees of entities

Background Literature  “Support tools for literature -based information access in molecular biology ”  2009 paper by Fabio Rinaldi and Dietrich Rebholz-Schuhmann  Paper shows different tools developed by the authors to support professional biologists in accessing information  High performance on gold standard data does not necessarily translate into high performance for database annotation

Background Literature  “An application of bioinformatics and text  mining to the discovery of novel genes related to bone biology”  2007 paper by Gajendran, Lin, and Fyhrie  Reports the results of text mining for a bone biology pathway including SMAD genes  Proposed a ranking systems for relevant genes based on text mining

METHODOLOGY

Extraction  To extract entity relationships from the biological literature, we examined flat relationships, which simply state there exists a relationship between two biological entities  A Thesaurus-based text analysis approach is used to discover the existence of relationships

Extraction  The document representation step next converts the downloaded text documents into data structures which are able to be processed without the loss of any meaningful information  The process uses a thesaurus, an array T of atomic tokens (or terms) identified by a unique numeric identifier.

Tf*idf method  The tf*idf (the term frequency multiplied with inverse document frequency) algorithm is applied to achieve a refined discrimination at the term representation level.  The inverse document frequency (idf) component acts as a weighting factor by taking into account inter-document term distribution.

Normalized weighting   where Tik represents the number of occurrences of term Tk in document i, Ik=log(N/nk) provides the inverse document frequency of term Tik in the base of documents, N is the number of documents in the base of documents, and nk is the number of documents in the base that contains the given term Tk.

Weight vector  Each document d i is converted to an M dimensional vector where W where W ik denotes the weights of the k th gene or protein term in the document and M indicates the number of total terms in the thesaurus.  W ik will increase with the term frequency (T ik ) and decrease with the total number of documents containing the given term in the collection (n k ).

Association matrix  The associations between entities k and l are computed using the following equation:  The association[k][l] will always be greater than or equal to zero. The relative values of association[k][l] will indicate the product of the importance of the k th and l th term in each document

Transitive text mining  The basic premise of transitive text mining is that if there are direct associations between objects A and B, as well as direct associations between objects B and C, then an association between A and C may be hypothesized even if the latter has not been explicitly seen in the literature.  Such transitive associations may be efficiently determined by computing the transitive closure of the association matrix

Floyd-Warshall algorithm  The transitive closure of a binary relation R on a set X is the smallest transitive relation on X that contains R  The Floyd-Warshall algorithm may be used to find the transitive closure

Separation of evidence principle  Evidence (i.e., a part of the capacities) once used along a transitive path may not be used again along another transitive path in defining the confidence measure of a transitive association.  This will allow us to find association strength using a flow model

Maximum flow  Maximum flow problem, seen as a special case of the circulation problem  The Edmonds-Karp algorithm is applied for each transitive association (a,b), to find the maximum flow through the graph

RESULTS AND DISCUSSION

Results and Discussion  To test our search strategy we chose to explore potential novel relationships between NMP4/CIZ (nuclear matrix protein 4/cas interacting zinc finger protein; hereafter referred to as Nmp4 for clarity) and proteins that may interact with this signalling pathway.  Nmp4 is a nuclear matrix architectural transcription factor that represses genes that support the osteoblast phenotype

Terms used  A summary of the terms used is presented in the following legend:

Direct Association Matrix  The following direct association matrix was generated:

Transitive matrix  Transitive closure and the Edmonds-Karp algorithm provided the following results:

Normalization  The Direct Association Matrix then normalizes. A thresh holding value of 152.1 was then obtained and used for examining and analyzing the data.  The MNF matrix was then normalized. A thresh holding value of 7000.2 was obtained from inspection of the scores.  The normalize data was used to generate heat maps.

Direct Association Heat Map

MNF Heat Map

Expert Heat Map

Error computation  The results from were then compared against expert provided scores. The average error was then computed as follows:  ∑|Expert(l,k) -Predicted(l,k)|/N r  where Expert(l,k) is the expert provided score of a relationship between entities l and k, Predicted(l,k) is the predicted score of a given relationship between entities l and k, l is one entity, k is another entity, and N r is the total number of relations.

Error results  Using random guessing, a random average error rate of 0.58 was obtained  Using the corresponding direct association matrix, an error rate 0.35 was obtained.  Using the maximum network flow method, an error rate of 0.24 was obtained.  Application of the maximum flow algorithm to this problem offers significant improvement over other methods

CONCLUSION

Conclusion  The biological literature is a huge and constantly increasing source of information which the biologist may consult for information about their field, but the vast amount of data can sometimes become overwhelming  Text Mining, a solution to this problem, has seen a great amount of development

Conclusion  The aim was to present a method which uses MNF to determine a confidence score for the derived transitive associations  A specific pathway in bone biology consisting of a number of important proteins was subjected to the text mining approach  A significantly higher agreement with an expert’s knowledge can be obtained with transitive mining than that with only direct associations.

BIOLOGY Outline Introduction Background Literature Methodology - PowerPoint PPT Presentation

Andrew Hoblitzell, Snehasis Mukhopadhyay, Qian You, Shiaofen Fang, Yuni Xia, and Joseph Bidwell Indiana University Purdue University Indianapolis TEXT MINING FOR BONE BIOLOGY Outline Introduction Background Literature Methodology

Basics of Molecular biology Molecular biology is the study of biology at molecular level.

2019-20 DNA Biology New Products RNA Biology PROTEIN Biology MOLECULAR Biology Plant DNA

Introduction to Fetal Medicine: Genetics and Embryology Question: What do cancer biology,

connections between cs and biology computing science and biology (1) biology is the science

Synthetic Biology Considerations in Synthetic Biology Considerations in Synthetic Biology

Biology Majors Information Session Biology Advising Center NHB 2.606 Biology Advising Center

Deep Computing in Biology Challenges and Progress Ajay K. Royyuru Computational Biology Center

Principles of Conservation Biology Biology 462 Brook Milligan Department of Biology New Mexico

MOLECULAR CELL BIOLOGY CONCENTRATION Director: Dr. Alexander M. Ishov Associate Director: Dr.

Frontiers of Metrology in Biology 26 th CGPM 2018 Marc Salit Joint Initiative for Metrology in

Standards Synthetic Biology uses new techniques combining biology and engineering to make new or

Michael A. D. Goodisman Social Systems Biology www.goodismanlab.biology.gatech.edu School of

Overview of Exam Statistics Paper Biology CS(Bio) Mean : 21 out of 35 Mean : 12 out of 24 1A (MC)

Biology 5-1 Cellular Biology Cell Structure General types of cellular structure:

1. Introduction to Molecular & Systems Biology EECS 600: Systems Biology &

Systems Biology Overview Dr. Shaila C. Rssle 1 Topics to be discussed What is

Pro: The Initial Treatment Should Always Be Debridement with Marrow Stimulation (6 minutes)

TUMORS . Antonio Llombart-Bosch 1 , Isidro Machado 2 , Jose Antonio Lpez-Guerrero 2 , Beatriz

What is Section 504? Section 504 of The Rehabilitation Act of 1973, is a federal statute that

We give the surgeon the freedom to operate www.biomendex.com 1 in 3 women will break bones 1 in 5

Quotes from a Poet (Mary Between Parasites and Planets: Science Oliver) in the Muggled Middle

Supplemental Benefits Benefits Available To You As A UT Employee Katy Pannell Benefits

Five Ways to Make the Most of Any Physician Visit Presented by Jacquelyn L. Allen, DO Basics

Quality of Life after Bone Marrow Transplantation for Severe Sickle Cell Disease Thabat bat

BIOLOGY Outline Introduction Background Literature Methodology - PowerPoint PPT Presentation

Andrew Hoblitzell, Snehasis Mukhopadhyay, Qian You, Shiaofen Fang, Yuni Xia, and Joseph Bidwell Indiana University Purdue University Indianapolis TEXT MINING FOR BONE BIOLOGY Outline Introduction Background Literature Methodology

Basics of Molecular biology Molecular biology is the study of biology at molecular level.

2019-20 DNA Biology New Products RNA Biology PROTEIN Biology MOLECULAR Biology Plant DNA

Introduction to Fetal Medicine: Genetics and Embryology Question: What do cancer biology,

connections between cs and biology computing science and biology (1) biology is the science

Synthetic Biology Considerations in Synthetic Biology Considerations in Synthetic Biology

Biology Majors Information Session Biology Advising Center NHB 2.606 Biology Advising Center

Deep Computing in Biology Challenges and Progress Ajay K. Royyuru Computational Biology Center

Principles of Conservation Biology Biology 462 Brook Milligan Department of Biology New Mexico

MOLECULAR CELL BIOLOGY CONCENTRATION Director: Dr. Alexander M. Ishov Associate Director: Dr.

Frontiers of Metrology in Biology 26 th CGPM 2018 Marc Salit Joint Initiative for Metrology in

Standards Synthetic Biology uses new techniques combining biology and engineering to make new or

Michael A. D. Goodisman Social Systems Biology www.goodismanlab.biology.gatech.edu School of

Overview of Exam Statistics Paper Biology CS(Bio) Mean : 21 out of 35 Mean : 12 out of 24 1A (MC)

Biology 5-1 Cellular Biology Cell Structure General types of cellular structure:

1. Introduction to Molecular &amp; Systems Biology EECS 600: Systems Biology &amp;

Systems Biology Overview Dr. Shaila C. Rssle 1 Topics to be discussed What is

Pro: The Initial Treatment Should Always Be Debridement with Marrow Stimulation (6 minutes)

TUMORS . Antonio Llombart-Bosch 1 , Isidro Machado 2 , Jose Antonio Lpez-Guerrero 2 , Beatriz

What is Section 504? Section 504 of The Rehabilitation Act of 1973, is a federal statute that

We give the surgeon the freedom to operate www.biomendex.com 1 in 3 women will break bones 1 in 5

Quotes from a Poet (Mary Between Parasites and Planets: Science Oliver) in the Muggled Middle

Supplemental Benefits Benefits Available To You As A UT Employee Katy Pannell Benefits

Five Ways to Make the Most of Any Physician Visit Presented by Jacquelyn L. Allen, DO Basics

Quality of Life after Bone Marrow Transplantation for Severe Sickle Cell Disease Thabat bat

1. Introduction to Molecular & Systems Biology EECS 600: Systems Biology &