The OntoGene system: an advanced information extraction application - - PowerPoint PPT Presentation

the ontogene system an advanced information extraction
SMART_READER_LITE
LIVE PREVIEW

The OntoGene system: an advanced information extraction application - - PowerPoint PPT Presentation

The OntoGene system: an advanced information extraction application for biological literature www.ontogene.org Fabio Rinaldi Outline Motivation, brief history OntoGene approach Evaluation (shared tasks) SASEBio: from text


slide-1
SLIDE 1

The OntoGene system: an advanced information extraction application for biological literature

www.ontogene.org Fabio Rinaldi

slide-2
SLIDE 2

Outline

 Motivation, brief history  OntoGene approach  Evaluation (shared tasks)  SASEBio: from text mining to interactive curation  Recent developments

 PharmGKB  CTD  BioTermEvo (Gintare)

slide-3
SLIDE 3

Motivations and History

 Motivation: prove that NLP technologies are mature

enough for real world applications

 Target: biomedical text mining

 Richness of terminological resources (grounding!)  Large text DBs - potential interest from bio comm.

 Goal: help organize the knowledge space of the

biomedical sciences.

 Started in late 2004 with applications combining

terminology structuring and dependency parsing.

slide-4
SLIDE 4
slide-5
SLIDE 5

OG-RM

slide-6
SLIDE 6

GENIA

slide-7
SLIDE 7

References

 Fabio Rinaldi, Gerold Schneider, Kaarel Kaljurand,

Michael Hess, Martin Romacker. An environment for relation mining over richly annotated corpora: the case of GENIA. BMC Bioinformatics 2006, 7(Suppl 3):S3. doi:10.1186/1471-2105-7-S3-S3

slide-8
SLIDE 8

BC II (2006): approach

 Annotate entities using reference DBs as source  Disambiguate proteins according to ORG distribution  Give each ID a score according to freq and position  Combine Ids in the same syntactic span  Use manually constructed syn patterns to filter out

unlikely pairs

 Use novel/background filter to identify sentences

likely to convey the 'core' message

 Results: 3rd best

slide-9
SLIDE 9

First SNF project

 “Detection of Biological Interactions from

Biomedical Literature” (SNF 100014-118396/1)

 Funding: SNF and Novartis  Duration: 18 months (April 2008 – October 2009)  Main focus: IntAct database

 Experimental methods (SMBM 2008)  Organisms (BioNLP 2009)  Entities (AIME 2009)  Interactions (CICLING 2009)

slide-10
SLIDE 10
slide-11
SLIDE 11

IntAct snippets

slide-12
SLIDE 12

Syntactic Filters

slide-13
SLIDE 13

PPI in BC II.5 (2009)

 All candidate pairs in a sentence are considered

 Entity recognition and disamb. learnt from IntAct  One semi-automated submissions (ORG selection)

 Candidate pairs are scored, according to:

 Pair salience; Zoning; Novelty score; Known

interaction; Syntactic paths;

 Syntax: now using learning to derive syn patterns

from manually annotated corpus

 Results: best according to “raw” AUC iP/R

slide-14
SLIDE 14

Annotated Abstract

slide-15
SLIDE 15

Protein Interactions (IPS)

 Parse all positive sentences  Apply lexico-syntactic patterns as filters  Interactions which do not 'pass' a filter are discarded  Results: P: 54.37%, R: 18.39%, F: 27.49%

slide-16
SLIDE 16

Importance of ranking

MRR

MAP

AUC iP/R

TAP-k

slide-17
SLIDE 17

SASEBio

 Semi-Automated Semantic Enrichment of the

Biomedical Literature

 Funding by SNF (grant 105315_130558/1) and

Novartis

 Duration: 3 years  Positions: 2 post-docs, 1 PhD  Goals:

 Improve our text mining technologies  Make the tools relevant to potential users

slide-18
SLIDE 18

SASEBio: activities so far

 CALBC: large scale entity extraction  BC III (2010): successful participation to all tasks  PharmGKB assisted curation experiment  Terminology evolution studies  BC 2012: best overall results in “triage” task for

CTD

slide-19
SLIDE 19

CALBC (2010)

 Large-scale entity extraction (900K abstracts)  CALBC I: 3rd place for diseases (F:84%) and

species (F:78%) against Silver Corpus I

 Best results for diseases and species against

harmonized voting Silver Corpus II

 Challenges:

 Processing large XML collections  Harmonize annotations  Efficiency of annotation process

slide-20
SLIDE 20

BioCreative III (2010)

 Good results in all tasks

 GN: Gene Normalization

 Middle-rank results

 PPI-ACT: binary classification of PPI papers

 Top-rank results

 PPI-IMT: find experimental methods in papers

 Top-rank results

 IAT: experimental interactive task

 Positive comments from curators about usability

slide-21
SLIDE 21

IAT: ODIN

slide-22
SLIDE 22

PharmGKB

 Provides manually annotated relationships between

Drugs/Genes/Diseases (36557 as of Sep 30th, 2010)

 Annotation based on publications, pathways and

RSIDs:

 26122 PMID  5467 Pathway  4968 RSID

 We consider only relationships derived from

publications

slide-23
SLIDE 23

Approach

 Abstracts (5062) downloaded from PubMed  Used the OG pipeline for entity annotation. Only

terms derived from PharmGKB (Drugs: 30351 terms / 2986 ids, Diseases: 28633 terms / 3198 ids, Genes: 176366 terms / 28633 ids)

 Candidate interactions generated according to a set

  • f different criterias (co-occurrence, syntax, ME)

 Comparison against “gold standard” using

BioCreative II.5 PPI scorer

slide-24
SLIDE 24

Creating a gold standard

 The manually annotated interactions can be used to

generate a gold standard

 10597 Gene/Drug  9415 Gene/Disease  4202 Drug/Disease  928 Gene/Gene  742 Drug/Drug  238 Disease/Disease

 Total: 26122 interactions (24958 without duplicates)

slide-25
SLIDE 25

Syntax-based approach

The neuronal nicotinic acetylcholine receptor alpha7 (nAChR alpha7) may be involved in cognitive deficits in Schizophrenia and Alzheimer's disease.'' [15695160]

slide-26
SLIDE 26

Computed Interactions

slide-27
SLIDE 27

Computed Interactions

P = 30%, R = 28%, AUC = 22% P = 7%, R = 66%, AUC = 28%

slide-28
SLIDE 28

Interactive curation

slide-29
SLIDE 29

Interactive curation

slide-30
SLIDE 30

BioCreative 2012

 Best overall results in Task 1 (triage for the

Comparative Toxicogenomics Database)

 Best entity recognition for diseases and chemicals

slide-31
SLIDE 31

Terminology evolution

 Goal: investigate appearance, disappearance and

replacement of biomedical terminology over time

 Quality terminology is essential for text mining

 Experiments with PharmGKB/CTD/UMLS as

reference terminology (diseases)

 Using PubMed abstracts as reference collection

slide-32
SLIDE 32

Term replacement?

slide-33
SLIDE 33

Summary

 Goal: Develop innovative text mining technologies

for the automatic extraction of information from the biomedical literature [application: assisted curation].

 OntoGene/SASEBio provide competitive text

mining technologies (BC, CALBC prove quality)

 ODIN as a tool for text-mining supported interactive

curation of the biomedical literature

 PharmGKB/CTD experiments provide case study  Terminology studies

slide-34
SLIDE 34

OntoGene highlights

 [2006] BioCreative II: PPI (3rd), IMT (best)  [2009] BioCreative II.5 PPI (best results); BioNLP  [2010] BioCreative III: ACT, IMT, IAT  [2011] CALBC (large scale entity extraction), BioNLP  [2012] PharmGKB/CTD assisted curation experiments  60 peer-reviewed publications, 17 journal papers

http://www.ontogene.org/

slide-35
SLIDE 35

Acknowledgments

 Institute of Computational Linguistics UZH

 Gerold Schneider (parsing, rel. extr., IMT, BioNLP)  Simon Clematide (ODIN, GN, ACT, CALBC)  Kaarel Kaljurand (pipeline, ODIN, BioNLP)  Gintare Grigonyte (Term evol.), Tilia Ellendorff

 NIBR-IT, Text Mining Services, Novartis

 Therese Vachon, Martin Romacker

 Swiss National Science Foundation