The OntoGene system: an advanced information extraction application - - PowerPoint PPT Presentation

the ontogene system an advanced information extraction
SMART_READER_LITE
LIVE PREVIEW

The OntoGene system: an advanced information extraction application - - PowerPoint PPT Presentation

The OntoGene system: an advanced information extraction application for biological literature www.ontogene.org Fabio Rinaldi Outline Motivation, brief history OntoGene approach Evaluation (shared tasks) SASEBio: from text


slide-1
SLIDE 1

The OntoGene system: an advanced information extraction application for biological literature

www.ontogene.org Fabio Rinaldi

slide-2
SLIDE 2

Outline

 Motivation, brief history  OntoGene approach  Evaluation (shared tasks)  SASEBio: from text mining to interactive curation  Recent developments

 PharmGKB  CTD  BioTermEvo (Gintare)

slide-3
SLIDE 3

Motivations and History

 Motivation: prove that NLP technologies are mature

enough for real world applications

 Target: biomedical text mining

 Richness of terminological resources (grounding!)  Large text DBs - potential interest from bio comm.

 Goal: help organize the knowledge space of the

biomedical sciences.

 Started in late 2004 with applications combining

terminology structuring and dependency parsing.

slide-4
SLIDE 4
slide-5
SLIDE 5

OG-RM

slide-6
SLIDE 6

GENIA

slide-7
SLIDE 7

References

 Fabio Rinaldi, Gerold Schneider, Kaarel Kaljurand,

Michael Hess, Martin Romacker. An environment for relation mining over richly annotated corpora: the case of GENIA. BMC Bioinformatics 2006, 7(Suppl 3):S3. doi:10.1186/1471-2105-7-S3-S3

slide-8
SLIDE 8
slide-9
SLIDE 9

BC II (2006): approach

 Annotate entities using reference DBs as source  Disambiguate proteins according to ORG distribution  Give each ID a score according to freq and position  Combine Ids in the same syntactic span  Use manually constructed syn patterns to filter out

unlikely pairs

 Use novel/background filter to identify sentences

likely to convey the 'core' message

 Results: 3rd best

slide-10
SLIDE 10

OG pipeline

slide-11
SLIDE 11

First SNF project

 “Detection of Biological Interactions from

Biomedical Literature” (SNF 100014-118396/1)

 Funding: SNF and Novartis  Duration: 18 months (April 2008 – October 2009)  Main focus: IntAct database

 Experimental methods (SMBM 2008)  Organisms (BioNLP 2009)  Entities (AIME 2009)  Interactions (CICLING 2009)

slide-12
SLIDE 12
slide-13
SLIDE 13

IntAct snippets

slide-14
SLIDE 14

Syntactic Filters

slide-15
SLIDE 15

PPI in BC II.5 (2009)

 All candidate pairs in a sentence are considered

 Entity recognition and disamb. learnt from IntAct  One semi-automated submissions (ORG selection)

 Candidate pairs are scored, according to:

 Pair salience; Zoning; Novelty score; Known

interaction; Syntactic paths;

 Syntax: now using learning to derive syn patterns

from manually annotated corpus

 Results: best according to “raw” AUC iP/R

slide-16
SLIDE 16

Annotated Abstract

slide-17
SLIDE 17

Protein Interactions (IPS)

 Parse all positive sentences  Apply lexico-syntactic patterns as filters  Interactions which do not 'pass' a filter are discarded  Results: P: 54.37%, R: 18.39%, F: 27.49%

slide-18
SLIDE 18

Importance of ranking

MRR

MAP

AUC iP/R

TAP-k

slide-19
SLIDE 19

SASEBio

 Semi-Automated Semantic Enrichment of the

Biomedical Literature

 Funding by SNF (grant 105315_130558/1) and

Novartis

 Duration: 3 years  Positions: 2 post-docs, 1 PhD  Goals:

 Improve our text mining technologies  Make the tools relevant to potential users

slide-20
SLIDE 20

SASEBio: activities so far

 CALBC: large scale entity extraction  BC III (2010): successful participation to all tasks  PharmGKB assisted curation experiment  Terminology evolution studies  BC 2012: best overall results in “triage” task for

CTD

slide-21
SLIDE 21

CALBC (2010)

 Large-scale entity extraction (900K abstracts)  CALBC I: 3rd place for diseases (F:84%) and

species (F:78%) against Silver Corpus I

 Best results for diseases and species against

harmonized voting Silver Corpus II

 Challenges:

 Processing large XML collections  Harmonize annotations  Efficiency of annotation process

slide-22
SLIDE 22

BioCreative III (2010)

 Good results in all tasks

 GN: Gene Normalization

 Middle-rank results

 PPI-ACT: binary classification of PPI papers

 Top-rank results

 PPI-IMT: find experimental methods in papers

 Top-rank results

 IAT: experimental interactive task

 Positive comments from curators about usability

slide-23
SLIDE 23

IAT: ODIN

slide-24
SLIDE 24
slide-25
SLIDE 25

PharmGKB

 Provides manually annotated relationships between

Drugs/Genes/Diseases (36557 as of Sep 30th, 2010)

 Annotation based on publications, pathways and

RSIDs:

 26122 PMID  5467 Pathway  4968 RSID

 We consider only relationships derived from

publications

slide-26
SLIDE 26

Approach

 Abstracts (5062) downloaded from PubMed  Used the OG pipeline for entity annotation. Only

terms derived from PharmGKB (Drugs: 30351 terms / 2986 ids, Diseases: 28633 terms / 3198 ids, Genes: 176366 terms / 28633 ids)

 Candidate interactions generated according to a set

  • f different criterias (co-occurrence, syntax, ME)

 Comparison against “gold standard” using

BioCreative II.5 PPI scorer

slide-27
SLIDE 27

Creating a gold standard

 The manually annotated interactions can be used to

generate a gold standard

 10597 Gene/Drug  9415 Gene/Disease  4202 Drug/Disease  928 Gene/Gene  742 Drug/Drug  238 Disease/Disease

 Total: 26122 interactions (24958 without duplicates)

slide-28
SLIDE 28

Syntax-based approach

The neuronal nicotinic acetylcholine receptor alpha7 (nAChR alpha7) may be involved in cognitive deficits in Schizophrenia and Alzheimer's disease.'' [15695160]

slide-29
SLIDE 29

Computed Interactions

slide-30
SLIDE 30

Computed Interactions

P = 30%, R = 28%, AUC = 22% P = 7%, R = 66%, AUC = 28%

slide-31
SLIDE 31

Interactive curation

slide-32
SLIDE 32

Interactive curation

slide-33
SLIDE 33

BioCreative 2012

 Best overall results in Task 1 (triage for the

Comparative Toxicogenomics Database)

 Best entity recognition for diseases and chemicals

slide-34
SLIDE 34

Terminology evolution

 Goal: investigate appearance, disappearance and

replacement of biomedical terminology over time

 Quality terminology is essential for text mining

 Experiments with PharmGKB/CTD/UMLS as

reference terminology (diseases)

 Using PubMed abstracts as reference collection

slide-35
SLIDE 35

Term replacement?

slide-36
SLIDE 36

Summary

 Goal: Develop innovative text mining technologies

for the automatic extraction of information from the biomedical literature [application: assisted curation].

 OntoGene/SASEBio provide competitive text

mining technologies (BC, CALBC prove quality)

 ODIN as a tool for text-mining supported interactive

curation of the biomedical literature

 PharmGKB/CTD experiments provide case study  Terminology studies

slide-37
SLIDE 37

OntoGene highlights

 [2006] BioCreative II: PPI (3rd), IMT (best)  [2009] BioCreative II.5 PPI (best results); BioNLP  [2010] BioCreative III: ACT, IMT, IAT  [2011] CALBC (large scale entity extraction), BioNLP  [2012] PharmGKB/CTD assisted curation experiments  60 peer-reviewed publications, 17 journal papers

http://www.ontogene.org/

slide-38
SLIDE 38

Acknowledgments

 Institute of Computational Linguistics UZH

 Gerold Schneider (parsing, rel. extr., IMT, BioNLP)  Simon Clematide (ODIN, GN, ACT, CALBC)  Kaarel Kaljurand (pipeline, ODIN, BioNLP)  Gintare Grigonyte (Term evol.), Tilia Ellendorff

 NIBR-IT, Text Mining Services, Novartis

 Therese Vachon, Martin Romacker

 Swiss National Science Foundation

slide-39
SLIDE 39

Syntactic Rules

synRel(passive, [X2,X1,X3], [ dep(subj,X2,X1),dep(pobj,X2,X3), dep(prep,X3,By), pos(X2,'VBN'), lemma(By,['by','through','via']) ]).

X1 By [by, through, via] X3 X2 subj prep pobj A regulates B B is regulated by A the regulation of B by A

slide-40
SLIDE 40

Semantic Rules

semRel(xrel([H,A,B]), direct_transitive([H,A,B])). semRel(xrel([H,A,B]), passive([H,B,A])). semRel(xrel([H,A,B]), nominalization([H,B,A])). H: event A: agent B: target

A B H agent target A regulates B B is regulated by A the regulation of B by A

slide-41
SLIDE 41

Normalization

 Simple surface variants (hyphens, spaces, brackets)  Normalize greek letters and roman numerals  Simple morphological variants (lemmatization)  Domain-specific variants (e.g. hPPP, PPPp)  Syntactic variants considered but not used  Might need to be adapted to specific DB resources

used (e.g. chemical entities have specific normalization/recognition rules – smile strings)

slide-42
SLIDE 42

Disambiguation

 Molecular biology papers typically (but not always)

describe experiments focusing on one (or few) model organism -> focus organism(s)

 Can be used for disambiguation

 reduces the set of identifiers  e.g. UniProtKB has average ambig of 2.6 id per term

 Evaluation over IntAct snippets

 Before disambiguation (P = 0.025, R = 0.742, F = 0.048)  After disambiguation (P = 0.499, R = 0.657, F = 0.567)

slide-43
SLIDE 43

Organism-based disambiguation

 Knowing the organism produces a

huge disambiguation effect

 Most articles refer to a single

  • rganism (not always)

 Adopted a statistical approach

that combines evidence from abstract, main text and absolute frequency of organisms (known from training data)

slide-44
SLIDE 44
slide-45
SLIDE 45

Announcing a New MT Challenge

MANTRA

http://www.mantra-project.eu/

10/31 /12

slide-46
SLIDE 46

Background Information

  • Languages:

English, German, French, Spanish

  • Domain:

biomedicine (you needn’t be an expert!)

  • Text genres:

a) [Medline] titles b) best-practice reports [Cochrane] c) pharmaceutical drug information [EMEA] d) patent claims

  • Corpora:

all parallel (pairwise)

  • Timeframe: 1st quarter of 2013
  • under review for: CLEF 2013
slide-47
SLIDE 47

Challenge Task I

10/31/12

1.Multilingual Named Entity Recognition

  • Given:
  • Either: sentence-aligned parallel corpora for E-G, E-F, E-S, no

annotations

  • Or: non-aligned parallel corpora for E-G, E-F, E-S, with bio-

medical named entities marked up in the English documents

  • monolingual biomedical terminologies (such as the MeSH),

with Concept Unique Identifiers (CUIs) as conceptual anchors across all four languages

  • Find: Occurrences of NEs (E-G, E-F, E-S) in all

documents, including translation equivalents via the identification of CUIs

slide-48
SLIDE 48
  • 1. Multilingual Named Entity Recognition

2.MT-based Lexical Acquisition

  • Given: (almost complete) English terminological

resources [such as the MeSH], and incomplete GFS counterparts (ranging from 30% to 60% coverage)

  • Find: novel terms for GFS and locate them in the

corresponding [MeSH] hierarchies, using the CUIs

10/31/12

Challenge Task II

slide-49
SLIDE 49

We Are Looking for …

  • Challenge participants
  • … with an interest in multilingual named entity

recognition, term extraction, lexical acquisition

  • Commercial providers of baseline systems
  • … who want to benefit from scaling their systems

up into super-large sublanguage domains using (en)rich(ed) terminological resources

10/31/12

http://www.mantra-project.eu/

slide-50
SLIDE 50
slide-51
SLIDE 51

BC 2012: CTD Triage

slide-52
SLIDE 52

References

 Fabio Rinaldi, Thomas Kappeler, Kaarel Kaljurand,

Gerold Schneider, Manfred Klenner, Simon Clematide, Michael Hess, Jean-Marc von Allmen, Pierre Parisot, Martin Romacker, Therese Vachon. OntoGene in BioCreative II. Genome Biology, 2008, 9:S13.

slide-53
SLIDE 53

Results in BC II.5

 Best results according

to official ‘raw’ AUC ip/R score

 Most balanced results

after rescoring

slide-54
SLIDE 54

References

 Fabio Rinaldi, Gerold Schneider, Kaarel Kaljurand,

Simon Clematide, Thérèse Vachon, Martin Romacker, "OntoGene in BioCreative II.5," IEEE/ACM Transactions on Computational Biology and Bioinformatics, 7(3), pp. 472-480, 2010.

 Gerold Schneider, Kaarel Kaljurand, Thomas

Kappeler, Fabio Rinaldi.Detecting protein-protein interactions in biomedical texts using a parser and linguistic resources. CICLING 2009.

slide-55
SLIDE 55

Terminology evolution

 Using UMLS as reference terminology (diseases)  Using PubMed abstracts as reference collection  Goal: investigate appearance, disappearance and

replacement of terminology

slide-56
SLIDE 56

References

 Fabio Rinaldi, Thomas Kappeler, Kaarel Kaljurand,

Gerold Schneider, Manfred Klenner, Simon Clematide, Michael Hess, Jean-Marc von Allmen, Pierre Parisot, Martin Romacker, Therese Vachon. OntoGene in BioCreative II. Genome Biology, 2008, 9:S13.

slide-57
SLIDE 57

Generating Relation Candidates: Basic Idea

Step I: Use the textual evidence from title and abstract – We combine entities across the whole abstract – Simple co-occurrence-based baseline: – Why not limiting to sentence co-occurrence? In our experimental data set (Pubmed abstracts with 12 or less curated relations) related genes, drugs and diseases did not cooccur in the same sentence in 50% of all cases.

relscore= freq(e1)+ freq(e2) freq(E)

slide-58
SLIDE 58

Generating Relation Candidates: Basic Idea

Step II: Use PubMed metadata (MeSH terms and chemical substance list) – This information is widely available (sometimes provided by the publishers before validation through PubMed) – Coverage of term recognizer improves (in case of term recognizer)

slide-59
SLIDE 59

Ranking Relation Candidates: Learn from the Past

Fact is: CTD is a large resource with over 70,000 manually curated relations extracted from over 22,000 PubMed-indexed articles usable for BC4 task Question is: How can we make use of this data to improve relation ranking?

slide-60
SLIDE 60

Learn from the Past

Our idea is: Use a maximum entropy classifier to assess the probability that an entity identified by our term recognizer turns into a component of a curatable relation.

 This idea implements a twofold task and domain

adaptation:

 entity identification is optimized (which terms tend to

be grounded in which concepts)

 relation relevance is optimized (entities are weighted

with regard to their appearance in gold standard relations). Concession: Using the past leads to conservative guesses (detecting novel concepts won't work optimal), however, textual evidence is still important.

slide-61
SLIDE 61

Scoring concepts

Data sources

 We learn from all CTD-curated PubMed abstracts with

  • max. 12 relations.

 Detecting more than 12 relations from abstracts only

by text mining tools seems less promising.

 BC4 test articles were held out.

Maximum Entropy modeling (aka. Logistic Regression)

slide-62
SLIDE 62

Ranking Relation Candidates

Entity score

 The score of an entity e in an article A is the sum of all

zoned term frequencies (titles counted 10 times) weighted by their gold probability: Relation score

 Harmonic mean of entity score

slide-63
SLIDE 63

Ranking Relation Candidates

Relation score

 Harmonic mean is better than simply taking the sum.

Reason: Both entities have to contribute, unbalance is penalized. Normalization

 In order to make relation scores comparable they are

normalized for a given data set, e.g. BC 4 test set.

slide-64
SLIDE 64

Entities and relationships

slide-65
SLIDE 65

Integration with IR system

 Conventional IR system (Lucene) used to provide

article ranking, modified through API, as

 Standard IR  Tokens replaced by OG tokens  Terms used as tokens (synonym expansion)  Relations used for query expansion

slide-66
SLIDE 66

Integration with IR system

slide-67
SLIDE 67
slide-68
SLIDE 68

OG terms to Lucene tokens

slide-69
SLIDE 69

Using OG tokens+terms

slide-70
SLIDE 70
slide-71
SLIDE 71

OG terms to Lucene tokens

slide-72
SLIDE 72

+Term IDs

slide-73
SLIDE 73
slide-74
SLIDE 74
slide-75
SLIDE 75

Query expansion

slide-76
SLIDE 76

Query expansion

slide-77
SLIDE 77

Using OG Relations

slide-78
SLIDE 78
slide-79
SLIDE 79

Results

slide-80
SLIDE 80
slide-81
SLIDE 81