The OntoGene system: an advanced information extraction application - - PowerPoint PPT Presentation
The OntoGene system: an advanced information extraction application - - PowerPoint PPT Presentation
The OntoGene system: an advanced information extraction application for biological literature www.ontogene.org Fabio Rinaldi Outline Motivation, brief history OntoGene approach Evaluation (shared tasks) SASEBio: from text
Outline
Motivation, brief history OntoGene approach Evaluation (shared tasks) SASEBio: from text mining to interactive curation Recent developments
PharmGKB CTD BioTermEvo (Gintare)
Motivations and History
Motivation: prove that NLP technologies are mature
enough for real world applications
Target: biomedical text mining
Richness of terminological resources (grounding!) Large text DBs - potential interest from bio comm.
Goal: help organize the knowledge space of the
biomedical sciences.
Started in late 2004 with applications combining
terminology structuring and dependency parsing.
OG-RM
GENIA
References
Fabio Rinaldi, Gerold Schneider, Kaarel Kaljurand,
Michael Hess, Martin Romacker. An environment for relation mining over richly annotated corpora: the case of GENIA. BMC Bioinformatics 2006, 7(Suppl 3):S3. doi:10.1186/1471-2105-7-S3-S3
BC II (2006): approach
Annotate entities using reference DBs as source Disambiguate proteins according to ORG distribution Give each ID a score according to freq and position Combine Ids in the same syntactic span Use manually constructed syn patterns to filter out
unlikely pairs
Use novel/background filter to identify sentences
likely to convey the 'core' message
Results: 3rd best
OG pipeline
First SNF project
“Detection of Biological Interactions from
Biomedical Literature” (SNF 100014-118396/1)
Funding: SNF and Novartis Duration: 18 months (April 2008 – October 2009) Main focus: IntAct database
Experimental methods (SMBM 2008) Organisms (BioNLP 2009) Entities (AIME 2009) Interactions (CICLING 2009)
IntAct snippets
Syntactic Filters
PPI in BC II.5 (2009)
All candidate pairs in a sentence are considered
Entity recognition and disamb. learnt from IntAct One semi-automated submissions (ORG selection)
Candidate pairs are scored, according to:
Pair salience; Zoning; Novelty score; Known
interaction; Syntactic paths;
Syntax: now using learning to derive syn patterns
from manually annotated corpus
Results: best according to “raw” AUC iP/R
Annotated Abstract
Protein Interactions (IPS)
Parse all positive sentences Apply lexico-syntactic patterns as filters Interactions which do not 'pass' a filter are discarded Results: P: 54.37%, R: 18.39%, F: 27.49%
Importance of ranking
MRR
MAP
AUC iP/R
TAP-k
SASEBio
Semi-Automated Semantic Enrichment of the
Biomedical Literature
Funding by SNF (grant 105315_130558/1) and
Novartis
Duration: 3 years Positions: 2 post-docs, 1 PhD Goals:
Improve our text mining technologies Make the tools relevant to potential users
SASEBio: activities so far
CALBC: large scale entity extraction BC III (2010): successful participation to all tasks PharmGKB assisted curation experiment Terminology evolution studies BC 2012: best overall results in “triage” task for
CTD
CALBC (2010)
Large-scale entity extraction (900K abstracts) CALBC I: 3rd place for diseases (F:84%) and
species (F:78%) against Silver Corpus I
Best results for diseases and species against
harmonized voting Silver Corpus II
Challenges:
Processing large XML collections Harmonize annotations Efficiency of annotation process
BioCreative III (2010)
Good results in all tasks
GN: Gene Normalization
Middle-rank results
PPI-ACT: binary classification of PPI papers
Top-rank results
PPI-IMT: find experimental methods in papers
Top-rank results
IAT: experimental interactive task
Positive comments from curators about usability
IAT: ODIN
PharmGKB
Provides manually annotated relationships between
Drugs/Genes/Diseases (36557 as of Sep 30th, 2010)
Annotation based on publications, pathways and
RSIDs:
26122 PMID 5467 Pathway 4968 RSID
We consider only relationships derived from
publications
Approach
Abstracts (5062) downloaded from PubMed Used the OG pipeline for entity annotation. Only
terms derived from PharmGKB (Drugs: 30351 terms / 2986 ids, Diseases: 28633 terms / 3198 ids, Genes: 176366 terms / 28633 ids)
Candidate interactions generated according to a set
- f different criterias (co-occurrence, syntax, ME)
Comparison against “gold standard” using
BioCreative II.5 PPI scorer
Creating a gold standard
The manually annotated interactions can be used to
generate a gold standard
10597 Gene/Drug 9415 Gene/Disease 4202 Drug/Disease 928 Gene/Gene 742 Drug/Drug 238 Disease/Disease
Total: 26122 interactions (24958 without duplicates)
Syntax-based approach
The neuronal nicotinic acetylcholine receptor alpha7 (nAChR alpha7) may be involved in cognitive deficits in Schizophrenia and Alzheimer's disease.'' [15695160]
Computed Interactions
Computed Interactions
P = 30%, R = 28%, AUC = 22% P = 7%, R = 66%, AUC = 28%
Interactive curation
Interactive curation
BioCreative 2012
Best overall results in Task 1 (triage for the
Comparative Toxicogenomics Database)
Best entity recognition for diseases and chemicals
Terminology evolution
Goal: investigate appearance, disappearance and
replacement of biomedical terminology over time
Quality terminology is essential for text mining
Experiments with PharmGKB/CTD/UMLS as
reference terminology (diseases)
Using PubMed abstracts as reference collection
Term replacement?
Summary
Goal: Develop innovative text mining technologies
for the automatic extraction of information from the biomedical literature [application: assisted curation].
OntoGene/SASEBio provide competitive text
mining technologies (BC, CALBC prove quality)
ODIN as a tool for text-mining supported interactive
curation of the biomedical literature
PharmGKB/CTD experiments provide case study Terminology studies
OntoGene highlights
[2006] BioCreative II: PPI (3rd), IMT (best) [2009] BioCreative II.5 PPI (best results); BioNLP [2010] BioCreative III: ACT, IMT, IAT [2011] CALBC (large scale entity extraction), BioNLP [2012] PharmGKB/CTD assisted curation experiments 60 peer-reviewed publications, 17 journal papers
http://www.ontogene.org/
Acknowledgments
Institute of Computational Linguistics UZH
Gerold Schneider (parsing, rel. extr., IMT, BioNLP) Simon Clematide (ODIN, GN, ACT, CALBC) Kaarel Kaljurand (pipeline, ODIN, BioNLP) Gintare Grigonyte (Term evol.), Tilia Ellendorff
NIBR-IT, Text Mining Services, Novartis
Therese Vachon, Martin Romacker
Swiss National Science Foundation
Syntactic Rules
synRel(passive, [X2,X1,X3], [ dep(subj,X2,X1),dep(pobj,X2,X3), dep(prep,X3,By), pos(X2,'VBN'), lemma(By,['by','through','via']) ]).
X1 By [by, through, via] X3 X2 subj prep pobj A regulates B B is regulated by A the regulation of B by A
Semantic Rules
semRel(xrel([H,A,B]), direct_transitive([H,A,B])). semRel(xrel([H,A,B]), passive([H,B,A])). semRel(xrel([H,A,B]), nominalization([H,B,A])). H: event A: agent B: target
A B H agent target A regulates B B is regulated by A the regulation of B by A
Normalization
Simple surface variants (hyphens, spaces, brackets) Normalize greek letters and roman numerals Simple morphological variants (lemmatization) Domain-specific variants (e.g. hPPP, PPPp) Syntactic variants considered but not used Might need to be adapted to specific DB resources
used (e.g. chemical entities have specific normalization/recognition rules – smile strings)
Disambiguation
Molecular biology papers typically (but not always)
describe experiments focusing on one (or few) model organism -> focus organism(s)
Can be used for disambiguation
reduces the set of identifiers e.g. UniProtKB has average ambig of 2.6 id per term
Evaluation over IntAct snippets
Before disambiguation (P = 0.025, R = 0.742, F = 0.048) After disambiguation (P = 0.499, R = 0.657, F = 0.567)
Organism-based disambiguation
Knowing the organism produces a
huge disambiguation effect
Most articles refer to a single
- rganism (not always)
Adopted a statistical approach
that combines evidence from abstract, main text and absolute frequency of organisms (known from training data)
Announcing a New MT Challenge
MANTRA
http://www.mantra-project.eu/
10/31 /12
Background Information
- Languages:
English, German, French, Spanish
- Domain:
biomedicine (you needn’t be an expert!)
- Text genres:
a) [Medline] titles b) best-practice reports [Cochrane] c) pharmaceutical drug information [EMEA] d) patent claims
- Corpora:
all parallel (pairwise)
- Timeframe: 1st quarter of 2013
- under review for: CLEF 2013
Challenge Task I
10/31/12
1.Multilingual Named Entity Recognition
- Given:
- Either: sentence-aligned parallel corpora for E-G, E-F, E-S, no
annotations
- Or: non-aligned parallel corpora for E-G, E-F, E-S, with bio-
medical named entities marked up in the English documents
- monolingual biomedical terminologies (such as the MeSH),
with Concept Unique Identifiers (CUIs) as conceptual anchors across all four languages
- Find: Occurrences of NEs (E-G, E-F, E-S) in all
documents, including translation equivalents via the identification of CUIs
- 1. Multilingual Named Entity Recognition
2.MT-based Lexical Acquisition
- Given: (almost complete) English terminological
resources [such as the MeSH], and incomplete GFS counterparts (ranging from 30% to 60% coverage)
- Find: novel terms for GFS and locate them in the
corresponding [MeSH] hierarchies, using the CUIs
10/31/12
Challenge Task II
We Are Looking for …
- Challenge participants
- … with an interest in multilingual named entity
recognition, term extraction, lexical acquisition
- Commercial providers of baseline systems
- … who want to benefit from scaling their systems
up into super-large sublanguage domains using (en)rich(ed) terminological resources
10/31/12
http://www.mantra-project.eu/
BC 2012: CTD Triage
References
Fabio Rinaldi, Thomas Kappeler, Kaarel Kaljurand,
Gerold Schneider, Manfred Klenner, Simon Clematide, Michael Hess, Jean-Marc von Allmen, Pierre Parisot, Martin Romacker, Therese Vachon. OntoGene in BioCreative II. Genome Biology, 2008, 9:S13.
Results in BC II.5
Best results according
to official ‘raw’ AUC ip/R score
Most balanced results
after rescoring
References
Fabio Rinaldi, Gerold Schneider, Kaarel Kaljurand,
Simon Clematide, Thérèse Vachon, Martin Romacker, "OntoGene in BioCreative II.5," IEEE/ACM Transactions on Computational Biology and Bioinformatics, 7(3), pp. 472-480, 2010.
Gerold Schneider, Kaarel Kaljurand, Thomas
Kappeler, Fabio Rinaldi.Detecting protein-protein interactions in biomedical texts using a parser and linguistic resources. CICLING 2009.
Terminology evolution
Using UMLS as reference terminology (diseases) Using PubMed abstracts as reference collection Goal: investigate appearance, disappearance and
replacement of terminology
References
Fabio Rinaldi, Thomas Kappeler, Kaarel Kaljurand,
Gerold Schneider, Manfred Klenner, Simon Clematide, Michael Hess, Jean-Marc von Allmen, Pierre Parisot, Martin Romacker, Therese Vachon. OntoGene in BioCreative II. Genome Biology, 2008, 9:S13.
Generating Relation Candidates: Basic Idea
Step I: Use the textual evidence from title and abstract – We combine entities across the whole abstract – Simple co-occurrence-based baseline: – Why not limiting to sentence co-occurrence? In our experimental data set (Pubmed abstracts with 12 or less curated relations) related genes, drugs and diseases did not cooccur in the same sentence in 50% of all cases.
relscore= freq(e1)+ freq(e2) freq(E)
Generating Relation Candidates: Basic Idea
Step II: Use PubMed metadata (MeSH terms and chemical substance list) – This information is widely available (sometimes provided by the publishers before validation through PubMed) – Coverage of term recognizer improves (in case of term recognizer)
Ranking Relation Candidates: Learn from the Past
Fact is: CTD is a large resource with over 70,000 manually curated relations extracted from over 22,000 PubMed-indexed articles usable for BC4 task Question is: How can we make use of this data to improve relation ranking?
Learn from the Past
Our idea is: Use a maximum entropy classifier to assess the probability that an entity identified by our term recognizer turns into a component of a curatable relation.
This idea implements a twofold task and domain
adaptation:
entity identification is optimized (which terms tend to
be grounded in which concepts)
relation relevance is optimized (entities are weighted
with regard to their appearance in gold standard relations). Concession: Using the past leads to conservative guesses (detecting novel concepts won't work optimal), however, textual evidence is still important.
Scoring concepts
Data sources
We learn from all CTD-curated PubMed abstracts with
- max. 12 relations.
Detecting more than 12 relations from abstracts only
by text mining tools seems less promising.
BC4 test articles were held out.
Maximum Entropy modeling (aka. Logistic Regression)
Ranking Relation Candidates
Entity score
The score of an entity e in an article A is the sum of all
zoned term frequencies (titles counted 10 times) weighted by their gold probability: Relation score
Harmonic mean of entity score
Ranking Relation Candidates
Relation score
Harmonic mean is better than simply taking the sum.
Reason: Both entities have to contribute, unbalance is penalized. Normalization
In order to make relation scores comparable they are
normalized for a given data set, e.g. BC 4 test set.
Entities and relationships
Integration with IR system
Conventional IR system (Lucene) used to provide
article ranking, modified through API, as
Standard IR Tokens replaced by OG tokens Terms used as tokens (synonym expansion) Relations used for query expansion