Stanford-UBC at TAC-KBP Eneko Agirre , Angel Chang, Dan Jurafsky, - PowerPoint PPT Presentation

Stanford-UBC at TAC-KBP Eneko Agirre , Angel Chang, Dan Jurafsky, Christopher Manning, Valentin Spitkovsky, Eric Yeh Ixa NLP group, University of the Basque Country NLP group, Stanford University

Outline • Entity linking • Slot filling TAC workshop – Nov. 2009 2

Outline • Entity linking • Slot filling TAC workshop – Nov. 2009 3

Entity linking string entity Paul Newman E0181364 • Given Knowledge Base • Given target string and surrounding text: I watched “Slapshot”, the 1977 hockey classic starring Paul Newman for the first time. • Return entity in KB (E0181364) or NIL • KB subset of Wikipedia Paul_Newman E0181364 Paul_Newman_(politician) NIL Paul_Newman_(cricketer) NIL Paul_Newman_(linguist) NIL Paul_Newman_(band) NIL TAC workshop – Nov. 2009 4

Entity Linking vs. Word Sense Disambiguation • Same layout as WSD – Given a preexisting dictionary (sense inventory): string concept counterfeit n-03562262 monosemy forgery n-03562262 variants bank n-09213565 bank n-08420278 polysemy ➔ Decide appropriate sense in context He cashed a check at the bank – Pleathora of methods (Agirre and Edmonds, 2006) TAC workshop – Nov. 2009 5

Entity linking vs. Word Sense Disambiguation • Entity linking has same layout, but... – Entities rather than concepts (instance vs. class) Norfolk also took the Minor Counties One-day Title, in 1986 (under Quorn Handley) and again (at Lord's, under Paul Newman ) in 1997 and 2001. – Dictionary is partial, needs to be completed • No full set of entities: those given by KB, otherwise NIL • Only one string, potentially many other variants (Paul Leonard Newman, Paul L. Newman, etc.) • Some differences, but same techniques might work TAC workshop – Nov. 2009

Approaches to entity linking • Dictionary lookup (no use of context) – Construct dictionary – Record preferred entity (prior) • Supervised system – Use training examples from Wikipedia • Knowledge-based system – Similarity between context and KB entry (Wikipedia article) • Combination TAC workshop – Nov. 2009 7

Constructing the dictionary • Table with all possible string – entity pairs • Two purposes – Inventory for supervised and knowledge-base algorithms – Disambiguation method, using an estimation of the prior • Space of concepts: KB concepts + Wikipedia articles – Remove redirection, disambiguation, list_of pages – Redirects are clustered, choosing KB entry as canonical form • Space of strings: names in KB, titles of articles, plus ... – Redirects (Paul Leonard Newman) – Anchor text of links to the article (Newman, Paul L. Newman) – Case normalization, fuzzy match for variations, misspellings (Paul Newma) TAC workshop – Nov. 2009 8

Constructing the dictionary: priors • For every unique string, distribution as anchor of entity The Prize is a 1963 spy film starring <a href="/wiki/Paul_Newman">Paul Newman</a> ... w) inter-Wikipedia links (03/09 dump) W) external Web links into Wikipedia (06/09 crawl) Paul Newman 0.9959 Paul_Newman W:1986/1988 w:990/1000 Paul Newman 0.0023 Paul_Newman_(band) w:7/1000 Paul Newman 0.0003 Cool_Hand_Luke W:1/1988 Paul Newman 0.0003 Newman's_Own W:1/1988 Paul Newman 0.0003 Paul_Newman_(austr...) w:1/1000 Paul Newman 0.0003 Paul_Newman_(musician) w:1/1000 Paul Newman 0.0003 Paul_Newman_(professor) w:1/1000 Paul Newman 0 Paul_Newman_(cricketer) Paul Newman 0 Paul_Newman_(linguist) Paul Newman 0 Paul_Newman_(politician) TAC workshop – Nov. 2009 9

Constructing the dictionary • Three versions, depending on string matching: a) EXCT: exact match b) LNRM: lower-cased normalized UTF-8, minus non-alpha-numeric low ASCII c) FUZZ: nearest non-zero Hamming distance matches • Additional dictionary: d) GOOG: google search site:en.wikipedia.org TAC workshop – Nov. 2009 10

Supervised disambiguation • Given target string and surrounding text, pick most appropriate entity – One multi-class classifier for each target string • Construct training data – Use anchors in Wikipedia text The Prize is a 1963 spy film starring <a href="/wiki/Paul_Newman">Paul Newman</a> ... – Some strings have few occurrences • Also use other strings for the target entities e.g for “Paul L. Newman”, also use “Paul Newman” TAC workshop – Nov. 2009 11

Supervised disambiguation • Build multi-class classifiers for each string – Inspired on WSD literature – Features • Patterns around target: wordforms / lemma / PoS • Bag of words: lemmas in context window • Noun/verb/adjective before/after the anchor text – SVM linear kernel TAC workshop – Nov. 2009 12

Knowledege-Based disambiguation • Given target string and surrounding text, pick most appropriate entity – Overlap between context and article text (Lesk, 86) • Convert article text into a TF-IDF vector, and store into Lucene. • Given string and text, rank articles by cosine similarity values. – Keep only articles in EXACT dictionary. – Document context: gather 25 tokens around all occurrences of target string TAC workshop – Nov. 2009 13

Combination • Each method outputs entities with scores • Heuristic combinations – RUN1 – Cascade of dictionaries: exact lookup, if not lower case norm, if not fuzzy – RUN2 – Vote using inverse of ranking • Cascade of dictionary • Google ranking • Supervised system • Knowledge-based system • Meta-classifier – RUN3: Linear combination, optimized on development set using conjugate gradient TAC workshop – Nov. 2009 14

Results & Conclusions • Good results overall – Dictionary as cornerstone micro KB – Priors remarkable Best 82.17 77.25 – NIL too conservative • Combination Stanford_UBC2 78.84 75.88 (voting) – Effective use of context – Voting worked best Stanford_UBC3 75.10 73.25 (meta) – Meta-classifier weak Stanford_UBC1 • WSD techniques work 74.85 69.49 (dict) • Currently Median 71.80 63.52 – Error analysis TAC workshop – Nov. 2009 15

Slot filling • Distant supervision (Mintz et al. 09) : – Use facts in Knowledge Base (via provided mapping) => gold-standard entity–slot–filler – Search for spans containing entity–filler pair in document base => positive examples to train – Search for mentions of target entity in document collection – Run each of the classifiers • Manual work kept to a minimum: types of fillers TAC workshop – Nov. 2009 16

Get gold tuples from KB • Infobox slot names – Use mapping provided by organizers Paul_Newman – occupation – “actor” Paul_Newman – per:title – “actor” • Ambiguity in mapping, multiple fillers in string per:place_of_birth “November 29, 1970 (1970-11-29) (age 38) Las Vegas, Nevada” – Set type of filler (or closed list) for each slot – Use NER on filler text TAC workshop – Nov. 2009

Train classifiers for each slot • Extract positive examples from document base 5words entity 0–10words filler 5words 5words filler 0–10words entity 5words • Negative example – Spans from other slots matching the entity type (2x positive if available) – Spans with entity, containing string of required type • Train logistic regression TAC workshop – Nov. 2009 18

Extract fillers • Search for mentions of target entity in collection 30w entity 30w • Run NER to select potential fillers • Run each of the classifiers • For each accepted entity – filler pair, count and average classifier weights • For each entity slot: – If single-valued, return top-scoring filler – If multiple-valued, return 5 top-scoring fillers • Link fillers to entities using LNRM dictionary method TAC workshop – Nov. 2009

Results and conclusions SF-average 1 – Basic system Best 77.9 2 – Bug, same as 1 Median 46.1 3 – Same as 2 but with 37.3 Stanford_UBC3 more negative samples. 35.5 Stanford_UBC1 • Below median • Premature version of the baseline system – Too liberal (few NILs) • Non-NILs over median – Filler in more than one slot TAC workshop – Nov. 2009 20

Thank you! TAC workshop – Nov. 2009 21

Stanford-UBC at TAC-KBP Eneko Agirre , Angel Chang, Dan Jurafsky, - PowerPoint PPT Presentation

Stanford-UBC at TAC-KBP Eneko Agirre , Angel Chang, Dan Jurafsky, Christopher Manning, Valentin Spitkovsky, Eric Yeh Ixa NLP group, University of the Basque Country NLP group, Stanford University Outline Entity linking Slot filling

Overview of Event Nugget Track TAC KBP 2016 Teruko Mitamura Zhengzhong Liu Eduard Hovy

Events Detection, Coreference and Sequencing: Whats next? Overview of TAC KBP 2017 Event

KBP 2017 Cold Start KB Construction and Slot Filling Hoa Dang Shahzad Rajput U.S. National

Text Analysis Conference TAC 2016 Sponsored by: Hoa Trang Dang National Institute of Standards

The BeSt Eval at the 2016 NIST TAC KBP Overview BeSt Eval Task

Event Detection and Coreference TAC KBP 2015 Sean Monahan, Michael Mohler, Marc Tomlinson Amy

TAC 2018 Streaming Multimedia KBP Pilot Hoa Trang Dang National Institute of Standards and

TAC KBP 2016 Linguistic Resources: Event Arguments (EA), Event Nuggets (EN) and Belief/Sentiment

Overview of 2015 TAC KBP Event Nugget Tasks Teruko Mitamura Zhengzhong Liu Eduard Hovy

The BeSt Eval at the 2017 NIST TAC KBP BeSt: Evaluating Mind Reading People in real world:

The Columbia-GWU System at the 2016 TAC KBP BeSt Evaluation Owen Rambow, Tao Yu, Axinia Radeva,

Overview of the KBP 2015 Slot Filler Validation Track Hoa Trang Dang National Institute of

Overview of the TAC2011 Knowledge Base Population (KBP) Track Heng Ji, Ralph Grishman and Hoa

The Generalized MDL Approach for Summarization Laks V.S. Lakshmanan (UBC) Raymond T. Ng (UBC)

UCA / UBC curriculum development 9/6/2017 UBC Dept Earth, Ocean & Atmospheric Sciences

ABA Meeting TAC Card Update May 21, 2019 Office of Disbursements ABA Meeting TAC Card Update

Using Big Data to Assess Rare Sequence Variants Kirk

Sky Faber University of California: Irvine Luca Ferretti University of Modena and Reggio Emilia

Disproving Inductive Entailments in Separation Logic via Base Pair Approximation James Brotherston

Evaluating ChIPseq Data Shoko Hirosue MRC Cancer Unit, University of Cambridge CRUK CI

1-10: Learning Goals Lets use different base-height pairs to find the area of a triangle.

Programming Molecules Anne Condon U. British Columbia 100 nm Paul Rothemund, 2006 Programming

A partition function algorithm for RNA-RNA interaction Hamidreza Chitsaz Raheleh Salari, Cenk

Information Theory and Coding i f s f f Image, Video and Audio Compression Markus

Sambuz

Useful Links

Newsletter

Mail Us

Stanford-UBC at TAC-KBP Eneko Agirre , Angel Chang, Dan Jurafsky, - PowerPoint PPT Presentation

Stanford-UBC at TAC-KBP Eneko Agirre , Angel Chang, Dan Jurafsky, Christopher Manning, Valentin Spitkovsky, Eric Yeh Ixa NLP group, University of the Basque Country NLP group, Stanford University Outline Entity linking Slot filling

Overview of Event Nugget Track TAC KBP 2016 Teruko Mitamura Zhengzhong Liu Eduard Hovy

Events Detection, Coreference and Sequencing: Whats next? Overview of TAC KBP 2017 Event

KBP 2017 Cold Start KB Construction and Slot Filling Hoa Dang Shahzad Rajput U.S. National

Text Analysis Conference TAC 2016 Sponsored by: Hoa Trang Dang National Institute of Standards

The BeSt Eval at the 2016 NIST TAC KBP Overview BeSt Eval Task

Event Detection and Coreference TAC KBP 2015 Sean Monahan, Michael Mohler, Marc Tomlinson Amy

TAC 2018 Streaming Multimedia KBP Pilot Hoa Trang Dang National Institute of Standards and

TAC KBP 2016 Linguistic Resources: Event Arguments (EA), Event Nuggets (EN) and Belief/Sentiment

Overview of 2015 TAC KBP Event Nugget Tasks Teruko Mitamura Zhengzhong Liu Eduard Hovy

The BeSt Eval at the 2017 NIST TAC KBP BeSt: Evaluating Mind Reading People in real world:

The Columbia-GWU System at the 2016 TAC KBP BeSt Evaluation Owen Rambow, Tao Yu, Axinia Radeva,

Overview of the KBP 2015 Slot Filler Validation Track Hoa Trang Dang National Institute of

Overview of the TAC2011 Knowledge Base Population (KBP) Track Heng Ji, Ralph Grishman and Hoa

The Generalized MDL Approach for Summarization Laks V.S. Lakshmanan (UBC) Raymond T. Ng (UBC)

UCA / UBC curriculum development 9/6/2017 UBC Dept Earth, Ocean &amp; Atmospheric Sciences

ABA Meeting TAC Card Update May 21, 2019 Office of Disbursements ABA Meeting TAC Card Update

Using Big Data to Assess Rare Sequence Variants Kirk

Sky Faber University of California: Irvine Luca Ferretti University of Modena and Reggio Emilia

Disproving Inductive Entailments in Separation Logic via Base Pair Approximation James Brotherston

Evaluating ChIPseq Data Shoko Hirosue MRC Cancer Unit, University of Cambridge CRUK CI

1-10: Learning Goals Lets use different base-height pairs to find the area of a triangle.

Programming Molecules Anne Condon U. British Columbia 100 nm Paul Rothemund, 2006 Programming

A partition function algorithm for RNA-RNA interaction Hamidreza Chitsaz Raheleh Salari, Cenk

Information Theory and Coding i f s f f Image, Video and Audio Compression Markus

Sambuz

Useful Links

Newsletter

Mail Us

UCA / UBC curriculum development 9/6/2017 UBC Dept Earth, Ocean & Atmospheric Sciences