Relation Extraction
- Prof. Sameer Singh
CS 295: STATISTICAL NLP WINTER 2017
February 23, 2017
Based on slides from Dan Jurafski, Chris Manning, and everyone else they copied from.
Relation Extraction Prof. Sameer Singh CS 295: STATISTICAL NLP - - PowerPoint PPT Presentation
Relation Extraction Prof. Sameer Singh CS 295: STATISTICAL NLP WINTER 2017 February 23, 2017 Based on slides from Dan Jurafski, Chris Manning, and everyone else they copied from. Outline Introduction to Relation Extraction Hand-written
February 23, 2017
Based on slides from Dan Jurafski, Chris Manning, and everyone else they copied from.
CS 295: STATISTICAL NLP (WINTER 2017) 2
CS 295: STATISTICAL NLP (WINTER 2017) 3
John was born in Liverpool, to Julia and Alfred Lennon.
Text
John Lennon Alfred Lennon Julia Lennon Liverpool
birthplace childOf childOf
Literal Facts
CS 295: STATISTICAL NLP (WINTER 2017) 4
Company report: “International Business Machines Corporation (IBM or the company) was incorporated in the State of New York on June 16, 1911, as the Computing-Tabulating-Recording Co. (C-T-R)…” Extracted Complex Relation: Company-Founding
Company IBM Location New York Date June 16, 1911 Original-Name Computing-Tabulating-Recording Co.
But we will focus on the simpler task of extracting relation triples Founding-year(IBM,1911) Founding-location(IBM,New York)
CS 295: STATISTICAL NLP (WINTER 2017) 5
The Leland Stanford Junior University, commonly referred to as Stanford University or Stanford, is an American private research university located in Stanford, California … near Palo Alto, California… Leland Stanford…founded the university in 1891
Stanford EQ Leland Stanford Junior University Stanford LOC-IN California Stanford IS-A research university Stanford LOC-NEAR Palo Alto Stanford FOUNDED-IN 1891 Stanford FOUNDER Leland Stanford
CS 295: STATISTICAL NLP (WINTER 2017) 6
ROLE: relates a person to an organization or a geopolitical entity
PART: generalized containment
AT: permanent and transient locations
SOCIAL: social relations among persons
CS 295: STATISTICAL NLP (WINTER 2017) 7
ARTIFACT GENERAL AFFILIATION ORG AFFILIATION PART- WHOLE PERSON- SOCIAL PHYSICAL Located Near Business Family Lasting Personal Citizen- Resident- Ethnicity- Religion Org-Location- Origin Founder Employment Membership Ownership Student-Alum Investor User-Owner-Inventor- Manufacturer Geographical Subsidiary Sports-Affiliation
CS 295: STATISTICAL NLP (WINTER 2017) 8
CS 295: STATISTICAL NLP (WINTER 2017) 9
CS 295: STATISTICAL NLP (WINTER 2017) 10
UMLS Resource
CS 295: STATISTICAL NLP (WINTER 2017) 11
Doppler echocardiography can be used to diagnose left anterior descending artery stenosis in patients with type 2 diabetes ê Echocardiography, Doppler DIAGNOSES Acquired stenosis
CS 295: STATISTICAL NLP (WINTER 2017) 12
Thousands of relations and millions of instances! Manually created from multiple sources including Wikipedia InfoBoxes
CS 295: STATISTICAL NLP (WINTER 2017) 13
CS 295: STATISTICAL NLP (WINTER 2017) 14
CS 295: STATISTICAL NLP (WINTER 2017) 15
What does Gelidium mean? How do you know?
CS 295: STATISTICAL NLP (WINTER 2017) 16
Hearst (1992): Automatic Acquisition of Hyponyms
CS 295: STATISTICAL NLP (WINTER 2017) 17
Hearst pattern Example occurrences
X and other Y ...temples, treasuries, and other important civic buildings. X or other Y Bruises, wounds, broken bones or other injuries... Y such as X The bow lute, such as the Bambara ndang... Such Y as X ...such authors as Herrick, Goldsmith, and Shakespeare. Y including X ...common-law countries, including Canada and England... Y , especially X European countries, especially France, England, and Spain...
CS 295: STATISTICAL NLP (WINTER 2017) 18
CS 295: STATISTICAL NLP (WINTER 2017) 19
CS 295: STATISTICAL NLP (WINTER 2017) 20
CS 295: STATISTICAL NLP (WINTER 2017) 21
Who holds what office in what organization?
PERSON, POSITION of ORG
PERSON(named|appointed|chose|etc.) PERSON Prep? POSITION
PERSON [be]? (named|appointed|etc.) Prep? ORG POSITION
CS 295: STATISTICAL NLP (WINTER 2017) 22
Combine tokens, dependency paths, and entity types to define rules. Argument 1 Argument 2
Person Organization
appos nmod case det
Bill Gates, the CEO of Microsoft, said …
… announced by Steve Jobs, the CEO of Apple. … announced by Bill Gates, the director and CEO of Microsoft. … mused Bill, a former CEO of Microsoft. and many other possible instantiations…
CS 295: STATISTICAL NLP (WINTER 2017) 23
Use a collection of rules as the system itself
Argument 1 Argument 2
,
Person Organization
DT CEO
appos nmod case det
Implies
Argument 1 Argument 2
headOf
Source:
Multiple Rules:
Variations
CS 295: STATISTICAL NLP (WINTER 2017) 24
Pluses
Minuses
CS 295: STATISTICAL NLP (WINTER 2017) 25
CS 295: STATISTICAL NLP (WINTER 2017) 26
CS 295: STATISTICAL NLP (WINTER 2017) 27
ARTIFACT GENERAL AFFILIATION ORG AFFILIATION PART- WHOLE PERSON- SOCIAL PHYSICAL Located Near Business Family Lasting Personal Citizen- Resident- Ethnicity- Religion Org-Location- Origin Founder Employment Membership Ownership Student-Alum Investor User-Owner-Inventor- Manufacturer Geographical Subsidiary Sports-Affiliation
ACE 2008 “Relation Extraction Task”
CS 295: STATISTICAL NLP (WINTER 2017) 28
American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said. SUBSIDIARY FAMILY EMPLOYMENT NIL FOUNDER CITIZEN INVENTOR
CS 295: STATISTICAL NLP (WINTER 2017) 29
Headwords of M1 and M2, and combination
Airlines Wagner Airlines-Wagner
Bag of words and bigrams in M1 and M2 {American, Airlines, Tim, Wagner, American Airlines, Tim Wagner} Words or bigrams in particular positions left and right of M1/M2
M2: -1 spokesman M2: +1 said
Bag of words or bigrams between the two entities
{a, AMR, of, immediately, matched, move, spokesman, the, unit} American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said Mention 1 Mention 2
CS 295: STATISTICAL NLP (WINTER 2017) 30
Named-entity types
Concatenation of the two named-entity types
Entity Level of M1 and M2 (NAME, NOMINAL, PRONOUN)
[it or he would be PRONOUN]
[the company would be NOMINAL] American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said Mention 1 Mention 2
CS 295: STATISTICAL NLP (WINTER 2017) 31
Base syntactic chunk sequence from one to the other
NP NP PP VP NP NP
Constituent path through the tree from one to the other
NP é NP é S é S ê NP
Dependency path Airlines matched Wagner said
American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said Mention 1 Mention 2
CS 295: STATISTICAL NLP (WINTER 2017) 32
Trigger list for family: kinship terms
Gazeteer:
CS 295: STATISTICAL NLP (WINTER 2017) 33
CS 295: STATISTICAL NLP (WINTER 2017) 34
Machine Learning: hopefully, generalizes the labels in the right way Use all of NLP as features: words, POS, NER, dependencies, embeddings However Usually, a lot of labeled data is needed, which is expensive & time consuming. Requires a lot of feature engineering!
Classifier
P(birthplace) = 0.75
John was born in Liverpool, to Julia and Alfred Lennon. Feature Engineering
…
NER Dep Path Text in b/w embeddings POS CS 295: STATISTICAL NLP (WINTER 2017) 35
Pluses
to different genres
Minuses
CS 295: STATISTICAL NLP (WINTER 2017) 36
CS 295: STATISTICAL NLP (WINTER 2017) 37
CS 295: STATISTICAL NLP (WINTER 2017) 38
CS 295: STATISTICAL NLP (WINTER 2017) 39
<Mark Twain, Elmira> Seed tuple od “died in” Look for the environments of the seed tuple
“Mark Twain is buried in Elmira, NY.” X is buried in Y “The grave of Mark Twain is in Elmira” The grave of X is in Y “Elmira is Mark Twain’s final resting place” Y is X’s final resting place.
Use those patterns to find new tuples Repeat
CS 295: STATISTICAL NLP (WINTER 2017) 40
Start with 5 seeds: Find Instances on the Web:
The Comedy of Errors, by William Shakespeare, was The Comedy of Errors, by William Shakespeare, is The Comedy of Errors, one of William Shakespeare's earliest attempts The Comedy of Errors, one of William Shakespeare's most
Extract patterns (group by middle, take longest common prefix/suffix)
?x , by ?y , ?x , one of ?y ‘s
Now iterate, finding new seeds that match the pattern
Author Book Isaac Asimov The Robots of Dawn David Brin Startide Rising James Gleick Chaos: Making a New Science Charles Dickens Great Expectations William Shakespeare The Comedy of Errors
Brin, Sergei. 1998. Extracting Patterns …
CS 295: STATISTICAL NLP (WINTER 2017) 41
Similar iterative algorithm Group instances w/similar prefix, middle, suffix, extract patterns
{’s, in, headquarters} {in, based} ORGANIZATION LOCATION
Organization Location of Headquarters Microsoft Redmond Exxon Irving IBM Armonk
ORGANIZATION LOCATION
.69 .75
CS 295: STATISTICAL NLP (WINTER 2017) 42
Snow, Jurafsky, Ng (2005), Wu & Weld (2007), Mintz, Bills, Snow, Jurafsky (2009)
CS 295: STATISTICAL NLP (WINTER 2017) 43
For each relation For each tuple in big database Find sentences in large corpus with both entities Extract frequent features (parse, words, etc) Train supervised classifier using these patterns 4 1 2 3 5 PER was born in LOC PER, born (XXXX), LOC PER’s birthplace in LOC <Edwin Hubble, Marshfield> <Albert Einstein, Ulm> Born-In Hubble was born in Marshfield Einstein, born (1879), Ulm Hubble’s birthplace in Marshfield P(born-in | f1,f2,f3,…,f70000)
CS 295: STATISTICAL NLP (WINTER 2017) 44
CS 295: STATISTICAL NLP (WINTER 2017) 45
Open Information Extraction:
1. Use parsed data to train a “trustworthy tuple” classifier 2. Single-pass extract all relations between NPs, keep if trustworthy 3. Assessor ranks relations based on text redundancy
(FCI, specializes in, software development) (Tesla, invented, coil transformer)
Banko, Cararella, Soderland, Broadhead, Etzioni. 2007
CS 295: STATISTICAL NLP (WINTER 2017) 46
Since it extracts totally new relations from the web
Instead, we can approximate precision (only)
Can also compute precision at different levels of recall.
But no way to evaluate recall
ˆ P = # of correctly extracted relations in the sample Total # of extracted relations in the sample
CS 295: STATISTICAL NLP (WINTER 2017) 47
CS 295: STATISTICAL NLP (WINTER 2017) 48
Homework
Project
Summaries
CS 295: STATISTICAL NLP (WINTER 2017) 49