Relation Extraction CSCI 699 Instructor: Xiang Ren USC Computer - - PowerPoint PPT Presentation
Relation Extraction CSCI 699 Instructor: Xiang Ren USC Computer - - PowerPoint PPT Presentation
Relation Extraction CSCI 699 Instructor: Xiang Ren USC Computer Science Relation extraction example CHICAGO (AP) Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities
Relation extraction example
CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner
- said. United, a unit of UAL, said the increase took effect Thursday
night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York.
Question: What relations should we extract?
Relation extraction example
CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner
- said. United, a unit of UAL, said the increase took effect Thursday
night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York.
Subject Relation Object American Airlines subsidiary AMR Tim Wagner employee American Airlines United Airlines subsidiary UAL
Why Relation Extraction?
- Create new structured knowledge bases, useful
for any app
- Augment current knowledge bases
- Adding words to WordNet thesaurus, facts to FreeBase or
DBPedia
- Support question answering
- The granddaughter of which actor starred in the movie
“E.T.”?
(acted-in ?x “E.T.”)(is-a ?y actor)(granddaughter-of ?x ?y)
- But which relations should we extract?
4
Relation types
For generic news texts ...
slide adapted from Jim Martin
Databases of Wikipedia Relations
6
Relations extracted from Infobox:
- Stanford state California
- Stanford motto “Die Luft der
Freiheit weht” … Wikipedia Infobox
Relation types: Freebase
23 Million Entities, thousands of relations
More relations: disease outbreaks
slide adapted from Eugene Agichtein
More relations: protein interactions
slide adapted from Rosario & Hearst
Relations between word senses
- NLP applications need word meaning!
- Question answering
- Conversational agents
- Summarization
- One key meaning component: word relations
- Hyponymy: San Francisco is an instance of a city
- Antonymy: acidic is the opposite of basic
- Meronymy: an alternator is a part of a car
WordNet is incomplete
In WordNet 3.1 Not in WordNet 3.1 insulin progestero ne leptin pregnenolo ne combustibili ty navigability affordabi lity reusabili ty HTML XML Google, Yahoo Microsoft, IBM
- Esp. for specific domains: restaurants, auto parts, finance
Ontological relations are missing for many words:
Evaluation of Relation Extraction
- Compute Pecision / Recall / F1 score for each
relation
12
P = # of correctly extracted relations Total # of extracted relations R = # of correctly extracted relations Total # of gold relations
F
1 = 2PR
P + R
Relation extraction: 5 types of methods
- 1. Hand-built patterns
- 2. Bootstrapping methods
- 3. Supervised methods
- 4. Distant supervision
- 5. Unsupervised methods
Relation extraction: 5 types of methods
- 1. Hand-built patterns
- 2. Bootstrapping methods
- 3. Supervised methods
- 4. Distant supervision
- 5. Unsupervised methods
A hand-built extraction rule
NYU Proteus system (1997)
Patterns for learning hyponyms
- Intuition from Hearst (1992)
Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use.
- What does Gelidium mean?
- How do you know?
Patterns for learning hyponyms
- Intuition from Hearst (1992)
Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use.
- What does Gelidium mean?
- How do you know?
Hearst’s lexico-syntactic patterns
Y such as X ((, X)* (, and/or) X) such Y as X… X… or other Y X… and other Y Y including X… Y , especially X…
Hearst, 1992. Automatic Acquisition of Hyponyms.
Examples of the Hearst patterns
Hearst pattern Example occurrences X and other Y
...temples, treasuries, and other important civic buildings.
X or other Y
bruises, wounds, broken bones or other injuries...
Y such as X
The bow lute, such as the Bambara ndang...
such Y as X
...such authors as Herrick, Goldsmith, and Shakespeare.
Y including X
...common-law countries, including Canada and England...
Y , especially X
European countries, especially France, England, and Spain...
Problems with hand-built patterns
- Requires hand-building patterns for each relation!
- hard to write; hard to maintain
- there are zillions of them
- domain-dependent
- Don’t want to do this for all possible relations!
- Plus, we’d like better accuracy
- Hearst: 66% accuracy on hyponym extraction
Relation extraction: 5 easy methods
- 1. Hand-built patterns
- 2. Bootstrapping methods
- 3. Supervised methods
- 4. Distant supervision
- 5. Unsupervised methods
Bootstrapping approaches
- If you don’t have enough annotated text to train on …
- But you do have:
- some seed instances of the relation
- (or some patterns that work pretty well)
- and lots & lots of unannotated text (e.g., the web)
- … can you use those seeds to do something useful?
- Bootstrapping can be considered semi-supervised
Bootstrapping example
- Target relation: burial place
- Seed tuple: [Mark Twain, Elmira]
- Grep/Google for “Mark Twain” and “Elmira”
“Mark Twain is buried in Elmira, NY.” → X is buried in Y “The grave of Mark Twain is in Elmira” → The grave of X is in Y “Elmira is Mark Twain’s final resting place” → Y is X’s final resting place
- Use those patterns to search for new tuples
Bootstrapping relations
slide adapted from Jim Martin
DIPRE (Brin 1998)
Extract (author, book) pairs Start with these 5 seeds: Learn these patterns: Iterate: use patterns to get more instances & patterns… Results: after three iterations of bootstrapping loop, extracted 15,000 author-book pairs with 95% accuracy.
Snowball (Agichtein & Gravano 2000)
New ideas:
- require that X and Y be named entities
- add heuristics to score extractions, select best ones
Snowball Results!
Conf middle right 1 <based, 0.53> <in, 0.53> <, , 0.01> 0.69 <’, 0.42> <s, 0.42> < headquarters, 0.42> <in, 0.12> 0.61 <(, 0.93> <), 0.12> Table 2: Actual patterns discovered by Snowball. (For each pattern the left vector is empty, tag1 =
ORGANIZATION, and tag2 =LOCATION.) Type of Error Correct Incorrect Location Organization Relationship PIdeal DIPRE 74 26 3 18 5 90% Snowball (all tuples) 52 48 6 41 1 88% Snowball (τt =0.8) 93 7 3 4 96% Baseline 25 75 8 62 5 66%
Bootstrapping problems
- Requires that we have seeds for each relation
- Sensitive to original set of seeds
- Big problem of semantic drift at each iteration
- Precision tends to be not that high
- Generally have lots of parameters to be tuned
- No probabilistic interpretation
- Hard to know how confident to be in each result
Relation extraction: 5 easy methods
- 1. Hand-built patterns
- 2. Bootstrapping methods
- 3. Supervised methods
- 4. Distant supervision
- 5. Unsupervised methods
Supervised relation extraction
The supervised approach requires:
- Defining an inventory of output labels
- Relation detection: true/false
- Relation classification: located-in, employee-of,
inventor-of, …
- Collecting labeled training data: MUC, ACE, …
- Defining a feature representation: words, entity
types, …
- Choosing a classifier: Naïve Bayes, MaxEnt, SVM,
…
- Evaluating the results
ACE 2008: relations
ACE 2008: data
39
Features
- Lightweight features — require little pre-processing
- Bags of words & bigrams between, before, and after the entities
- Stemmed versions of the same
- The types of the entities
- The distance (number of words) between the entities
- Medium-weight features — require base phrase
chunking
- Base-phrase chunk paths
- Heavyweight features — require full syntactic parsing
- Dependency-tree paths
- Constituent-tree paths
- Tree distance between the entities
Let’s take a closer look at features used in (Zhou et al. 2005)
Features: words
American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said.
Bag-of-words features WM1 = {American, Airlines}, WM2 = {Tim, Wagner} Head-word features HM1 = Airlines, HM2 = Wagner, HM12 = Airlines+Wagner Words in between WBNULL = false, WBF = a, WBL = spokesman, WBO = {unit, of, AMR, immediately, matched, the, move} Words before and after BM1F = NULL, BM1L = NULL, AM2F = said, AM2L = NULL
Word features yield good precision (69%), but poor recall (24%)
Features: NE type & mention level
American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said.
42
Named entity types (ORG, LOC, PER, etc.) ET1 = ORG, ET2 = PER, ET12 = ORG-PER Mention levels (NAME, NOMINAL, or PRONOUN) ML1 = NAME, ML2 = NAME, ML12 = NAME+NAME
- Named entity type features help recall a lot (+8%)
- Mention level features have little impact
Features: overlap
- American Airlines, a unit of AMR, immediately matched the
move, spokesman Tim Wagner said.
- Number of mentions and words in between
- #MB = 1, #WB = 9
- Does one mention include in the other?
- M1>M2 = false, M1<M2 = false
- Conjunctive features
- ET12+M1>M2 = ORG-
PER+false ET12+M1<M2 = ORG-PER+false
- HM12+M1>M2 =
Airlines+Wagner+false HM12+M1<M2 = Airlines+Wagner+false
- These features hurt precision a lot (-10%), but
also help recall a lot (+8%)
- 43
Features: syntactic features
46
Features of mention dependencies ET1DW1 = ORG:Airlines H1DW1 = matched:Airlines ET2DW2 = PER:Wagner H2DW2 = said:Wagner Features describing entity types and dependency tree ET12SameNP = ORG-PER-false ET12SamePP = ORG-PER-false ET12SameVP = ORG-PER-false
These features had disappointingly little impact!
Relation extraction classifiers
Now use any (multiclass) classifier you like:
- SVM
- MaxEnt (aka multiclass logistic regression)
- Naïve Bayes
- etc.
[Zhou et al. 2005 used a one-vs-many SVM]
48
Zhou et al. 2005 results
Position-aware LSTM for Relation Extraction
(Zhang et al., 2017)
40
Relation Extraction
Penner is survived by his brother , John, a copy editor at the Times, and his former wife,Times sportswriter Lisa Dillman.
Relation Extraction
Penner is survived by his brother , John, a copy editor at the Times, and his former wife,Times sportswriter Lisa Dillman. Key elements
- Context (relevant + irrelevant)
- Entities (types + positions)
Position-aware Attention Model
h1 ps po
1
h2 ps po
2
hn ps
n
po
n
q
…
an a2 a1 x1 x2 xn
and
1
- 2
1
2
- 1
h3 x3 ps po
3
2
3
married
…
4 2
a3
Mike (subject) Lisa (object)
Position-aware Attention Model
h1 ps po
1
h2 ps po
2
hn ps
n
po
n
q
…
an a2 a1 x1 x2 xn
and
1
- 2
1
2
- 1
h3 x3 ps po
3
2
3
married
…
4 2
a3
Mike (subject) Lisa (object)
x = [x1, ..., xn] ps = [ps, ...,ps]
1 n
po = [po, ..., po]
1 n
Word: Position:
Embedding Layers
Position-aware Attention Model
h1 ps po
1
h2 ps po
2
hn ps
n
po
n
q
…
an a2 a1 x1 x2 xn
and
1
- 2
1
2
- 1
h3 x3 ps po
3
2
3
married
…
4 2
a3
Mike (subject) Lisa (object)
x = [x1, ..., xn] ps = [ps, ...,ps]
1 n
po = [po, ..., po]
1 n
Word: Position:
Embedding Layers LSTM Layers
{h1, ..., h n} = LSTM({x1, ..., x n})
Position-aware Attention Model
h1 ps po
1
h2 ps po
2
hn ps
n
po
n
q
…
an a2 a1 x1 x2 xn
and
1
- 2
1
2
- 1
h3 x3 ps po
3
2
3
married
…
4 2
a3
Mike (subject) Lisa (object)
Summary Vector
q = hn
Position-aware Attention Model
h1 ps po
1
h2 ps po
2
hn ps
n
po
n
q
…
an a2 a1 x1 x2 xn
and
1
- 2
1
2
- 1
h3 x3 ps po
3
2
3
married
…
4 2
a3
Mike (subject) Lisa (object)
Summary Vector
q = hn
Attention Layer
Position-aware Attention Model
h1 ps po
1
h2 ps po
2
hn ps
n
po
n
q
…
an a2 a1 x1 x2 xn
and
1
- 2
1
2
- 1
h3 x3 ps po
3
2
3
married
…
4 2
a3
Mike (subject) Lisa (object)
Relation Representation
Position-aware Attention Model
h1 ps po
1
h2 ps po
2
hn ps
n
po
n
q
…
an a2 a1 x1 x2 xn
and
1
- 2
1
2
- 1
h3 x3 ps po
3
2
3
married
…
4 2
a3
Mike (subject) Lisa (object)
Relation Representation
n
Softmax Layer
OtherAugmentations
- Word dropout: balance OOV distribution
OtherAugmentations
- Word dropout: balance OOV distribution
Penner is survived by his brother , J
- hn
<UNK>
OtherAugmentations
- Entity masking: focus onrelations, not specific
entities Penner is survived by his brother , John
OtherAugmentations
- Entity masking: focus onrelations, not specific
entities Penner is survived by his brother , J
- hn
SUBJ-PER
OtherAugmentations
- Entity masking: focus onrelations, not specific
entities Penner is survived by his brother , J
- hn
SUBJ-PER OBJ-PER
OtherAugmentations
- Linguistic information: POS and NER
embeddings from Stanford CoreNLP Penner is survived by …
OtherAugmentations
- Linguistic information: POS and NER
embeddings from Stanford CoreNLP Penner is survived by …
NNP VPZ VBN IN
…
PER O O O
…
OtherAugmentations
- Linguistic information: POS and NER
embeddings from Stanford CoreNLP
Models ComparedAgainst
- Stanford’s TAC KBP 2015 winningsystem
- Patterns
- Logistic regression (LR)
Non- Neural
Models ComparedAgainst
- Stanford’s TAC KBP 2015 winningsystem
- Patterns
- Logistic regression (LR)
- C N N with positional encodings(Nguyen and
Grishman,2015)
- Dependency-based RNN (Xu et al., 2015)
- LSTM: 2-layerStacked-LSTM
Non- Neural Neural
Relation Extraction Results
Model P R F1 T raditional Patterns 86.9 23.2 36.6 LR 73.5 49.9 59.4 LR + Patterns 72.9 51.8 60.5
Relation Extraction Results
Model P R F1 T raditional Patterns 86.9 23.2 36.6 LR 73.5 49.9 59.4 LR + Patterns 72.9 51.8 60.5
- Patterns: highprecision
- LR: relatively higherrecall
Relation Extraction Results
Model P R F1 T raditional LR + Patterns 72.9 51.8 60.5 Neural CNN 75.6 47.5 58.3 CNN-PE 70.3 54.2 61.2 SDP-LSTM 66.3 52.7 58.7 LSTM 65.7 59.9 62.7
Relation Extraction Results
Model P R F1 T raditional LR + Patterns 72.9 51.8 60.5 Neural CNN 75.6 47.5 58.3 CNN-PE 70.3 54.2 61.2 SDP-LSTM 66.3 52.7 58.7 LSTM 65.7 59.9 62.7
- CNN higher precision; LSTMhigher recall
- CNN-PE and LSTM outperform traditional
Relation Extraction Results
Model P R F1 T raditional LR + Patterns 72.9 51.8 60.5 Neural LSTM 65.7 59.9 62.7 Our model 65.7 64.5 65.1 Ensemble (5) 70.1 64.6 67.2
Relation Extraction Results
Model P R F1 T raditional LR + Patterns 72.9 51.8 60.5 Neural LSTM 65.7 59.9 62.7 Our model 65.7 64.5 65.1 Ensemble (5) 70.1 64.6 67.2
- Our model: +2.4 improvement on F1
Supervised RE: summary
- Supervised approach can achieve high accuracy
- At least, for some relations
- If we have lots of hand-labeled training data
- But has significant limitations!
- Labeling 5,000 relations (+ named entities) is expensive
- Doesn’t generalize to different relations
Relation extraction: 5 easy methods
- 1. Hand-built patterns
- 2. Bootstrapping methods
- 3. Supervised methods
- 4. Distant supervision
- 5. Unsupervised methods
Distant supervision
- Hypothesis: If two entities belong to a certain relation, any
sentence containing those two entities is likely to express that relation
- Key idea: use a database of relations to get lots of noisy
training examples
- instead of hand-creating seed tuples (bootstrapping)
- instead of using hand-labeled corpus (supervised)
Snow, Jurafsky, Ng. 2005. Learning syntactic patterns for automatic hypernym discovery. NIPS 17 Mintz, Bills, Snow, Jurafsky. 2009. Distant supervision for relation extraction without labeled data. ACL-2009.
Benefits of distant supervision
- Has advantages of supervised approach
- leverage rich, reliable hand-created knowledge
- relations have canonical names
- can use rich features (e.g. syntactic features)
- Has advantages of unsupervised approach
- leverage unlimited amounts of text data
- allows for very large number of weak features
- not sensitive to training corpus: genre-
independent
Hypernyms via distant supervision
Construct a noisy training set consisting of occurrences from our corpus that contain a hyponym-hypernym pair from WordNet.
This yields high-signal examples like:
“...consider authors like Shakespeare...” “Some authors (including Shakespeare)...” “Shakespeare was the author of several...” “Shakespeare, author of The Tempest...”
slide adapted from Rion Snow
Learning hypernym patterns
- 2. Collect noun pairs
- 3. Is pair an IS-A in WordNet?
- 4. Parse the sentences
- 5. Extract patterns
- 6. Train classifier on patterns
slide adapted from Rion Snow
e.g. (atom, deuterium) 752,311 pairs from 6M sentences of newswire 14,387 yes; 737,924 no 69,592 dependency paths with >5 pairs logistic regression with 70K features (converted to 974,288 bucketed binary features)
Key idea: work at corpus level (entity pairs), instead of sentence level!
- 1. Take corpus sentences
... doubly heavy hydrogen atom called deuterium ...
One of 70,000 patterns
Pattern: <superordinate> called <subordinate> Learned from cases such as:
(sarcoma, cancer) (deuterium, atom) …an uncommon bone cancer called osteogenic sarcoma and to… …heavy water rich in the doubly heavy hydrogen atom called deuterium.
New pairs discovered:
(efflorescence, condition) (O’neal_inc, company) (hat_creek_outfit, ranch) (hiv-1, aids_virus) (bateau_mouche, attraction) …and a condition called efflorescence are other reasons for… …The company, now called O'Neal Inc., was sole distributor of… …run a small ranch called the Hat Creek Outfit. …infected by the AIDS virus, called HIV-1. …local sightseeing attraction called the Bateau Mouche...
slide adapted from Rion Snow
What about other relations?
Mintz, Bills, Snow, Jurafsky (2009). Distant supervision for relation extraction without labeled data.
102 relations 940,000 entities 1.8 million instances Training set 1.8 million articles 25.7 million sentences Corpus
Frequent Freebase relations
Collecting training data
Bill Gates founded Microsoft in 1975. Bill Gates, founder of Microsoft, … Bill Gates attended Harvard from… Google was founded by Larry Page …
Corpus text Freebase
Founder: (Bill Gates, Microsoft) Founder: (Larry Page, Google) CollegeAttended: (Bill Gates, Harvard)
Training data
Collecting training data
Bill Gates founded Microsoft in 1975. Bill Gates, founder of Microsoft, … Bill Gates attended Harvard from… Google was founded by Larry Page …
Corpus text Freebase
Founder: (Bill Gates, Microsoft) Founder: (Larry Page, Google) CollegeAttended: (Bill Gates, Harvard) Label: Feature: Founder X founded Y
Training data
(Bill Gates, Microsoft)
Collecting training data
Bill Gates founded Microsoft in 1975. Bill Gates, founder of Microsoft, … Bill Gates attended Harvard from… Google was founded by Larry Page …
Corpus text Freebase
Founder: (Bill Gates, Microsoft) Founder: (Larry Page, Google) CollegeAttended: (Bill Gates, Harvard) Label: Feature: Feature: Founder X founded Y X, founder of Y
Training data
(Bill Gates, Microsoft)
Collecting training data
Bill Gates founded Microsoft in 1975. Bill Gates, founder of Microsoft, … Bill Gates attended Harvard from… Google was founded by Larry Page …
Corpus text Freebase
Founder: (Bill Gates, Microsoft) Founder: (Larry Page, Google) CollegeAttended: (Bill Gates, Harvard) Label: Feature: Feature: Founder X founded Y X, founder of Y
Training data
(Bill Gates, Microsoft) (Bill Gates, Harvard) Label: Feature: CollegeAttended X attended Y
Collecting training data
Bill Gates founded Microsoft in 1975. Bill Gates, founder of Microsoft, … Bill Gates attended Harvard from… Google was founded by Larry Page … CollegeAttended: (Bill Gates, Harvard)
Corpus text Freebase
Founder: (Bill Gates, Microsoft) Founder: (Larry Page, Google) Label: Feature: Feature: Founder X founded Y X, founder of Y
Training data
(Bill Gates, Microsoft) (Larry Page, Google) Label: Feature: Founder Y was founded by X (Bill Gates, Harvard) Label: Feature: CollegeAttended X attended Y
Negative training data
Corpus text
Larry Page took a swipe at Microsoft... ...after Harvard invited Larry Page to... Google is Bill Gates' worst fear ... Label: Feature: NO_RELATION X took a swipe at Y
Training data
(Larry Page, Microsoft) (Bill Gates, Google) Label: Feature: NO_RELATION Y is X's worst fear (Larry Page, Harvard) Label: Feature: NO_RELATION Y invited X
Can’t train a classifier with only positive data! Need negative training data too! Solution? Sample 1% of unrelated pairs of entities.
Preparing test data
Corpus text
Henry Ford founded Ford Motor Co. in… Ford Motor Co. was founded by Henry Ford… Steve Jobs attended Reed College from…
Test data
Preparing test data
Corpus text
Henry Ford founded Ford Motor Co. in… Ford Motor Co. was founded by Henry Ford… Steve Jobs attended Reed College from… Label: Feature: ??? X founded Y
Test data
(Henry Ford, Ford Motor Co.)
Preparing test data
Corpus text
Henry Ford founded Ford Motor Co. in… Ford Motor Co. was founded by Henry Ford… Steve Jobs attended Reed College from… Label: Feature: Feature: ??? X founded Y Y was founded by X
Test data
(Henry Ford, Ford Motor Co.)
Preparing test data
Corpus text
Henry Ford founded Ford Motor Co. in… Ford Motor Co. was founded by Henry Ford… Steve Jobs attended Reed College from… Label: Feature: Feature: ??? X founded Y Y was founded by X
Test data
(Henry Ford, Ford Motor Co.) (Steve Jobs, Reed College) Label: Feature: ??? X attended Y
Predictions!
The experiment
(Steve Jobs, Reed College) ??? X Label: Feature: attended Y Founder X X, Founder Y was (Larry Page, Google) Label: Feature: founded by X Label: Feature: founded Y Feature: founder of Y (Bill Gates, Harvard) Label: C
- l
l e g e A t t e n d e d F e a t u r e : X attended Y ??? X Y was Label: Feature: founded Y Feature: founded by X
Test data
(Henry Ford, Ford Motor Co.) (Bill Gates, Google) Label: N O _ R E L A T I O N F e a t u r e : Y is X's worst fear (Larry Page, Harvard) Label: N O _ R E L A T I O N F e a t u r e : Y invited X
Positive training data
(Bill Gates, Microsoft)
Negative training data
(Larry Page, Microsoft) Label: N O _ R E L A T I O N F e a t u r e : X took a swipe at Y
Learning: multiclass logistic regression Trained relation classifier
(Henry Ford, Ford Motor Co.) Label: Founder (Steve Jobs, Reed College) Label: CollegeAttended
Advantages of the approach
- ACE paradigm: labeling sentences
- This paradigm: labeling entity pairs
- Make use of multiple appearances of entities
- If a pair of entities appears in 10 sentences, and
each sentence has 5 features extracted from it, the entity pair will have 50 associated features
Experimental set-up
- 1.8 million relation instances used for training
- Compared to 17,000 relation instances in ACE
- 800,000 Wikipedia articles used for training,
400,000 different articles used for testing
- Only extract relation instances not already in
Freebase
Newly discovered instances
T en relation instances extracted by the system that weren’t in Freebase
Human evaluation
Precision@K, using Mechanical Turk labelers:
- At recall of 100 instances, using both feature sets (lexical and syntax)
- ffers the best performance for a majority of the relations
- At recall of 1000 instances, using syntax features improves
performance for a majority of the relations
Distant supervision: conclusions
- Distant supervision extracts high-precision patterns
for a variety of relations
- Can make use of 1000x more data than simple
supervised algorithms
- Syntax features almost always help
- The combination of syntax and lexical features is
sometimes even better
- Syntax features are probably most useful when
entities are far apart, often when there are modifiers in between
Heterogeneous Supervision
- Provide a general framework to encode
knowledge for supervision:
- Knowledge base facts, heuristic patterns, ……
- Labelling functions:
91 return died_in for < , , s> if DiedIn( , ) in KB return born_in for < , , s> if match(‘ * born in * ’, s) return died_in for < , , s> if match(‘ * killed in * ’, s) return born_in for < , , s> if BornIn( , ) in KB
Λ
e1 e2 λ4 λ2 e1 e2 e1 e2 e1 e2 λ3 λ1 e1 e2 e1 e2 Domain-specific Patterns Knowledge Base
(Liu et al, EMNLP 2017)
Robert Newton "Bob" Ford was an American outlaw best known for killing his gang leader Jesse James ( ) in Missouri ( ) Hussein ( ) was born in Amman ( ) on 14 November 1935. Gofraid ( ) died in 989, said to be killed in Dal Riata ( ).
D
e1 e2
c2 c3 c1
e1 e1 e2 e2
return died_in for < , , s> if DiedIn( , ) in KB return born_in for < , , s> if match(‘ * born in * ’, s) return died_in for < , , s> if match(‘ * killed in * ’, s) return born_in for < , , s> if BornIn( , ) in KB
Λ
e1 e2 λ4 λ2 e1 e2 e1 e2 e1 e2 λ3 λ1 e1 e2 e1 e2 λ1 λ2 λ4 λ3
c1 c3 c2
Robert Newton "Bob" Ford was an American outlaw best known for killing his gang leader Jesse James ( ) in Missouri ( ) Hussein ( ) was born in Amman ( ) on 14 November 1935. Gofraid ( ) died in 989, said to be killed in Dal Riata ( ).
D
e1 e2
c2 c3 c1
e1 e1 e2 e2
return died_in for < , , s> if DiedIn( , ) in KB return born_in for < , , s> if match(‘ * born in * ’, s) return died_in for < , , s> if match(‘ * killed in * ’, s) return born_in for < , , s> if BornIn( , ) in KB
Λ
e1 e2 λ4 λ2 e1 e2 e1 e2 e1 e2 λ3 λ1 e1 e2 e1 e2 λ1 λ2 λ4 λ3
c1 c3 c2
Challenges
- Relation Extraction
- Resolve Conflicts among Heterogeneous Supervision
92
Conflicts among Heterogeneous Supervision
- A straightforward way: majority voting
93 Robert Newton "Bob" Ford was an American outlaw best known for killing his gang leader Jesse James ( ) in Missouri ( ) Hussein ( ) was born in Amman ( ) on 14 November 1935. Gofraid ( ) died in 989, said to be killed in Dal Riata ( ).
D
e1 e2
c2 c3 c1
e1 e1 e2 e2
return died_in for < , , s> if DiedIn( , ) in KB return born_in for < , , s> if match(‘ * born in * ’, s) return died_in for < , , s> if match(‘ * killed in * ’, s) return born_in for < , , s> if BornIn( , ) in KB
Λ
e1 e2 λ4 λ2 e1 e2 e1 e2 e1 e2 λ3 λ1 e1 e2 e1 e2 λ1 λ2 λ4 λ3
c1 c3 c2
Robert Newton "Bob" Ford was an American outlaw best known for killing his gang leader Jesse James ( ) in Missouri ( ) Hussein ( ) was born in Amman ( ) on 14 November 1935. Gofraid ( ) died in 989, said to be killed in Dal Riata ( ).
D
e1 e2
c2 c3 c1
e1 e1 e2 e2
return died_in for < , , s> if DiedIn( , ) in KB return born_in for < , , s> if match(‘ * born in * ’, s) return died_in for < , , s> if match(‘ * killed in * ’, s) return born_in for < , , s> if BornIn( , ) in KB
Λ
e1 e2 λ4 λ2 e1 e2 e1 e2 e1 e2 λ3 λ1 e1 e2 e1 e2 λ1 λ2 λ4 λ3
c1 c3 c2
Conflicts among Heterogeneous Supervision
- How to resolve conflicts among Heterogeneous
Supervision?
- Works for C3 and C2, but not work for C1
94 Robert Newton "Bob" Ford was an American outlaw best known for killing his gang leader Jesse James ( ) in Missouri ( ) Hussein ( ) was born in Amman ( ) on 14 November 1935. Gofraid ( ) died in 989, said to be killed in Dal Riata ( ).
D
e1 e2
c2 c3 c1
e1 e1 e2 e2
return died_in for < , , s> if DiedIn( , ) in KB return born_in for < , , s> if match(‘ * born in * ’, s) return died_in for < , , s> if match(‘ * killed in * ’, s) return born_in for < , , s> if BornIn( , ) in KB
Λ
e1 e2 λ4 λ2 e1 e2 e1 e2 e1 e2 λ3 λ1 e1 e2 e1 e2 λ1 λ2 λ4 λ3
c1 c3 c2
Robert Newton "Bob" Ford was an American outlaw best known for killing his gang leader Jesse James ( ) in Missouri ( ) Hussein ( ) was born in Amman ( ) on 14 November 1935. Gofraid ( ) died in 989, said to be killed in Dal Riata ( ).
D
e1 e2
c2 c3 c1
e1 e1 e2 e2
return died_in for < , , s> if DiedIn( , ) in KB return born_in for < , , s> if match(‘ * born in * ’, s) return died_in for < , , s> if match(‘ * killed in * ’, s) return born_in for < , , s> if BornIn( , ) in KB
Λ
e1 e2 λ4 λ2 e1 e2 e1 e2 e1 e2 λ3 λ1 e1 e2 e1 e2 λ1 λ2 λ4 λ3
c1 c3 c2
Conflicts among Heterogeneous Supervision
- Truth Discovery:
- Some sources (labeling functions) would be more reliable
than others
- Sou
Source Co Consistency As Assumpti tion
- n: a source is likely to
provide true information with the same probability for all instances.
95
Conflicts among Heterogeneous Supervision
- For Distant Supervision, all annotations come from
Knowledge Base.
- For Heterogeneous Supervision, annotations are from
different sources, and some could be more reliable than others.
96 return died_in for < , , s> if DiedIn( , ) in KB return born_in for < , , s> if match(‘ * born in * ’, s) return died_in for < , , s> if match(‘ * killed in * ’, s) return born_in for < , , s> if BornIn( , ) in KB
Λ
e1 e2 λ4 λ2 e1 e2 e1 e2 e1 e2 λ3 λ1 e1 e2 e1 e2 Domain-specific Patterns Knowledge Base
Conflicts among Heterogeneous Supervision
- We introduce context awareness to truth
discovery, and modified the assumption:
- A labeling function (LF) is likely to provide true
information with the same probability for instances with similar context.
- If we can “contextualize” a LF, then we can measure the
“expertise” of a LF over a given sentence context
97
Relation Mention Representation
- Text Feature Extraction
- Text Feature Representation
- Relation Mention Representation
98
HEAD_EM1_Hussein TKN_EM1_Hussein born HEAD_EM2_Amman …… Text Feature Extraction Text Feature Representation Mapping from Text Embedding to Relation Mention Embedding: Robert Newton "Bob" Ford was an American outlaw best known for killing his gang leader Jesse James ( ) in Missouri ( ) Gofraid ( ) died in 989, said to be killed in Dal Riata ( ).
D
e1 e2
c2 c1
e1 e2
Hussein ( ) was born in Amman ( ) on 14 November 1935.
c3
e1 e2 tanh(W · 1 |fc1| X
fi∈fc1
vi) vi ∈ Rnv zc ∈ Rnz
True label discovery
- Probability Model:
- Describing the generation of Heterogeneous Supervision?
- Different from crowdsourcing. E.g., ONE worker may
annotate:
99
True label discovery
- Describing the correctness of Heterogeneous
Supervision
100
zc
|C|
|Λ|
li
sc,i ρc,i
|O|
!",$ = &((",$ == ("
∗)
underlying true label correctness of annotation (",$
- bserved annotation
Representation of relation mention Representation of labeling function whether c belongs to the proficient subset of +,
True label discovery
- Describing the correctness of Heterogeneous
Supervision
- ! "#,% = 1 = ! "#,% = 1 (#,% = 1 ∗ ! (#,% = 1 +
! "#,% = 1 (#,% = 0 ∗ ! (#,% = 0
- ! (#,% = 1 = ,(.%
/ ∗ 0#)
101
zc
|C|
|Λ|
li
sc,i ρc,i
|O|
Case Study
102
Experiments
103
Relation extraction: 5 easy methods
- 1. Hand-built patterns
- 2. Bootstrapping methods
- 3. Supervised methods
- 4. Distant supervision
- 5. Unsupervised methods
DIRT (Lin & Pantel 2003)
- DIRT = Discovery of Inference Rules from T
ext
- Looks at dependency paths between noun pairs
- N:subj:V←find→V:obj:N→solution→N:to:N
- i.e., X finds solution to Y
- Applies ”extended distributional hypothesis”
- If two paths tend to occur in similar contexts, the meanings of the
paths tend to be similar.
- So, defines path similarity in terms of cooccurrence counts with
various slot fillers
DIRT examples
The top-20 most similar paths to “X solves Y”:
Y is solved by X X resolves Y X finds a solution to Y X tries to solve Y X deals with Y Y is resolved by X X addresses Y X seeks a solution to Y X do something about Y X solution to Y Y is resolved in X Y is solved through X X rectifies Y X copes with Y X overcomes Y X eases Y X tackles Y X alleviates Y X corrects Y X is a solution to Y
Ambiguous paths in DIRT
- X addresses Y
- I addressed my letter to him personally.
- She addressed an audience of Shawnee chiefs.
- Will Congress finally address the immigration issue?
- X tackles Y
- Foley tackled the quarterback in the endzone.
- Police are beginning to tackle rising crime.
- X is a solution to Y
- (5, 1) is a solution to the equation 2x – 3y = 7
- Nuclear energy is a solution to the energy crisis.
TextRunner (Banko et al. 2007)
- 1. Self-supervised learner: automatically labels
+/– examples & learns a crude relation extractor
- 2. Single-pass extractor: makes one pass over
corpus, extracting candidate relations in each sentence
- 3. Redundancy-based assessor: assigns a
probability to each extraction, based on frequency counts
Step 1: Self-supervised learner
- Run a parser over 2000 sentences
- Parsing is relatively expensive, so can’t run on whole web
- For each pair of base noun phrases NPi and NPj
- Extract all tuples t = (NPi, relationi,j , NPj)
- Label each tuple based on features of parse:
- Positive iff the dependency path between the NPs is short, and
doesn’t cross a clause boundary, and neither NP is a pronoun
- Now train a Naïve Bayes classifier on the labeled tuples
- Using lightweight features like POS tags nearby, stop words,
etc.
Step 2: Single-pass extractor
- Over a huge (web-sized) corpus:
- Run a dumb POS tagger
- Run a dumb Base Noun Phrase chunker
- Extract all text strings between base NPs
- Run heuristic rules to simplify text strings
Scientists from many universities are intently studying stars → scientists, are studying, stars
- Pass candidate tuples to Naïve Bayes classifier
- Save only those predicted to be “trustworthy”
Step 3: Redundancy-based assessor
- Collect counts for each simplified tuple
scientists, are studying, stars → 17
- Compute likelihood of each tuple
- given the counts for each relation
- and the number of sentences
- and a combinatoric balls-and-urns model [Downey et al. 05]
TextRunner examples
slide from Oren Etzioni
TextRunner results
- From corpus of 9M web pages, containing 133M
sentences
- Extracted 60.5 million tuples
§ § FCI, specializes in, software development
- Evaluation
- Not well formed:
§ § demands, of securing, border 29, dropped,
instruments
- Abstract:
§ § Einstein, derived, theory executive, hired by,
company
- True, concrete:
§ § Tesla, invented, coil transformer
Yao et al. 2012: motivation
- Goal: induce clusters of dependency paths
which express the same semantic relation, like DIRT
- But, improve upon DIRT by properly handling
semantic ambiguity of individual paths
Yao et al. 2012: approach
- 1. Extract tuples (entity, path, entity) from corpus
- 2. Construct feature representations of every tuple
- 3. Group the tuples for each path into sense clusters
- 4. Cluster the sense clusters into semantic relations
Extracting tuples
- Start with NYT corpus
- Apply lemmatization, NER tagging, dependency parsing
- For each pair of entities in a sentence:
- Extract dependency path between them, as in Lin
- Form a tuple consisting of the two entities and the path
- Filter rare tuples, tuples with two direct objects, etc.
- Result: 1M tuples, 500K entities, 1300 patterns
Feature representation
- Entity names, as bags of words, prefixed with "l:" or "r:"
- ex: ("LA Lakers", "NY Knicks") => {l:LA, l:Lakers, r:NY
, r:Knicks}
- Using bag-of-words encourages overlap, i.e., combats sparsity
- Words between and around the two entities
- Exclude stop words, words with capital letters
- Include two words to the left and right
- Document theme (e.g. sports, politics, finance)
- Assigned by an LDA topic model which treats NYTimes topic
descriptors as words in a synthetic document
- Sentence theme
- Assigned by a standard LDA topic model
Clustering tuples into senses
- Goal: group tuples for each path into coherent sense
clusters
- Currently exploring multiple different approaches:
- LDA-like topic models
- Matrix factorization approaches
- Result: each tuple is assigned one topic/sense
- Tuples with the same topic/sense constitute a cluster
Sense cluster examples
Sense clusters for path “A play B”, along with sample entity pairs and top features.
Semantic relation results
Just like DIRT, each semantic relation has multiple paths. But, one path can now appear in multiple semantic relations. DIRT can’t do that!
Relation extraction: 5 easy methods
- 1. Hand-built patterns
- 2. Bootstrapping methods
- 3. Supervised methods
- 4. Distant supervision
- 5. Unsupervised methods