Relation Extraction CSCI 699 Instructor: Xiang Ren USC Computer - - PowerPoint PPT Presentation

relation extraction
SMART_READER_LITE
LIVE PREVIEW

Relation Extraction CSCI 699 Instructor: Xiang Ren USC Computer - - PowerPoint PPT Presentation

Relation Extraction CSCI 699 Instructor: Xiang Ren USC Computer Science Relation extraction example CHICAGO (AP) Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities


slide-1
SLIDE 1

Relation Extraction

CSCI 699

Instructor: Xiang Ren USC Computer Science

slide-2
SLIDE 2

Relation extraction example

CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner

  • said. United, a unit of UAL, said the increase took effect Thursday

night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York.

Question: What relations should we extract?

slide-3
SLIDE 3

Relation extraction example

CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner

  • said. United, a unit of UAL, said the increase took effect Thursday

night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York.

Subject Relation Object American Airlines subsidiary AMR Tim Wagner employee American Airlines United Airlines subsidiary UAL

slide-4
SLIDE 4

Why Relation Extraction?

  • Create new structured knowledge bases, useful

for any app

  • Augment current knowledge bases
  • Adding words to WordNet thesaurus, facts to FreeBase or

DBPedia

  • Support question answering
  • The granddaughter of which actor starred in the movie

“E.T.”?

(acted-in ?x “E.T.”)(is-a ?y actor)(granddaughter-of ?x ?y)

  • But which relations should we extract?

4

slide-5
SLIDE 5

Relation types

For generic news texts ...

slide adapted from Jim Martin

slide-6
SLIDE 6

Databases of Wikipedia Relations

6

Relations extracted from Infobox:

  • Stanford state California
  • Stanford motto “Die Luft der

Freiheit weht” … Wikipedia Infobox

slide-7
SLIDE 7

Relation types: Freebase

23 Million Entities, thousands of relations

slide-8
SLIDE 8

More relations: disease outbreaks

slide adapted from Eugene Agichtein

slide-9
SLIDE 9

More relations: protein interactions

slide adapted from Rosario & Hearst

slide-10
SLIDE 10

Relations between word senses

  • NLP applications need word meaning!
  • Question answering
  • Conversational agents
  • Summarization
  • One key meaning component: word relations
  • Hyponymy: San Francisco is an instance of a city
  • Antonymy: acidic is the opposite of basic
  • Meronymy: an alternator is a part of a car
slide-11
SLIDE 11

WordNet is incomplete

In WordNet 3.1 Not in WordNet 3.1 insulin progestero ne leptin pregnenolo ne combustibili ty navigability affordabi lity reusabili ty HTML XML Google, Yahoo Microsoft, IBM

  • Esp. for specific domains: restaurants, auto parts, finance

Ontological relations are missing for many words:

slide-12
SLIDE 12

Evaluation of Relation Extraction

  • Compute Pecision / Recall / F1 score for each

relation

12

P = # of correctly extracted relations Total # of extracted relations R = # of correctly extracted relations Total # of gold relations

F

1 = 2PR

P + R

slide-13
SLIDE 13

Relation extraction: 5 types of methods

  • 1. Hand-built patterns
  • 2. Bootstrapping methods
  • 3. Supervised methods
  • 4. Distant supervision
  • 5. Unsupervised methods
slide-14
SLIDE 14

Relation extraction: 5 types of methods

  • 1. Hand-built patterns
  • 2. Bootstrapping methods
  • 3. Supervised methods
  • 4. Distant supervision
  • 5. Unsupervised methods
slide-15
SLIDE 15

A hand-built extraction rule

NYU Proteus system (1997)

slide-16
SLIDE 16

Patterns for learning hyponyms

  • Intuition from Hearst (1992)

Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use.

  • What does Gelidium mean?
  • How do you know?
slide-17
SLIDE 17

Patterns for learning hyponyms

  • Intuition from Hearst (1992)

Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use.

  • What does Gelidium mean?
  • How do you know?
slide-18
SLIDE 18

Hearst’s lexico-syntactic patterns

Y such as X ((, X)* (, and/or) X) such Y as X… X… or other Y X… and other Y Y including X… Y , especially X…

Hearst, 1992. Automatic Acquisition of Hyponyms.

slide-19
SLIDE 19

Examples of the Hearst patterns

Hearst pattern Example occurrences X and other Y

...temples, treasuries, and other important civic buildings.

X or other Y

bruises, wounds, broken bones or other injuries...

Y such as X

The bow lute, such as the Bambara ndang...

such Y as X

...such authors as Herrick, Goldsmith, and Shakespeare.

Y including X

...common-law countries, including Canada and England...

Y , especially X

European countries, especially France, England, and Spain...

slide-20
SLIDE 20

Problems with hand-built patterns

  • Requires hand-building patterns for each relation!
  • hard to write; hard to maintain
  • there are zillions of them
  • domain-dependent
  • Don’t want to do this for all possible relations!
  • Plus, we’d like better accuracy
  • Hearst: 66% accuracy on hyponym extraction
slide-21
SLIDE 21

Relation extraction: 5 easy methods

  • 1. Hand-built patterns
  • 2. Bootstrapping methods
  • 3. Supervised methods
  • 4. Distant supervision
  • 5. Unsupervised methods
slide-22
SLIDE 22

Bootstrapping approaches

  • If you don’t have enough annotated text to train on …
  • But you do have:
  • some seed instances of the relation
  • (or some patterns that work pretty well)
  • and lots & lots of unannotated text (e.g., the web)
  • … can you use those seeds to do something useful?
  • Bootstrapping can be considered semi-supervised
slide-23
SLIDE 23

Bootstrapping example

  • Target relation: burial place
  • Seed tuple: [Mark Twain, Elmira]
  • Grep/Google for “Mark Twain” and “Elmira”

“Mark Twain is buried in Elmira, NY.” → X is buried in Y “The grave of Mark Twain is in Elmira” → The grave of X is in Y “Elmira is Mark Twain’s final resting place” → Y is X’s final resting place

  • Use those patterns to search for new tuples
slide-24
SLIDE 24

Bootstrapping relations

slide adapted from Jim Martin

slide-25
SLIDE 25

DIPRE (Brin 1998)

Extract (author, book) pairs Start with these 5 seeds: Learn these patterns: Iterate: use patterns to get more instances & patterns… Results: after three iterations of bootstrapping loop, extracted 15,000 author-book pairs with 95% accuracy.

slide-26
SLIDE 26

Snowball (Agichtein & Gravano 2000)

New ideas:

  • require that X and Y be named entities
  • add heuristics to score extractions, select best ones
slide-27
SLIDE 27

Snowball Results!

Conf middle right 1 <based, 0.53> <in, 0.53> <, , 0.01> 0.69 <’, 0.42> <s, 0.42> < headquarters, 0.42> <in, 0.12> 0.61 <(, 0.93> <), 0.12> Table 2: Actual patterns discovered by Snowball. (For each pattern the left vector is empty, tag1 =

ORGANIZATION, and tag2 =LOCATION.) Type of Error Correct Incorrect Location Organization Relationship PIdeal DIPRE 74 26 3 18 5 90% Snowball (all tuples) 52 48 6 41 1 88% Snowball (τt =0.8) 93 7 3 4 96% Baseline 25 75 8 62 5 66%

slide-28
SLIDE 28

Bootstrapping problems

  • Requires that we have seeds for each relation
  • Sensitive to original set of seeds
  • Big problem of semantic drift at each iteration
  • Precision tends to be not that high
  • Generally have lots of parameters to be tuned
  • No probabilistic interpretation
  • Hard to know how confident to be in each result
slide-29
SLIDE 29

Relation extraction: 5 easy methods

  • 1. Hand-built patterns
  • 2. Bootstrapping methods
  • 3. Supervised methods
  • 4. Distant supervision
  • 5. Unsupervised methods
slide-30
SLIDE 30

Supervised relation extraction

The supervised approach requires:

  • Defining an inventory of output labels
  • Relation detection: true/false
  • Relation classification: located-in, employee-of,

inventor-of, …

  • Collecting labeled training data: MUC, ACE, …
  • Defining a feature representation: words, entity

types, …

  • Choosing a classifier: Naïve Bayes, MaxEnt, SVM,

  • Evaluating the results
slide-31
SLIDE 31

ACE 2008: relations

slide-32
SLIDE 32

ACE 2008: data

39

slide-33
SLIDE 33

Features

  • Lightweight features — require little pre-processing
  • Bags of words & bigrams between, before, and after the entities
  • Stemmed versions of the same
  • The types of the entities
  • The distance (number of words) between the entities
  • Medium-weight features — require base phrase

chunking

  • Base-phrase chunk paths
  • Heavyweight features — require full syntactic parsing
  • Dependency-tree paths
  • Constituent-tree paths
  • Tree distance between the entities

Let’s take a closer look at features used in (Zhou et al. 2005)

slide-34
SLIDE 34

Features: words

American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said.

Bag-of-words features WM1 = {American, Airlines}, WM2 = {Tim, Wagner} Head-word features HM1 = Airlines, HM2 = Wagner, HM12 = Airlines+Wagner Words in between WBNULL = false, WBF = a, WBL = spokesman, WBO = {unit, of, AMR, immediately, matched, the, move} Words before and after BM1F = NULL, BM1L = NULL, AM2F = said, AM2L = NULL

Word features yield good precision (69%), but poor recall (24%)

slide-35
SLIDE 35

Features: NE type & mention level

American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said.

42

Named entity types (ORG, LOC, PER, etc.) ET1 = ORG, ET2 = PER, ET12 = ORG-PER Mention levels (NAME, NOMINAL, or PRONOUN) ML1 = NAME, ML2 = NAME, ML12 = NAME+NAME

  • Named entity type features help recall a lot (+8%)
  • Mention level features have little impact
slide-36
SLIDE 36

Features: overlap

  • American Airlines, a unit of AMR, immediately matched the

move, spokesman Tim Wagner said.

  • Number of mentions and words in between
  • #MB = 1, #WB = 9
  • Does one mention include in the other?
  • M1>M2 = false, M1<M2 = false
  • Conjunctive features
  • ET12+M1>M2 = ORG-

PER+false ET12+M1<M2 = ORG-PER+false

  • HM12+M1>M2 =

Airlines+Wagner+false HM12+M1<M2 = Airlines+Wagner+false

  • These features hurt precision a lot (-10%), but

also help recall a lot (+8%)

  • 43
slide-37
SLIDE 37

Features: syntactic features

46

Features of mention dependencies ET1DW1 = ORG:Airlines H1DW1 = matched:Airlines ET2DW2 = PER:Wagner H2DW2 = said:Wagner Features describing entity types and dependency tree ET12SameNP = ORG-PER-false ET12SamePP = ORG-PER-false ET12SameVP = ORG-PER-false

These features had disappointingly little impact!

slide-38
SLIDE 38

Relation extraction classifiers

Now use any (multiclass) classifier you like:

  • SVM
  • MaxEnt (aka multiclass logistic regression)
  • Naïve Bayes
  • etc.

[Zhou et al. 2005 used a one-vs-many SVM]

48

slide-39
SLIDE 39

Zhou et al. 2005 results

slide-40
SLIDE 40

Position-aware LSTM for Relation Extraction

(Zhang et al., 2017)

40

slide-41
SLIDE 41

Relation Extraction

Penner is survived by his brother , John, a copy editor at the Times, and his former wife,Times sportswriter Lisa Dillman.

slide-42
SLIDE 42

Relation Extraction

Penner is survived by his brother , John, a copy editor at the Times, and his former wife,Times sportswriter Lisa Dillman. Key elements

  • Context (relevant + irrelevant)
  • Entities (types + positions)
slide-43
SLIDE 43

Position-aware Attention Model

h1 ps po

1

h2 ps po

2

hn ps

n

po

n

q

an a2 a1 x1 x2 xn

and

1

  • 2

1

2

  • 1

h3 x3 ps po

3

2

3

married

4 2

a3

Mike (subject) Lisa (object)

slide-44
SLIDE 44

Position-aware Attention Model

h1 ps po

1

h2 ps po

2

hn ps

n

po

n

q

an a2 a1 x1 x2 xn

and

1

  • 2

1

2

  • 1

h3 x3 ps po

3

2

3

married

4 2

a3

Mike (subject) Lisa (object)

x = [x1, ..., xn] ps = [ps, ...,ps]

1 n

po = [po, ..., po]

1 n

Word: Position:

Embedding Layers

slide-45
SLIDE 45

Position-aware Attention Model

h1 ps po

1

h2 ps po

2

hn ps

n

po

n

q

an a2 a1 x1 x2 xn

and

1

  • 2

1

2

  • 1

h3 x3 ps po

3

2

3

married

4 2

a3

Mike (subject) Lisa (object)

x = [x1, ..., xn] ps = [ps, ...,ps]

1 n

po = [po, ..., po]

1 n

Word: Position:

Embedding Layers LSTM Layers

{h1, ..., h n} = LSTM({x1, ..., x n})

slide-46
SLIDE 46

Position-aware Attention Model

h1 ps po

1

h2 ps po

2

hn ps

n

po

n

q

an a2 a1 x1 x2 xn

and

1

  • 2

1

2

  • 1

h3 x3 ps po

3

2

3

married

4 2

a3

Mike (subject) Lisa (object)

Summary Vector

q = hn

slide-47
SLIDE 47

Position-aware Attention Model

h1 ps po

1

h2 ps po

2

hn ps

n

po

n

q

an a2 a1 x1 x2 xn

and

1

  • 2

1

2

  • 1

h3 x3 ps po

3

2

3

married

4 2

a3

Mike (subject) Lisa (object)

Summary Vector

q = hn

Attention Layer

slide-48
SLIDE 48

Position-aware Attention Model

h1 ps po

1

h2 ps po

2

hn ps

n

po

n

q

an a2 a1 x1 x2 xn

and

1

  • 2

1

2

  • 1

h3 x3 ps po

3

2

3

married

4 2

a3

Mike (subject) Lisa (object)

Relation Representation

slide-49
SLIDE 49

Position-aware Attention Model

h1 ps po

1

h2 ps po

2

hn ps

n

po

n

q

an a2 a1 x1 x2 xn

and

1

  • 2

1

2

  • 1

h3 x3 ps po

3

2

3

married

4 2

a3

Mike (subject) Lisa (object)

Relation Representation

n

Softmax Layer

slide-50
SLIDE 50

OtherAugmentations

  • Word dropout: balance OOV distribution
slide-51
SLIDE 51

OtherAugmentations

  • Word dropout: balance OOV distribution

Penner is survived by his brother , J

  • hn

<UNK>

slide-52
SLIDE 52

OtherAugmentations

  • Entity masking: focus onrelations, not specific

entities Penner is survived by his brother , John

slide-53
SLIDE 53

OtherAugmentations

  • Entity masking: focus onrelations, not specific

entities Penner is survived by his brother , J

  • hn

SUBJ-PER

slide-54
SLIDE 54

OtherAugmentations

  • Entity masking: focus onrelations, not specific

entities Penner is survived by his brother , J

  • hn

SUBJ-PER OBJ-PER

slide-55
SLIDE 55

OtherAugmentations

  • Linguistic information: POS and NER

embeddings from Stanford CoreNLP Penner is survived by …

slide-56
SLIDE 56

OtherAugmentations

  • Linguistic information: POS and NER

embeddings from Stanford CoreNLP Penner is survived by …

NNP VPZ VBN IN

PER O O O

slide-57
SLIDE 57

OtherAugmentations

  • Linguistic information: POS and NER

embeddings from Stanford CoreNLP

slide-58
SLIDE 58

Models ComparedAgainst

  • Stanford’s TAC KBP 2015 winningsystem
  • Patterns
  • Logistic regression (LR)

Non- Neural

slide-59
SLIDE 59

Models ComparedAgainst

  • Stanford’s TAC KBP 2015 winningsystem
  • Patterns
  • Logistic regression (LR)
  • C N N with positional encodings(Nguyen and

Grishman,2015)

  • Dependency-based RNN (Xu et al., 2015)
  • LSTM: 2-layerStacked-LSTM

Non- Neural Neural

slide-60
SLIDE 60

Relation Extraction Results

Model P R F1 T raditional Patterns 86.9 23.2 36.6 LR 73.5 49.9 59.4 LR + Patterns 72.9 51.8 60.5

slide-61
SLIDE 61

Relation Extraction Results

Model P R F1 T raditional Patterns 86.9 23.2 36.6 LR 73.5 49.9 59.4 LR + Patterns 72.9 51.8 60.5

  • Patterns: highprecision
  • LR: relatively higherrecall
slide-62
SLIDE 62

Relation Extraction Results

Model P R F1 T raditional LR + Patterns 72.9 51.8 60.5 Neural CNN 75.6 47.5 58.3 CNN-PE 70.3 54.2 61.2 SDP-LSTM 66.3 52.7 58.7 LSTM 65.7 59.9 62.7

slide-63
SLIDE 63

Relation Extraction Results

Model P R F1 T raditional LR + Patterns 72.9 51.8 60.5 Neural CNN 75.6 47.5 58.3 CNN-PE 70.3 54.2 61.2 SDP-LSTM 66.3 52.7 58.7 LSTM 65.7 59.9 62.7

  • CNN higher precision; LSTMhigher recall
  • CNN-PE and LSTM outperform traditional
slide-64
SLIDE 64

Relation Extraction Results

Model P R F1 T raditional LR + Patterns 72.9 51.8 60.5 Neural LSTM 65.7 59.9 62.7 Our model 65.7 64.5 65.1 Ensemble (5) 70.1 64.6 67.2

slide-65
SLIDE 65

Relation Extraction Results

Model P R F1 T raditional LR + Patterns 72.9 51.8 60.5 Neural LSTM 65.7 59.9 62.7 Our model 65.7 64.5 65.1 Ensemble (5) 70.1 64.6 67.2

  • Our model: +2.4 improvement on F1
slide-66
SLIDE 66

Supervised RE: summary

  • Supervised approach can achieve high accuracy
  • At least, for some relations
  • If we have lots of hand-labeled training data
  • But has significant limitations!
  • Labeling 5,000 relations (+ named entities) is expensive
  • Doesn’t generalize to different relations
slide-67
SLIDE 67

Relation extraction: 5 easy methods

  • 1. Hand-built patterns
  • 2. Bootstrapping methods
  • 3. Supervised methods
  • 4. Distant supervision
  • 5. Unsupervised methods
slide-68
SLIDE 68

Distant supervision

  • Hypothesis: If two entities belong to a certain relation, any

sentence containing those two entities is likely to express that relation

  • Key idea: use a database of relations to get lots of noisy

training examples

  • instead of hand-creating seed tuples (bootstrapping)
  • instead of using hand-labeled corpus (supervised)

Snow, Jurafsky, Ng. 2005. Learning syntactic patterns for automatic hypernym discovery. NIPS 17 Mintz, Bills, Snow, Jurafsky. 2009. Distant supervision for relation extraction without labeled data. ACL-2009.

slide-69
SLIDE 69

Benefits of distant supervision

  • Has advantages of supervised approach
  • leverage rich, reliable hand-created knowledge
  • relations have canonical names
  • can use rich features (e.g. syntactic features)
  • Has advantages of unsupervised approach
  • leverage unlimited amounts of text data
  • allows for very large number of weak features
  • not sensitive to training corpus: genre-

independent

slide-70
SLIDE 70

Hypernyms via distant supervision

Construct a noisy training set consisting of occurrences from our corpus that contain a hyponym-hypernym pair from WordNet.

This yields high-signal examples like:

“...consider authors like Shakespeare...” “Some authors (including Shakespeare)...” “Shakespeare was the author of several...” “Shakespeare, author of The Tempest...”

slide adapted from Rion Snow

slide-71
SLIDE 71

Learning hypernym patterns

  • 2. Collect noun pairs
  • 3. Is pair an IS-A in WordNet?
  • 4. Parse the sentences
  • 5. Extract patterns
  • 6. Train classifier on patterns

slide adapted from Rion Snow

e.g. (atom, deuterium) 752,311 pairs from 6M sentences of newswire 14,387 yes; 737,924 no 69,592 dependency paths with >5 pairs logistic regression with 70K features (converted to 974,288 bucketed binary features)

Key idea: work at corpus level (entity pairs), instead of sentence level!

  • 1. Take corpus sentences

... doubly heavy hydrogen atom called deuterium ...

slide-72
SLIDE 72

One of 70,000 patterns

Pattern: <superordinate> called <subordinate> Learned from cases such as:

(sarcoma, cancer) (deuterium, atom) …an uncommon bone cancer called osteogenic sarcoma and to… …heavy water rich in the doubly heavy hydrogen atom called deuterium.

New pairs discovered:

(efflorescence, condition) (O’neal_inc, company) (hat_creek_outfit, ranch) (hiv-1, aids_virus) (bateau_mouche, attraction) …and a condition called efflorescence are other reasons for… …The company, now called O'Neal Inc., was sole distributor of… …run a small ranch called the Hat Creek Outfit. …infected by the AIDS virus, called HIV-1. …local sightseeing attraction called the Bateau Mouche...

slide-73
SLIDE 73

slide adapted from Rion Snow

What about other relations?

Mintz, Bills, Snow, Jurafsky (2009). Distant supervision for relation extraction without labeled data.

102 relations 940,000 entities 1.8 million instances Training set 1.8 million articles 25.7 million sentences Corpus

slide-74
SLIDE 74

Frequent Freebase relations

slide-75
SLIDE 75

Collecting training data

Bill Gates founded Microsoft in 1975. Bill Gates, founder of Microsoft, … Bill Gates attended Harvard from… Google was founded by Larry Page …

Corpus text Freebase

Founder: (Bill Gates, Microsoft) Founder: (Larry Page, Google) CollegeAttended: (Bill Gates, Harvard)

Training data

slide-76
SLIDE 76

Collecting training data

Bill Gates founded Microsoft in 1975. Bill Gates, founder of Microsoft, … Bill Gates attended Harvard from… Google was founded by Larry Page …

Corpus text Freebase

Founder: (Bill Gates, Microsoft) Founder: (Larry Page, Google) CollegeAttended: (Bill Gates, Harvard) Label: Feature: Founder X founded Y

Training data

(Bill Gates, Microsoft)

slide-77
SLIDE 77

Collecting training data

Bill Gates founded Microsoft in 1975. Bill Gates, founder of Microsoft, … Bill Gates attended Harvard from… Google was founded by Larry Page …

Corpus text Freebase

Founder: (Bill Gates, Microsoft) Founder: (Larry Page, Google) CollegeAttended: (Bill Gates, Harvard) Label: Feature: Feature: Founder X founded Y X, founder of Y

Training data

(Bill Gates, Microsoft)

slide-78
SLIDE 78

Collecting training data

Bill Gates founded Microsoft in 1975. Bill Gates, founder of Microsoft, … Bill Gates attended Harvard from… Google was founded by Larry Page …

Corpus text Freebase

Founder: (Bill Gates, Microsoft) Founder: (Larry Page, Google) CollegeAttended: (Bill Gates, Harvard) Label: Feature: Feature: Founder X founded Y X, founder of Y

Training data

(Bill Gates, Microsoft) (Bill Gates, Harvard) Label: Feature: CollegeAttended X attended Y

slide-79
SLIDE 79

Collecting training data

Bill Gates founded Microsoft in 1975. Bill Gates, founder of Microsoft, … Bill Gates attended Harvard from… Google was founded by Larry Page … CollegeAttended: (Bill Gates, Harvard)

Corpus text Freebase

Founder: (Bill Gates, Microsoft) Founder: (Larry Page, Google) Label: Feature: Feature: Founder X founded Y X, founder of Y

Training data

(Bill Gates, Microsoft) (Larry Page, Google) Label: Feature: Founder Y was founded by X (Bill Gates, Harvard) Label: Feature: CollegeAttended X attended Y

slide-80
SLIDE 80

Negative training data

Corpus text

Larry Page took a swipe at Microsoft... ...after Harvard invited Larry Page to... Google is Bill Gates' worst fear ... Label: Feature: NO_RELATION X took a swipe at Y

Training data

(Larry Page, Microsoft) (Bill Gates, Google) Label: Feature: NO_RELATION Y is X's worst fear (Larry Page, Harvard) Label: Feature: NO_RELATION Y invited X

Can’t train a classifier with only positive data! Need negative training data too! Solution? Sample 1% of unrelated pairs of entities.

slide-81
SLIDE 81

Preparing test data

Corpus text

Henry Ford founded Ford Motor Co. in… Ford Motor Co. was founded by Henry Ford… Steve Jobs attended Reed College from…

Test data

slide-82
SLIDE 82

Preparing test data

Corpus text

Henry Ford founded Ford Motor Co. in… Ford Motor Co. was founded by Henry Ford… Steve Jobs attended Reed College from… Label: Feature: ??? X founded Y

Test data

(Henry Ford, Ford Motor Co.)

slide-83
SLIDE 83

Preparing test data

Corpus text

Henry Ford founded Ford Motor Co. in… Ford Motor Co. was founded by Henry Ford… Steve Jobs attended Reed College from… Label: Feature: Feature: ??? X founded Y Y was founded by X

Test data

(Henry Ford, Ford Motor Co.)

slide-84
SLIDE 84

Preparing test data

Corpus text

Henry Ford founded Ford Motor Co. in… Ford Motor Co. was founded by Henry Ford… Steve Jobs attended Reed College from… Label: Feature: Feature: ??? X founded Y Y was founded by X

Test data

(Henry Ford, Ford Motor Co.) (Steve Jobs, Reed College) Label: Feature: ??? X attended Y

slide-85
SLIDE 85

Predictions!

The experiment

(Steve Jobs, Reed College) ??? X Label: Feature: attended Y Founder X X, Founder Y was (Larry Page, Google) Label: Feature: founded by X Label: Feature: founded Y Feature: founder of Y (Bill Gates, Harvard) Label: C

  • l

l e g e A t t e n d e d F e a t u r e : X attended Y ??? X Y was Label: Feature: founded Y Feature: founded by X

Test data

(Henry Ford, Ford Motor Co.) (Bill Gates, Google) Label: N O _ R E L A T I O N F e a t u r e : Y is X's worst fear (Larry Page, Harvard) Label: N O _ R E L A T I O N F e a t u r e : Y invited X

Positive training data

(Bill Gates, Microsoft)

Negative training data

(Larry Page, Microsoft) Label: N O _ R E L A T I O N F e a t u r e : X took a swipe at Y

Learning: multiclass logistic regression Trained relation classifier

(Henry Ford, Ford Motor Co.) Label: Founder (Steve Jobs, Reed College) Label: CollegeAttended

slide-86
SLIDE 86

Advantages of the approach

  • ACE paradigm: labeling sentences
  • This paradigm: labeling entity pairs
  • Make use of multiple appearances of entities
  • If a pair of entities appears in 10 sentences, and

each sentence has 5 features extracted from it, the entity pair will have 50 associated features

slide-87
SLIDE 87

Experimental set-up

  • 1.8 million relation instances used for training
  • Compared to 17,000 relation instances in ACE
  • 800,000 Wikipedia articles used for training,

400,000 different articles used for testing

  • Only extract relation instances not already in

Freebase

slide-88
SLIDE 88

Newly discovered instances

T en relation instances extracted by the system that weren’t in Freebase

slide-89
SLIDE 89

Human evaluation

Precision@K, using Mechanical Turk labelers:

  • At recall of 100 instances, using both feature sets (lexical and syntax)
  • ffers the best performance for a majority of the relations
  • At recall of 1000 instances, using syntax features improves

performance for a majority of the relations

slide-90
SLIDE 90

Distant supervision: conclusions

  • Distant supervision extracts high-precision patterns

for a variety of relations

  • Can make use of 1000x more data than simple

supervised algorithms

  • Syntax features almost always help
  • The combination of syntax and lexical features is

sometimes even better

  • Syntax features are probably most useful when

entities are far apart, often when there are modifiers in between

slide-91
SLIDE 91

Heterogeneous Supervision

  • Provide a general framework to encode

knowledge for supervision:

  • Knowledge base facts, heuristic patterns, ……
  • Labelling functions:

91 return died_in for < , , s> if DiedIn( , ) in KB return born_in for < , , s> if match(‘ * born in * ’, s) return died_in for < , , s> if match(‘ * killed in * ’, s) return born_in for < , , s> if BornIn( , ) in KB

Λ

e1 e2 λ4 λ2 e1 e2 e1 e2 e1 e2 λ3 λ1 e1 e2 e1 e2 Domain-specific Patterns Knowledge Base

(Liu et al, EMNLP 2017)

slide-92
SLIDE 92

Robert Newton "Bob" Ford was an American outlaw best known for killing his gang leader Jesse James ( ) in Missouri ( ) Hussein ( ) was born in Amman ( ) on 14 November 1935. Gofraid ( ) died in 989, said to be killed in Dal Riata ( ).

D

e1 e2

c2 c3 c1

e1 e1 e2 e2

return died_in for < , , s> if DiedIn( , ) in KB return born_in for < , , s> if match(‘ * born in * ’, s) return died_in for < , , s> if match(‘ * killed in * ’, s) return born_in for < , , s> if BornIn( , ) in KB

Λ

e1 e2 λ4 λ2 e1 e2 e1 e2 e1 e2 λ3 λ1 e1 e2 e1 e2 λ1 λ2 λ4 λ3

c1 c3 c2

Robert Newton "Bob" Ford was an American outlaw best known for killing his gang leader Jesse James ( ) in Missouri ( ) Hussein ( ) was born in Amman ( ) on 14 November 1935. Gofraid ( ) died in 989, said to be killed in Dal Riata ( ).

D

e1 e2

c2 c3 c1

e1 e1 e2 e2

return died_in for < , , s> if DiedIn( , ) in KB return born_in for < , , s> if match(‘ * born in * ’, s) return died_in for < , , s> if match(‘ * killed in * ’, s) return born_in for < , , s> if BornIn( , ) in KB

Λ

e1 e2 λ4 λ2 e1 e2 e1 e2 e1 e2 λ3 λ1 e1 e2 e1 e2 λ1 λ2 λ4 λ3

c1 c3 c2

Challenges

  • Relation Extraction
  • Resolve Conflicts among Heterogeneous Supervision

92

slide-93
SLIDE 93

Conflicts among Heterogeneous Supervision

  • A straightforward way: majority voting

93 Robert Newton "Bob" Ford was an American outlaw best known for killing his gang leader Jesse James ( ) in Missouri ( ) Hussein ( ) was born in Amman ( ) on 14 November 1935. Gofraid ( ) died in 989, said to be killed in Dal Riata ( ).

D

e1 e2

c2 c3 c1

e1 e1 e2 e2

return died_in for < , , s> if DiedIn( , ) in KB return born_in for < , , s> if match(‘ * born in * ’, s) return died_in for < , , s> if match(‘ * killed in * ’, s) return born_in for < , , s> if BornIn( , ) in KB

Λ

e1 e2 λ4 λ2 e1 e2 e1 e2 e1 e2 λ3 λ1 e1 e2 e1 e2 λ1 λ2 λ4 λ3

c1 c3 c2

Robert Newton "Bob" Ford was an American outlaw best known for killing his gang leader Jesse James ( ) in Missouri ( ) Hussein ( ) was born in Amman ( ) on 14 November 1935. Gofraid ( ) died in 989, said to be killed in Dal Riata ( ).

D

e1 e2

c2 c3 c1

e1 e1 e2 e2

return died_in for < , , s> if DiedIn( , ) in KB return born_in for < , , s> if match(‘ * born in * ’, s) return died_in for < , , s> if match(‘ * killed in * ’, s) return born_in for < , , s> if BornIn( , ) in KB

Λ

e1 e2 λ4 λ2 e1 e2 e1 e2 e1 e2 λ3 λ1 e1 e2 e1 e2 λ1 λ2 λ4 λ3

c1 c3 c2

slide-94
SLIDE 94

Conflicts among Heterogeneous Supervision

  • How to resolve conflicts among Heterogeneous

Supervision?

  • Works for C3 and C2, but not work for C1

94 Robert Newton "Bob" Ford was an American outlaw best known for killing his gang leader Jesse James ( ) in Missouri ( ) Hussein ( ) was born in Amman ( ) on 14 November 1935. Gofraid ( ) died in 989, said to be killed in Dal Riata ( ).

D

e1 e2

c2 c3 c1

e1 e1 e2 e2

return died_in for < , , s> if DiedIn( , ) in KB return born_in for < , , s> if match(‘ * born in * ’, s) return died_in for < , , s> if match(‘ * killed in * ’, s) return born_in for < , , s> if BornIn( , ) in KB

Λ

e1 e2 λ4 λ2 e1 e2 e1 e2 e1 e2 λ3 λ1 e1 e2 e1 e2 λ1 λ2 λ4 λ3

c1 c3 c2

Robert Newton "Bob" Ford was an American outlaw best known for killing his gang leader Jesse James ( ) in Missouri ( ) Hussein ( ) was born in Amman ( ) on 14 November 1935. Gofraid ( ) died in 989, said to be killed in Dal Riata ( ).

D

e1 e2

c2 c3 c1

e1 e1 e2 e2

return died_in for < , , s> if DiedIn( , ) in KB return born_in for < , , s> if match(‘ * born in * ’, s) return died_in for < , , s> if match(‘ * killed in * ’, s) return born_in for < , , s> if BornIn( , ) in KB

Λ

e1 e2 λ4 λ2 e1 e2 e1 e2 e1 e2 λ3 λ1 e1 e2 e1 e2 λ1 λ2 λ4 λ3

c1 c3 c2

slide-95
SLIDE 95

Conflicts among Heterogeneous Supervision

  • Truth Discovery:
  • Some sources (labeling functions) would be more reliable

than others

  • Sou

Source Co Consistency As Assumpti tion

  • n: a source is likely to

provide true information with the same probability for all instances.

95

slide-96
SLIDE 96

Conflicts among Heterogeneous Supervision

  • For Distant Supervision, all annotations come from

Knowledge Base.

  • For Heterogeneous Supervision, annotations are from

different sources, and some could be more reliable than others.

96 return died_in for < , , s> if DiedIn( , ) in KB return born_in for < , , s> if match(‘ * born in * ’, s) return died_in for < , , s> if match(‘ * killed in * ’, s) return born_in for < , , s> if BornIn( , ) in KB

Λ

e1 e2 λ4 λ2 e1 e2 e1 e2 e1 e2 λ3 λ1 e1 e2 e1 e2 Domain-specific Patterns Knowledge Base

slide-97
SLIDE 97

Conflicts among Heterogeneous Supervision

  • We introduce context awareness to truth

discovery, and modified the assumption:

  • A labeling function (LF) is likely to provide true

information with the same probability for instances with similar context.

  • If we can “contextualize” a LF, then we can measure the

“expertise” of a LF over a given sentence context

97

slide-98
SLIDE 98

Relation Mention Representation

  • Text Feature Extraction
  • Text Feature Representation
  • Relation Mention Representation

98

HEAD_EM1_Hussein TKN_EM1_Hussein born HEAD_EM2_Amman …… Text Feature Extraction Text Feature Representation Mapping from Text Embedding to Relation Mention Embedding: Robert Newton "Bob" Ford was an American outlaw best known for killing his gang leader Jesse James ( ) in Missouri ( ) Gofraid ( ) died in 989, said to be killed in Dal Riata ( ).

D

e1 e2

c2 c1

e1 e2

Hussein ( ) was born in Amman ( ) on 14 November 1935.

c3

e1 e2 tanh(W · 1 |fc1| X

fi∈fc1

vi) vi ∈ Rnv zc ∈ Rnz

slide-99
SLIDE 99

True label discovery

  • Probability Model:
  • Describing the generation of Heterogeneous Supervision?
  • Different from crowdsourcing. E.g., ONE worker may

annotate:

99

slide-100
SLIDE 100

True label discovery

  • Describing the correctness of Heterogeneous

Supervision

100

zc

|C|

|Λ|

li

sc,i ρc,i

|O|

!",$ = &((",$ == ("

∗)

underlying true label correctness of annotation (",$

  • bserved annotation

Representation of relation mention Representation of labeling function whether c belongs to the proficient subset of +,

slide-101
SLIDE 101

True label discovery

  • Describing the correctness of Heterogeneous

Supervision

  • ! "#,% = 1 = ! "#,% = 1 (#,% = 1 ∗ ! (#,% = 1 +

! "#,% = 1 (#,% = 0 ∗ ! (#,% = 0

  • ! (#,% = 1 = ,(.%

/ ∗ 0#)

101

zc

|C|

|Λ|

li

sc,i ρc,i

|O|

slide-102
SLIDE 102

Case Study

102

slide-103
SLIDE 103

Experiments

103

slide-104
SLIDE 104

Relation extraction: 5 easy methods

  • 1. Hand-built patterns
  • 2. Bootstrapping methods
  • 3. Supervised methods
  • 4. Distant supervision
  • 5. Unsupervised methods
slide-105
SLIDE 105

DIRT (Lin & Pantel 2003)

  • DIRT = Discovery of Inference Rules from T

ext

  • Looks at dependency paths between noun pairs
  • N:subj:V←find→V:obj:N→solution→N:to:N
  • i.e., X finds solution to Y
  • Applies ”extended distributional hypothesis”
  • If two paths tend to occur in similar contexts, the meanings of the

paths tend to be similar.

  • So, defines path similarity in terms of cooccurrence counts with

various slot fillers

slide-106
SLIDE 106

DIRT examples

The top-20 most similar paths to “X solves Y”:

Y is solved by X X resolves Y X finds a solution to Y X tries to solve Y X deals with Y Y is resolved by X X addresses Y X seeks a solution to Y X do something about Y X solution to Y Y is resolved in X Y is solved through X X rectifies Y X copes with Y X overcomes Y X eases Y X tackles Y X alleviates Y X corrects Y X is a solution to Y

slide-107
SLIDE 107

Ambiguous paths in DIRT

  • X addresses Y
  • I addressed my letter to him personally.
  • She addressed an audience of Shawnee chiefs.
  • Will Congress finally address the immigration issue?
  • X tackles Y
  • Foley tackled the quarterback in the endzone.
  • Police are beginning to tackle rising crime.
  • X is a solution to Y
  • (5, 1) is a solution to the equation 2x – 3y = 7
  • Nuclear energy is a solution to the energy crisis.
slide-108
SLIDE 108

TextRunner (Banko et al. 2007)

  • 1. Self-supervised learner: automatically labels

+/– examples & learns a crude relation extractor

  • 2. Single-pass extractor: makes one pass over

corpus, extracting candidate relations in each sentence

  • 3. Redundancy-based assessor: assigns a

probability to each extraction, based on frequency counts

slide-109
SLIDE 109

Step 1: Self-supervised learner

  • Run a parser over 2000 sentences
  • Parsing is relatively expensive, so can’t run on whole web
  • For each pair of base noun phrases NPi and NPj
  • Extract all tuples t = (NPi, relationi,j , NPj)
  • Label each tuple based on features of parse:
  • Positive iff the dependency path between the NPs is short, and

doesn’t cross a clause boundary, and neither NP is a pronoun

  • Now train a Naïve Bayes classifier on the labeled tuples
  • Using lightweight features like POS tags nearby, stop words,

etc.

slide-110
SLIDE 110

Step 2: Single-pass extractor

  • Over a huge (web-sized) corpus:
  • Run a dumb POS tagger
  • Run a dumb Base Noun Phrase chunker
  • Extract all text strings between base NPs
  • Run heuristic rules to simplify text strings

Scientists from many universities are intently studying stars → scientists, are studying, stars

  • Pass candidate tuples to Naïve Bayes classifier
  • Save only those predicted to be “trustworthy”
slide-111
SLIDE 111

Step 3: Redundancy-based assessor

  • Collect counts for each simplified tuple

scientists, are studying, stars → 17

  • Compute likelihood of each tuple
  • given the counts for each relation
  • and the number of sentences
  • and a combinatoric balls-and-urns model [Downey et al. 05]
slide-112
SLIDE 112

TextRunner examples

slide from Oren Etzioni

slide-113
SLIDE 113

TextRunner results

  • From corpus of 9M web pages, containing 133M

sentences

  • Extracted 60.5 million tuples

§ § FCI, specializes in, software development

  • Evaluation
  • Not well formed:

§ § demands, of securing, border 29, dropped,

instruments

  • Abstract:

§ § Einstein, derived, theory executive, hired by,

company

  • True, concrete:

§ § Tesla, invented, coil transformer

slide-114
SLIDE 114

Yao et al. 2012: motivation

  • Goal: induce clusters of dependency paths

which express the same semantic relation, like DIRT

  • But, improve upon DIRT by properly handling

semantic ambiguity of individual paths

slide-115
SLIDE 115

Yao et al. 2012: approach

  • 1. Extract tuples (entity, path, entity) from corpus
  • 2. Construct feature representations of every tuple
  • 3. Group the tuples for each path into sense clusters
  • 4. Cluster the sense clusters into semantic relations
slide-116
SLIDE 116

Extracting tuples

  • Start with NYT corpus
  • Apply lemmatization, NER tagging, dependency parsing
  • For each pair of entities in a sentence:
  • Extract dependency path between them, as in Lin
  • Form a tuple consisting of the two entities and the path
  • Filter rare tuples, tuples with two direct objects, etc.
  • Result: 1M tuples, 500K entities, 1300 patterns
slide-117
SLIDE 117

Feature representation

  • Entity names, as bags of words, prefixed with "l:" or "r:"
  • ex: ("LA Lakers", "NY Knicks") => {l:LA, l:Lakers, r:NY

, r:Knicks}

  • Using bag-of-words encourages overlap, i.e., combats sparsity
  • Words between and around the two entities
  • Exclude stop words, words with capital letters
  • Include two words to the left and right
  • Document theme (e.g. sports, politics, finance)
  • Assigned by an LDA topic model which treats NYTimes topic

descriptors as words in a synthetic document

  • Sentence theme
  • Assigned by a standard LDA topic model
slide-118
SLIDE 118

Clustering tuples into senses

  • Goal: group tuples for each path into coherent sense

clusters

  • Currently exploring multiple different approaches:
  • LDA-like topic models
  • Matrix factorization approaches
  • Result: each tuple is assigned one topic/sense
  • Tuples with the same topic/sense constitute a cluster
slide-119
SLIDE 119

Sense cluster examples

Sense clusters for path “A play B”, along with sample entity pairs and top features.

slide-120
SLIDE 120

Semantic relation results

Just like DIRT, each semantic relation has multiple paths. But, one path can now appear in multiple semantic relations. DIRT can’t do that!

slide-121
SLIDE 121

Relation extraction: 5 easy methods

  • 1. Hand-built patterns
  • 2. Bootstrapping methods
  • 3. Supervised methods
  • 4. Distant supervision
  • 5. Unsupervised methods