Application: Semantic Role Labeling CS 6956: Deep Learning for NLP - - PowerPoint PPT Presentation

application semantic role labeling
SMART_READER_LITE
LIVE PREVIEW

Application: Semantic Role Labeling CS 6956: Deep Learning for NLP - - PowerPoint PPT Presentation

Application: Semantic Role Labeling CS 6956: Deep Learning for NLP Overview What is semantic role labeling? The state-of-the-art before neural networks Neural models for semantic roles 1 Overview What is semantic role labeling?


slide-1
SLIDE 1

CS 6956: Deep Learning for NLP

Application: Semantic Role Labeling

slide-2
SLIDE 2

Overview

  • What is semantic role labeling?

– The state-of-the-art before neural networks

  • Neural models for semantic roles

1

slide-3
SLIDE 3

Overview

  • What is semantic role labeling?

– The state-of-the-art before neural networks

  • Neural models for semantic roles

2

slide-4
SLIDE 4

Semantic roles

For an event that is described in a verb, different noun phrases fulfill different semantic roles

Think of noun phrases as representing typed arguments

3

slide-5
SLIDE 5

Semantic roles

For an event that is described in a verb, different noun phrases fulfill different semantic roles

Think of noun phrases as representing typed arguments

John saw Mary eat the apple

4

slide-6
SLIDE 6

Semantic roles

For an event that is described in a verb, different noun phrases fulfill different semantic roles

Think of noun phrases as representing typed arguments

John saw Mary eat the apple

5

The seeing event

slide-7
SLIDE 7

Semantic roles

For an event that is described in a verb, different noun phrases fulfill different semantic roles

Think of noun phrases as representing typed arguments

John saw Mary eat the apple

6

Which entity is performing the “seeing” action? (i.e. initiating it) What is being seen? The seeing event

slide-8
SLIDE 8

Semantic roles

For an event that is described in a verb, different noun phrases fulfill different semantic roles

Think of noun phrases as representing typed arguments

John saw Mary eat the apple

7

The eating event

slide-9
SLIDE 9

Semantic roles

For an event that is described in a verb, different noun phrases fulfill different semantic roles

Think of noun phrases as representing typed arguments

John saw Mary eat the apple

8

Which entity is performing the “eating”? What is being eaten? The eating event

slide-10
SLIDE 10

Semantic role labeling

Loosely speaking, the task of identifying who does what to whom, when where and why Input: A sentence and a verb Output: A list of labeled spans

– Spans represent the arguments that participate in the event – The labels represent the semantic role of each argument – Optionally, also label the verb with a frame type that describes the action (think word sense disambiguation)

9

slide-11
SLIDE 11

Semantic role labeling

Loosely speaking, the task of identifying who does what to whom, when where and why Input: A sentence and a verb Output: A list of labeled spans

– Spans represent the arguments that participate in the event – The labels represent the semantic role of each argument – Optionally, also label the verb with a frame type that describes the action (think word sense disambiguation)

10

slide-12
SLIDE 12

Semantic role labeling

Loosely speaking, the task of identifying who does what to whom, when where and why Input: A sentence and a verb Output: A list of labeled spans

– Spans represent the arguments that participate in the event – The labels represent the semantic role of each argument – Optionally, also label the verb with a frame type that describes the action (think word sense disambiguation)

11

Variants exist, but for simplicity we will use this setting

slide-13
SLIDE 13

Semantic role labeling

Loosely speaking, the task of identifying who does what to whom, when where and why Input: A sentence and a verb Output: A list of labeled spans

– Spans represent the arguments that participate in the event – The labels represent the semantic role of each argument – Optionally, also label the verb with a frame type that describes the action (think word sense disambiguation)

12

Variants exist, but for simplicity we will use this setting

slide-14
SLIDE 14

What is the set of labels?

13

We want the labels to participants in event frames – That is, the semantic arguments of events Coming up with a closed set of labels can be daunting

slide-15
SLIDE 15

What is the set of labels?

14

We want the labels to participants in event frames – That is, the semantic arguments of events Coming up with a closed set of labels can be daunting Some examples:

Semantic role Description Example Agent The entity who initiates an event John cut an apple with a knife Patient The entity who undergoes a change of state John cut an apple with a knife Instrument The means/intermediary used to perform the action John cut an apple with a knife Location The location of the event John placed an apple on the table

slide-16
SLIDE 16

What is the set of labels?

15

We want the labels to participants in event frames – That is, the semantic arguments of events Coming up with a closed set of labels can be daunting Some examples (not nearly complete!):

Semantic role Description Example Agent The entity who initiates an event John cut an apple with a knife Patient The entity who undergoes a change of state John cut an apple with a knife Instrument The means/intermediary used to perform the action John cut an apple with a knife Location The location of the event John placed an apple on the table

slide-17
SLIDE 17

Two styles of labels commonly seen

  • FrameNet [Fillmore et al]

– Labels are fine-grained semantic roles based on the theory of Frame Semantics

  • e.g. Agent, Patient, Instrument, Location, Beneficiary, etc

– More a lexical resource than a corpus

  • Each semantic frame associated with exemplars
  • PropBank [Palmer et al]

– Labels are theory neutral but defined on a verb-by-verb basis

  • More abstract labels: e.g. Arg0, Arg1, Arg2, Arg-Loc, etc.

– An annotated corpus

  • The Wall Street Journal part of the Penn Treebank

16

slide-18
SLIDE 18

FrameNet and PropBank: Examples

17

Jack bought a glove from Mary. Jack acquired a glove from Mary. Jack returned a glove to Mary.

slide-19
SLIDE 19

FrameNet and PropBank: Examples

18

ACQUIRE frame Recipient Theme Source COMMERCE_GOODS_TRANSFER frame Buyer Goods Seller Agent Theme Recipient

Jack bought a glove from Mary. Jack acquired a glove from Mary. Jack returned a glove to Mary.

FrameNet frame elements

slide-20
SLIDE 20

FrameNet and PropBank: Examples

19

Arg0 Arg1 Arg2 Arg0 Arg1 Arg2 Arg0 Arg1 Arg2

Jack bought a glove from Mary. Jack acquired a glove from Mary. Jack returned a glove to Mary.

PropBank labels. The interpretation of these labels depends on the verb

slide-21
SLIDE 21

Overview

  • What is semantic role labeling?

– The state-of-the-art before neural networks

  • Neural models for semantic roles

20

slide-22
SLIDE 22

Semantic Role Labeling

  • Mostly based on PropBank [Palmer et. al. 05]

– Large human-annotated corpus of verb semantic relations

  • The task: To predict arguments of verbs

21

Given a sentence, identifies who does what to whom, where and when. The bus was heading for Nairobi in Kenya

slide-23
SLIDE 23

Semantic Role Labeling

  • Mostly based on PropBank [Palmer et. al. 05]

– Large human-annotated corpus of verb semantic relations

  • The task: To predict arguments of verbs

22

Given a sentence, identifies who does what to whom, where and when. The bus was heading for Nairobi in Kenya Relation: Head Mover[A0]: the bus Destination[A1]: Nairobi in Kenya

slide-24
SLIDE 24

Semantic Role Labeling

  • Mostly based on PropBank [Palmer et. al. 05]

– Large human-annotated corpus of verb semantic relations

  • The task: To predict arguments of verbs

23

Given a sentence, identifies who does what to whom, where and when. The bus was heading for Nairobi in Kenya Relation: Head Mover[A0]: the bus Destination[A1]: Nairobi in Kenya Predicate Arguments

slide-25
SLIDE 25

Predicting verb arguments

1. Identify candidate arguments

for verb using parse tree

– Filtered using a binary classifier

2. Classify argument candidates

– Multi-class classifier (one of multiple labels per candidate)

3. Inference

– Using probability estimates from argument classifier – Must respect structural and linguistic constraints

  • Eg: No overlapping arguments

The bus was heading for Nairobi in Kenya.

24

A state-of-the-art pre-neural network approach

slide-26
SLIDE 26

Predicting verb arguments

1. Identify candidate arguments

for verb using parse tree

– Filtered using a binary classifier

2. Classify argument candidates

– Multi-class classifier (one of multiple labels per candidate)

3. Inference

– Using probability estimates from argument classifier – Must respect structural and linguistic constraints

  • Eg: No overlapping arguments

The bus was heading for Nairobi in Kenya.

25

A state-of-the-art pre-neural network approach

slide-27
SLIDE 27

Predicting verb arguments

1. Identify candidate arguments

for verb using parse tree

– Filtered using a binary classifier

2. Classify argument candidates

– Multi-class classifier (one of multiple labels per candidate)

3. Inference

– Using probability estimates from argument classifier – Must respect structural and linguistic constraints

  • Eg: No overlapping arguments

The bus was heading for Nairobi in Kenya.

26

A state-of-the-art pre-neural network approach

slide-28
SLIDE 28

Predicting verb arguments

1. Identify candidate arguments

for verb using parse tree

– Filtered using a binary classifier

2. Classify argument candidates

– Multi-class classifier (one of multiple labels per candidate)

3. Inference

– Using probability estimates from argument classifier – Must respect structural and linguistic constraints

  • Eg: No overlapping arguments

The bus was heading for Nairobi in Kenya.

27

A state-of-the-art pre-neural network approach

slide-29
SLIDE 29

Inference: verb arguments

The bus was heading for Nairobi in Kenya.

Special label, meaning “Not an argument”

28

A state-of-the-art pre-neural network approach

slide-30
SLIDE 30

Inference: verb arguments

The bus was heading for Nairobi in Kenya.

0.1 0.5 0.2 0.1 0.1 0.5 0.2 0.0 0.2 0.1 0.1 0.1 0.1 0.1 0.6 0.4 0.1 0.1 0.1 0.3 Special label, meaning “Not an argument”

29

A state-of-the-art pre-neural network approach

slide-31
SLIDE 31

Inference: verb arguments

The bus was heading for Nairobi in Kenya.

Special label, meaning “Not an argument” 0.1 0.5 0.2 0.1 0.1 0.5 0.2 0.0 0.2 0.1 0.1 0.1 0.1 0.1 0.6 0.4 0.1 0.1 0.1 0.3

30

A state-of-the-art pre-neural network approach

slide-32
SLIDE 32

Inference: verb arguments

The bus was heading for Nairobi in Kenya.

Special label, meaning “Not an argument”

Total: 2.0

0.1 0.5 0.2 0.1 0.1 0.5 0.2 0.0 0.2 0.1 0.1 0.1 0.1 0.1 0.6 0.4 0.1 0.1 0.1 0.3

31

heading (The bus, for Nairobi, for Nairobi in Kenya) A state-of-the-art pre-neural network approach

slide-33
SLIDE 33

Inference: verb arguments

The bus was heading for Nairobi in Kenya.

Special label, meaning “Not an argument” Violates constraint: Overlapping argument!

Total: 2.0

0.1 0.5 0.2 0.1 0.1 0.5 0.2 0.0 0.2 0.1 0.1 0.1 0.1 0.1 0.6 0.4 0.1 0.1 0.1 0.3

32

heading (The bus, for Nairobi, for Nairobi in Kenya) A state-of-the-art pre-neural network approach

slide-34
SLIDE 34

Inference: verb arguments

The bus was heading for Nairobi in Kenya.

Special label, meaning “Not an argument”

Total: 1.9

0.1 0.5 0.2 0.1 0.1 0.5 0.2 0.0 0.2 0.1 0.1 0.1 0.1 0.1 0.6 0.4 0.1 0.1 0.1 0.3

33

Total: 2.0

heading (The bus, for Nairobi in Kenya) A state-of-the-art pre-neural network approach

slide-35
SLIDE 35

Scoring argument labels

  • Essentially a multi-class classification problem
  • Typically linear models with large number of carefully hand-crafted

features – Words, parts of speech – The type of the phrase in a parse tree – The path in a parse tree from the verb to the phrase

34

A state-of-the-art pre-neural network approach [Gildea and Jurafsky 2002, Toutanova et al 2004-, Punyakanok et al 2004-, and others]

slide-36
SLIDE 36

Scoring argument labels

  • Essentially a multi-class classification problem
  • Typically linear models with large number of carefully hand-crafted

features – Words, parts of speech – The type of the phrase in a parse tree – The path in a parse tree from the verb to the phrase

35

A state-of-the-art pre-neural network approach

[Gildea and Jurafsky 2002, Pradhan et al 2004-, Toutanova et al 2004-, Punyakanok et al 2004-, and others]

Figure from [Palmer et al 2010]

slide-37
SLIDE 37

Scoring argument labels

  • Essentially a multi-class classification problem
  • Typically linear models with large number of carefully hand-crafted

features – Words, parts of speech – The type of the phrase in a parse tree – The path in a parse tree from the verb to the phrase

36

A state-of-the-art pre-neural network approach [Gildea and Jurafsky 2002, Toutanova et al 2004-, Punyakanok et al 2004-, and others] Figure from [Palmer et al 2010] And many more carefully designed features, many million dimensional feature vectors

slide-38
SLIDE 38

Extension: Structured learning

  • Why should we train a multiclass classifier that operates on each label

independently? – Instead, train a model that scores the entire set of labels for a frame jointly [Täckström et al 2015]

  • That is, train a model that learns to assign a score for the entire sentence

rather than one label at a time 𝑇𝑑𝑝𝑠𝑓 𝑗𝑜𝑞𝑣𝑢, 𝑀𝑏𝑐𝑓𝑚𝑡 = 2 𝑡𝑑𝑝𝑠𝑓(𝑗𝑜𝑞𝑣𝑢, 𝑚𝑏𝑐𝑓𝑚)

  • 789:7 ∈ <89:7=

At training time, we want 𝑇𝑑𝑝𝑠𝑓 𝑗𝑜𝑞𝑣𝑢, 𝐻𝑠𝑝𝑣𝑜𝑒 𝑈𝑠𝑣𝑢ℎ > 𝑇𝑑𝑝𝑠𝑓 𝑗𝑜𝑞𝑣𝑢, 𝐵𝑜𝑝𝑢ℎ𝑓𝑠 𝑏𝑡𝑡𝑗𝑕𝑜𝑛𝑓𝑜𝑢

37

slide-39
SLIDE 39

Extension: Structured learning

  • Why should we train a multiclass classifier that operates on each label

independently? – Instead, train a model that scores the entire set of labels for a frame jointly [Täckström et al 2015]

  • That is, train a model that learns to assign a score for the entire sentence

rather than one label at a time 𝑇𝑑𝑝𝑠𝑓 𝑗𝑜𝑞𝑣𝑢, 𝑀𝑏𝑐𝑓𝑚𝑡 = 2 𝑡𝑑𝑝𝑠𝑓(𝑗𝑜𝑞𝑣𝑢, 𝑚𝑏𝑐𝑓𝑚)

  • 789:7 ∈ <89:7=

At training time, we want 𝑇𝑑𝑝𝑠𝑓 𝑗𝑜𝑞𝑣𝑢, 𝐻𝑠𝑝𝑣𝑜𝑒 𝑈𝑠𝑣𝑢ℎ > 𝑇𝑑𝑝𝑠𝑓 𝑗𝑜𝑞𝑣𝑢, 𝐵𝑜𝑝𝑢ℎ𝑓𝑠 𝑏𝑡𝑡𝑗𝑕𝑜𝑛𝑓𝑜𝑢

38

slide-40
SLIDE 40

Extension: Structured learning

  • Why should we train a multiclass classifier that operates on each label

independently? – Instead, train a model that scores the entire set of labels for a frame jointly [Täckström et al 2015]

  • That is, train a model that learns to assign a score for the entire sentence

rather than one label at a time 𝑇𝑑𝑝𝑠𝑓 𝑗𝑜𝑞𝑣𝑢, 𝑀𝑏𝑐𝑓𝑚𝑡 = 2 𝑡𝑑𝑝𝑠𝑓(𝑗𝑜𝑞𝑣𝑢, 𝑚𝑏𝑐𝑓𝑚)

  • 789:7 ∈ <89:7=
  • At training time, we want

𝑇𝑑𝑝𝑠𝑓 𝑗𝑜𝑞𝑣𝑢, 𝐻𝑠𝑝𝑣𝑜𝑒 𝑈𝑠𝑣𝑢ℎ > 𝑇𝑑𝑝𝑠𝑓 𝑗𝑜𝑞𝑣𝑢, 𝐵𝑜𝑝𝑢ℎ𝑓𝑠 𝑏𝑡𝑡𝑗𝑕𝑜𝑛𝑓𝑜𝑢

39

slide-41
SLIDE 41

Extension: Structured learning

  • Why should we train a multiclass classifier that operates on each label

independently? – Instead, train a model that scores the entire set of labels for a frame jointly [Täckström et al 2015]

  • That is, train a model that learns to assign a score for the entire sentence

rather than one label at a time 𝑇𝑑𝑝𝑠𝑓 𝑗𝑜𝑞𝑣𝑢, 𝑀𝑏𝑐𝑓𝑚𝑡 = 2 𝑡𝑑𝑝𝑠𝑓(𝑗𝑜𝑞𝑣𝑢, 𝑚𝑏𝑐𝑓𝑚)

  • 789:7 ∈ <89:7=
  • At training time, we want

𝑇𝑑𝑝𝑠𝑓 𝑗𝑜𝑞𝑣𝑢, 𝐻𝑠𝑝𝑣𝑜𝑒 𝑈𝑠𝑣𝑢ℎ > 𝑇𝑑𝑝𝑠𝑓 𝑗𝑜𝑞𝑣𝑢, 𝐵𝑜𝑝𝑢ℎ𝑓𝑠 𝑏𝑡𝑡𝑗𝑕𝑜𝑛𝑓𝑜𝑢

40

Requires enumerating all possible competing assignments subject to linguistic constraints. Framed as a dynamic program by [Täckström et al 2015]

slide-42
SLIDE 42

How well did these perform?

  • Shared tasks and evaluations based on PropBank

– F1 scores across all labels – [Toutanova et al. 2005-2008]: 80.3 – [Punyakanok et al. 2005-2008]: 79.4 – [Täckström et al 2015]: 79.9

  • Common characteristics of these approaches

– Rich features – Used an ensemble of classifiers – Used some way to integrate multiple multi-class decisions

  • Either only at prediction time or at both training time and when the

model is used

41

~10 years, nearly no change in numbers!!

slide-43
SLIDE 43

How well did these perform?

  • Shared tasks and evaluations based on PropBank

– F1 scores across all labels – [Toutanova et al. 2005-2008]: 80.3 – [Punyakanok et al. 2005-2008]: 79.4 – [Täckström et al 2015]: 79.9

  • Common characteristics of these approaches

– Rich features – Used an ensemble of classifiers – Used some way to integrate multiple multi-class decisions

  • Either only at prediction time or at both training time and when the

model is used

42

slide-44
SLIDE 44

Why is this problem hard?

Encompasses a wide variety of linguistic phenomena

– Accounts for prepositional phrase attachment

43

John frightened the raccoon with a big tail. John frightened the raccoon with a big stick.

Arg0 Arg1 Arg0 Arg1

slide-45
SLIDE 45

Why is this problem hard?

Encompasses a wide variety of linguistic phenomena

– The dependencies can be very far away

44

John frightened the raccoon. John walked quietly and frightened the raccoon. John walked quietly into the garden and frightened the raccoon.

In all three cases, John is the Arg0 of frightened…. …but it can be far away from the verb.

slide-46
SLIDE 46

Why is this problem hard?

Encompasses a wide variety of linguistic phenomena

– Unifies syntactic alternations

45

John broke the vase

Subject position = Arg0 Object position = Arg1

The vase broke

Subject position = Arg1

slide-47
SLIDE 47

Overview

  • What is semantic role labeling?

– The state-of-the-art before neural networks

  • Neural models for semantic roles

46

slide-48
SLIDE 48

How can we introduce neural networks into this problem?

Let’s brainstorm ideas Using tools we have seen so far

47

slide-49
SLIDE 49

Some approaches

  • We have scoring functions with hand-designed features

– Replace the soring functions with a neural network

  • We want to share statistical information across labels

– Embed the labels into a vector space as well

  • We want better input representations

– Convolutional networks (We will see this later) – BiLSTM networks

  • We still want to keep the constraints that help decoding

48

slide-50
SLIDE 50

Neural network factors

49

[FitzGerald et al 2015] Input: Span, label, frame (think verb) Output: A score for the span taking this label for this frame

slide-51
SLIDE 51

Neural network factors

50

[FitzGerald et al 2015] Input: Span, label, frame (think verb) Output: A score for the span taking this label for this frame Embed the span using a two layer network. Uses hand-crafted features from the span

slide-52
SLIDE 52

Neural network factors

51

[FitzGerald et al 2015] Input: Span, label, frame (think verb) Output: A score for the span taking this label for this frame Embed the frame using a

  • ne-hot representation of

the frame

slide-53
SLIDE 53

Neural network factors

52

[FitzGerald et al 2015] Input: Span, label, frame (think verb) Output: A score for the span taking this label for this frame Embed the label using a

  • ne-hot representation of

the label

slide-54
SLIDE 54

Neural network factors

53

[FitzGerald et al 2015] Input: Span, label, frame (think verb) Output: A score for the span taking this label for this frame A frame-role vector that using a ReLU layer

slide-55
SLIDE 55

Neural network factors

54

[FitzGerald et al 2015] Input: Span, label, frame (think verb) Output: A score for the span taking this label for this frame Score = dot product of these two vectors

slide-56
SLIDE 56

Neural network factors

55

[FitzGerald et al 2015] Input: Span, label, frame (think verb) Output: A score for the span taking this label for this frame Important: Once we have this scoring function, we can plug it into the previous methods directly We can choose to train these networks independently as a multi-class classifier or in the structured version.

slide-57
SLIDE 57

Performance

Shared tasks and evaluations based on PropBank

– F1 scores across all labels – [Toutanova et al. 2005-2008]: 80.3 – [Punyakanok et al. 2005-2008]: 79.4 – [Täckström et al 2015]: 79.9 – [Fitzgerald et al 2015] (structured, product of experts): 80.3

56

slide-58
SLIDE 58

BiLSTM networks

57

[Zhou et al 2015, He et al 2017] Figure from He et al 2017

slide-59
SLIDE 59

BiLSTM networks

58

[Zhou et al 2015, He et al 2017] Figure from He et al 2017 Word embedding + indicator for predicate

slide-60
SLIDE 60

BiLSTM networks

59

[Zhou et al 2015, He et al 2017] Figure from He et al 2017 Multiple BiLSTM layers

slide-61
SLIDE 61

BiLSTM networks

60

[Zhou et al 2015, He et al 2017] Figure from He et al 2017 Highway connections

slide-62
SLIDE 62

BiLSTM networks

61

[Zhou et al 2015, He et al 2017] Figure from He et al 2017 BIO encoding

  • f labels
slide-63
SLIDE 63

BiLSTM networks

62

[Zhou et al 2015, He et al 2017] Figure from He et al 2017 Each decision is independent

  • f all others.

Invalid transitions not allowed during prediction time

slide-64
SLIDE 64

Many moving pieces…

  • Word embeddings (Glove)

– Better word embeddings can give better results

  • Stacked BiLSTM networks
  • Highway connections
  • Constrained inference

– With limited set of constraints – Can be extended to include more constraints that we saw before

  • Product of experts

– Train multiple models with random initializations and ensemble them

  • Important consideration with training

– Variational dropout: Same dropout mask for all each time step

  • We will visit this later

63

See paper for details

slide-65
SLIDE 65

Performance

Shared tasks and evaluations based on PropBank

– F1 scores across all labels – [Toutanova et al. 2005-2008]: 80.3 – [Punyakanok et al. 2005-2008]: 79.4 – [Täckström et al 2015]: 79.9 – [Fitzgerald et al 2015] (structured, product of experts): 80.3 – [He et al 2017](with product of experts): 84.6

  • No hand-designed features!

64

slide-66
SLIDE 66

Coming up…

Several other advances in semantic role labeling in recent years

  • We will revisit this task

– Convolutional networks (Collobert et al)

  • Slightly older results, but important paper

– Transformer networks (Strubell et al)

  • Current state-of-the-art
  • The return of syntax

– LSTM-CRFs (Zhou et al)

  • Adding structure to an RNN

65