Deep Learning for Broad Coverage Semantics: SRL, Coreference, and - - PowerPoint PPT Presentation

deep learning for broad coverage semantics srl
SMART_READER_LITE
LIVE PREVIEW

Deep Learning for Broad Coverage Semantics: SRL, Coreference, and - - PowerPoint PPT Presentation

Deep Learning for Broad Coverage Semantics: SRL, Coreference, and Beyond Luke Zettlemoyer * Joint work with Luheng He , Kenton Lee , Matthew Peters* , Christopher Clark , Matthew Gardner*, Mohit Iyyer* , Mandar Joshi , Mike Lewis


slide-1
SLIDE 1

Deep Learning for Broad Coverage Semantics: SRL, Coreference, and Beyond

Joint work with Luheng He†, Kenton Lee†, Matthew Peters*, Christopher Clark†, Matthew Gardner*, Mohit Iyyer*, Mandar Joshi†, Mike Lewis‡, Julian Michael†, Mark Neumann*

† Paul G. Allen School of Computer Science & Engineering, University of Washington,

‡ Facebook AI Research * Allen Institute for Artificial Intelligence

Luke Zettlemoyer†*

slide-2
SLIDE 2

Three Simple Steps that will Revolutionize Your ML Research

Step 1: Step 2: Step 3:

slide-3
SLIDE 3

Three Simple Steps that will Revolutionize Your ML Research

Step 1: Step 2: Step 3: Gather lots of training data!

slide-4
SLIDE 4

Three Simple Steps that will Revolutionize Your ML Research

Step 1: Step 2: Step 3:

Gather lots of training data!

slide-5
SLIDE 5

Three Simple Steps that will Revolutionize Your ML Research

Step 1: Step 2: Step 3:

Gather lots of training data! Apply Deep Learning!!

slide-6
SLIDE 6

Three Simple Steps that will Revolutionize Your ML Research

Step 1: Step 2: Step 3:

… …

Gather lots of training data! Apply Deep Learning!!

slide-7
SLIDE 7

Three Simple Steps that will Revolutionize Your ML Research

Step 1: Step 2: Step 3:

… …

Gather lots of training data! Apply Deep Learning!! Observe Impressive Gains!!!

slide-8
SLIDE 8

Broad Coverage Semantics

Coreference: clustering NPs

A fire in a Bangladeshi garment factory has left at least 37 people dead and 100

  • hospitalized. Most of the deceased were

killed in the crush as workers tried to flee the blaze in the four-story building.

ARG0 NASA PRED

  • bserve

ARG1 an X-ray flare 400 times brighter than usual TMP On January 5, 2015

Example Tasks:

Many applications:

Semantic Role Labeling: who did what, etc.

Question Answering Information Extraction Machine Translation

slide-9
SLIDE 9

Does the Recipe Work for Broad Coverage Semantics?

Step 1: Gather lots of training data! Step 2: Apply Deep Learning!! Step 3: Observe Impressive Gains!!!

slide-10
SLIDE 10

Does the Recipe Work for Broad Coverage Semantics?

Step 1: Gather lots of training data! Step 2: Apply Deep Learning!! Step 3: Observe Impressive Gains!!!

Challenge 1: Data is costly and limited (e.g. linguists required to label PennTreebank / OntoNotes)

slide-11
SLIDE 11

Does the Recipe Work for Broad Coverage Semantics?

Step 1: Gather lots of training data! Step 2: Apply Deep Learning!! Step 3: Observe Impressive Gains!!!

Challenge 1: Data is costly and limited (e.g. linguists required to label PennTreebank / OntoNotes) Challenge 2: Pipeline of structured prediction problems with cascading errors (e.g. POS->Parsing->SRL->Coref)

slide-12
SLIDE 12

New Learning Approaches

Semantic Role Labeling: Coreference:

A fire in a Bangladeshi garment factory has left at least 37 people dead and 100

  • hospitalized. Most of the deceased were

killed in the crush as workers tried to flee the blaze in the four-story building.

ARG0 NASA PRED

  • bserve

ARG1 an X-ray flare 400 times brighter than usual TMP On January 5, 2015

New state-of-the-art results for two tasks:

Common themes:

  • End-to-end training of deep neural networks
  • No preprocessing (e.g., no POS, no parser, etc.)
  • Large gains in accuracy with simpler models and

no extra training data

slide-13
SLIDE 13

My mug broke into pieces immediately. The robot broke my favorite mug with a wrench.

Semantic Role Labeling (SRL)

predicate argument role label

who what when where why …

slide-14
SLIDE 14

My mug broke into pieces immediately. The robot broke my favorite mug with a wrench.

Semantic Role Labeling (SRL)

v subj

  • bj

prep subj v prep adv

predicate argument role label

who what when where why …

slide-15
SLIDE 15

My mug broke into pieces immediately. The robot broke my favorite mug with a wrench.

Semantic Role Labeling (SRL)

v subj

  • bj

prep subj v prep adv thing broken thing broken

predicate argument role label

who what when where why …

slide-16
SLIDE 16

My mug broke into pieces immediately. The robot broke my favorite mug with a wrench.

Semantic Role Labeling (SRL)

v subj

  • bj

prep subj v prep adv thing broken thing broken breaker instrument pieces (final state) temporal

predicate argument role label

who what when where why …

slide-17
SLIDE 17

My mug broke into pieces immediately. The robot broke my favorite mug with a wrench.

Semantic Role Labeling (SRL)

v subj

  • bj

prep Frame: break.01

role description

ARG0 breaker ARG1 thing broken ARG2 instrument ARG3 pieces ARG4 broken away from what? subj v prep adv thing broken thing broken breaker instrument pieces (final state) temporal

predicate argument role label

who what when where why …

slide-18
SLIDE 18

My mug broke into pieces immediately. The robot broke my favorite mug with a wrench.

Semantic Role Labeling (SRL)

v subj

  • bj

prep Frame: break.01

role description

ARG0 breaker ARG1 thing broken ARG2 instrument ARG3 pieces ARG4 broken away from what? subj v prep adv thing broken thing broken breaker instrument pieces (final state) temporal ARG0 ARG1 ARG2 ARG3 ARG1 ARGM-TMP

predicate argument role label

who what when where why …

slide-19
SLIDE 19

SRL is a hard problem …

  • Over 10 years, F1 on PropBank:

80.3 (Toutanova et al, 2005) — 80.3 (FitzGerald et al, 2015)

  • Many interesting challenges:

Syntactic alternation Prepositional phrase attachment Long-range dependencies and common sense

slide-20
SLIDE 20

SRL Systems

syntactic features candidate argument spans labeled arguments prediction labeling ILP/DP sentence, predicate argument id.

Pipeline Systems

Punyakanok et al., 2008 Täckström et al., 2015 FitzGerald et al., 2015

slide-21
SLIDE 21

SRL Systems

syntactic features candidate argument spans labeled arguments prediction labeling ILP/DP sentence, predicate argument id.

Pipeline Systems

Punyakanok et al., 2008 Täckström et al., 2015 FitzGerald et al., 2015 sentence, predicate BIO sequence prediction

Deep BiLSTM + CRF layer

Viterbi context window features

End-to-end Systems

Collobert et al., 2011 Zhou and Xu, 2015 Wang et. al, 2015

slide-22
SLIDE 22

SRL Systems

syntactic features candidate argument spans labeled arguments prediction labeling ILP/DP sentence, predicate argument id.

Pipeline Systems

Punyakanok et al., 2008 Täckström et al., 2015 FitzGerald et al., 2015 sentence, predicate BIO sequence prediction

Deep BiLSTM + CRF layer

Viterbi context window features

End-to-end Systems

Collobert et al., 2011 Zhou and Xu, 2015 Wang et. al, 2015 Deep BiLSTM

Hard constraints

BIO sequence prediction sentence, predicate

*This work

He et al., 2017

slide-23
SLIDE 23

The cats love hats .

Input (sentence and predicate):

SRL as BIO Tagging Problem

slide-24
SLIDE 24

The cats love hats .

Input (sentence and predicate): BIO output:

B-ARG0 I-ARG0 B-V I-ARG1 O

(Begin, Inside, Outside)

SRL as BIO Tagging Problem

slide-25
SLIDE 25

The cats love hats .

Input (sentence and predicate): BIO output:

B-ARG0 I-ARG0 B-V I-ARG1 O

Final SRL output:

ARG0 V ARG1

(Begin, Inside, Outside)

SRL as BIO Tagging Problem

slide-26
SLIDE 26

the cats love hats [ ] [ ] [V] [ ]

B-ARG0 0.4 I-ARG0 0.05 B-ARG1 0.5 I-ARG1 0.03 … … B-ARG0 0.1 I-ARG0 0.5 B-ARG1 0.1 I-ARG1 0.2 … … B-ARG0 0.001 I-ARG0 0.001 B-ARG1 0.001 … … B-V 0.95 B-ARG0 0.1 I-ARG0 0.1 B-ARG1 0.7 I-ARG1 0.2 … …

[He et al, 2017]

slide-27
SLIDE 27

the cats love hats [ ] [ ] [V] [ ]

B-ARG0 0.4 I-ARG0 0.05 B-ARG1 0.5 I-ARG1 0.03 … … B-ARG0 0.1 I-ARG0 0.5 B-ARG1 0.1 I-ARG1 0.2 … … B-ARG0 0.001 I-ARG0 0.001 B-ARG1 0.001 … … B-V 0.95 B-ARG0 0.1 I-ARG0 0.1 B-ARG1 0.7 I-ARG1 0.2 … …

(1) Deep BiLSTM tagger

[He et al, 2017]

slide-28
SLIDE 28

the cats love hats [ ] [ ] [V] [ ]

B-ARG0 0.4 I-ARG0 0.05 B-ARG1 0.5 I-ARG1 0.03 … … B-ARG0 0.1 I-ARG0 0.5 B-ARG1 0.1 I-ARG1 0.2 … … B-ARG0 0.001 I-ARG0 0.001 B-ARG1 0.001 … … B-V 0.95 B-ARG0 0.1 I-ARG0 0.1 B-ARG1 0.7 I-ARG1 0.2 … …

(1) Deep BiLSTM tagger (2) Highway connections

[He et al, 2017]

slide-29
SLIDE 29

the cats love hats [ ] [ ] [V] [ ]

B-ARG0 0.4 I-ARG0 0.05 B-ARG1 0.5 I-ARG1 0.03 … … B-ARG0 0.1 I-ARG0 0.5 B-ARG1 0.1 I-ARG1 0.2 … … B-ARG0 0.001 I-ARG0 0.001 B-ARG1 0.001 … … B-V 0.95 B-ARG0 0.1 I-ARG0 0.1 B-ARG1 0.7 I-ARG1 0.2 … …

(1) Deep BiLSTM tagger (2) Highway connections (3) Variational dropout

[He et al, 2017]

slide-30
SLIDE 30

the cats love hats [ ] [ ] [V] [ ]

B-ARG0 0.4 I-ARG0 0.05 B-ARG1 0.5 I-ARG1 0.03 … … B-ARG0 0.1 I-ARG0 0.5 B-ARG1 0.1 I-ARG1 0.2 … … B-ARG0 0.001 I-ARG0 0.001 B-ARG1 0.001 … … B-V 0.95 B-ARG0 0.1 I-ARG0 0.1 B-ARG1 0.7 I-ARG1 0.2 … …

(1) Deep BiLSTM tagger (2) Highway connections (4) Viterbi decoding with hard constraints (3) Variational dropout

[He et al, 2017]

slide-31
SLIDE 31

Other Implementation Details …

  • 8 layer BiLSTMs with 300D hidden layers.
  • 100D GloVe embeddings, updated during training.
  • Orthonormal initialization for LSTM weight

matrices (Saxe et al., 2013)

  • 5 model ensemble with product-of-experts

(Hinton 2002)

  • Trained for 500 epochs.
slide-32
SLIDE 32

F1 60 65 70 75 80 85 90 Ours* Ours Zhou FitzGerald* Täckström Toutanova* Punyakanok*

WSJ Test Brown (out-domain) Test

*:Ensemble models

2017 2015 2015 2015 2017 2008 2008 Datasets

CoNLL 2005 Results

CoNLL 2012 (OntoNotes) Results Ablations

slide-33
SLIDE 33

F1 60 65 70 75 80 85 90 Ours* Ours Zhou FitzGerald* Täckström Toutanova* Punyakanok*

WSJ Test Brown (out-domain) Test

79.4 80.3 79.9 80.3 82.8 83.1 84.6 *:Ensemble models

2017 2015 2015 2015 2017 2008 2008 Datasets

CoNLL 2005 Results

CoNLL 2012 (OntoNotes) Results Ablations

slide-34
SLIDE 34

F1 60 65 70 75 80 85 90 Ours* Ours Zhou FitzGerald* Täckström Toutanova* Punyakanok*

WSJ Test Brown (out-domain) Test

67.8 68.8 71.3 72.2 69.4 72.1 73.6 79.4 80.3 79.9 80.3 82.8 83.1 84.6 *:Ensemble models

2017 2015 2015 2015 2017 2008 2008 Datasets

CoNLL 2005 Results

CoNLL 2012 (OntoNotes) Results Ablations

slide-35
SLIDE 35

F1 60 65 70 75 80 85 90 Ours* Ours Zhou FitzGerald* Täckström Toutanova* Punyakanok*

WSJ Test Brown (out-domain) Test

67.8 68.8 71.3 72.2 69.4 72.1 73.6 79.4 80.3 79.9 80.3 82.8 83.1 84.6 Pipeline models BiLSTM models *:Ensemble models

2017 2015 2015 2015 2017 2008 2008 Datasets

CoNLL 2005 Results

CoNLL 2012 (OntoNotes) Results Ablations

slide-36
SLIDE 36

Ablations on Number of Layers (2,4,6 and 8)

13

F1 on CoNLL-05 Dev. 70 75 80 85 L2 L4 L6 L8

Greedy decoding Viterbi decoding

80.5 80.1 79.1 74.6

slide-37
SLIDE 37

Ablations on Number of Layers (2,4,6 and 8)

13

F1 on CoNLL-05 Dev. 70 75 80 85 L2 L4 L6 L8

Greedy decoding Viterbi decoding

81.6 81.4 80.5 77.2 80.5 80.1 79.1 74.6

slide-38
SLIDE 38

Ablations on Number of Layers (2,4,6 and 8)

13

F1 on CoNLL-05 Dev. 70 75 80 85 L2 L4 L6 L8

Greedy decoding Viterbi decoding

81.6 81.4 80.5 77.2 80.5 80.1 79.1 74.6 Performance increases as model goes deeper. Biggest jump from 2 to 4 layer.

slide-39
SLIDE 39

Ablations on Number of Layers (2,4,6 and 8)

13

F1 on CoNLL-05 Dev. 70 75 80 85 L2 L4 L6 L8

Greedy decoding Viterbi decoding

81.6 81.4 80.5 77.2 80.5 80.1 79.1 74.6

Shallow models benefit more from constrained decoding.

Performance increases as model goes deeper. Biggest jump from 2 to 4 layer.

slide-40
SLIDE 40

New Learning Approaches

Semantic Role Labeling: Coreference:

A fire in a Bangladeshi garment factory has left at least 37 people dead and 100

  • hospitalized. Most of the deceased were

killed in the crush as workers tried to flee the blaze in the four-story building.

ARG0 NASA PRED

  • bserve

ARG1 an X-ray flare 400 times brighter than usual TMP On January 5, 2015

New state-of-the-art results for two tasks:

Common themes:

  • End-to-end training of deep neural networks
  • No preprocessing (e.g., no POS, no parser, etc.)
  • Large gains in accuracy with simpler models and

no extra training data

slide-41
SLIDE 41

Coreference Resolution

A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building. Input document

slide-42
SLIDE 42

Coreference Resolution

A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building. Input document

Cluster #1 A fire in a Bangladeshi garment factory the blaze in the four-story building

slide-43
SLIDE 43

Coreference Resolution

A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building.

Cluster #1 A fire in a Bangladeshi garment factory the blaze in the four-story building Cluster #2 a Bangladeshi garment factory the four-story building

Input document

slide-44
SLIDE 44

Coreference Resolution

A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building.

Cluster #1 A fire in a Bangladeshi garment factory the blaze in the four-story building Cluster #2 a Bangladeshi garment factory the four-story building Cluster #3 at least 37 people the deceased

Input document

slide-45
SLIDE 45

Two Subproblems

A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building. Input document

Mention detection

A fire in a Bangladeshi garment factory at least 37 people … the four-story building

Mention clustering

Cluster #1 A fire in a Bangladeshi garment factory the blaze in the four-story building Cluster #2 a Bangladeshi garment factory the four-story building Cluster #3 at least 37 people the deceased

slide-46
SLIDE 46

Previous Approach: Rule-based pipeline

A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Candidate mentions A fire in a Bangladeshi garment factory garment factory at least 37 people dead and 100 hospitalized … Input document Hand-engineered rules Syntactic parser Mention #1 Mention #2 Coreferent? A fire in a Bangladeshi garment factory garment ✓/✗ garment factory ✓/✗ factory at least 37 people dead and 100 hospitalized ✓/✗ … … ✓/✗

slide-47
SLIDE 47

Previous Approach: Rule-based pipeline

A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Candidate mentions A fire in a Bangladeshi garment factory garment factory at least 37 people dead and 100 hospitalized … Input document Hand-engineered rules Syntactic parser Mention #1 Mention #2 Coreferent? A fire in a Bangladeshi garment factory garment ✓/✗ garment factory ✓/✗ factory at least 37 people dead and 100 hospitalized ✓/✗ … … ✓/✗

Mention clustering: main source of improvement for many years!

  • Haghighi and Klein (2010)
  • Raghunathan et al. (2010)
  • Clark & Manning (2016)
slide-48
SLIDE 48

Previous Approach: Rule-based pipeline

A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Candidate mentions A fire in a Bangladeshi garment factory garment factory at least 37 people dead and 100 hospitalized … Input document Hand-engineered rules Syntactic parser Mention #1 Mention #2 Coreferent? A fire in a Bangladeshi garment factory garment ✓/✗ garment factory ✓/✗ factory at least 37 people dead and 100 hospitalized ✓/✗ … … ✓/✗

Relies on parser for:

  • mention detection
  • syntactic features for clustering (e.g. head words)
slide-49
SLIDE 49

End-to-end Approach

  • Consider all possible spans
  • Learn to rank antecedent spans
  • Factored model to prune search space
slide-50
SLIDE 50

General Electric said the Postal Service contacted the company

Bidirectional LSTM Word & character embeddings

Key Idea: Span Representations

slide-51
SLIDE 51

General Electric said the Postal Service contacted the company the Postal Service

+

Bidirectional LSTM Word & character embeddings Span representation

Key Idea: Span Representations

slide-52
SLIDE 52

Bidirectional LSTM Word & character embeddings Span representation

General Electric said the Postal Service contacted the company the Postal Service

Boundary representations

Key Idea: Span Representations

slide-53
SLIDE 53

General Electric said the Postal Service contacted the company the Postal Service

+

Bidirectional LSTM Word & character embeddings Head-finding attention Span representation

Attention mechanism to learn headedness

Key Idea: Span Representations

slide-54
SLIDE 54

Bidirectional LSTM Word & character embeddings Head-finding attention Span representation

General Electric said the Postal Service contacted the company General Electric

+

Electric said the

+

the Postal Service

+

Service contacted the

+

the company

+

Compute all span representations

Key Idea: Span Representations

slide-55
SLIDE 55

A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building. Witnesses say the only exit door was on the ground floor, and that it was locked when the fire broke out.

Every span independently chooses an antecedent

Input document

Mention Ranking

slide-56
SLIDE 56

y3 ∈ {✏, 1, 2}

  • Reason over all possible spans
  • Assign an antecedent to every span

Span Antecedent 1 A 2 A fire 3 A fire in … … … M

  • ut

y3 y2 y1 yM

Mention Ranking

slide-57
SLIDE 57

Example Clustering

A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building. Witnesses say the only exit door was on the ground floor, and that it was locked when the fire broke out. Input document

Span Antecedent ( ) A A fire … … a Bangladeshi garment factory … … the four-story building a Bangladeshi garment factory … …

  • ut

✏ ✏ ✏ ✏

yi

slide-58
SLIDE 58

Example Clustering

A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building. Witnesses say the only exit door was on the ground floor, and that it was locked when the fire broke out. Input document

Span Antecedent ( ) A A fire … … a Bangladeshi garment factory … … the four-story building a Bangladeshi garment factory … …

  • ut

✏ ✏ ✏ ✏

yi

Not a mention

slide-59
SLIDE 59

Example Clustering

A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building. Witnesses say the only exit door was on the ground floor, and that it was locked when the fire broke out. Input document

Span Antecedent ( ) A A fire … … a Bangladeshi garment factory … … the four-story building a Bangladeshi garment factory … …

  • ut

✏ ✏ ✏ ✏

yi

No link with previously occurring span

slide-60
SLIDE 60

Example Clustering

A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building. Witnesses say the only exit door was on the ground floor, and that it was locked when the fire broke out. Input document

Span Antecedent ( ) A A fire … … a Bangladeshi garment factory … … the four-story building a Bangladeshi garment factory … …

  • ut

✏ ✏ ✏ ✏

yi

Predicted coreference link

slide-61
SLIDE 61

P(y1, . . . , yM | D) =

M

Y

i=1

P(yi | D) =

M

Y

i=1

es(i,yi) P

y0∈Y(i) es(i,y0)

s(i, j) = ( sm(i) + sm(j) + sa(i, j) j 6= ✏ j = ✏

Factor coreference score to enable span pruning:

s(i, j)

Span Ranking Model

slide-62
SLIDE 62

P(y1, . . . , yM | D) =

M

Y

i=1

P(yi | D) =

M

Y

i=1

es(i,yi) P

y0∈Y(i) es(i,y0)

s(i, j) s(i, j) = ( sm(i) + sm(j) + sa(i, j) j 6= ✏ j = ✏

Factor coreference score to enable span pruning:

s(i, j)

Is this span a mention?

Span Ranking Model

slide-63
SLIDE 63

P(y1, . . . , yM | D) =

M

Y

i=1

P(yi | D) =

M

Y

i=1

es(i,yi) P

y0∈Y(i) es(i,y0)

s(i, j) s(i, j) = ( sm(i) + sm(j) + sa(i, j) j 6= ✏ j = ✏

Factor coreference score to enable span pruning:

s(i, j)

Is span j an antecedent of span i?

Span Ranking Model

slide-64
SLIDE 64

s(i, j)

Dummy antecedent has a fixed zero score

s(i, j) = ( sm(i) + sm(j) + sa(i, j) j 6= ✏ j = ✏

Factor coreference score to enable span pruning:

s(i, j) P(y1, . . . , yM | D) =

M

Y

i=1

P(yi | D) =

M

Y

i=1

es(i,yi) P

y0∈Y(i) es(i,y0)

Span Ranking Model

slide-65
SLIDE 65

Experimental Setup

Dataset: English OntoNotes (CoNLL-2012) Genres: Telephone conversations, newswire, newsgroups, broadcast conversation, broadcast news, weblogs Documents: 2802 training, 343 development, 348 test Aggressive pruning: Maximum span width, maximum sentence training, suppress spans with inconsistent bracketing, maximum number of antecedents Features: distance between spans, span width Metadata: speaker information, genre Longest document has 4009 words!

slide-66
SLIDE 66

Coreference Results

Test Avg. F1 (%) 50.0 54.0 58.0 62.0 66.0 70.0 Durrett & Klein (2013) Björkelund & Kuhn (2014) Martschat & Strube (2015)

62.5 61.6 60.3

Linear models

slide-67
SLIDE 67

Coreference Results

Test Avg. F1 (%) 50.0 54.0 58.0 62.0 66.0 70.0 Durrett & Klein (2013) Björkelund & Kuhn (2014) Martschat & Strube (2015) Wiseman et al. (2016) Clark & Manning (2016)

65.7 64.2 62.5 61.6 60.3

Linear models Neural models

slide-68
SLIDE 68

Coreference Results

Test Avg. F1 (%) 50.0 54.0 58.0 62.0 66.0 70.0 Durrett & Klein (2013) Björkelund & Kuhn (2014) Martschat & Strube (2015) Wiseman et al. (2016) Clark & Manning (2016)

65.7 64.2 62.5 61.6 60.3

Pipelined models

slide-69
SLIDE 69

Coreference Results

Test Avg. F1 (%) 50.0 54.0 58.0 62.0 66.0 70.0 Durrett & Klein (2013) Björkelund & Kuhn (2014) Martschat & Strube (2015) Wiseman et al. (2016) Clark & Manning (2016) Our model (single) Our model (ensemble)

68.8 67.2 65.7 64.2 62.5 61.6 60.3

Pipelined models End-to-end models

slide-70
SLIDE 70

A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building. : Mention in a predicted cluster : Head-finding attention weight

Qualitative Analysis

slide-71
SLIDE 71

A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building. : Mention in a predicted cluster : Head-finding attention weight

Qualitative Analysis

Attention-based head finder facilitates soft similarity cues

slide-72
SLIDE 72

A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building. : Mention in a predicted cluster : Head-finding attention weight

Qualitative Analysis

Good head-finding requires word-order information!

slide-73
SLIDE 73

Common Error Case

: Mention in a predicted cluster : Head-finding attention weight

The flight attendants have until 6:00 today to ratify labor concessions. The pilots' union and ground crew did so yesterday.

slide-74
SLIDE 74

: Mention in a predicted cluster : Head-finding attention weight

The flight attendants have until 6:00 today to ratify labor concessions. The pilots' union and ground crew did so yesterday. Conflating relatedness with paraphrasing

Common Error Case

slide-75
SLIDE 75

Does the Recipe Work for Broad Coverage Semantics?

Step 1: Gather lots of training data! Step 2: Apply Deep Learning!! Step 3: Observe Impressive Gains!!!

Challenge 1: Data is costly and limited (e.g. linguists required to label PennTreebank / OntoNotes) Challenge 2: Pipeline of structured prediction problems with cascading errors (e.g. POS->Parsing->SRL->Coref)

slide-76
SLIDE 76

Where Will the Data Come From???

slide-77
SLIDE 77

Option 1: Semi-supervised learning

  • E.g. word2vec and GloVe are in wide use 


[Mikolov et al., 2013; Pennington et al., 2014]

Where Will the Data Come From???

slide-78
SLIDE 78

Option 1: Semi-supervised learning

  • E.g. word2vec and GloVe are in wide use 


[Mikolov et al., 2013; Pennington et al., 2014]

  • Can we learn better word representations?

Where Will the Data Come From???

slide-79
SLIDE 79

Option 1: Semi-supervised learning

  • E.g. word2vec and GloVe are in wide use 


[Mikolov et al., 2013; Pennington et al., 2014]

  • Can we learn better word representations?

Option 2: Supervised learning

Where Will the Data Come From???

slide-80
SLIDE 80

Option 1: Semi-supervised learning

  • E.g. word2vec and GloVe are in wide use 


[Mikolov et al., 2013; Pennington et al., 2014]

  • Can we learn better word representations?

Option 2: Supervised learning

  • Can we gather more direct forms of

supervision?

Where Will the Data Come From???

slide-81
SLIDE 81

Learning Better Word Representations

Goal: Model contextualized syntax and semantics

R(plays, “The robot plays piano.”) R(plays, “The robot starred in many plays.”)

R(wi, w1 . . . wn) ∈ Rn

6=

slide-82
SLIDE 82

2 Layer Bidirectional LSTM Character convolutions

General Electric said the Postal Service contacted the company

Word Embeddings from a Language Model

Step 1: Train a large BiLM on unlabeled data

slide-83
SLIDE 83

2 Layer Bidirectional LSTM Character convolutions

General Electric said the Postal Service contacted the company

Left and Right Per Word Softmaxs

contacted the Postal Electric

… … …

Word Embeddings from a Language Model

Step 1: Train a large BiLM on unlabeled data

slide-84
SLIDE 84

2 Layer Bidirectional LSTM Character convolutions

General Electric said the Postal Service contacted the company

Word Embeddings from a Language Model

Step 2: Compute linear function of pre-trained model Step 1: Train a large BiLM on unlabeled data

slide-85
SLIDE 85

Word Embeddings from a Language Model

2 Layer Bidirectional LSTM Character convolutions

General Electric said the Postal Service contacted the company

LM Embeddings

= α1 + α2 + α3

Step 2: Compute linear function of pre-trained model Step 1: Train a large BiLM on unlabeled data

slide-86
SLIDE 86

Word Embeddings from a Language Model

2 Layer Bidirectional LSTM Character convolutions

General Electric said the Postal Service contacted the company

LM Embeddings

= α1 + α2 + α3

Step 2: Compute linear function of pre-trained model Step 1: Train a large BiLM on unlabeled data Step 3: Learn weights for each end task

slide-87
SLIDE 87

Best Single System Results

Test Avg. F1 (%) 55.0 58.0 61.0 64.0 67.0 70.0

Feature Based Neural Neural+LM

70.4 67.2 62.5

SRL (+2.9 F1)

Coreference (+3.2 F1)

Test Avg. F1 (%) 75.0 77.0 79.0 81.0 83.0 85.0

Feature Based Neural Nueral+LM

84.6 81.7 79.9

slide-88
SLIDE 88

40 50 60 70 80 90 100

SNLI SQuAD Coref SRL NER Sentiment (SST)

54.7 92.2 84.6 70.4 85.3 88.7 51.4 90.2 81.4 67.2 81.1 88 53.7 91.9 81.7 67.2 84.3 88.1

Previous SOTA Baseline Baseline+LM

SOTA For Many Others Tasks

slide-89
SLIDE 89

What Does it Learn?

slide-90
SLIDE 90

What Does it Learn?

Semantics:

  • Supervised WSD task

[Miller et al.,1994]

  • Use N-th layer in NN

classifier

slide-91
SLIDE 91

What Does it Learn?

  • Avg. F1 (%)

65.0 66.2 67.4 68.6 69.8 71.0

Layer 1 Layer 2 Iacobacci (2016)

70.1 69.0 65.9

Semantics:

  • Supervised WSD task

[Miller et al.,1994]

  • Use N-th layer in NN

classifier

slide-92
SLIDE 92

What Does it Learn?

  • Avg. F1 (%)

65.0 66.2 67.4 68.6 69.8 71.0

Layer 1 Layer 2 Iacobacci (2016)

70.1 69.0 65.9

Semantics:

  • Supervised WSD task

[Miller et al.,1994]

  • Use N-th layer in NN

classifier Syntax:

  • Label POS corpus 


[Marcus et al., 1993]

  • Learn classifier on 


N-th layer

slide-93
SLIDE 93

What Does it Learn?

Accuracy

95.0 95.6 96.2 96.8 97.4 98.0

Layer 1 Layer 2 Ling et al. (2015)

97.8 95.8 97.0

  • Avg. F1 (%)

65.0 66.2 67.4 68.6 69.8 71.0

Layer 1 Layer 2 Iacobacci (2016)

70.1 69.0 65.9

Semantics:

  • Supervised WSD task

[Miller et al.,1994]

  • Use N-th layer in NN

classifier Syntax:

  • Label POS corpus 


[Marcus et al., 1993]

  • Learn classifier on 


N-th layer

slide-94
SLIDE 94

Option 1: Semi-supervised learning

  • E.g. word2vec and GloVe are in wide use 


[Mikolov et al., 2013; Pennington et al., 2014]

  • Can we learn better word representations?

Option 2: Supervised learning

  • Can we gather more direct forms of

supervision?

Where Will the Data Come From???

slide-95
SLIDE 95
  • Introduce a new SRL formulation with no frame or

role inventory

  • Use question-answer pairs to model verbal

predicate-argument relations

  • Annotated over 3,000 sentences in weeks with

non-expert, part-time annotators

  • Showed that this data is high-quality and learnable

A First Data Step: QA-SRL

[He et al, 2015]

slide-96
SLIDE 96

ARG1 ARG4 ARG3 ARG2

The rent rose 10% from $3000 to $3300

??? ??? ???

amount risen start point end point

  • Depends on pre-defined frame

inventory, requires syntactic parses

  • Annotators need to:

1) Identify the Frameset 2) Find arguments in the parse 3) Assign labels accordingly

  • If frame doesn’t exist, create new

The Proposition Bank: An Annotated Corpus of Semantic Roles, Palmer et al., 2005 http://verbs.colorado.edu/propbank/framesets-english/rise-v.html

Frameset: rise.01 , go up Arg1-: Logical subject, patient, thing rising Arg2-EXT: EXT, amount risen Arg3-DIR: start point Arg4-LOC: end point Argm-LOC: medium

Previous Method: Annotation with Frames

slide-97
SLIDE 97

Our Annotation Scheme

They increased the rent this year .

Given sentence and a verb:

slide-98
SLIDE 98

Our Annotation Scheme

Who increased something ? They increased the rent this year .

Given sentence and a verb: Step 1: Ask a question about the verb:

slide-99
SLIDE 99

Our Annotation Scheme

Who increased something ? They They increased the rent this year .

Given sentence and a verb: Step 1: Ask a question about the verb: Step 2: Answer with words in the sentence:

slide-100
SLIDE 100

Our Annotation Scheme

Who increased something ? They They increased the rent this year .

Given sentence and a verb: Step 1: Ask a question about the verb: Step 2: Answer with words in the sentence: Step 3: Repeat, write as many QA pairs as possible …

slide-101
SLIDE 101

Our Annotation Scheme

Who increased something ? They What is increased ? the rent When is something increased ? this year They increased the rent this year .

Given sentence and a verb: Step 1: Ask a question about the verb: Step 2: Answer with words in the sentence: Step 3: Repeat, write as many QA pairs as possible …

slide-102
SLIDE 102

Wh-Question Answer the rent What rose ? 10% $3000 $3300 How much did something rise ? What did something rise from ? What did something rise to ?

ARG1 ARG4 ARG3 ARG2

The rent rose 10% from $3000 to $3300

??? ??? ???

amount risen start point end point

Our Method: Q/A Pairs for Semantic Relations

slide-103
SLIDE 103

Wh-words vs. PropBank Roles

Who What When Where Why How HowMuch ARG0 1575 414 3 5 17 28 2 ARG1 285 2481 4 25 20 23 95 ARG2 85 364 2 49 17 51 74 ARG3 11 62 7 8 4 16 31 ARG4 2 30 5 11 2 4 30 ARG5 1 2 AM-ADV 5 44 9 2 25 27 6 AM-CAU 3 1 23 1 AM-DIR 6 1 13 4 AM-EXT 4 5 5 AM-LOC 1 35 10 89 13 11 AM-MNR 5 47 2 8 4 108 14 AM-PNC 2 21 1 39 7 2 AM-PRD 1 1 1 AM-TMP 2 51 341 2 11 20 10

slide-104
SLIDE 104
  • Easily explained
  • No pre-defined roles, few syntactic assumption
  • Can capture implicit arguments
  • Generalizable across domains

Advantages Limitations

  • Only modeling verbs (for now)
  • Not annotating verb senses directly
  • Can have multiple equivalent questions

Challenges

  • What questions to ask?
  • How much data do we need?
  • Can we generalize to other tasks, such as coref?
slide-105
SLIDE 105

Does the Recipe Work for Broad Coverage Semantics?

Step 1: Gather lots of training data! Step 2: Apply Deep Learning!! Step 3: Observe Impressive Gains!!!

Challenge 1: Data is costly and limited (e.g. linguists required to label PennTreebank / OntoNotes) Challenge 2: Pipeline of structured prediction problems with cascading errors (e.g. POS->Parsing->SRL->Coref)

slide-106
SLIDE 106

Models

  • End-to-end deep learning for SRL and

coreference

  • No preprocessing (e.g. no parser or POS tagger)


Data

  • Contextualized word embeddings from a

language model

  • First steps towards scalable data annotation

Contributions

slide-107
SLIDE 107

Future Directions

  • Multi-task learning, given architectural similarities
  • Multi-lingual should work, in theory…
  • Need to scale up data annotation efforts, and

focus on out of domain performance

The End: Questions?

Recent Release

  • AllenNLP: Deep Learning Semantic NLP toolkit
  • See demos and code at AllenNLP.org