Higher-order Coreference Resolution with Coarse-to-fine Inference - - PowerPoint PPT Presentation

higher order coreference resolution with coarse to fine
SMART_READER_LITE
LIVE PREVIEW

Higher-order Coreference Resolution with Coarse-to-fine Inference - - PowerPoint PPT Presentation

Higher-order Coreference Resolution with Coarse-to-fine Inference Kenton Lee * Luheng He Luke Zettlemoyer University of Washington * Now at Google 1 Coreference Resolution Its because of what both of you are doing to have things


slide-1
SLIDE 1

Higher-order Coreference Resolution with Coarse-to-fine Inference

1

Kenton Lee* Luheng He Luke Zettlemoyer University of Washington

* Now at Google

slide-2
SLIDE 2

It’s because of what both of you are doing to have things change.

Coreference Resolution

2

I think that’s what’s… Go ahead Linda. Thanks goes to you and to the media to help us. Absolutely. Obviously we couldn’t seem loud enough to bring the attention, so our hat is off to all of you as well.

Example from Wiseman et al. (2016)

slide-3
SLIDE 3

It’s because of what both of you are doing to have things change.

Coreference Resolution

3

I think that’s what’s… Go ahead Linda. Thanks goes to you and to the media to help us. Absolutely. Obviously we couldn’t seem loud enough to bring the attention, so our hat is off to all of you as well.

Example from Wiseman et al. (2016)

slide-4
SLIDE 4

It’s because of what both of you are doing to have things change.

Coreference Resolution

4

I think that’s what’s… Go ahead Linda. Thanks goes to you and to the media to help us. Absolutely. Obviously we couldn’t seem loud enough to bring the attention, so our hat is off to all of you as well.

Example from Wiseman et al. (2016)

slide-5
SLIDE 5

Recent Trends in Coreference Resolution

5

End-to-end models have achieved large improvements

Advantages

  • Conceptually simple
  • Minimal feature engineering

Disadvantages

  • Computationally expensive
  • Very little “reasoning” involved
slide-6
SLIDE 6

Contributions

6

  • Address a modeling challenge:
  • Enable higher-order (multi-hop) coreference
  • Address a computational challenge:
  • Coarse-to-fine inference with a factored model
slide-7
SLIDE 7

Contributions

7

  • Address a modeling challenge:
  • Enable higher-order (multi-hop) coreference
  • Address a computational challenge:
  • Coarse-to-fine inference with a factored model
slide-8
SLIDE 8

Lee et al. 2017 (EMNLP):

  • Consider all possible spans in the document:
  • Compute neural span representations:
  • Estimate probability distribution over possible antecedents:

Existing Approach: Span-ranking Model

8

1 < i < n h(i) P(yi | h)

slide-9
SLIDE 9

I think that’s what’s… Go ahead Linda. Thanks goes to you and to the media to help us. Obviously we couldn’t seem loud enough to bring the attention, so our hat is off to all of you as well. Absolutely.

Limitations of a First Order Model

9 Example from Wiseman et al. (2016)

Local information not sufficient

slide-10
SLIDE 10

I think that’s what’s… Go ahead Linda. Thanks goes to you and to the media to help us. Absolutely. Obviously we couldn’t seem loud enough to bring the attention, so our hat is off to all of you as well.

10 Example from Wiseman et al. (2016)

Global structure reveals inconsistency

Limitations of a First Order Model

slide-11
SLIDE 11
  • Let span representations softly condition on previous decisions

Higher-order Model

11

slide-12
SLIDE 12
  • Let span representations softly condition on previous decisions
  • For each iteration:
  • Estimation antecedent distribution
  • Attend over possible antecedents
  • Merge every span representation with its expected antecedent

Higher-order Model

12

slide-13
SLIDE 13

I think that’s what’s… Go ahead Linda. Thanks goes to you and to the media to help us. Absolutely. Obviously we couldn’t seem loud enough to bring the attention, so our hat is off to all of you as well.

13

I Linda you ε

P(yall of you | h)

Higher-order Model

slide-14
SLIDE 14

I think that’s what’s… Go ahead Linda. Thanks goes to you and to the media to help us. Absolutely. Obviously we couldn’t seem loud enough to bring the attention, so our hat is off to all of you as well.

14

Higher-order Model

I Linda ε

P(yyou | h)

slide-15
SLIDE 15

I think that’s what’s… Go ahead Linda. Thanks goes to you and to the media to help us. Absolutely. Obviously we couldn’t seem loud enough to bring the attention, so our hat is off to all of you as well.

15

Higher-order Model

I Linda ε

P(yyou | h)

Learn a representation of “you” w.r.t. “I”

slide-16
SLIDE 16

I think that’s what’s… Go ahead Linda. Thanks goes to you and to the media to help us. Absolutely. Obviously we couldn’t seem loud enough to bring the attention, so our hat is off to all of you as well.

16

Higher-order Model

I Linda you ε

P(yall of you | h)

slide-17
SLIDE 17

I think that’s what’s… Go ahead Linda. Thanks goes to you and to the media to help us. Absolutely. Obviously we couldn’t seem loud enough to bring the attention, so our hat is off to all of you as well.

17

I Linda you ε

Higher-order Model

P(yall of you | h0)

I Linda you ε

P(yall of you | h)

slide-18
SLIDE 18
  • Let span representations softly condition on previous decisions
  • Iterative inference to compute

Higher-order Model

18

hn(i)

slide-19
SLIDE 19
  • Let span representations softly condition on previous decisions
  • Iterative inference to compute :
  • Base case: (from the baseline)

Higher-order Model

hn(i) h0(i) = h(i)

slide-20
SLIDE 20
  • Let span representations softly condition on previous decisions
  • Iterative inference to compute :
  • Base case: (from the baseline)
  • Recursive case:

Higher-order Model

20

hn(i) h0(i) = h(i) an(i) = X

yi

P(yi|hn−1)hn−1(i)

(attention mechanism)

slide-21
SLIDE 21
  • Let span representations softly condition on previous decisions
  • Iterative inference to compute :
  • Base case: (from the baseline)
  • Recursive case:

Higher-order Model

21

hn(i) h0(i) = h(i) an(i) = X

yi

P(yi|hn−1)hn−1(i)

(attention mechanism)

fn(i) = σ(W[an(i), hn−1(i)])

(forget gates)

slide-22
SLIDE 22
  • Let span representations softly condition on previous decisions
  • Iterative inference to compute :
  • Base case: (from the baseline)
  • Recursive case:

Higher-order Model

22

hn(i) h0(i) = h(i) an(i) = X

yi

P(yi|hn−1)hn−1(i)

(attention mechanism)

fn(i) = σ(W[an(i), hn−1(i)])

(forget gates)

hn(i) = fn(i) an(i) + (1 fn(i)) hn−1(i)

slide-23
SLIDE 23
  • Let span representations softly condition on previous decisions
  • Iterative inference to compute :
  • Base case: (from the baseline)
  • Recursive case:
  • Final result:

Higher-order Model

23

hn(i) h0(i) = h(i) an(i) = X

yi

P(yi|hn−1)hn−1(i)

(attention mechanism)

fn(i) = σ(W[an(i), hn−1(i)])

(forget gates)

hn(i) = fn(i) an(i) + (1 fn(i)) hn−1(i) P(yi|hn)

slide-24
SLIDE 24
  • Let span representations softly condition on previous decisions
  • Iterative inference to compute :
  • Base case: (from the baseline)
  • Recursive case:
  • Final result:

Higher-order Model

24

hn(i) h0(i) = h(i) an(i) = X

yi

P(yi|hn−1)hn−1(i)

(attention mechanism)

fn(i) = σ(W[an(i), hn−1(i)])

(forget gates)

hn(i) = fn(i) an(i) + (1 fn(i)) hn−1(i) P(yi|hn)

Final coreference decision conditions on clusters of size n + 2

slide-25
SLIDE 25

Recent Trends in Coreference Resolution

25

End-to-end models have achieved large improvements

Advantages

  • Conceptually simple
  • Minimal feature engineering

Disadvantages

  • Computationally expensive
  • Very little “reasoning” involved
slide-26
SLIDE 26

Recent Trends in Coreference Resolution

26

End-to-end models have achieved large improvements

Advantages

  • Conceptually simple
  • Minimal feature engineering

Disadvantages

  • Computationally expensive
  • Very little “reasoning” involved

2nd order model already runs out of memory

slide-27
SLIDE 27

Contributions

27

  • Address a modeling challenge:
  • Enable higher-order (multi-hop) coreference
  • Address a computational challenge:
  • Coarse-to-fine inference with a factored model
slide-28
SLIDE 28

Computational Challenge

28

It’s because of what both of you are doing to have things change.

  • Mention candidates just for exposition
slide-29
SLIDE 29

Computational Challenge

29

It’s because of what both of you are doing to have things change.

  • Mention candidates just for exposition
  • O(n2) spans to consider in practice
slide-30
SLIDE 30

Computational Challenge

30

It’s because of what both of you are doing to have things change.

  • Mention candidates just for exposition
  • O(n2) spans to consider in practice
  • O(n4) coreference links to consider
slide-31
SLIDE 31

Coarse-to-fine Inference

31

P(yi|h) = softmax(s(i, yi, h))

slide-32
SLIDE 32

Coarse-to-fine Inference

32

P(yi|h) = softmax(s(i, yi, h))

Existing scoring function:

+FFNN(h(i), h(j), h(i) h(j)) s(i, j, h) = FFNN(h(i)) + FFNN(h(j))

Mention scores Antecedent scores

slide-33
SLIDE 33

Coarse-to-fine Inference

33

P(yi|h) = softmax(s(i, yi, h))

Coarse-to-fine scoring function:

+FFNN(h(i), h(j), h(i) h(j)) s(i, j, h) = FFNN(h(i)) + FFNN(h(j))

Mention scores Antecedent scores Cheap/inaccurate antecedent scores

+h(i)>Wch(j)

slide-34
SLIDE 34

Coarse-to-fine Inference

34

P(yi|h) = softmax(s(i, yi, h))

Coarse-to-fine scoring function:

+FFNN(h(i), h(j), h(i) h(j)) s(i, j, h) = FFNN(h(i)) + FFNN(h(j))

Mention scores Antecedent scores Cheap/inaccurate antecedent scores

+h(i)>Wch(j)

Only compute expensive scores for the top K span pairs

slide-35
SLIDE 35

Experimental Setup

35

Dataset: English OntoNotes (CoNLL-2012) Baseline: Lee et al. 2017 with: (1) Better hyperparameters (deeper LSTMs, longer spans, etc.) (2) ELMo (Peters et al. 2018) embeddings

slide-36
SLIDE 36

Coreference Results

36

Test Avg. F1 (%) 70.0 71.0 72.0 73.0 74.0 Lee et al. (2017) + ELMo + hyperparameter tuning

72.3

slide-37
SLIDE 37

Coreference Results

37

Test Avg. F1 (%) 70.0 71.0 72.0 73.0 74.0 Lee et al. (2017) + ELMo + hyperparameter tuning + coarse-to-fine

72.6 72.3

slide-38
SLIDE 38

Coreference Results

38

Test Avg. F1 (%) 70.0 71.0 72.0 73.0 74.0 Lee et al. (2017) + ELMo + hyperparameter tuning + coarse-to-fine + second-

  • rder

inference

73.1 72.6 72.3

slide-39
SLIDE 39

Summary

39

  • Improve structural consistency via multi-hop coreference
  • Enable more complex inference via coarse-to-fine beam search