Higher-order Coreference Resolution with Coarse-to-fine Inference
1
Kenton Lee* Luheng He Luke Zettlemoyer University of Washington
* Now at Google
Higher-order Coreference Resolution with Coarse-to-fine Inference - - PowerPoint PPT Presentation
Higher-order Coreference Resolution with Coarse-to-fine Inference Kenton Lee * Luheng He Luke Zettlemoyer University of Washington * Now at Google 1 Coreference Resolution Its because of what both of you are doing to have things
1
Kenton Lee* Luheng He Luke Zettlemoyer University of Washington
* Now at Google
It’s because of what both of you are doing to have things change.
2
I think that’s what’s… Go ahead Linda. Thanks goes to you and to the media to help us. Absolutely. Obviously we couldn’t seem loud enough to bring the attention, so our hat is off to all of you as well.
Example from Wiseman et al. (2016)
It’s because of what both of you are doing to have things change.
3
I think that’s what’s… Go ahead Linda. Thanks goes to you and to the media to help us. Absolutely. Obviously we couldn’t seem loud enough to bring the attention, so our hat is off to all of you as well.
Example from Wiseman et al. (2016)
It’s because of what both of you are doing to have things change.
4
I think that’s what’s… Go ahead Linda. Thanks goes to you and to the media to help us. Absolutely. Obviously we couldn’t seem loud enough to bring the attention, so our hat is off to all of you as well.
Example from Wiseman et al. (2016)
5
Advantages
Disadvantages
6
7
Lee et al. 2017 (EMNLP):
8
1 < i < n h(i) P(yi | h)
I think that’s what’s… Go ahead Linda. Thanks goes to you and to the media to help us. Obviously we couldn’t seem loud enough to bring the attention, so our hat is off to all of you as well. Absolutely.
9 Example from Wiseman et al. (2016)
Local information not sufficient
I think that’s what’s… Go ahead Linda. Thanks goes to you and to the media to help us. Absolutely. Obviously we couldn’t seem loud enough to bring the attention, so our hat is off to all of you as well.
10 Example from Wiseman et al. (2016)
Global structure reveals inconsistency
11
12
I think that’s what’s… Go ahead Linda. Thanks goes to you and to the media to help us. Absolutely. Obviously we couldn’t seem loud enough to bring the attention, so our hat is off to all of you as well.
13
I Linda you ε
P(yall of you | h)
I think that’s what’s… Go ahead Linda. Thanks goes to you and to the media to help us. Absolutely. Obviously we couldn’t seem loud enough to bring the attention, so our hat is off to all of you as well.
14
I Linda ε
P(yyou | h)
I think that’s what’s… Go ahead Linda. Thanks goes to you and to the media to help us. Absolutely. Obviously we couldn’t seem loud enough to bring the attention, so our hat is off to all of you as well.
15
I Linda ε
P(yyou | h)
Learn a representation of “you” w.r.t. “I”
I think that’s what’s… Go ahead Linda. Thanks goes to you and to the media to help us. Absolutely. Obviously we couldn’t seem loud enough to bring the attention, so our hat is off to all of you as well.
16
I Linda you ε
P(yall of you | h)
I think that’s what’s… Go ahead Linda. Thanks goes to you and to the media to help us. Absolutely. Obviously we couldn’t seem loud enough to bring the attention, so our hat is off to all of you as well.
17
I Linda you ε
P(yall of you | h0)
I Linda you ε
P(yall of you | h)
18
hn(i)
hn(i) h0(i) = h(i)
20
hn(i) h0(i) = h(i) an(i) = X
yi
P(yi|hn−1)hn−1(i)
(attention mechanism)
21
hn(i) h0(i) = h(i) an(i) = X
yi
P(yi|hn−1)hn−1(i)
(attention mechanism)
fn(i) = σ(W[an(i), hn−1(i)])
(forget gates)
22
hn(i) h0(i) = h(i) an(i) = X
yi
P(yi|hn−1)hn−1(i)
(attention mechanism)
fn(i) = σ(W[an(i), hn−1(i)])
(forget gates)
hn(i) = fn(i) an(i) + (1 fn(i)) hn−1(i)
23
hn(i) h0(i) = h(i) an(i) = X
yi
P(yi|hn−1)hn−1(i)
(attention mechanism)
fn(i) = σ(W[an(i), hn−1(i)])
(forget gates)
hn(i) = fn(i) an(i) + (1 fn(i)) hn−1(i) P(yi|hn)
24
hn(i) h0(i) = h(i) an(i) = X
yi
P(yi|hn−1)hn−1(i)
(attention mechanism)
fn(i) = σ(W[an(i), hn−1(i)])
(forget gates)
hn(i) = fn(i) an(i) + (1 fn(i)) hn−1(i) P(yi|hn)
Final coreference decision conditions on clusters of size n + 2
25
Advantages
Disadvantages
26
Advantages
Disadvantages
2nd order model already runs out of memory
27
28
It’s because of what both of you are doing to have things change.
29
It’s because of what both of you are doing to have things change.
30
It’s because of what both of you are doing to have things change.
31
P(yi|h) = softmax(s(i, yi, h))
32
P(yi|h) = softmax(s(i, yi, h))
Existing scoring function:
+FFNN(h(i), h(j), h(i) h(j)) s(i, j, h) = FFNN(h(i)) + FFNN(h(j))
Mention scores Antecedent scores
33
P(yi|h) = softmax(s(i, yi, h))
Coarse-to-fine scoring function:
+FFNN(h(i), h(j), h(i) h(j)) s(i, j, h) = FFNN(h(i)) + FFNN(h(j))
Mention scores Antecedent scores Cheap/inaccurate antecedent scores
+h(i)>Wch(j)
34
P(yi|h) = softmax(s(i, yi, h))
Coarse-to-fine scoring function:
+FFNN(h(i), h(j), h(i) h(j)) s(i, j, h) = FFNN(h(i)) + FFNN(h(j))
Mention scores Antecedent scores Cheap/inaccurate antecedent scores
+h(i)>Wch(j)
Only compute expensive scores for the top K span pairs
35
Dataset: English OntoNotes (CoNLL-2012) Baseline: Lee et al. 2017 with: (1) Better hyperparameters (deeper LSTMs, longer spans, etc.) (2) ELMo (Peters et al. 2018) embeddings
36
Test Avg. F1 (%) 70.0 71.0 72.0 73.0 74.0 Lee et al. (2017) + ELMo + hyperparameter tuning
72.3
37
Test Avg. F1 (%) 70.0 71.0 72.0 73.0 74.0 Lee et al. (2017) + ELMo + hyperparameter tuning + coarse-to-fine
72.6 72.3
38
Test Avg. F1 (%) 70.0 71.0 72.0 73.0 74.0 Lee et al. (2017) + ELMo + hyperparameter tuning + coarse-to-fine + second-
inference
73.1 72.6 72.3
39