 
              CoNLL 2012 CoNLL 2005 Results Datasets Ablations (OntoNotes) Results *:Ensemble models WSJ Test Brown (out-domain) Test 90 84.6 85 83.1 82.8 80.3 80.3 79.9 79.4 80 F1 75 70 65 60 Ours* Ours Zhou FitzGerald* Täckström Toutanova* Punyakanok* 2017 2017 2015 2015 2015 2008 2008
CoNLL 2012 CoNLL 2005 Results Datasets Ablations (OntoNotes) Results *:Ensemble models WSJ Test Brown (out-domain) Test 90 84.6 85 83.1 82.8 80.3 80.3 79.9 79.4 80 F1 73.6 75 72.2 72.1 71.3 69.4 68.8 70 67.8 65 60 Ours* Ours Zhou FitzGerald* Täckström Toutanova* Punyakanok* 2017 2017 2015 2015 2015 2008 2008
CoNLL 2012 CoNLL 2005 Results Datasets Ablations (OntoNotes) Results *:Ensemble models WSJ Test Brown (out-domain) Test 90 84.6 85 83.1 82.8 80.3 80.3 79.9 79.4 80 F1 73.6 75 72.2 72.1 71.3 69.4 68.8 70 67.8 65 60 Ours* Ours Zhou FitzGerald* Täckström Toutanova* Punyakanok* 2017 2017 2015 2015 2015 2008 2008 BiLSTM models Pipeline models
Ablations on Number of Layers (2,4,6 and 8) 85 F1 on CoNLL-05 Dev. 80.5 80.1 80 79.1 74.6 75 70 L2 L4 L6 L8 Greedy decoding Viterbi decoding 13
Ablations on Number of Layers (2,4,6 and 8) 85 81.6 81.4 F1 on CoNLL-05 Dev. 80.5 80.5 80.1 80 79.1 77.2 74.6 75 70 L2 L4 L6 L8 Greedy decoding Viterbi decoding 13
Ablations on Number of Layers (2,4,6 and 8) 85 81.6 81.4 F1 on CoNLL-05 Dev. 80.5 80.5 80.1 80 79.1 77.2 Performance increases as model goes deeper. Biggest 74.6 75 jump from 2 to 4 layer. 70 L2 L4 L6 L8 Greedy decoding Viterbi decoding 13
Ablations on Number of Layers (2,4,6 and 8) 85 Shallow models benefit more from constrained decoding. 81.6 81.4 F1 on CoNLL-05 Dev. 80.5 80.5 80.1 80 79.1 77.2 Performance increases as model goes deeper. Biggest 74.6 75 jump from 2 to 4 layer. 70 L2 L4 L6 L8 Greedy decoding Viterbi decoding 13
New Learning Approaches New state-of-the-art results for two tasks: Coreference: Semantic Role Labeling: A fire in a Bangladeshi garment factory ARG0 NASA has left at least 37 people dead and 100 PRED observe hospitalized. Most of the deceased were ARG1 an X-ray flare 400 times brighter than usual killed in the crush as workers tried to TMP On January 5, 2015 flee the blaze in the four-story building. Common themes: • End-to-end training of deep neural networks • No preprocessing (e.g., no POS, no parser, etc.) • Large gains in accuracy with simpler models and no extra training data
Coreference Resolution Input document A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building.
Coreference Resolution Input document A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building. Cluster #1 A fire in a Bangladeshi garment factory the blaze in the four-story building
Coreference Resolution Input document A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building. Cluster #1 A fire in a Bangladeshi garment factory the blaze in the four-story building Cluster #2 a Bangladeshi garment factory the four-story building
Coreference Resolution Input document A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building. Cluster #1 A fire in a Bangladeshi garment factory the blaze in the four-story building Cluster #2 a Bangladeshi garment factory the four-story building Cluster #3 at least 37 people the deceased
Two Subproblems Mention A fire in a Bangladeshi garment factory Input document detection at least 37 people A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of … the deceased were killed in the crush as workers the four-story building tried to flee the blaze in the four-story building. Mention clustering Cluster #1 A fire in a Bangladeshi garment factory the blaze in the four-story building Cluster #2 a Bangladeshi garment factory the four-story building Cluster #3 at least 37 people the deceased
Previous Approach: Rule-based pipeline Hand-engineered rules Input document Syntactic parser A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Candidate mentions Mention #1 Mention #2 Coreferent? A fire in a Bangladeshi garment factory A fire in a Bangladeshi garment factory garment ✓ / ✗ garment ✓ / ✗ garment factory factory ✓ / ✗ factory at least 37 people dead and 100 hospitalized at least 37 people dead and 100 hospitalized … … ✓ / ✗ …
Previous Approach: Rule-based pipeline Hand-engineered rules Input document Syntactic parser A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Mention clustering: main source of improvement for many years! • Haghighi and Klein (2010) Candidate mentions Mention #1 Mention #2 Coreferent? • Raghunathan et al. (2010) A fire in a Bangladeshi garment factory A fire in a Bangladeshi garment factory garment ✓ / ✗ • … garment ✓ / ✗ garment factory • Clark & Manning (2016) factory ✓ / ✗ factory at least 37 people dead and 100 hospitalized at least 37 people dead and 100 hospitalized … … ✓ / ✗ …
Previous Approach: Rule-based pipeline Hand-engineered rules Input document Syntactic parser A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Relies on parser for: • mention detection • syntactic features for clustering (e.g. head words) Candidate mentions Mention #1 Mention #2 Coreferent? A fire in a Bangladeshi garment factory A fire in a Bangladeshi garment factory garment ✓ / ✗ garment ✓ / ✗ garment factory factory ✓ / ✗ factory at least 37 people dead and 100 hospitalized at least 37 people dead and 100 hospitalized … … ✓ / ✗ …
End-to-end Approach • Consider all possible spans • Learn to rank antecedent spans • Factored model to prune search space
Key Idea: Span Representations Bidirectional LSTM Word & character embeddings General Electric said the Postal Service contacted the company
Key Idea: Span Representations the Postal Service Span representation + Bidirectional LSTM Word & character embeddings General Electric said the Postal Service contacted the company
Key Idea: Span Representations Boundary representations the Postal Service Span representation Bidirectional LSTM Word & character embeddings General Electric said the Postal Service contacted the company
Key Idea: Span Representations Attention mechanism to learn headedness the Postal Service Span representation + Head-finding attention Bidirectional LSTM Word & character embeddings General Electric said the Postal Service contacted the company
Key Idea: Span Representations Compute all span representations General Electric Electric said the the Postal Service Service contacted the the company Span representation + + + + + Head-finding attention Bidirectional LSTM Word & character embeddings General Electric said the Postal Service contacted the company
Mention Ranking Every span independently chooses an antecedent Input document A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building. Witnesses say the only exit door was on the ground floor, and that it was locked when the fire broke out.
Mention Ranking Span Antecedent Reason over all possible spans • y 1 1 A 2 A fire y 2 3 A fire in y 3 Assign an antecedent to every span • … … … M out y M y 3 ∈ { ✏ , 1 , 2 } ✏
Example Clustering Input document A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building. Witnesses say the only exit door was on the ground floor, and that it was locked when the fire broke out. Span Antecedent ( ) y i A ✏ A fire ✏ … … a Bangladeshi garment factory ✏ … … the four-story building a Bangladeshi garment factory … … out ✏
Example Clustering Input document A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building. Witnesses Not a mention say the only exit door was on the ground floor, and that it was locked when the fire broke out. Span Antecedent ( ) y i A ✏ A fire ✏ … … a Bangladeshi garment factory ✏ … … the four-story building a Bangladeshi garment factory … … out ✏
Example Clustering Input document A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building. Witnesses say the only exit door was on the ground floor, and that it was locked when the fire broke out. Span Antecedent ( ) y i A ✏ No link with previously occurring span A fire ✏ … … a Bangladeshi garment factory ✏ … … the four-story building a Bangladeshi garment factory … … out ✏
Example Clustering Input document A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building. Witnesses say the only exit door was on the ground floor, and that it was locked when the fire broke out. Span Antecedent ( ) y i A ✏ A fire ✏ … … Predicted coreference link a Bangladeshi garment factory ✏ … … the four-story building a Bangladeshi garment factory … … out ✏
Span Ranking Model M Y P ( y 1 , . . . , y M | D ) = P ( y i | D ) i =1 M e s ( i,y i ) Y = y 0 ∈ Y ( i ) e s ( i,y 0 ) P i =1 Factor coreference score to enable span pruning: s ( i, j ) ( s m ( i ) + s m ( j ) + s a ( i, j ) j 6 = ✏ s ( i, j ) = 0 j = ✏
Span Ranking Model M Y P ( y 1 , . . . , y M | D ) = P ( y i | D ) i =1 M e s ( i,y i ) Y = Is this span a mention? y 0 ∈ Y ( i ) e s ( i,y 0 ) P i =1 Factor coreference score to enable span pruning: s ( i, j ) s ( i, j ) ( s m ( i ) + s m ( j ) + s a ( i, j ) j 6 = ✏ s ( i, j ) = 0 j = ✏
Span Ranking Model M Y P ( y 1 , . . . , y M | D ) = P ( y i | D ) i =1 M e s ( i,y i ) Y = y 0 ∈ Y ( i ) e s ( i,y 0 ) P Is span j an antecedent of span i? i =1 Factor coreference score to enable span pruning: s ( i, j ) s ( i, j ) ( s m ( i ) + s m ( j ) + s a ( i, j ) j 6 = ✏ s ( i, j ) = 0 j = ✏
Span Ranking Model M Y P ( y 1 , . . . , y M | D ) = P ( y i | D ) i =1 M e s ( i,y i ) Y = y 0 ∈ Y ( i ) e s ( i,y 0 ) P i =1 Factor coreference score to enable span pruning: s ( i, j ) s ( i, j ) ( s m ( i ) + s m ( j ) + s a ( i, j ) j 6 = ✏ s ( i, j ) = 0 j = ✏ Dummy antecedent has a fixed zero score
Experimental Setup Dataset : English OntoNotes (CoNLL-2012) Genres : Telephone conversations, newswire, newsgroups, broadcast conversation, broadcast news, weblogs Documents : 2802 training, 343 development, 348 test Longest document has 4009 words! Aggressive pruning : Maximum span width, maximum sentence training, suppress spans with inconsistent bracketing, maximum number of antecedents Features : distance between spans, span width Metadata : speaker information, genre
Coreference Results 70.0 66.0 Test Avg. F1 (%) 62.5 61.6 62.0 60.3 58.0 54.0 50.0 Durrett Björkelund Martschat & Klein & Kuhn & Strube (2013) (2014) (2015) Linear models
Coreference Results 70.0 65.7 66.0 64.2 Test Avg. F1 (%) 62.5 61.6 62.0 60.3 58.0 54.0 50.0 Durrett Björkelund Martschat Wiseman Clark & & Klein & Kuhn & Strube et al. Manning (2013) (2014) (2015) (2016) (2016) Neural models Linear models
Coreference Results 70.0 65.7 66.0 64.2 Test Avg. F1 (%) 62.5 61.6 62.0 60.3 58.0 54.0 50.0 Durrett Björkelund Martschat Wiseman Clark & & Klein & Kuhn & Strube et al. Manning (2013) (2014) (2015) (2016) (2016) Pipelined models
Coreference Results 68.8 70.0 67.2 65.7 66.0 64.2 Test Avg. F1 (%) 62.5 61.6 62.0 60.3 58.0 54.0 50.0 Durrett Björkelund Martschat Wiseman Clark & Our model Our model & Klein & Kuhn & Strube et al. Manning (single) (ensemble) (2013) (2014) (2015) (2016) (2016) End-to-end models Pipelined models
Qualitative Analysis : Mention in a predicted cluster : Head-finding attention weight A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building.
Qualitative Analysis : Mention in a predicted cluster Attention-based head finder facilitates soft similarity cues : Head-finding attention weight A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building.
Qualitative Analysis : Mention in a predicted cluster : Head-finding attention weight Good head-finding requires word-order information! A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building.
Common Error Case : Mention in a predicted cluster : Head-finding attention weight The flight attendants have until 6:00 today to ratify labor concessions. The pilots' union and ground crew did so yesterday.
Common Error Case : Mention in a predicted cluster : Head-finding attention weight The flight attendants have until 6:00 today to ratify labor concessions. The pilots' union and ground crew did so yesterday. Conflating relatedness with paraphrasing
Does the Recipe Work for Broad Coverage Semantics? Step 1: Gather lots of training data! Challenge 1: Data is costly and limited (e.g. linguists required to label PennTreebank / OntoNotes) Step 2: Apply Deep Learning!! Challenge 2: Pipeline of structured prediction problems with cascading errors (e.g. POS->Parsing->SRL->Coref) Step 3: Observe Impressive Gains!!!
Where Will the Data Come From???
Where Will the Data Come From??? Option 1: Semi-supervised learning • E.g. word2vec and GloVe are in wide use [Mikolov et al., 2013; Pennington et al., 2014]
Where Will the Data Come From??? Option 1: Semi-supervised learning • E.g. word2vec and GloVe are in wide use [Mikolov et al., 2013; Pennington et al., 2014] • Can we learn better word representations?
Where Will the Data Come From??? Option 1: Semi-supervised learning • E.g. word2vec and GloVe are in wide use [Mikolov et al., 2013; Pennington et al., 2014] • Can we learn better word representations? Option 2: Supervised learning
Where Will the Data Come From??? Option 1: Semi-supervised learning • E.g. word2vec and GloVe are in wide use [Mikolov et al., 2013; Pennington et al., 2014] • Can we learn better word representations? Option 2: Supervised learning • Can we gather more direct forms of supervision?
Learning Better Word Representations Goal: Model contextualized syntax and semantics R ( w i , w 1 . . . w n ) ∈ R n R (plays, “The robot plays piano.”) 6 = R (plays, “The robot starred in many plays.”)
Word Embeddings from a Language Model Step 1: Train a large BiLM on unlabeled data 2 Layer Bidirectional LSTM Character convolutions General Electric said the Postal Service contacted the company
Word Embeddings from a Language Model Step 1: Train a large BiLM on unlabeled data Electric the Postal contacted … … … Left and Right Per Word Softmaxs 2 Layer Bidirectional LSTM Character convolutions General Electric said the Postal Service contacted the company
Word Embeddings from a Language Model Step 1: Train a large BiLM on unlabeled data Step 2: Compute linear function of pre-trained model 2 Layer Bidirectional LSTM Character convolutions General Electric said the Postal Service contacted the company
Word Embeddings from a Language Model Step 1: Train a large BiLM on unlabeled data Step 2: Compute linear function of pre-trained model = α 1 + α 2 + α 3 LM Embeddings 2 Layer Bidirectional LSTM Character convolutions General Electric said the Postal Service contacted the company
Word Embeddings from a Language Model Step 1: Train a large BiLM on unlabeled data Step 2: Compute linear function of pre-trained model Step 3: Learn weights for each end task = α 1 + α 2 + α 3 LM Embeddings 2 Layer Bidirectional LSTM Character convolutions General Electric said the Postal Service contacted the company
Best Single System Results 70.4 84.6 85.0 70.0 67.2 83.0 67.0 81.7 Test Avg. F1 (%) Test Avg. F1 (%) 81.0 64.0 62.5 79.9 79.0 61.0 77.0 58.0 75.0 55.0 Feature Based Neural Nueral+LM Feature Based Neural Neural+LM SRL Coreference (+3.2 F1) (+2.9 F1)
SOTA For Many Others Tasks Previous SOTA Baseline Baseline+LM 100 92.2 91.9 90.2 88.7 90 88.1 88 85.3 84.6 84.3 81.7 81.4 81.1 80 70.4 70 67.2 67.2 60 54.7 53.7 51.4 50 40 SNLI SQuAD Coref SRL NER Sentiment (SST)
What Does it Learn?
What Does it Learn? Semantics: • Supervised WSD task [Miller et al.,1994] • Use N-th layer in NN classifier
What Does it Learn? Semantics: 71.0 70.1 69.8 69.0 • Supervised WSD task Avg. F1 (%) 68.6 [Miller et al.,1994] 67.4 65.9 66.2 • Use N-th layer in NN 65.0 classifier Layer 1 Layer 2 Iacobacci (2016)
What Does it Learn? Semantics: 71.0 70.1 69.8 69.0 • Supervised WSD task Avg. F1 (%) 68.6 [Miller et al.,1994] 67.4 65.9 66.2 • Use N-th layer in NN 65.0 classifier Layer 1 Layer 2 Iacobacci (2016) Syntax: • Label POS corpus [Marcus et al., 1993] • Learn classifier on N-th layer
What Does it Learn? Semantics: 71.0 70.1 69.8 69.0 • Supervised WSD task Avg. F1 (%) 68.6 [Miller et al.,1994] 67.4 65.9 66.2 • Use N-th layer in NN 65.0 classifier Layer 1 Layer 2 Iacobacci (2016) 97.8 Syntax: 98.0 97.4 97.0 • Label POS corpus Accuracy 96.8 [Marcus et al., 1993] 96.2 95.8 • Learn classifier on 95.6 N-th layer 95.0 Layer 1 Layer 2 Ling et al. (2015)
Where Will the Data Come From??? Option 1: Semi-supervised learning • E.g. word2vec and GloVe are in wide use [Mikolov et al., 2013; Pennington et al., 2014] • Can we learn better word representations? Option 2: Supervised learning • Can we gather more direct forms of supervision?
A First Data Step: QA-SRL • Introduce a new SRL formulation with no frame or role inventory • Use question-answer pairs to model verbal predicate-argument relations • Annotated over 3,000 sentences in weeks with non-expert , part-time annotators • Showed that this data is high-quality and learnable [He et al, 2015]
Previous Method: Annotation with Frames ??? ??? amount risen end point ARG1 ARG2 ARG4 The rent rose 10% from $3000 to $3300 ??? start point ARG3 Depends on pre-defined frame • Frameset: rise.01 , go up inventory, requires syntactic parses Arg1- : Logical subject, patient, Annotators need to: • thing rising 1) Identify the Frameset Arg2-EXT : EXT, amount risen 2) Find arguments in the parse Arg3-DIR : start point Arg4-LOC : end point 3) Assign labels accordingly Argm-LOC : medium • If frame doesn’t exist, create new The Proposition Bank: An Annotated Corpus of Semantic Roles, Palmer et al., 2005 http://verbs.colorado.edu/propbank/framesets-english/rise-v.html
Our Annotation Scheme Given sentence and a verb: They increased the rent this year .
Our Annotation Scheme Given sentence and a verb: They increased the rent this year . Step 1: Ask a question about the verb: Who increased something ?
Our Annotation Scheme Given sentence and a verb: They increased the rent this year . Step 1: Ask a question Step 2: Answer with words about the verb: in the sentence: They Who increased something ?
Our Annotation Scheme Given sentence and a verb: They increased the rent this year . Step 1: Ask a question Step 2: Answer with words about the verb: in the sentence: They Who increased something ? Step 3: Repeat, write as many QA pairs as possible …
Recommend
More recommend