deep learning for broad coverage semantics srl
play

Deep Learning for Broad Coverage Semantics: SRL, Coreference, and - PowerPoint PPT Presentation

Deep Learning for Broad Coverage Semantics: SRL, Coreference, and Beyond Luke Zettlemoyer * Joint work with Luheng He , Kenton Lee , Matthew Peters* , Christopher Clark , Matthew Gardner*, Mohit Iyyer* , Mandar Joshi , Mike Lewis


  1. CoNLL 2012 CoNLL 2005 Results Datasets Ablations (OntoNotes) Results *:Ensemble models WSJ Test Brown (out-domain) Test 90 84.6 85 83.1 82.8 80.3 80.3 79.9 79.4 80 F1 75 70 65 60 Ours* Ours Zhou FitzGerald* Täckström Toutanova* Punyakanok* 2017 2017 2015 2015 2015 2008 2008

  2. CoNLL 2012 CoNLL 2005 Results Datasets Ablations (OntoNotes) Results *:Ensemble models WSJ Test Brown (out-domain) Test 90 84.6 85 83.1 82.8 80.3 80.3 79.9 79.4 80 F1 73.6 75 72.2 72.1 71.3 69.4 68.8 70 67.8 65 60 Ours* Ours Zhou FitzGerald* Täckström Toutanova* Punyakanok* 2017 2017 2015 2015 2015 2008 2008

  3. CoNLL 2012 CoNLL 2005 Results Datasets Ablations (OntoNotes) Results *:Ensemble models WSJ Test Brown (out-domain) Test 90 84.6 85 83.1 82.8 80.3 80.3 79.9 79.4 80 F1 73.6 75 72.2 72.1 71.3 69.4 68.8 70 67.8 65 60 Ours* Ours Zhou FitzGerald* Täckström Toutanova* Punyakanok* 2017 2017 2015 2015 2015 2008 2008 BiLSTM models Pipeline models

  4. Ablations on Number of Layers (2,4,6 and 8) 85 F1 on CoNLL-05 Dev. 80.5 80.1 80 79.1 74.6 75 70 L2 L4 L6 L8 Greedy decoding Viterbi decoding 13

  5. Ablations on Number of Layers (2,4,6 and 8) 85 81.6 81.4 F1 on CoNLL-05 Dev. 80.5 80.5 80.1 80 79.1 77.2 74.6 75 70 L2 L4 L6 L8 Greedy decoding Viterbi decoding 13

  6. Ablations on Number of Layers (2,4,6 and 8) 85 81.6 81.4 F1 on CoNLL-05 Dev. 80.5 80.5 80.1 80 79.1 77.2 Performance increases as model goes deeper. Biggest 74.6 75 jump from 2 to 4 layer. 70 L2 L4 L6 L8 Greedy decoding Viterbi decoding 13

  7. Ablations on Number of Layers (2,4,6 and 8) 85 Shallow models benefit more from constrained decoding. 81.6 81.4 F1 on CoNLL-05 Dev. 80.5 80.5 80.1 80 79.1 77.2 Performance increases as model goes deeper. Biggest 74.6 75 jump from 2 to 4 layer. 70 L2 L4 L6 L8 Greedy decoding Viterbi decoding 13

  8. New Learning Approaches New state-of-the-art results for two tasks: Coreference: Semantic Role Labeling: A fire in a Bangladeshi garment factory ARG0 NASA has left at least 37 people dead and 100 PRED observe hospitalized. Most of the deceased were ARG1 an X-ray flare 400 times brighter than usual killed in the crush as workers tried to TMP On January 5, 2015 flee the blaze in the four-story building. Common themes: • End-to-end training of deep neural networks • No preprocessing (e.g., no POS, no parser, etc.) • Large gains in accuracy with simpler models and no extra training data

  9. Coreference Resolution Input document A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building.

  10. Coreference Resolution Input document A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building. Cluster #1 A fire in a Bangladeshi garment factory the blaze in the four-story building

  11. Coreference Resolution Input document A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building. Cluster #1 A fire in a Bangladeshi garment factory the blaze in the four-story building Cluster #2 a Bangladeshi garment factory the four-story building

  12. Coreference Resolution Input document A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building. Cluster #1 A fire in a Bangladeshi garment factory the blaze in the four-story building Cluster #2 a Bangladeshi garment factory the four-story building Cluster #3 at least 37 people the deceased

  13. Two Subproblems Mention A fire in a Bangladeshi garment factory Input document detection at least 37 people A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of … the deceased were killed in the crush as workers the four-story building tried to flee the blaze in the four-story building. Mention clustering Cluster #1 A fire in a Bangladeshi garment factory the blaze in the four-story building Cluster #2 a Bangladeshi garment factory the four-story building Cluster #3 at least 37 people the deceased

  14. Previous Approach: Rule-based pipeline Hand-engineered rules Input document Syntactic parser A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Candidate mentions Mention #1 Mention #2 Coreferent? A fire in a Bangladeshi garment factory A fire in a Bangladeshi garment factory garment ✓ / ✗ garment ✓ / ✗ garment factory factory ✓ / ✗ factory at least 37 people dead and 100 hospitalized at least 37 people dead and 100 hospitalized … … ✓ / ✗ …

  15. Previous Approach: Rule-based pipeline Hand-engineered rules Input document Syntactic parser A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Mention clustering: main source of improvement for many years! • Haghighi and Klein (2010) Candidate mentions Mention #1 Mention #2 Coreferent? • Raghunathan et al. (2010) A fire in a Bangladeshi garment factory A fire in a Bangladeshi garment factory garment ✓ / ✗ • … garment ✓ / ✗ garment factory • Clark & Manning (2016) factory ✓ / ✗ factory at least 37 people dead and 100 hospitalized at least 37 people dead and 100 hospitalized … … ✓ / ✗ …

  16. Previous Approach: Rule-based pipeline Hand-engineered rules Input document Syntactic parser A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Relies on parser for: • mention detection • syntactic features for clustering (e.g. head words) Candidate mentions Mention #1 Mention #2 Coreferent? A fire in a Bangladeshi garment factory A fire in a Bangladeshi garment factory garment ✓ / ✗ garment ✓ / ✗ garment factory factory ✓ / ✗ factory at least 37 people dead and 100 hospitalized at least 37 people dead and 100 hospitalized … … ✓ / ✗ …

  17. End-to-end Approach • Consider all possible spans • Learn to rank antecedent spans • Factored model to prune search space

  18. Key Idea: Span Representations Bidirectional LSTM Word & character embeddings General Electric said the Postal Service contacted the company

  19. Key Idea: Span Representations the Postal Service Span representation + Bidirectional LSTM Word & character embeddings General Electric said the Postal Service contacted the company

  20. Key Idea: Span Representations Boundary representations the Postal Service Span representation Bidirectional LSTM Word & character embeddings General Electric said the Postal Service contacted the company

  21. Key Idea: Span Representations Attention mechanism to learn headedness the Postal Service Span representation + Head-finding attention Bidirectional LSTM Word & character embeddings General Electric said the Postal Service contacted the company

  22. Key Idea: Span Representations Compute all span representations General Electric Electric said the the Postal Service Service contacted the the company Span representation + + + + + Head-finding attention Bidirectional LSTM Word & character embeddings General Electric said the Postal Service contacted the company

  23. Mention Ranking Every span independently chooses an antecedent Input document A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building. Witnesses say the only exit door was on the ground floor, and that it was locked when the fire broke out.

  24. Mention Ranking Span Antecedent Reason over all possible spans • y 1 1 A 2 A fire y 2 3 A fire in y 3 Assign an antecedent to every span • … … … M out y M y 3 ∈ { ✏ , 1 , 2 } ✏

  25. Example Clustering Input document A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building. Witnesses say the only exit door was on the ground floor, and that it was locked when the fire broke out. Span Antecedent ( ) y i A ✏ A fire ✏ … … a Bangladeshi garment factory ✏ … … the four-story building a Bangladeshi garment factory … … out ✏

  26. Example Clustering Input document A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building. Witnesses Not a mention say the only exit door was on the ground floor, and that it was locked when the fire broke out. Span Antecedent ( ) y i A ✏ A fire ✏ … … a Bangladeshi garment factory ✏ … … the four-story building a Bangladeshi garment factory … … out ✏

  27. Example Clustering Input document A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building. Witnesses say the only exit door was on the ground floor, and that it was locked when the fire broke out. Span Antecedent ( ) y i A ✏ No link with previously occurring span A fire ✏ … … a Bangladeshi garment factory ✏ … … the four-story building a Bangladeshi garment factory … … out ✏

  28. Example Clustering Input document A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building. Witnesses say the only exit door was on the ground floor, and that it was locked when the fire broke out. Span Antecedent ( ) y i A ✏ A fire ✏ … … Predicted coreference link a Bangladeshi garment factory ✏ … … the four-story building a Bangladeshi garment factory … … out ✏

  29. Span Ranking Model M Y P ( y 1 , . . . , y M | D ) = P ( y i | D ) i =1 M e s ( i,y i ) Y = y 0 ∈ Y ( i ) e s ( i,y 0 ) P i =1 Factor coreference score to enable span pruning: s ( i, j ) ( s m ( i ) + s m ( j ) + s a ( i, j ) j 6 = ✏ s ( i, j ) = 0 j = ✏

  30. Span Ranking Model M Y P ( y 1 , . . . , y M | D ) = P ( y i | D ) i =1 M e s ( i,y i ) Y = Is this span a mention? y 0 ∈ Y ( i ) e s ( i,y 0 ) P i =1 Factor coreference score to enable span pruning: s ( i, j ) s ( i, j ) ( s m ( i ) + s m ( j ) + s a ( i, j ) j 6 = ✏ s ( i, j ) = 0 j = ✏

  31. Span Ranking Model M Y P ( y 1 , . . . , y M | D ) = P ( y i | D ) i =1 M e s ( i,y i ) Y = y 0 ∈ Y ( i ) e s ( i,y 0 ) P Is span j an antecedent of span i? i =1 Factor coreference score to enable span pruning: s ( i, j ) s ( i, j ) ( s m ( i ) + s m ( j ) + s a ( i, j ) j 6 = ✏ s ( i, j ) = 0 j = ✏

  32. Span Ranking Model M Y P ( y 1 , . . . , y M | D ) = P ( y i | D ) i =1 M e s ( i,y i ) Y = y 0 ∈ Y ( i ) e s ( i,y 0 ) P i =1 Factor coreference score to enable span pruning: s ( i, j ) s ( i, j ) ( s m ( i ) + s m ( j ) + s a ( i, j ) j 6 = ✏ s ( i, j ) = 0 j = ✏ Dummy antecedent has a fixed zero score

  33. Experimental Setup Dataset : English OntoNotes (CoNLL-2012) Genres : Telephone conversations, newswire, newsgroups, broadcast conversation, broadcast news, weblogs Documents : 2802 training, 343 development, 348 test Longest document has 4009 words! Aggressive pruning : Maximum span width, maximum sentence training, suppress spans with inconsistent bracketing, maximum number of antecedents Features : distance between spans, span width Metadata : speaker information, genre

  34. Coreference Results 70.0 66.0 Test Avg. F1 (%) 62.5 61.6 62.0 60.3 58.0 54.0 50.0 Durrett Björkelund Martschat & Klein & Kuhn & Strube (2013) (2014) (2015) Linear models

  35. Coreference Results 70.0 65.7 66.0 64.2 Test Avg. F1 (%) 62.5 61.6 62.0 60.3 58.0 54.0 50.0 Durrett Björkelund Martschat Wiseman Clark & & Klein & Kuhn & Strube et al. Manning (2013) (2014) (2015) (2016) (2016) Neural models Linear models

  36. Coreference Results 70.0 65.7 66.0 64.2 Test Avg. F1 (%) 62.5 61.6 62.0 60.3 58.0 54.0 50.0 Durrett Björkelund Martschat Wiseman Clark & & Klein & Kuhn & Strube et al. Manning (2013) (2014) (2015) (2016) (2016) Pipelined models

  37. Coreference Results 68.8 70.0 67.2 65.7 66.0 64.2 Test Avg. F1 (%) 62.5 61.6 62.0 60.3 58.0 54.0 50.0 Durrett Björkelund Martschat Wiseman Clark & Our model Our model & Klein & Kuhn & Strube et al. Manning (single) (ensemble) (2013) (2014) (2015) (2016) (2016) End-to-end models Pipelined models

  38. Qualitative Analysis : Mention in a predicted cluster : Head-finding attention weight A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building.

  39. Qualitative Analysis : Mention in a predicted cluster Attention-based head finder facilitates soft similarity cues : Head-finding attention weight A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building.

  40. Qualitative Analysis : Mention in a predicted cluster : Head-finding attention weight Good head-finding requires word-order information! A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in the four-story building.

  41. Common Error Case : Mention in a predicted cluster : Head-finding attention weight The flight attendants have until 6:00 today to ratify labor concessions. The pilots' union and ground crew did so yesterday.

  42. Common Error Case : Mention in a predicted cluster : Head-finding attention weight The flight attendants have until 6:00 today to ratify labor concessions. The pilots' union and ground crew did so yesterday. Conflating relatedness with paraphrasing

  43. Does the Recipe Work for Broad Coverage Semantics? Step 1: Gather lots of training data! Challenge 1: Data is costly and limited (e.g. linguists required to label PennTreebank / OntoNotes) Step 2: Apply Deep Learning!! Challenge 2: Pipeline of structured prediction problems with cascading errors (e.g. POS->Parsing->SRL->Coref) Step 3: Observe Impressive Gains!!!

  44. Where Will the Data Come From???

  45. Where Will the Data Come From??? Option 1: Semi-supervised learning • E.g. word2vec and GloVe are in wide use 
 [Mikolov et al., 2013; Pennington et al., 2014]

  46. Where Will the Data Come From??? Option 1: Semi-supervised learning • E.g. word2vec and GloVe are in wide use 
 [Mikolov et al., 2013; Pennington et al., 2014] • Can we learn better word representations?

  47. Where Will the Data Come From??? Option 1: Semi-supervised learning • E.g. word2vec and GloVe are in wide use 
 [Mikolov et al., 2013; Pennington et al., 2014] • Can we learn better word representations? Option 2: Supervised learning

  48. Where Will the Data Come From??? Option 1: Semi-supervised learning • E.g. word2vec and GloVe are in wide use 
 [Mikolov et al., 2013; Pennington et al., 2014] • Can we learn better word representations? Option 2: Supervised learning • Can we gather more direct forms of supervision?

  49. Learning Better Word Representations Goal: Model contextualized syntax and semantics R ( w i , w 1 . . . w n ) ∈ R n R (plays, “The robot plays piano.”) 6 = R (plays, “The robot starred in many plays.”)

  50. Word Embeddings from a Language Model Step 1: Train a large BiLM on unlabeled data 2 Layer Bidirectional LSTM Character convolutions General Electric said the Postal Service contacted the company

  51. Word Embeddings from a Language Model Step 1: Train a large BiLM on unlabeled data Electric the Postal contacted … … … Left and Right Per Word Softmaxs 2 Layer Bidirectional LSTM Character convolutions General Electric said the Postal Service contacted the company

  52. Word Embeddings from a Language Model Step 1: Train a large BiLM on unlabeled data Step 2: Compute linear function of pre-trained model 2 Layer Bidirectional LSTM Character convolutions General Electric said the Postal Service contacted the company

  53. Word Embeddings from a Language Model Step 1: Train a large BiLM on unlabeled data Step 2: Compute linear function of pre-trained model = α 1 + α 2 + α 3 LM Embeddings 2 Layer Bidirectional LSTM Character convolutions General Electric said the Postal Service contacted the company

  54. Word Embeddings from a Language Model Step 1: Train a large BiLM on unlabeled data Step 2: Compute linear function of pre-trained model Step 3: Learn weights for each end task = α 1 + α 2 + α 3 LM Embeddings 2 Layer Bidirectional LSTM Character convolutions General Electric said the Postal Service contacted the company

  55. Best Single System Results 70.4 84.6 85.0 70.0 67.2 83.0 67.0 81.7 Test Avg. F1 (%) Test Avg. F1 (%) 81.0 64.0 62.5 79.9 79.0 61.0 77.0 58.0 75.0 55.0 Feature Based Neural Nueral+LM Feature Based Neural Neural+LM SRL Coreference (+3.2 F1) (+2.9 F1)

  56. SOTA For Many Others Tasks Previous SOTA Baseline Baseline+LM 100 92.2 91.9 90.2 88.7 90 88.1 88 85.3 84.6 84.3 81.7 81.4 81.1 80 70.4 70 67.2 67.2 60 54.7 53.7 51.4 50 40 SNLI SQuAD Coref SRL NER Sentiment (SST)

  57. What Does it Learn?

  58. What Does it Learn? Semantics: • Supervised WSD task [Miller et al.,1994] • Use N-th layer in NN classifier

  59. What Does it Learn? Semantics: 71.0 70.1 69.8 69.0 • Supervised WSD task Avg. F1 (%) 68.6 [Miller et al.,1994] 67.4 65.9 66.2 • Use N-th layer in NN 65.0 classifier Layer 1 Layer 2 Iacobacci (2016)

  60. What Does it Learn? Semantics: 71.0 70.1 69.8 69.0 • Supervised WSD task Avg. F1 (%) 68.6 [Miller et al.,1994] 67.4 65.9 66.2 • Use N-th layer in NN 65.0 classifier Layer 1 Layer 2 Iacobacci (2016) Syntax: • Label POS corpus 
 [Marcus et al., 1993] • Learn classifier on 
 N-th layer

  61. What Does it Learn? Semantics: 71.0 70.1 69.8 69.0 • Supervised WSD task Avg. F1 (%) 68.6 [Miller et al.,1994] 67.4 65.9 66.2 • Use N-th layer in NN 65.0 classifier Layer 1 Layer 2 Iacobacci (2016) 97.8 Syntax: 98.0 97.4 97.0 • Label POS corpus 
 Accuracy 96.8 [Marcus et al., 1993] 96.2 95.8 • Learn classifier on 
 95.6 N-th layer 95.0 Layer 1 Layer 2 Ling et al. (2015)

  62. Where Will the Data Come From??? Option 1: Semi-supervised learning • E.g. word2vec and GloVe are in wide use 
 [Mikolov et al., 2013; Pennington et al., 2014] • Can we learn better word representations? Option 2: Supervised learning • Can we gather more direct forms of supervision?

  63. A First Data Step: QA-SRL • Introduce a new SRL formulation with no frame or role inventory • Use question-answer pairs to model verbal predicate-argument relations • Annotated over 3,000 sentences in weeks with non-expert , part-time annotators • Showed that this data is high-quality and learnable [He et al, 2015]

  64. Previous Method: Annotation with Frames ??? ??? amount risen end point ARG1 ARG2 ARG4 The rent rose 10% from $3000 to $3300 ??? start point ARG3 Depends on pre-defined frame • Frameset: rise.01 , go up inventory, requires syntactic parses Arg1- : Logical subject, patient, Annotators need to: • thing rising 1) Identify the Frameset Arg2-EXT : EXT, amount risen 2) Find arguments in the parse Arg3-DIR : start point Arg4-LOC : end point 3) Assign labels accordingly Argm-LOC : medium • If frame doesn’t exist, create new The Proposition Bank: An Annotated Corpus of Semantic Roles, Palmer et al., 2005 http://verbs.colorado.edu/propbank/framesets-english/rise-v.html

  65. Our Annotation Scheme Given sentence and a verb: They increased the rent this year .

  66. Our Annotation Scheme Given sentence and a verb: They increased the rent this year . Step 1: Ask a question about the verb: Who increased something ?

  67. Our Annotation Scheme Given sentence and a verb: They increased the rent this year . Step 1: Ask a question Step 2: Answer with words about the verb: in the sentence: They Who increased something ?

  68. Our Annotation Scheme Given sentence and a verb: They increased the rent this year . Step 1: Ask a question Step 2: Answer with words about the verb: in the sentence: They Who increased something ? Step 3: Repeat, write as many QA pairs as possible …

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend