natural language processing csep 517 sequence models
play

Natural Language Processing (CSEP 517): Sequence Models Noah Smith - PowerPoint PPT Presentation

Natural Language Processing (CSEP 517): Sequence Models Noah Smith 2017 c University of Washington nasmith@cs.washington.edu April 17, 2017 1 / 98 To-Do List Online quiz: due Sunday Read: Collins (2011), which has somewhat


  1. Natural Language Processing (CSEP 517): Sequence Models Noah Smith � 2017 c University of Washington nasmith@cs.washington.edu April 17, 2017 1 / 98

  2. To-Do List ◮ Online quiz: due Sunday ◮ Read: Collins (2011), which has somewhat different notation; Jurafsky and Martin (2016a,b,c) ◮ A2 due April 23 (Sunday) 2 / 98

  3. Linguistic Analysis: Overview Every linguistic analyzer is comprised of: 1. Theoretical motivation from linguistics and/or the text domain 2. An algorithm that maps V † to some output space Y . 3. An implementation of the algorithm ◮ Once upon a time: rule systems and crafted rules ◮ Most common now: supervised learning from annotated data ◮ Frontier: less supervision (semi-, un-, reinforcement, distant, . . . ) 3 / 98

  4. Sequence Labeling After text classification ( V † → L ), the next simplest type of output is a sequence labeling . � x 1 , x 2 , . . . , x ℓ � �→ � y 1 , y 2 , . . . , y ℓ � x �→ y Every word gets a label in L . Example problems: ◮ part-of-speech tagging (Church, 1988) ◮ spelling correction (Kernighan et al., 1990) ◮ word alignment (Vogel et al., 1996) ◮ named-entity recognition (Bikel et al., 1999) ◮ compression (Conroy and O’Leary, 2001) 4 / 98

  5. The Simplest Sequence Labeler: “Local” Classifier Define features of a labeled word in context: φ ( x , i, y ) . Train a classifier, e.g., y i = argmax ˆ s ( x , i, y ) y ∈L linear = argmax w · φ ( x , i, y ) y ∈L Decide the label for each word independently. 5 / 98

  6. The Simplest Sequence Labeler: “Local” Classifier Define features of a labeled word in context: φ ( x , i, y ) . Train a classifier, e.g., y i = argmax ˆ s ( x , i, y ) y ∈L linear = argmax w · φ ( x , i, y ) y ∈L Decide the label for each word independently. Sometimes this works! 6 / 98

  7. The Simplest Sequence Labeler: “Local” Classifier Define features of a labeled word in context: φ ( x , i, y ) . Train a classifier, e.g., y i = argmax ˆ s ( x , i, y ) y ∈L linear = argmax w · φ ( x , i, y ) y ∈L Decide the label for each word independently. Sometimes this works! We can do better when there are predictable relationships between Y i and Y i +1 . 7 / 98

  8. Generative Sequence Labeling: Hidden Markov Models ℓ +1 � p ( x , y ) = p ( x i | y i ) · p ( y i | y i − 1 ) i =1 For each state/label y ∈ L : ◮ p ( X i | Y i = y ) is the “emission” distribution for y ◮ p ( Y i | Y i − 1 = y ) is called the “transition” distribution for y Assume Y 0 is always a start state and Y ℓ +1 is always a stop state; x ℓ +1 is always the stop symbol. 8 / 98

  9. Graphical Representation of Hidden Markov Models y 0 y 1 y 2 y 3 y 4 y 5 x 1 x 2 x 3 x 4 x 5 Note: handling of beginning and end of sequence is a bit different than before. Last x is known since p ( � | � ) = 1 . 9 / 98

  10. Structured vs. Not Each of these has an advantage over the other: ◮ The HMM lets the different labels “interact.” ◮ The local classifier makes all of x available for every decision. 10 / 98

  11. Prediction with HMMs The classical HMM tells us to choose: ℓ +1 � argmax p ( x i , | y i ) · p ( y i | y i − 1 ) y ∈L ℓ +1 i =1 How to optimize over |L| ℓ choices without explicit enumeration? 11 / 98

  12. Prediction with HMMs The classical HMM tells us to choose: ℓ +1 � argmax p ( x i , | y i ) · p ( y i | y i − 1 ) y ∈L ℓ +1 i =1 How to optimize over |L| ℓ choices without explicit enumeration? Key: exploit the conditional independence assumptions: Y i ⊥ Y 1: i − 2 | Y i − 1 Y i ⊥ Y i +2: ℓ | Y i +1 12 / 98

  13. Part-of-Speech Tagging Example I suspect the present forecast is pessimistic . noun • • • • • • adj. • • • • adv. • verb • • • • num. • det. • punc. • With this very simple tag set, 7 8 = 5 . 7 million labelings. (Even restricting to the possibilities above, 288 labelings.) 13 / 98

  14. Two Obvious Solutions Brute force: Enumerate all solutions, score them, pick the best. Greedy: Pick each ˆ y i according to: ˆ y i = argmax p ( y | ˆ y i − 1 ) · p ( x i | y ) y ∈L What’s wrong with these? 14 / 98

  15. Two Obvious Solutions Brute force: Enumerate all solutions, score them, pick the best. Greedy: Pick each ˆ y i according to: ˆ y i = argmax p ( y | ˆ y i − 1 ) · p ( x i | y ) y ∈L What’s wrong with these? Consider: “the old dog the footsteps of the young” (credit: Julia Hirschberg) “the horse raced past the barn fell” 15 / 98

  16. Conditional Independence We can get an exact solution in polynomial time! Y i ⊥ Y 1: i − 2 | Y i − 1 Y i ⊥ Y i +2: ℓ | Y i +1 Given the adjacent labels to Y i , others do not matter. Let’s start at the last position, ℓ . . . 16 / 98

  17. High-Level View of Viterbi ◮ The decision about Y ℓ is a function of y ℓ − 1 , x ℓ , and nothing else!  �  X ℓ = x ℓ , � � p ( Y ℓ = y | x , y 1:( ℓ − 1) ) = p  Y ℓ = y Y ℓ − 1 = y ℓ − 1 , �  � Y ℓ +1 = � � = p ( Y ℓ = y, X ℓ = x ℓ , Y ℓ − 1 = y ℓ − 1 , Y ℓ +1 = � ) p ( X ℓ = x ℓ , Y ℓ − 1 = y ℓ − 1 , Y ℓ +1 = � ) ∝ p ( � | y ) · p ( x ℓ | y ) · p ( y | y ℓ − 1 ) 17 / 98

  18. High-Level View of Viterbi ◮ The decision about Y ℓ is a function of y ℓ − 1 , x ℓ , and nothing else!  �  X ℓ = x ℓ , � � p ( Y ℓ = y | x , y 1:( ℓ − 1) ) = p  Y ℓ = y Y ℓ − 1 = y ℓ − 1 , �  � Y ℓ +1 = � � = p ( Y ℓ = y, X ℓ = x ℓ , Y ℓ − 1 = y ℓ − 1 , Y ℓ +1 = � ) p ( X ℓ = x ℓ , Y ℓ − 1 = y ℓ − 1 , Y ℓ +1 = � ) ∝ p ( � | y ) · p ( x ℓ | y ) · p ( y | y ℓ − 1 ) ◮ If, for each value of y ℓ − 1 , we knew the best y 1:( ℓ − 1) , then picking y ℓ would be easy. 18 / 98

  19. High-Level View of Viterbi ◮ The decision about Y ℓ is a function of y ℓ − 1 , x ℓ , and nothing else!  �  X ℓ = x ℓ , � � p ( Y ℓ = y | x , y 1:( ℓ − 1) ) = p  Y ℓ = y Y ℓ − 1 = y ℓ − 1 , �  � Y ℓ +1 = � � = p ( Y ℓ = y, X ℓ = x ℓ , Y ℓ − 1 = y ℓ − 1 , Y ℓ +1 = � ) p ( X ℓ = x ℓ , Y ℓ − 1 = y ℓ − 1 , Y ℓ +1 = � ) ∝ p ( � | y ) · p ( x ℓ | y ) · p ( y | y ℓ − 1 ) ◮ If, for each value of y ℓ − 1 , we knew the best y 1:( ℓ − 1) , then picking y ℓ would be easy. ◮ Idea: for each position i , calculate the score of the best label prefix y 1: i ending in each possible value for Y i . 19 / 98

  20. High-Level View of Viterbi ◮ The decision about Y ℓ is a function of y ℓ − 1 , x ℓ , and nothing else!  �  X ℓ = x ℓ , � � p ( Y ℓ = y | x , y 1:( ℓ − 1) ) = p  Y ℓ = y Y ℓ − 1 = y ℓ − 1 , �  � Y ℓ +1 = � � = p ( Y ℓ = y, X ℓ = x ℓ , Y ℓ − 1 = y ℓ − 1 , Y ℓ +1 = � ) p ( X ℓ = x ℓ , Y ℓ − 1 = y ℓ − 1 , Y ℓ +1 = � ) ∝ p ( � | y ) · p ( x ℓ | y ) · p ( y | y ℓ − 1 ) ◮ If, for each value of y ℓ − 1 , we knew the best y 1:( ℓ − 1) , then picking y ℓ would be easy. ◮ Idea: for each position i , calculate the score of the best label prefix y 1: i ending in each possible value for Y i . ◮ With a little bookkeeping, we can then trace backwards and recover the best label sequence. 20 / 98

  21. Chart Data Structure x 1 x 2 . . . x ℓ y y ′ . . . y last 21 / 98

  22. Recurrence First, think about the score of the best sequence. Let s i ( y ) be the score of the best label sequence for x 1: i that ends in y . It is defined recursively: y ′ ∈L p ( y | y ′ ) · s ℓ − 1 ( y ′ ) s ℓ ( y ) = p ( � | y ) · p ( x ℓ | y ) · max 22 / 98

  23. Recurrence First, think about the score of the best sequence. Let s i ( y ) be the score of the best label sequence for x 1: i that ends in y . It is defined recursively: y ′ ∈L p ( y | y ′ ) · s ℓ − 1 ( y ′ ) s ℓ ( y ) = p ( � | y ) · p ( x ℓ | y ) · max y ′ ∈L p ( y | y ′ ) · s ℓ − 2 ( y ′ ) s ℓ − 1 ( y ) = p ( x ℓ − 1 | y ) · max 23 / 98

  24. Recurrence First, think about the score of the best sequence. Let s i ( y ) be the score of the best label sequence for x 1: i that ends in y . It is defined recursively: y ′ ∈L p ( y | y ′ ) · s ℓ − 1 ( y ′ ) s ℓ ( y ) = p ( � | y ) · p ( x ℓ | y ) · max y ′ ∈L p ( y | y ′ ) · s ℓ − 2 ( y ′ ) s ℓ − 1 ( y ) = p ( x ℓ − 1 | y ) · max y ′ ∈L p ( y | y ′ ) · s ℓ − 3 ( y ′ ) s ℓ − 2 ( y ) = p ( x ℓ − 2 | y ) · max 24 / 98

  25. Recurrence First, think about the score of the best sequence. Let s i ( y ) be the score of the best label sequence for x 1: i that ends in y . It is defined recursively: y ′ ∈L p ( y | y ′ ) · s ℓ − 1 ( y ′ ) s ℓ ( y ) = p ( � | y ) · p ( x ℓ | y ) · max y ′ ∈L p ( y | y ′ ) · s ℓ − 2 ( y ′ ) s ℓ − 1 ( y ) = p ( x ℓ − 1 | y ) · max y ′ ∈L p ( y | y ′ ) · s ℓ − 3 ( y ′ ) s ℓ − 2 ( y ) = p ( x ℓ − 2 | y ) · max . . . y ′ ∈L p ( y | y ′ ) · s i − 1 ( y ′ ) s i ( y ) = p ( x i | y ) · max 25 / 98

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend