Natural Language Processing (CSEP 517): Sequence Models Noah Smith - PowerPoint PPT Presentation

Natural Language Processing (CSEP 517): Sequence Models Noah Smith � 2017 c University of Washington nasmith@cs.washington.edu April 17, 2017 1 / 98

To-Do List ◮ Online quiz: due Sunday ◮ Read: Collins (2011), which has somewhat different notation; Jurafsky and Martin (2016a,b,c) ◮ A2 due April 23 (Sunday) 2 / 98

Linguistic Analysis: Overview Every linguistic analyzer is comprised of: 1. Theoretical motivation from linguistics and/or the text domain 2. An algorithm that maps V † to some output space Y . 3. An implementation of the algorithm ◮ Once upon a time: rule systems and crafted rules ◮ Most common now: supervised learning from annotated data ◮ Frontier: less supervision (semi-, un-, reinforcement, distant, . . . ) 3 / 98

Sequence Labeling After text classification ( V † → L ), the next simplest type of output is a sequence labeling . � x 1 , x 2 , . . . , x ℓ � �→ � y 1 , y 2 , . . . , y ℓ � x �→ y Every word gets a label in L . Example problems: ◮ part-of-speech tagging (Church, 1988) ◮ spelling correction (Kernighan et al., 1990) ◮ word alignment (Vogel et al., 1996) ◮ named-entity recognition (Bikel et al., 1999) ◮ compression (Conroy and O’Leary, 2001) 4 / 98

The Simplest Sequence Labeler: “Local” Classifier Define features of a labeled word in context: φ ( x , i, y ) . Train a classifier, e.g., y i = argmax ˆ s ( x , i, y ) y ∈L linear = argmax w · φ ( x , i, y ) y ∈L Decide the label for each word independently. 5 / 98

The Simplest Sequence Labeler: “Local” Classifier Define features of a labeled word in context: φ ( x , i, y ) . Train a classifier, e.g., y i = argmax ˆ s ( x , i, y ) y ∈L linear = argmax w · φ ( x , i, y ) y ∈L Decide the label for each word independently. Sometimes this works! 6 / 98

The Simplest Sequence Labeler: “Local” Classifier Define features of a labeled word in context: φ ( x , i, y ) . Train a classifier, e.g., y i = argmax ˆ s ( x , i, y ) y ∈L linear = argmax w · φ ( x , i, y ) y ∈L Decide the label for each word independently. Sometimes this works! We can do better when there are predictable relationships between Y i and Y i +1 . 7 / 98

Generative Sequence Labeling: Hidden Markov Models ℓ +1 � p ( x , y ) = p ( x i | y i ) · p ( y i | y i − 1 ) i =1 For each state/label y ∈ L : ◮ p ( X i | Y i = y ) is the “emission” distribution for y ◮ p ( Y i | Y i − 1 = y ) is called the “transition” distribution for y Assume Y 0 is always a start state and Y ℓ +1 is always a stop state; x ℓ +1 is always the stop symbol. 8 / 98

Graphical Representation of Hidden Markov Models y 0 y 1 y 2 y 3 y 4 y 5 x 1 x 2 x 3 x 4 x 5 Note: handling of beginning and end of sequence is a bit different than before. Last x is known since p ( � | � ) = 1 . 9 / 98

Structured vs. Not Each of these has an advantage over the other: ◮ The HMM lets the different labels “interact.” ◮ The local classifier makes all of x available for every decision. 10 / 98

Prediction with HMMs The classical HMM tells us to choose: ℓ +1 � argmax p ( x i , | y i ) · p ( y i | y i − 1 ) y ∈L ℓ +1 i =1 How to optimize over |L| ℓ choices without explicit enumeration? 11 / 98

Prediction with HMMs The classical HMM tells us to choose: ℓ +1 � argmax p ( x i , | y i ) · p ( y i | y i − 1 ) y ∈L ℓ +1 i =1 How to optimize over |L| ℓ choices without explicit enumeration? Key: exploit the conditional independence assumptions: Y i ⊥ Y 1: i − 2 | Y i − 1 Y i ⊥ Y i +2: ℓ | Y i +1 12 / 98

Part-of-Speech Tagging Example I suspect the present forecast is pessimistic . noun • • • • • • adj. • • • • adv. • verb • • • • num. • det. • punc. • With this very simple tag set, 7 8 = 5 . 7 million labelings. (Even restricting to the possibilities above, 288 labelings.) 13 / 98

Two Obvious Solutions Brute force: Enumerate all solutions, score them, pick the best. Greedy: Pick each ˆ y i according to: ˆ y i = argmax p ( y | ˆ y i − 1 ) · p ( x i | y ) y ∈L What’s wrong with these? 14 / 98

Two Obvious Solutions Brute force: Enumerate all solutions, score them, pick the best. Greedy: Pick each ˆ y i according to: ˆ y i = argmax p ( y | ˆ y i − 1 ) · p ( x i | y ) y ∈L What’s wrong with these? Consider: “the old dog the footsteps of the young” (credit: Julia Hirschberg) “the horse raced past the barn fell” 15 / 98

Conditional Independence We can get an exact solution in polynomial time! Y i ⊥ Y 1: i − 2 | Y i − 1 Y i ⊥ Y i +2: ℓ | Y i +1 Given the adjacent labels to Y i , others do not matter. Let’s start at the last position, ℓ . . . 16 / 98

High-Level View of Viterbi ◮ The decision about Y ℓ is a function of y ℓ − 1 , x ℓ , and nothing else!  �  X ℓ = x ℓ , � � p ( Y ℓ = y | x , y 1:( ℓ − 1) ) = p  Y ℓ = y Y ℓ − 1 = y ℓ − 1 , �  � Y ℓ +1 = � � = p ( Y ℓ = y, X ℓ = x ℓ , Y ℓ − 1 = y ℓ − 1 , Y ℓ +1 = � ) p ( X ℓ = x ℓ , Y ℓ − 1 = y ℓ − 1 , Y ℓ +1 = � ) ∝ p ( � | y ) · p ( x ℓ | y ) · p ( y | y ℓ − 1 ) 17 / 98

High-Level View of Viterbi ◮ The decision about Y ℓ is a function of y ℓ − 1 , x ℓ , and nothing else!  �  X ℓ = x ℓ , � � p ( Y ℓ = y | x , y 1:( ℓ − 1) ) = p  Y ℓ = y Y ℓ − 1 = y ℓ − 1 , �  � Y ℓ +1 = � � = p ( Y ℓ = y, X ℓ = x ℓ , Y ℓ − 1 = y ℓ − 1 , Y ℓ +1 = � ) p ( X ℓ = x ℓ , Y ℓ − 1 = y ℓ − 1 , Y ℓ +1 = � ) ∝ p ( � | y ) · p ( x ℓ | y ) · p ( y | y ℓ − 1 ) ◮ If, for each value of y ℓ − 1 , we knew the best y 1:( ℓ − 1) , then picking y ℓ would be easy. 18 / 98

High-Level View of Viterbi ◮ The decision about Y ℓ is a function of y ℓ − 1 , x ℓ , and nothing else!  �  X ℓ = x ℓ , � � p ( Y ℓ = y | x , y 1:( ℓ − 1) ) = p  Y ℓ = y Y ℓ − 1 = y ℓ − 1 , �  � Y ℓ +1 = � � = p ( Y ℓ = y, X ℓ = x ℓ , Y ℓ − 1 = y ℓ − 1 , Y ℓ +1 = � ) p ( X ℓ = x ℓ , Y ℓ − 1 = y ℓ − 1 , Y ℓ +1 = � ) ∝ p ( � | y ) · p ( x ℓ | y ) · p ( y | y ℓ − 1 ) ◮ If, for each value of y ℓ − 1 , we knew the best y 1:( ℓ − 1) , then picking y ℓ would be easy. ◮ Idea: for each position i , calculate the score of the best label prefix y 1: i ending in each possible value for Y i . 19 / 98

High-Level View of Viterbi ◮ The decision about Y ℓ is a function of y ℓ − 1 , x ℓ , and nothing else!  �  X ℓ = x ℓ , � � p ( Y ℓ = y | x , y 1:( ℓ − 1) ) = p  Y ℓ = y Y ℓ − 1 = y ℓ − 1 , �  � Y ℓ +1 = � � = p ( Y ℓ = y, X ℓ = x ℓ , Y ℓ − 1 = y ℓ − 1 , Y ℓ +1 = � ) p ( X ℓ = x ℓ , Y ℓ − 1 = y ℓ − 1 , Y ℓ +1 = � ) ∝ p ( � | y ) · p ( x ℓ | y ) · p ( y | y ℓ − 1 ) ◮ If, for each value of y ℓ − 1 , we knew the best y 1:( ℓ − 1) , then picking y ℓ would be easy. ◮ Idea: for each position i , calculate the score of the best label prefix y 1: i ending in each possible value for Y i . ◮ With a little bookkeeping, we can then trace backwards and recover the best label sequence. 20 / 98

Chart Data Structure x 1 x 2 . . . x ℓ y y ′ . . . y last 21 / 98

Recurrence First, think about the score of the best sequence. Let s i ( y ) be the score of the best label sequence for x 1: i that ends in y . It is defined recursively: y ′ ∈L p ( y | y ′ ) · s ℓ − 1 ( y ′ ) s ℓ ( y ) = p ( � | y ) · p ( x ℓ | y ) · max 22 / 98

Recurrence First, think about the score of the best sequence. Let s i ( y ) be the score of the best label sequence for x 1: i that ends in y . It is defined recursively: y ′ ∈L p ( y | y ′ ) · s ℓ − 1 ( y ′ ) s ℓ ( y ) = p ( � | y ) · p ( x ℓ | y ) · max y ′ ∈L p ( y | y ′ ) · s ℓ − 2 ( y ′ ) s ℓ − 1 ( y ) = p ( x ℓ − 1 | y ) · max 23 / 98

Recurrence First, think about the score of the best sequence. Let s i ( y ) be the score of the best label sequence for x 1: i that ends in y . It is defined recursively: y ′ ∈L p ( y | y ′ ) · s ℓ − 1 ( y ′ ) s ℓ ( y ) = p ( � | y ) · p ( x ℓ | y ) · max y ′ ∈L p ( y | y ′ ) · s ℓ − 2 ( y ′ ) s ℓ − 1 ( y ) = p ( x ℓ − 1 | y ) · max y ′ ∈L p ( y | y ′ ) · s ℓ − 3 ( y ′ ) s ℓ − 2 ( y ) = p ( x ℓ − 2 | y ) · max 24 / 98

Recurrence First, think about the score of the best sequence. Let s i ( y ) be the score of the best label sequence for x 1: i that ends in y . It is defined recursively: y ′ ∈L p ( y | y ′ ) · s ℓ − 1 ( y ′ ) s ℓ ( y ) = p ( � | y ) · p ( x ℓ | y ) · max y ′ ∈L p ( y | y ′ ) · s ℓ − 2 ( y ′ ) s ℓ − 1 ( y ) = p ( x ℓ − 1 | y ) · max y ′ ∈L p ( y | y ′ ) · s ℓ − 3 ( y ′ ) s ℓ − 2 ( y ) = p ( x ℓ − 2 | y ) · max . . . y ′ ∈L p ( y | y ′ ) · s i − 1 ( y ′ ) s i ( y ) = p ( x i | y ) · max 25 / 98

Natural Language Processing (CSEP 517): Sequence Models Noah Smith - PowerPoint PPT Presentation

Natural Language Processing (CSEP 517): Sequence Models Noah Smith 2017 c University of Washington nasmith@cs.washington.edu April 17, 2017 1 / 98 To-Do List Online quiz: due Sunday Read: Collins (2011), which has somewhat

CSEP 517 Natural Language Processing Luke Zettlemoyer Machine Translation, Sequence-to-sequence

CSEP 517 Natural Language Processing Language Models Luke Zettlemoyer Slides adapted from Dan

Natural Language Processing (CSEP 517): Introduction & Language Models Noah Smith c 2017

CSEP 517 Natural Language Processing Hidden Markov Models Luke Zettlemoyer University of

CSEP 517 Natural Language Processing Autumn 2015 Hidden Markov Models Yejin Choi University of

CSEP 517 Natural Language Processing Introduction Luke Zettlemoyer Slides adapted from Dan

CSEP 517: Natural Language Processing New PMP Course! Instructor: Luke Zettlemoyer Autumn 2013

CSEP 517 Natural Language Processing Autumn 2018 Introduction Luke Zettlemoyer Slides adapted

Natural Language Processing (CSEP 517): Computational Pragmatics Chenhao Tan 2017 c

CSEP 517 Natural Language Processing Text Classification Linear Models Luke Zettlemoyer -

CSEP 517 Natural Language Processing Autumn 2018 Text Classification Linear Models Luke

CSEP 517 Natural Language Processing Autumn 2018 Distributed Semantics & Embeddings Luke

CSEP 517 Natural Language Processing Autumn 2015 Parsing (Trees) Yejin Choi - University of

CSEP 517 Natural Language Processing Frame Semantics Luke Zettlemoyer Slides adapted from Yejin

CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer

CSEP 517 Natural Language Processing Autumn 2015 Introduction Yejin Choi Slides adapted

Meredith Corporation Smith Barney Citigroup Entertainment, Media and Telecommunications

Subverting algorithmic policies of sonic control in Nicolas Collinss Broken Light (1992) Dr

Real spiritual Leadership 1 Corinthians 4 1 Corinthians 4 1 Corinthians 4 1 Corinthians 4

Risk Registers The Good The Bad, Making Real Change Wayne Routly SA3T1 TL, SA6T4 TL, Security

Greening Development = Environmental Alchemy? Tony Simons, PhD World Agroforestry Centre (ICRAF),

For Wednesday Read Becker, ch. 9, sections 3-4 Program 6 Any questions? Handling

Lab 5 preview Hung-Wei Tseng Announcement Lab 4 due Thursday before 5:30pm! Interview

WITH C++ Prof. Amr Goneid AUC Part 5. Functions Prof. amr Goneid, AUC 1 Functions Prof. amr

Sambuz

Useful Links

Newsletter

Mail Us

Natural Language Processing (CSEP 517): Sequence Models Noah Smith - PowerPoint PPT Presentation

Natural Language Processing (CSEP 517): Sequence Models Noah Smith 2017 c University of Washington nasmith@cs.washington.edu April 17, 2017 1 / 98 To-Do List Online quiz: due Sunday Read: Collins (2011), which has somewhat

CSEP 517 Natural Language Processing Luke Zettlemoyer Machine Translation, Sequence-to-sequence

CSEP 517 Natural Language Processing Language Models Luke Zettlemoyer Slides adapted from Dan

Natural Language Processing (CSEP 517): Introduction &amp; Language Models Noah Smith c 2017

CSEP 517 Natural Language Processing Hidden Markov Models Luke Zettlemoyer University of

CSEP 517 Natural Language Processing Autumn 2015 Hidden Markov Models Yejin Choi University of

CSEP 517 Natural Language Processing Introduction Luke Zettlemoyer Slides adapted from Dan

CSEP 517: Natural Language Processing New PMP Course! Instructor: Luke Zettlemoyer Autumn 2013

CSEP 517 Natural Language Processing Autumn 2018 Introduction Luke Zettlemoyer Slides adapted

Natural Language Processing (CSEP 517): Computational Pragmatics Chenhao Tan 2017 c

CSEP 517 Natural Language Processing Text Classification Linear Models Luke Zettlemoyer -

CSEP 517 Natural Language Processing Autumn 2018 Text Classification Linear Models Luke

CSEP 517 Natural Language Processing Autumn 2018 Distributed Semantics &amp; Embeddings Luke

CSEP 517 Natural Language Processing Autumn 2015 Parsing (Trees) Yejin Choi - University of

CSEP 517 Natural Language Processing Frame Semantics Luke Zettlemoyer Slides adapted from Yejin

CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer

CSEP 517 Natural Language Processing Autumn 2015 Introduction Yejin Choi Slides adapted

Meredith Corporation Smith Barney Citigroup Entertainment, Media and Telecommunications

Subverting algorithmic policies of sonic control in Nicolas Collinss Broken Light (1992) Dr

Real spiritual Leadership 1 Corinthians 4 1 Corinthians 4 1 Corinthians 4 1 Corinthians 4

Risk Registers The Good The Bad, Making Real Change Wayne Routly SA3T1 TL, SA6T4 TL, Security

Greening Development = Environmental Alchemy? Tony Simons, PhD World Agroforestry Centre (ICRAF),

For Wednesday Read Becker, ch. 9, sections 3-4 Program 6 Any questions? Handling

Lab 5 preview Hung-Wei Tseng Announcement Lab 4 due Thursday before 5:30pm! Interview

WITH C++ Prof. Amr Goneid AUC Part 5. Functions Prof. amr Goneid, AUC 1 Functions Prof. amr

Sambuz

Useful Links

Newsletter

Mail Us

Natural Language Processing (CSEP 517): Introduction & Language Models Noah Smith c 2017

CSEP 517 Natural Language Processing Autumn 2018 Distributed Semantics & Embeddings Luke