CSEP 517 Natural Language Processing Autumn 2015 Hidden Markov - PowerPoint PPT Presentation

CSEP 517 Natural Language Processing Autumn 2015 Hidden Markov Models Yejin Choi University of Washington [Many slides from Dan Klein, Michael Collins, Luke Zettlemoyer]

Overview § Hidden Markov Models § Learning § Supervised: Maximum Likelihood § Inference (or Decoding) § Viterbi § Forward Backward § N-gram Taggers

Pairs of Sequences § Consider the problem of jointly modeling a pair of strings § E.g.: part of speech tagging DT NNP NN VBD VBN RP NN NNS The Georgia branch had taken on loan commitments … DT NN IN NN VBD NNS VBD The average of interbank offered rates plummeted … § Q: How do we map each word in the input sentence onto the appropriate label? § A: We can learn a joint distribution: p ( x 1 . . . x n , y 1 . . . y n ) § And then compute the most likely assignment: arg max y 1 ...y n p ( x 1 . . . x n , y 1 . . . y n )

Classic Solution: HMMs § We want a model of sequences y and observations x 𝑧↓𝑜 y 0 y 1 y 2 y n +1 x 1 x 2 x n n Y p ( x 1 . . . x n , y 1 . . . y n ) = q ( STOP | y n ) q ( y i | y i − 1 ) e ( x i | y i ) i =1 where y 0 = START and we call q(y’|y) the transition distribution and e(x|y) the emission (or observation) distribution. § Assumptions: § Tag/state sequence is generated by a markov model § Words are chosen independently, conditioned only on the tag/state § These are totally broken assumptions: why?

Example: POS Tagging The Georgia branch had taken on loan commitments … DT NNP NN VBD VBN RP NN NNS § HMM Model: § States Y = {DT, NNP, NN, ... } are the POS tags § Observations X = V are words § Transition dist ’ n q(y i |y i -1 ) models the tag sequences § Emission dist ’ n e(x i |y i ) models words given their POS § Q : How to we represent n-gram POS taggers?

Example: Chunking § Goal: Segment text into spans with certain properties § For example, named entities: PER, ORG, and LOC Germany ’s representative to the European Union ’s veterinary committee Werner Zwingman said on Wednesday consumers should … [Germany] LOC ’s representative to the [European Union] ORG ’s veterinary committee [Werner Zwingman] PER said on Wednesday consumers should … § Q : Is this a tagging problem?

Example: Chunking [Germany] LOC ’s representative to the [European Union] ORG ’s veterinary committee [Werner Zwingman] PER said on Wednesday consumers should … Germany/BL ’s/NA representative/NA to/NA the/NA European/BO Union/CO ’s/NA veterinary/NA committee/NA Werner/BP Zwingman/CP said/NA on/NA Wednesday/NA consumers/NA should/NA … § HMM Model: § States Y = {NA,BL,CL,BO,CO,BP,CP} represent beginnings (BL,BO,BP) and continuations (CL,CO,CP) of chunks, as well as other words (NA) § Observations X = V are words § Transition dist ’ n q(y i |y i -1 ) models the tag sequences § Emission dist ’ n e(x i |y i ) models words given their type

Example: HMM Translation Model 1 2 4 5 6 7 8 9 3 E: Thank you , I shall do so gladly . A: 1 3 7 6 8 8 8 8 9 F: Gracias , lo haré de muy buen grado . Model Parameters Emissions: e( F 1 = Gracias | E A1 = Thank ) Transitions: p( A 2 = 3 | A 1 = 1)

HMM Inference and Learning § Learning § Maximum likelihood: transitions q and emissions e n Y p ( x 1 . . . x n , y 1 . . . y n ) = q ( STOP | y n ) q ( y i | y i − 1 ) e ( x i | y i ) i =1 § Inference (linear time in sentence length!) y ∗ = arg max y 1 ...y n p ( x 1 . . . x n , y 1 . . . y n ) § Viterbi: § Forward Backward: X X p ( x 1 . . . x n , y i ) = p ( x 1 . . . x n , y 1 . . . y n ) y 1 ...y i − 1 y i +1 ...y n

Learning: Maximum Likelihood n Y p ( x 1 . . . x n , y 1 . . . y n ) = q ( STOP | y n ) q ( y i | y i − 1 ) e ( x i | y i ) i =1 § Learning (Supervised Learning) § Maximum likelihood methods for estimating transitions q and emissions e e ML ( x | y ) = c ( y, x ) q ML ( y i | y i − 1 ) = c ( y i − 1 , y i ) c ( y ) c ( y i − 1 ) § Will these estimates be high quality? § Which is likely to be more sparse, q or e? § Can use all of the same smoothing tricks we saw for language models!

Learning: Low Frequency Words n Y p ( x 1 . . . x n , y 1 . . . y n ) = q ( STOP | y n ) q ( y i | y i − 1 ) e ( x i | y i ) i =1 § Typically, linear interpolation works well for transitions q ( y i | y i − 1 ) = λ 1 q ML ( y i | y i − 1 ) + λ 2 q ML ( y i ) § However, other approaches used for emissions § Step 1: Split the vocabulary § Frequent words: appear more than M (often 5) times § Low frequency: everything else § Step 2: Map each low frequency word to one of a small, finite set of possibilities § For example, based on prefixes, suffixes, etc. § Step 3: Learn model for this new space of possible word sequences

Low Frequency Words: An Example Named Entity Recognition [Bickel et. al, 1999] § Used the following word classes for infrequent words: twoDigitNum 90 Two digit year fourDigitNum 1990 Four digit year containsDigitAndAlpha A8956-67 Product code containsDigitAndDash 09-96 Date containsDigitAndSlash 11/9/89 Date containsDigitAndComma 23,000.00 Monetary amount containsDigitAndPeriod 1.00 Monetary amount,percentage othernum 456789 Other number allCaps BBN Organization capPeriod M. Person name initial firstWord first word of sentence no useful capitalization infor initCap Sally Capitalized word lowercase can Uncapitalized word other , Punctuation marks, all other 18

Low Frequency Words: An Example § Profits/NA soared/NA at/NA Boeing/SC Co./CC ,/NA easily/NA topping/NA forecasts/NA on/NA Wall/SL Street/CL ,/NA as/NA their/NA CEO/NA Alan/SP Mulally/CP announced/NA first/NA quarter/NA results/NA ./NA § firstword /NA soared/NA at/NA initCap /SC Co./CC ,/NA easily/NA lowercase /NA forecasts/NA on/NA initCap /SL Street/CL ,/NA as/NA their/NA CEO/NA Alan/SP initCap /CP announced/NA first/NA quarter/NA results/NA ./NA NA = No entity SC = Start Company CC = Continue Company SL = Start Location CL = Continue Location …

Inference (Decoding) § Problem: find the most likely (Viterbi) sequence under the model arg max y 1 ...y n p ( x 1 . . . x n , y 1 . . . y n ) § Given model parameters, we can score any sequence pair NNP VBZ NN NNS CD NN . Fed raises interest rates 0.5 percent . q(NNP| ♦ ) e(Fed|NNP) q(VBZ|NNP) e(raises|VBZ) q(NN|VBZ) … .. § In principle, we ’ re done – list all possible tag sequences, score each one, pick the best one (the Viterbi state sequence) logP = -23 NNP VBZ NN NNS CD NN NNP NNS NN NNS CD NN logP = -29 NNP VBZ VB NNS CD NN logP = -27

n Dynamic Programming! Y p ( x 1 . . . x n , y 1 . . . y n ) = q ( STOP | y n ) q ( y i | y i − 1 ) e ( x i | y i ) i =1 arg max y 1 ...y n p ( x 1 . . . x n , y 1 . . . y n ) § Define π (i,y i ) to be the max score of a sequence of length i ending in tag y i π ( i, y i ) = y 1 ...y i − 1 p ( x 1 . . . x i , y 1 . . . y i ) max = max y i − 1 e ( x i | y i ) q ( y i | y i − 1 ) max y 1 ...y i − 2 p ( x 1 . . . x i − 1 , y 1 . . . y i − 1 ) = max y i − 1 e ( x i | y i ) q ( y i | y i − 1 ) π ) π ( i − 1 , y i − 1 ) = § We now have an efficient algorithm. Start with i=0 and work your way to the end of the sentence!

Time flies like an arrow; Fruit flies like a banana 16

Fruit Flies Like Bananas 𝜌 (1, ¡ 𝑂 ) 𝜌 (2, ¡ 𝑂 ) 𝜌 (3, ¡ 𝑂 ) 𝜌 (4, ¡ 𝑂 ) START STOP 𝜌 (1, ¡ 𝑊 ) 𝜌 (2, ¡ 𝑊 ) 𝜌 (3, ¡ 𝑊 ) 𝜌 (4, ¡ 𝑊 ) 𝜌 (1, ¡ 𝐽𝑂 ) 𝜌 (2, ¡ 𝐽𝑂 ) 𝜌 (3, ¡ 𝐽𝑂 ) 𝜌 (4, ¡ 𝐽𝑂 ) π ( i, y i ) = y 1 ...y i − 1 p ( x 1 . . . x i , y 1 . . . y i ) max 17

Fruit Flies Like Bananas 𝜌 (1, ¡ 𝑂 ) 𝜌 (2, ¡ 𝑂 ) 𝜌 (3, ¡ 𝑂 ) 𝜌 (4, ¡ 𝑂 ) =0.03 START STOP 𝜌 (1, ¡ 𝑊 ) 𝜌 (2, ¡ 𝑊 ) 𝜌 (3, ¡ 𝑊 ) 𝜌 (4, ¡ 𝑊 ) =0.01 𝜌 (1, ¡ 𝐽𝑂 ) 𝜌 (2, ¡ 𝐽𝑂 ) 𝜌 (3, ¡ 𝐽𝑂 ) 𝜌 (4, ¡ 𝐽𝑂 ) =0 π ( i, y i ) = y 1 ...y i − 1 p ( x 1 . . . x i , y 1 . . . y i ) max 18

Fruit Flies Like Bananas 𝜌 (1, ¡ 𝑂 ) 𝜌 (2, ¡ 𝑂 ) 𝜌 (3, ¡ 𝑂 ) 𝜌 (4, ¡ 𝑂 ) =0.005 =0.03 START STOP 𝜌 (1, ¡ 𝑊 ) 𝜌 (2, ¡ 𝑊 ) 𝜌 (3, ¡ 𝑊 ) 𝜌 (4, ¡ 𝑊 ) =0.01 𝜌 (1, ¡ 𝐽𝑂 ) 𝜌 (2, ¡ 𝐽𝑂 ) 𝜌 (3, ¡ 𝐽𝑂 ) 𝜌 (4, ¡ 𝐽𝑂 ) =0 π ( i, y i ) = y 1 ...y i − 1 p ( x 1 . . . x i , y 1 . . . y i ) max 19

Fruit Flies Like Bananas 𝜌 (1, ¡ 𝑂 ) 𝜌 (2, ¡ 𝑂 ) 𝜌 (3, ¡ 𝑂 ) 𝜌 (4, ¡ 𝑂 ) =0.005 =0.03 START STOP 𝜌 (1, ¡ 𝑊 ) 𝜌 (2, ¡ 𝑊 ) 𝜌 (3, ¡ 𝑊 ) 𝜌 (4, ¡ 𝑊 ) =0.007 =0.01 𝜌 (1, ¡ 𝐽𝑂 ) 𝜌 (2, ¡ 𝐽𝑂 ) 𝜌 (3, ¡ 𝐽𝑂 ) 𝜌 (4, ¡ 𝐽𝑂 ) =0 =0 π ( i, y i ) = y 1 ...y i − 1 p ( x 1 . . . x i , y 1 . . . y i ) max 20

Fruit Flies Like Bananas 𝜌 (1, ¡ 𝑂 ) 𝜌 (2, ¡ 𝑂 ) 𝜌 (3, ¡ 𝑂 ) 𝜌 (4, ¡ 𝑂 ) =0.005 =0.03 =0.0001 START STOP 𝜌 (1, ¡ 𝑊 ) 𝜌 (2, ¡ 𝑊 ) 𝜌 (3, ¡ 𝑊 ) 𝜌 (4, ¡ 𝑊 ) =0.007 =0.0007 =0.01 𝜌 (1, ¡ 𝐽𝑂 ) 𝜌 (2, ¡ 𝐽𝑂 ) 𝜌 (3, ¡ 𝐽𝑂 ) 𝜌 (4, ¡ 𝐽𝑂 ) =0 =0.0003 =0 π ( i, y i ) = y 1 ...y i − 1 p ( x 1 . . . x i , y 1 . . . y i ) max 21

CSEP 517 Natural Language Processing Autumn 2015 Hidden Markov - PowerPoint PPT Presentation

CSEP 517 Natural Language Processing Autumn 2015 Hidden Markov Models Yejin Choi University of Washington [Many slides from Dan Klein, Michael Collins, Luke Zettlemoyer] Overview Hidden Markov Models Learning Supervised:

CSEP 517 Natural Language Processing Language Models Luke Zettlemoyer Slides adapted from Dan

CSEP 517 Natural Language Processing Introduction Luke Zettlemoyer Slides adapted from Dan

CSEP 517: Natural Language Processing New PMP Course! Instructor: Luke Zettlemoyer Autumn 2013

CSEP 517 Natural Language Processing Autumn 2018 Introduction Luke Zettlemoyer Slides adapted

Natural Language Processing (CSEP 517): Computational Pragmatics Chenhao Tan 2017 c

Natural Language Processing (CSEP 517): Introduction & Language Models Noah Smith c 2017

CSEP 517 Natural Language Processing Autumn 2015 Parsing (Trees) Yejin Choi - University of

CSEP 517 Natural Language Processing Frame Semantics Luke Zettlemoyer Slides adapted from Yejin

CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer

CSEP 517 Natural Language Processing Autumn 2015 Introduction Yejin Choi Slides adapted

CSEP 517 Natural Language Processing Luke Zettlemoyer Machine Translation, Sequence-to-sequence

Natural Language Processing (CSEP 517): Distributional Semantics Roy Schwartz 2017 c

Natural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, &

CSEP 517 Natural Language Processing Coreference Resolution Luke Zettlemoyer University of

Natural Language Processing (CSEP 517): Dependency Syntax and Parsing Noah Smith 2017 c

CSEP 517 Natural Language Processing Text Classification Linear Models Luke Zettlemoyer -

MATTHEW MATTHEW 12 Plot 21-22 Reject 27 Cross 28 OD OUTLINE I. The Setting A. Rejection by

Float metadata on DAC GDAC CATHERINE SCHMECHTIG Tentative Conclusions (from M. Donnelly)

Welcome to your home church! How you can join Christs victory over death Death is natural

Naked Singularityies and Self-Similarity in Gravitational Collapse HARADA, Tomohiro Department

We are learning to write in sentences Just for today we will explain why we think a character

(AAAS) With funding and support from the National Science Foundation (NSF) Audio: Call in using

1 KINGS 1 KINGS 1 KINGS United Kingdom 1:111:43 40 Years 1:12:46 Establishment of

Page 2 NAME: Practice Exam 2 Problem 1 (4 points) Use the axioms of probability to prove that P

CSEP 517 Natural Language Processing Autumn 2015 Hidden Markov - PowerPoint PPT Presentation

CSEP 517 Natural Language Processing Autumn 2015 Hidden Markov Models Yejin Choi University of Washington [Many slides from Dan Klein, Michael Collins, Luke Zettlemoyer] Overview Hidden Markov Models Learning Supervised:

CSEP 517 Natural Language Processing Language Models Luke Zettlemoyer Slides adapted from Dan

CSEP 517 Natural Language Processing Introduction Luke Zettlemoyer Slides adapted from Dan

CSEP 517: Natural Language Processing New PMP Course! Instructor: Luke Zettlemoyer Autumn 2013

CSEP 517 Natural Language Processing Autumn 2018 Introduction Luke Zettlemoyer Slides adapted

Natural Language Processing (CSEP 517): Computational Pragmatics Chenhao Tan 2017 c

Natural Language Processing (CSEP 517): Introduction &amp; Language Models Noah Smith c 2017

CSEP 517 Natural Language Processing Autumn 2015 Parsing (Trees) Yejin Choi - University of

CSEP 517 Natural Language Processing Frame Semantics Luke Zettlemoyer Slides adapted from Yejin

CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer

CSEP 517 Natural Language Processing Autumn 2015 Introduction Yejin Choi Slides adapted

CSEP 517 Natural Language Processing Luke Zettlemoyer Machine Translation, Sequence-to-sequence

Natural Language Processing (CSEP 517): Distributional Semantics Roy Schwartz 2017 c

Natural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, &amp;

CSEP 517 Natural Language Processing Coreference Resolution Luke Zettlemoyer University of

Natural Language Processing (CSEP 517): Dependency Syntax and Parsing Noah Smith 2017 c

CSEP 517 Natural Language Processing Text Classification Linear Models Luke Zettlemoyer -

MATTHEW MATTHEW 12 Plot 21-22 Reject 27 Cross 28 OD OUTLINE I. The Setting A. Rejection by

Float metadata on DAC GDAC CATHERINE SCHMECHTIG Tentative Conclusions (from M. Donnelly)

Welcome to your home church! How you can join Christs victory over death Death is natural

Naked Singularityies and Self-Similarity in Gravitational Collapse HARADA, Tomohiro Department

We are learning to write in sentences Just for today we will explain why we think a character

(AAAS) With funding and support from the National Science Foundation (NSF) Audio: Call in using

1 KINGS 1 KINGS 1 KINGS United Kingdom 1:111:43 40 Years 1:12:46 Establishment of

Page 2 NAME: Practice Exam 2 Problem 1 (4 points) Use the axioms of probability to prove that P

Natural Language Processing (CSEP 517): Introduction & Language Models Noah Smith c 2017

Natural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, &