: g 0 n 1 i m e r m u t a c r e g L o r P s M - PowerPoint PPT Presentation

Lecture 10:   Introduction to POS Tagging : g 0 n 1 i m e r m u t a c r e g L o r P s M c i M m H a r n o y f D CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 1

HMM decoding (Viterbi) We are given a sentence w = w (1) … w (N) w = “she promised to back the bill”   We want to use an HMM tagger to find its POS tags t t * = argmax t P( w , t ) = argmax t P( t (1) ) · P( w (1) | t (1) ) · P( t (2) | t (1) ) · … · P( w (N) | t (N) ) But: with T tags, w has O(T N ) possible tag sequences! To do this efficiently (in O ( T 2 N ) time), we will use a dynamic programming technique called   the Viterbi algorithm which exploits the independence assumptions in the HMM. 2 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Dynamic programming Dynamic programming is a general technique to solve certain complex search problems by memoization 1.) Recursively decompose the large search problem into smaller subproblems   that can be solved efficiently – There is only a polynomial number of subproblems.   2.) Store (memoize) the solutions of each subproblem in a common data structure – Processing this data structure takes polynomial time 3 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

The Viterbi algorithm A dynamic programming algorithm which finds the best (=most probable) tag sequence t* for an input sentence w: t * = argmax t P( w | t )P( t ) Complexity: linear in the sentence length. With a bigram HMM, Viterbi runs in O(T 2 N) steps   for an input sentence with N words and a tag set of T tags.   The independence assumptions of the HMM tell us   how to break up the big search problem   (find t * = argmax t P( w | t )P( t ) ) into smaller subproblems.   The data structure used to store the solution of these subproblems is the trellis. 4 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Bookkeeping: the trellis w (i-1) w (i) w (i+1) ... w (N-1) w (N) w (1) w (2) ... t 1 States ... t j word w (i) has tag t j ... t T Words (“time steps”) We use a N × T table (“ trellis ”) to keep track of the HMM.   The HMM can assign one of the T tags to each of the N words. 5 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Computing P ( t , w ) for one tag sequence P ( t (1) = t 1 ) w (i-1) w (i) w (i+1) ... w (N-1) w (N) w (1) w (2) ... t 1 P ( w (i) | t i ) P ( w (1) | t 1 ) ... P ( t j | t 1 ) P ( t .. | t i ) P ( t j | t . . ) P ( t i | t … ) t j P ( w (2) | t j ) P ( w (N) | t j ) ... P ( w (i+1) | t i+1 ) t T One path through the trellis = one tag sequence 6 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  Viterbi: Basic Idea t (1) … t ( N − 1) t ( N ) Task: Find the tag sequence that maximizes the joint probability N π ( t (1) ) P ( w (1) ∣ t (1) ) P ( t ( i ) ∣ t ( i − 1) ) P ( w ( i ) ∣ t ( i ) ) ∏ i =2 t (1) t (2) The choice of affects the probability of ,   t (3) which in turn affects the probability of , etc: π ( t (1) ) P ( w (1) ∣ t (1) ) P ( t (2) ∣ t (1) ) P ( w (2) ∣ t (2) ) P ( t (3) ∣ t (2) ) … t (1) → We cannot fix (or any tag) until the end of the sentence ! 7 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

          Exploiting the independence assumptions t (1) t (2) t (3) … = t i t j t k … You want to find the best tag sequence This depends only This depends only on on the choices of the choices of   t (2) = t j t (3) = t k t (1) = t i t (2) = t j and and argmax t i , t j , t k ,... π ( t (1) = t i ) P ( w (1) ∣ t (1) = t i ) P ( t (2) = t j ∣ t (1) = t i ) P ( w (2) ∣ t (2) = t j ) P ( t (3) = t k ∣ t (2) = t j ) P ( w (3) ∣ t (3) = t k )… This depends This depends only on the choice This depends   only on the choice of t (2) = t j only on the choice of t (1) = t i of t (3) = t k Step 1: For any particular choice of t (1) = t i w (1) for , compute i = 1.. N For all words π ( t (1) = t i ) P ( w (1) ∣ t (1) = t i ) in the sentence:     t (2) = t j w (2) Step 2a): for any particular choice of for ,   For all tags j = 1... T Step 2b):   w (1) in the tag set: t i pick the tag for that gives the highest probability to Compute   argmax t i π ( t (1) = t i ) P ( w (1) ∣ t (1) = t i ) P ( t (2) = t j ∣ t (1) = t i ) P ( w (2) ∣ t (2) = t j ) Find the best   t (1)...( i ) tag sequence t (2) = t j t ( i ) = t j t i Step 3: You have already found the best for any .   that ends in t (3) = t k w (3) Now, for any particular choice of for ,   w (2) pick the tag for t j that gives the highest probability to argmax t j π ( t (1) = t i ) P ( w (1) ∣ t (1) = t i ) P ( t (2) = t j ∣ t (1) = t i ) P ( w (2) ∣ t (2) = t j ) P ( t (3) = t k ∣ t (2) = t j ) 8 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Viterbi: Basic Idea Assume we knew (for any tag ) the maximum t j t (1) … t ( N ) probability of any complete sequence   t ( N ) = t j that ends in that tag [ N : last word in w ]   t j Call that probability the Viterbi probability of tag   N at position , and store it as trellis[N][j].viterbi Then, the probability of the best tag sequence   (i.e. the maximum probability of any complete t (1) … t ( N ) sequence ) for our sentence is max k ∈ {1,..,T} (trellis[ N ][ k ].viterbi) 9 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Viterbi: Basic Idea w ( i ) t j Viterbi probability of tag for word : trellis[i][j].viterbi   P ( w (1)...( i ) , t (1)...( i ) ) w (1)...( i ) The highest probability of the prefix t ( i ) = t j t (1)...( i ) and any tag sequence ending in trellis[i][j].viterbi = max P (w (1) …w (i) , t (1) …, t (i) = t j ) The probability of the best tag sequence overall is given by: max k trellis[N][k].viterbi (the largest entry in the last column of the trellis) The Viterbi probability trellis[i][j].viterbi (for any cell in the trellis)   can easily be computed based on the cells in the preceding column, trellis[i-1][k].viterbi 10 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Viterbi: Basic Idea w ( i ) t j Viterbi probability of tag for word : trellis[i][j].viterbi   P ( w (1)...( i ) , t (1)...( i ) ) w (1)...( i ) The highest probability of the prefix t ( i ) = t j t (1)...( i ) and any tag sequence ending in w (1) Base case: First word in the sentence viterbi = π ( t j ) P ( w (1) ∣ t j ) Initial probability for tag t j   trellis[1][ j ] . Emission probability for w (1) w ( i ) Recurrence: Any other word in the sentence trellis[ i ][ j ] . viterbi = ( trellis[ i − 1][ k ] . viterbi × P ( t j ∣ t k ) P ( w (i) ∣ t j ) ) max k Viterbi probability of tag t k for Transition prob.   Emission prob.   the preceding word w (i-1) for t j given t k for w (i) given t j 11 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Initialization For a bigram HMM: Given an N-word sentence w (1) …w (N) and a tag set consisting of T tags, create a trellis of size N × T In the first column, initialize each cell trellis[1][k] as   trellis[1][k] := π (t k ) P (w (1) | t k ) (there is only a single tag sequence for the first word that assigns a particular tag to that word) 12 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  Viterbi: filling in the first column w (1) π (DT) : probability that a π (DT) × P ( w (1) ∣ DT) DT sentence starts with DT   ... P ( w (1) ∣ DT) : probability π (NNS) × P ( w (1) ∣ NNS) NNS that tag DT emits word w (1) ... π (VBZ) × P ( w (1) ∣ VBZ) VBZ We want to find the best (most likely) tag sequence   for the entire sentence. Each cell trellis[i][j] (corresponding to word w (i) with tag t j ) contains: — trellis[i][j].viterbi: the probability of the best sequence ending in t j — trellis[i][j].backpointer: to the cell k in the previous column that corresponds to the best tag sequence ending in t j 13 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

At any internal cell – For each cell in the preceding column: multiply its Viterbi probability with the transition probability to the current cell. – Keep a single backpointer to the best (highest scoring) cell   in the preceding column – Multiply this score with the emission probability   of the current word w (n-1) w (n) t 1 P ( w (1..n-1), t (n-1) =t 1 ) P ( t i | t 1 ... ... ) trellis[n][i].viterbi = t i P (w (n) | t i ) P ( w (1..n-1), t (n-1) =t i ) P (t i | t i ) ⋅ Max j ( trellis[n-1][j].viterbi ⋅ P(t i |t j ) ) ... ... P (t i | t T ) t T P ( w (1..n-1), t n-1 =t T ) 14 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

At the end of the sentence In the last column (i.e. at the end of the sentence) pick the cell with the highest entry, and trace back the backpointers to the first word in the sentence. 15 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

: g 0 n 1 i m e r m u t a c r e g L o r P s M - PowerPoint PPT Presentation

Lecture 10: Introduction to POS Tagging : g 0 n 1 i m e r m u t a c r e g L o r P s M c i M m H a r n o y f D CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Lecture 4.9: Variation of parameters for systems Matthew Macauley Department of Mathematical

Abstract The Hamilton- Jacobi partial differential equation is generalized to be applicable for

Asset pricing under optimal contracts Jak sa Cvitani c (Caltech) joint work with Hao Xing

NLP Programming Tutorial 8 - Phrase Structure Parsing Graham Neubig Nara Institute of Science

Physics 2D Lecture Slides Lecture 6: Jan 13th 2004 Vivek Sharma UCSD Physics Lorentz

Interoperability Developmental Test & Evaluation Guidance May 2017 DT&E Interoperability

How Do Degrees Simplest Case (cont-d) of Confidence Comment General Case Change with Time?

Manifold Construction and Parameterization for Nonlinear Manifold-Based Model Reduction Chenjie

Math 211 Math 211 Lecture #2 Separable Equations 2 Interval of Existence Interval of

General Relativistic Smoothed Particle Hydrodynamics (GR-SPH) David Liptai Supervisors: Daniel

This weeks question: What role should the church play in a believer's life (generally) or

Financial Econometrics Econ 40357 Constant expected return model and efficient market hypothesis

Scalable and Precise Taint Analysis for Android Wei Huang 12 , Yao Dong 1 , Ana Milanova 1 ,

Forestclaw : Programming paradigms Donna Calhoun (Boise State University) Carsten Burstedde,

CS3000:&Algorithms&&&Data Jonathan&Ullman Lecture&2:&

CSC321 Lecture 6: Backpropagation Roger Grosse Roger Grosse CSC321 Lecture 6: Backpropagation 1

On the motion of compressible inviscid fluids driven by stochastic forcing Eduard Feireisl based

MA Macroeconomics 11. The Solow Model Karl Whelan School of Economics, UCD Autumn 2014 Karl

EI331 Signals and Systems Lecture 14 Bo Jiang John Hopcroft Center for Computer Science

Boosted Density Estimation Remastered Zac Cranko 1,2 and Richard Nock 2,1,3 1 The Australian

2.2 Classic Differential Geometry 1 Hao Li http://cs621.hao-li.com 1 Spring 2018 CSCI 621:

San Francisco Economic Strategy The 2007 Economic Strategy identifies three

Math 5490 11/3/2014 Dynamical Systems Math 5490 Summary So Far November 3, 2014 Topics in

The cumulative cultural evolution of category structure in an open-ended meaning space Jon W.

: g 0 n 1 i m e r m u t a c r e g L o r P s M - PowerPoint PPT Presentation

Lecture 10: Introduction to POS Tagging : g 0 n 1 i m e r m u t a c r e g L o r P s M c i M m H a r n o y f D CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Lecture 4.9: Variation of parameters for systems Matthew Macauley Department of Mathematical

Abstract The Hamilton- Jacobi partial differential equation is generalized to be applicable for

Asset pricing under optimal contracts Jak sa Cvitani c (Caltech) joint work with Hao Xing

NLP Programming Tutorial 8 - Phrase Structure Parsing Graham Neubig Nara Institute of Science

Physics 2D Lecture Slides Lecture 6: Jan 13th 2004 Vivek Sharma UCSD Physics Lorentz

Interoperability Developmental Test &amp; Evaluation Guidance May 2017 DT&amp;E Interoperability

How Do Degrees Simplest Case (cont-d) of Confidence Comment General Case Change with Time?

Manifold Construction and Parameterization for Nonlinear Manifold-Based Model Reduction Chenjie

Math 211 Math 211 Lecture #2 Separable Equations 2 Interval of Existence Interval of

General Relativistic Smoothed Particle Hydrodynamics (GR-SPH) David Liptai Supervisors: Daniel

This weeks question: What role should the church play in a believer's life (generally) or

Financial Econometrics Econ 40357 Constant expected return model and efficient market hypothesis

Scalable and Precise Taint Analysis for Android Wei Huang 12 , Yao Dong 1 , Ana Milanova 1 ,

Forestclaw : Programming paradigms Donna Calhoun (Boise State University) Carsten Burstedde,

CS3000:&amp;Algorithms&amp;&amp;&amp;Data Jonathan&amp;Ullman Lecture&amp;2:&amp;

CSC321 Lecture 6: Backpropagation Roger Grosse Roger Grosse CSC321 Lecture 6: Backpropagation 1

On the motion of compressible inviscid fluids driven by stochastic forcing Eduard Feireisl based

MA Macroeconomics 11. The Solow Model Karl Whelan School of Economics, UCD Autumn 2014 Karl

EI331 Signals and Systems Lecture 14 Bo Jiang John Hopcroft Center for Computer Science

Boosted Density Estimation Remastered Zac Cranko 1,2 and Richard Nock 2,1,3 1 The Australian

2.2 Classic Differential Geometry 1 Hao Li http://cs621.hao-li.com 1 Spring 2018 CSCI 621:

San Francisco Economic Strategy The 2007 Economic Strategy identifies three

Math 5490 11/3/2014 Dynamical Systems Math 5490 Summary So Far November 3, 2014 Topics in

The cumulative cultural evolution of category structure in an open-ended meaning space Jon W.

Interoperability Developmental Test & Evaluation Guidance May 2017 DT&E Interoperability

CS3000:&Algorithms&&&Data Jonathan&Ullman Lecture&2:&