Lecture 11: Viterbi and Forward Algorithms Kai-Wei Chang CS @ - PowerPoint PPT Presentation

Lecture 11: Viterbi and Forward Algorithms Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS6501 Natural Language Processing 1

Quiz 1 Quiz 1 30 25 20 15 10 5 0 [0-5] [6-10] [11-15] [16-20] [21-25] v Max: 24; Mean: 18.1; Median: 18; SD: 3.36 CS6501 Natural Language Processing 2

This lecture v Two important algorithms for inference v Forward algorithm v Viterbi algorithm CS6501 Natural Language Processing 3

CS6501 Natural Language Processing ‹#›

Three basic problems for HMMs v Likelihood of the input: v Forward algorithm How likely the sentence ”I love cat” occurs v Decoding (tagging) the input: v Viterbi algorithm POS tags of ”I love cat” occurs v Estimation (learning): How to learn the model? v Find the best model parameters v Case 1: supervised – tags are annotated v Maximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text v Forward-backward algorithm CS6501 Natural Language Processing 5

Likelihood of the input v How likely a sentence “I love cat” occur v Compute 𝑄(𝒙 ∣ 𝝁 ) for the input 𝒙 and HMM 𝜇 v Remember, we model 𝑄 𝒖,𝒙 ∣ 𝝁 v 𝑄 𝒙 𝝁 = ∑ 𝑄 𝒖, 𝒙 𝝁 𝒖 Marginal probability: Sum over all possible tag sequences CS6501 Natural Language Processing 6

Likelihood of the input v How likely a sentence “I love cat” occur v 𝑄 𝒙 𝝁 = ∑ 𝑄 𝒖, 𝒙 𝝁 𝒖 1 𝑄 𝑥 . 𝑢 . 𝑄 𝑢 . ∣ 𝑢 .40 = ∑ Π ./0 𝒖 v Assume we have 2 tags N, V v 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢” 𝝁 = 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”,”𝑂𝑂𝑂” 𝝁 + 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”,”𝑂𝑂𝑊” 𝝁 +𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”,”𝑂𝑊𝑂” 𝝁 + 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑂𝑊𝑊” 𝝁 +𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”,”𝑊𝑂𝑂” 𝝁 + 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑊𝑂𝑊” 𝝁 +𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑊𝑊𝑂” 𝝁 + 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”,”𝑊𝑊𝑊” 𝝁 v Now, let’s write down 𝑄(“𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢” ∣ 𝝁) with 45 tags… CS6501 Natural Language Processing 7

𝝁 is the parameter set of Trellis diagram HMM. Let’s ignore it in some slides for simplicity’s sake 1 𝑄 𝑥 . 𝑢 . 𝑄 𝑢 . ∣ 𝑢 .40 v Goal: P(𝐱 ∣ 𝝁) = ∑ Π ./0 𝒖 𝑄(𝑢 C = 2|𝑢 0 = 1 ) 𝑄(𝑥 J |𝑢 J = 1 ) 𝑄(𝑢 J = 1|𝑢 C = 1 ) ⋯ 𝑗 = 1 𝑗 = 2 𝑗 = 3 𝑗 = 4 CS6501 Natural Language Processing 8

Trellis diagram v P(“I eat a fish”, NVVA ) 𝑄(𝑊|𝑊 ) 𝑄(𝐵|𝑊 ) 𝑄(𝑊|𝑂 ) 𝑄(𝐽|𝑂 ) N N N N 𝑄(𝑂| < 𝑇 > ) ⋯ V V V V 𝑄(𝑔𝑗𝑡ℎ| A) A A A A 𝑄(𝑓𝑏𝑢|𝑊 ) 𝑄(𝑏|𝑊 ) 𝑗 = 1 𝑗 = 2 𝑗 = 3 𝑗 = 4 CS6501 Natural Language Processing 9

Trellis diagram 1 𝑄 𝑥 . 𝑢 . 𝑄 𝑢 . ∣ 𝑢 .40 v ∑ Π ./0 : sum over all paths 𝒖 N N N N ⋯ V V V V A A A A 𝑗 = 1 𝑗 = 2 𝑗 = 3 𝑗 = 4 CS6501 Natural Language Processing 10

Dynamic programming v Recursively decompose a problem into smaller sub-problems v Similar to mathematical induction v Base step: initial values for 𝑗 = 1 v Inductive step: assume we know the values for 𝑗 = 𝑙 , let’s compute 𝑗 = 𝑙 + 1 CS6501 Natural Language Processing 11

Forward algorithm v Inductive step: from 𝑗 = 𝑙 to i = k+1 P(𝒙 𝒍 , 𝑢 Y = 𝑟) v 𝒖 Y : tag sequence with length 𝑙, 𝒙 Y = 𝑥 0 , 𝑥 C … 𝑥 Y 𝑄(𝒖 Y ,𝒙 𝒍 ) 𝑄(𝒖 Y40 , 𝒙 𝒍 , 𝑢 Y = 𝑟) v ∑ = ∑ ∑ k 𝒖 g 𝒖 gij tag @ i=k tag sequences tag sequences N N N N ⋯ V V V V A A A A 𝑗 = 1 𝑗 = 2 𝑗 = 3 𝑗 = 4 CS6501 Natural Language Processing 12

Forward algorithm v Inductive step: from 𝑗 = 𝑙 to i = k+1 𝑄(𝒖 Y ,𝒙) = ∑ P(𝒙 𝒍 , 𝑢 Y = 𝑟) v ∑ k 𝒖 g v P(𝒙 𝒍 , 𝑢 Y = 𝑟) = ∑ 𝒙 𝒍 , 𝑢 Y40 = 𝑟 m , 𝑢 Y = 𝑟) 𝑄( k l = ∑ 𝒙 𝒍40 , 𝑢 Y40 = 𝑟 m )𝑄(𝑢 Y = 𝑟 ∣ 𝑢 Y40 = 𝑟′)𝑄(𝑥 Y ∣ 𝑢 Y = 𝑟) 𝑄( km N N N N ⋯ V V V V A A A A 𝑗 = 𝑙 − 1 𝑗 = 𝑙 13 CS6501 Natural Language Processing

Forward algorithm v Inductive step: from 𝑗 = 𝑙 to i = k+1 Let’s call it 𝛽 Y (𝑟) 𝑄(𝒖 Y ,𝒙) = ∑ P(𝒙 𝒍 , 𝑢 Y = 𝑟) v ∑ This is 𝛽 Y40 (𝑟′) k 𝒖 g v P(𝒙 𝒍 , 𝑢 Y = 𝑟) = ∑ 𝒙 𝒍 , 𝑢 Y40 = 𝑟 m , 𝑢 Y = 𝑟) 𝑄( k l = ∑ 𝒙 𝒍40 , 𝑢 Y40 = 𝑟 m )𝑄(𝑢 Y = 𝑟 ∣ 𝑢 Y40 = 𝑟′)𝑄(𝑥 Y ∣ 𝑢 Y = 𝑟) 𝑄( km N N N N ⋯ V V V V A A A A 𝑗 = 𝑙 − 1 𝑗 = 𝑙 14 CS6501 Natural Language Processing

Forward algorithm v Inductive step: from 𝑗 = 𝑙 to i = k+1 v 𝛽 Y (𝑟) = ∑ 𝛽 Y40 (𝑟′) 𝑄(𝑢 Y = 𝑟 ∣ 𝑢 Y40 = 𝑟′)𝑄(𝑥 Y ∣ 𝑢 Y = 𝑟) km N N N N ⋯ V V V V A A A A 𝑗 = 𝑙 − 1 𝑗 = 𝑙 15 CS6501 Natural Language Processing

Forward algorithm v Inductive step: from 𝑗 = 𝑙 to i = k+1 v 𝛽 Y (𝑟) = ∑ 𝛽 Y40 (𝑟′) 𝑄(𝑢 Y = 𝑟 ∣ 𝑢 Y40 = 𝑟′)𝑄(𝑥 Y ∣ 𝑢 Y = 𝑟) km = 𝑄(𝑥 Y ∣ 𝑢 Y = 𝑟) ∑ 𝛽 Y40 (𝑟′) 𝑄(𝑢 Y = 𝑟 ∣ 𝑢 Y40 = 𝑟′) km N N N N ⋯ V V V V A A A A 𝑗 = 𝑙 − 1 𝑗 = 𝑙 16 CS6501 Natural Language Processing

Forward algorithm v Base step: i=0 v 𝛽 0 𝑟 = 𝑄 𝑥 0 𝑢 0 = 𝑟 𝑄(𝑢 0 = 𝑟 ∣ 𝑢 q ) initial probability 𝑞(𝑢 0 = 𝑟) N N N N ⋯ V V V V A A A A 𝑗 = 𝑙 − 1 𝑗 = 𝑙 17 CS6501 Natural Language Processing

Implementation using an array From Julia Hockenmaier, Intro to NLP v Use a 𝑜×𝑈 table to keep 𝛽 Y (𝑟) CS6501 Natural Language Processing 18

Implementation using an array Initial: Trellis[1][q] = 𝑄 𝑥 0 𝑢 0 = 𝑟 𝑄(𝑢 0 = 𝑟 ∣ 𝑢 q ) CS6501 Natural Language Processing 19

Implementation using an array i i Induction: 𝛽 Y (𝑟) = 𝑄(𝑥 Y ∣ 𝑢 Y = 𝑟) ∑ 𝛽 Y40 (𝑟′) 𝑄(𝑢 Y = 𝑟 ∣ 𝑢 Y40 = 𝑟′) km CS6501 Natural Language Processing 20

The forward algorithm (Pseudo Code) .fwd=0 CS6501 Natural Language Processing 21

Jason’s ice cream ard" o p(…|C) p(…|H) p(…|START) p(1|…) 0.5 0.1 p(2|…) 0.4 0.2 #cones p(3|…) 0.1 0.7 p(C|…) 0.8 0.2 0.5 p(H|…) 0.2 0.8 0.5 v P(”1,2,1”)? 0.5 0.4 0.5 0.8 0.8 C C C 0.5 0.2 0.2 0.2 0.2 0.5 H H H 0.8 0.8 0.1 0.2 0.1 CS6501 Natural Language Processing 22

Three basic problems for HMMs v Likelihood of the input: v Forward algorithm How likely the sentence ”I love cat” occurs v Decoding (tagging) the input: v Viterbi algorithm POS tags of ”I love cat” occurs v Estimation (learning): How to learn the model? v Find the best model parameters v Case 1: supervised – tags are annotated v Maximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text v Forward-backward algorithm CS6501 Natural Language Processing 23

Prediction in generative model v Inference: What is the most likely sequence of tags for the given sequence of words w initial probability 𝑞(𝑢 0 ) v What are the latent states that most likely generate the sequence of word w CS6501 Natural Language Processing 24

Tagging the input v Find best tag sequence of “I love cat” v Remember, we model 𝑄 𝒖,𝒙 ∣ 𝝁 v 𝒖 ∗ = arg max 𝑄 𝒖, 𝒙 𝝁 𝒖 Find the best one from all possible tag sequences CS6501 Natural Language Processing 25

Tagging the input v Assume we have 2 tags N, V v Which one is the best? 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑂𝑂𝑂” 𝝁 , 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑂𝑂𝑊” 𝝁 , 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑂𝑊𝑂” 𝝁 , 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑂𝑊𝑊” 𝝁 , 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑊𝑂𝑂” 𝝁 , 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑊𝑂𝑊” 𝝁 , 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑊𝑊𝑂” 𝝁 , 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑊𝑊𝑊” 𝝁 v Again! We need an efficient algorithm CS6501 Natural Language Processing 26

Trellis diagram 1 𝑄 𝑥 . 𝑢 . 𝑄 𝑢 . ∣ 𝑢 .40 v Goal: argmax Π ./0 𝒖 𝑄(𝑢 C = 2|𝑢 0 = 1 ) 𝑄(𝑥 J |𝑢 J = 1 ) 𝑄(𝑢 J = 1|𝑢 C = 1 ) ⋯ 𝑗 = 1 𝑗 = 2 𝑗 = 3 𝑗 = 4 CS6501 Natural Language Processing 27

Trellis diagram 1 𝑄 𝑥 . 𝑢 . 𝑄 𝑢 . ∣ 𝑢 .40 v Goal: argmax Π ./0 𝒖 v Find the best path! N N N N ⋯ V V V V A A A A 𝑗 = 1 𝑗 = 2 𝑗 = 3 𝑗 = 4 CS6501 Natural Language Processing 28

Lecture 11: Viterbi and Forward Algorithms Kai-Wei Chang CS @ - PowerPoint PPT Presentation

Lecture 11: Viterbi and Forward Algorithms Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS6501 Natural Language Processing 1 Quiz 1 Quiz 1 30 25 20 15 10 5 0 [0-5]

Search and Decoding Lecture 16 CS 753 Instructor: Preethi Jyothi Recall Viterbi search Viterbi

Viterbi decoder on STI CELL processor Michal Blaek (blazem2@fel.cvut.cz) Viterbi algorithm

Coding and decoding with convolutional codes. The Viterbi Algorithm. J.-M. Brossier 2008 J.-M.

Coding and decoding with convolutional codes. The Viterbi Algorithm. J.-M. Brossier 2008 J.-M.

Living with Continual Failure Ronald L. Rivest Viterbi Professor of EECS MIT, Cambridge, MA

Illegitimi non carborundum Ronald L. Rivest Viterbi Professor of EECS MIT, Cambridge, MA CRYPTO

Algorithms for NLP CS 11-711 Fall 2020 Lecture 8: Viterbi, discriminative sequence labeling,

Hidden Markov Models Biostatistics 615/815 Lecture 10: . . Summary . 1 / 33 . Viterbi

Today's Specials Detailed look at Lagrange Multipliers Forward-Backward and Viterbi

Graph Algorithms Chapter 22 1 CPTR 430 Algorithms Graph Algorithms Why Study Graph Algorithms?

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

Greedy Algorithms Chapter 16 1 CPTR 430 Algorithms Greedy Algorithms Greedy Algorithms For

Algorithms (2IL15) Lecture 13 Wrap-up lecture 1 TU/e Algorithms (2IL15) Lecture 13

6.02 Fall 2012 Lecture #7 Viterbi decoding of convolutional codes Path and branch metrics

Algorithms (2IL15) Lecture 10 NP-Completeness, II 1 TU/e Algorithms (2IL15) Lecture 10

Algorithms (2IL15) Lecture 12 Linear Programming 1 TU/e Algorithms (2IL15) Lecture 12

OCIO 1 Infrastructure Accomplishments Reduced Costs Security Disaster

gpi & g4c Universwiftnet 19/03/2019 Unparalleled growth in adoption, traffic and corridors

1. Introduction and Reciprocity Non Commutative (NC) spaces are defined by

Natlang Code book Based on a true story by Ramtin M. Seraj Spring 2015 1 Natural Language

Phonetics Suprasegmental Features Darrell Larsen Linguistics 101 Darrell Larsen Phonetics

Lecture 24: NER & Entity Linking Kai-Wei Chang CS @ University of Virginia kw@kwchang.net

New Trends in Asset Management Thierry Roncalli Professor of Finance, University of Evry,

Capital Budgeting: Applications and Pitfalls (Welch, Chapter 13) Ivo Welch UCLA Anderson School,

Lecture 11: Viterbi and Forward Algorithms Kai-Wei Chang CS @ - PowerPoint PPT Presentation

Lecture 11: Viterbi and Forward Algorithms Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS6501 Natural Language Processing 1 Quiz 1 Quiz 1 30 25 20 15 10 5 0 [0-5]

Search and Decoding Lecture 16 CS 753 Instructor: Preethi Jyothi Recall Viterbi search Viterbi

Viterbi decoder on STI CELL processor Michal Blaek (blazem2@fel.cvut.cz) Viterbi algorithm

Coding and decoding with convolutional codes. The Viterbi Algorithm. J.-M. Brossier 2008 J.-M.

Coding and decoding with convolutional codes. The Viterbi Algorithm. J.-M. Brossier 2008 J.-M.

Living with Continual Failure Ronald L. Rivest Viterbi Professor of EECS MIT, Cambridge, MA

Illegitimi non carborundum Ronald L. Rivest Viterbi Professor of EECS MIT, Cambridge, MA CRYPTO

Algorithms for NLP CS 11-711 Fall 2020 Lecture 8: Viterbi, discriminative sequence labeling,

Hidden Markov Models Biostatistics 615/815 Lecture 10: . . Summary . 1 / 33 . Viterbi

Today's Specials Detailed look at Lagrange Multipliers Forward-Backward and Viterbi

Graph Algorithms Chapter 22 1 CPTR 430 Algorithms Graph Algorithms Why Study Graph Algorithms?

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

Greedy Algorithms Chapter 16 1 CPTR 430 Algorithms Greedy Algorithms Greedy Algorithms For

Algorithms (2IL15) Lecture 13 Wrap-up lecture 1 TU/e Algorithms (2IL15) Lecture 13

6.02 Fall 2012 Lecture #7 Viterbi decoding of convolutional codes Path and branch metrics

Algorithms (2IL15) Lecture 10 NP-Completeness, II 1 TU/e Algorithms (2IL15) Lecture 10

Algorithms (2IL15) Lecture 12 Linear Programming 1 TU/e Algorithms (2IL15) Lecture 12

OCIO 1 Infrastructure Accomplishments Reduced Costs Security Disaster

gpi &amp; g4c Universwiftnet 19/03/2019 Unparalleled growth in adoption, traffic and corridors

1. Introduction and Reciprocity Non Commutative (NC) spaces are defined by

Natlang Code book Based on a true story by Ramtin M. Seraj Spring 2015 1 Natural Language

Phonetics Suprasegmental Features Darrell Larsen Linguistics 101 Darrell Larsen Phonetics

Lecture 24: NER &amp; Entity Linking Kai-Wei Chang CS @ University of Virginia kw@kwchang.net

New Trends in Asset Management Thierry Roncalli Professor of Finance, University of Evry,

Capital Budgeting: Applications and Pitfalls (Welch, Chapter 13) Ivo Welch UCLA Anderson School,

gpi & g4c Universwiftnet 19/03/2019 Unparalleled growth in adoption, traffic and corridors

Lecture 24: NER & Entity Linking Kai-Wei Chang CS @ University of Virginia kw@kwchang.net