ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2019

Recap: HMM • Elements of HMM: – Set of states (tags) – Output alphabet (word types) – Start state (beginning of sentence) – State transition probabilities – Output probabilities from each state Algorithms for HMMs (Goldwater, ANLP) 2

More general notation • Previous lecture: – Sequence of tags T = t 1 … t n – Sequence of words S = w 1 … w n • This lecture: – Sequence of states Q = q 1 ... q T – Sequence of outputs O = o 1 ... o T – So t is now a time step, not a tag! And T is the sequence length. Algorithms for HMMs (Goldwater, ANLP) 3

Recap: HMM • Given a sentence O = o 1 ... o T with tags Q = q 1 ... q T , compute P(O,Q) as: 𝑈 𝑄(𝑃, 𝑅) = 𝑄 𝑝 𝑢 𝑟 𝑢 𝑄 𝑟 𝑢 𝑟 𝑢−1 𝑢=1 argmax 𝑅 𝑄(𝑅|𝑃) • But we want to find without enumerating all possible Q – Use Viterbi algorithm to store partial computations. Algorithms for HMMs (Goldwater, ANLP) 4

Today’s lecture • What algorithms can we use to – Efficiently compute the most probable tag sequence for a given word sequence? – Efficiently compute the likelihood for an HMM (probability it outputs a given sequence s )? – Learn the parameters of an HMM given unlabelled training data? • What are the properties of these algorithms (complexity, convergence, etc)? Algorithms for HMMs (Goldwater, ANLP) 5

Tagging example Words: <s> one dog bit </s> Possible tags: <s> CD NN NN </s> (ordered by NN VB VBD frequency for PRP each word) Algorithms for HMMs (Goldwater, ANLP) 6

Tagging example Words: <s> one dog bit </s> Possible tags: <s> CD NN NN </s> (ordered by NN VB VBD frequency for PRP each word) • Choosing the best tag for each word independently gives the wrong answer (<s> CD NN NN </s>). • P(VBD|bit) < P(NN|bit), but may yield a better sequence (<s> CD NN VB </s>) – because P(VBD|NN) and P(</s>|VBD) are high. Algorithms for HMMs (Goldwater, ANLP) 7

Viterbi: intuition Words: <s> one dog bit </s> Possible tags: <s> CD NN NN </s> (ordered by NN VB VBD frequency for PRP each word) • Suppose we have already computed a) The best tag sequence for <s> … bit that ends in NN. b) The best tag sequence for <s> … bit that ends in VBD. • Then, the best full sequence would be either – sequence (a) extended to include </s>, or – sequence (b) extended to include </s>. Algorithms for HMMs (Goldwater, ANLP) 8

Viterbi: intuition Words: <s> one dog bit </s> Possible tags: <s> CD NN NN </s> (ordered by NN VB VBD frequency for PRP each word) • But similarly, to get a) The best tag sequence for <s> … bit that ends in NN. • We could extend one of: – The best tag sequence for <s> … dog that ends in NN. – The best tag sequence for <s> … dog that ends in VB. • And so on… Algorithms for HMMs (Goldwater, ANLP) 9

Viterbi: high-level picture • Intuition: the best path of length t ending in state q must include the best path of length t-1 to the previous state. ( t now a time step , not a tag ). Algorithms for HMMs (Goldwater, ANLP) 10

Viterbi: high-level picture • Intuition: the best path of length t ending in state q must include the best path of length t-1 to the previous state. ( t now a time step , not a tag ). So, – Find the best path of length t-1 to each state. – Consider extending each of those by 1 step, to state q . – Take the best of those options as the best path to state q . Algorithms for HMMs (Goldwater, ANLP) 11

Notation • Sequence of observations over time o 1 , o 2 , …, o T – here, words in sentence • Vocabulary size V of possible observations • Set of possible states q 1 , q 2 , …, q N (see note next slide) – here, tags • A , an NxN matrix of transition probabilities – a ij : the prob of transitioning from state i to j . (JM3 Fig 8.7) • B , an NxV matrix of output probabilities – b i (o t ) : the prob of emitting o t from state i . ( JM3 Fig 8.8) Algorithms for HMMs (Goldwater, ANLP) 12

Note on notation • J&M use q 1 , q 2 , …, q N for set of states, but also use q 1 , q 2 , …, q T for state sequence over time. – So, just seeing q 1 is ambiguous (though usually disambiguated from context). – I’ll instead use q i for state names, and q t for state at time t. – So we could have q t = q i , meaning: the state we’re in at time t is q i . Algorithms for HMMs (Goldwater, ANLP) 13

HMM example w/ new notation Start .3 .5 q 1 q 2 .7 .5 .6 .1 .3 .1 .7 .2 x y z x y z • States {q 1 , q 2 } (or {<s>, q 1 , q 2 } ) • Output alphabet {x, y, z} Adapted from Manning & Schuetze, Fig 9.2 Algorithms for HMMs (Goldwater, ANLP) 14

Transition and Output Probabilities • Transition matrix A : q 1 q 2 a ij = P(q j | q i ) <s> 1 0 q 1 .7 .3 q 2 .5 .5 • Output matrix B : x y z b i (o) = P(o | q i ) q 1 .6 .1 .3 q 2 for output o .1 .7 .2 Algorithms for HMMs (Goldwater, ANLP) 15

Joint probability of (states, outputs) • Let λ = (A, B) be the parameters of our HMM. • Using our new notation, given state sequence Q = (q 1 ... q T ) and output sequence O = (o 1 ... o T ), we have: 𝑈 𝑄 𝑃, 𝑅 𝜇 = 𝑄 𝑝 𝑢 𝑟 𝑢 𝑄 𝑟 𝑢 𝑟 𝑢−1 𝑢=1 Algorithms for HMMs (Goldwater, ANLP) 16

Joint probability of (states, outputs) • Let λ = (A, B) be the parameters of our HMM. • Using our new notation, given state sequence Q = (q 1 ... q T ) and output sequence O = (o 1 ... o T ), we have: 𝑈 𝑄 𝑃, 𝑅 𝜇 = 𝑄 𝑝 𝑢 𝑟 𝑢 𝑄 𝑟 𝑢 𝑟 𝑢−1 𝑢=1 𝑈 • Or: 𝑄 𝑃, 𝑅 𝜇 = 𝑐 𝑟 𝑢 (𝑝 𝑢 ) 𝑏 𝑟 𝑢−1 𝑟 𝑢 𝑢=1 Algorithms for HMMs (Goldwater, ANLP) 17

Joint probability of (states, outputs) • Let λ = (A, B) be the parameters of our HMM. • Using our new notation, given state sequence Q = (q 1 ... q T ) and output sequence O = (o 1 ... o T ), we have: 𝑈 𝑄 𝑃, 𝑅 𝜇 = 𝑄 𝑝 𝑢 𝑟 𝑢 𝑄 𝑟 𝑢 𝑟 𝑢−1 𝑢=1 𝑈 • Or: 𝑄 𝑃, 𝑅 𝜇 = 𝑐 𝑟 𝑢 (𝑝 𝑢 ) 𝑏 𝑟 𝑢−1 𝑟 𝑢 𝑢=1 • Example: 𝑄 𝑃 = 𝑧, 𝑨 , 𝑅 = (𝑟 1 , 𝑟 1 ) 𝜇 = 𝑐 1 𝑧 ∙ 𝑐 1 𝑨 ∙ 𝑏 <𝑡>,1 ∙ 𝑏 11 = (.1)(.3)(1)(.7) Algorithms for HMMs (Goldwater, ANLP) 18

Viterbi: high-level picture argmax 𝑅 𝑄(𝑅|𝑃) • Want to find • Intuition: the best path of length t ending in state q must include the best path of length t-1 to the previous state. So, – Find the best path of length t-1 to each state. – Consider extending each of those by 1 step, to state q . – Take the best of those options as the best path to state q . Algorithms for HMMs (Goldwater, ANLP) 19

Viterbi algorithm • Use a chart to store partial results as we go – NxT table, where v(j,t) is the probability* of the best state sequence for o 1 … o t that ends in state j . Algorithms for HMMs (Goldwater, ANLP) 20 *Specifically, v(j,t) stores the max of the joint probability P(o 1 …o t ,q 1 …q t-1 ,q t =j| λ )

Viterbi algorithm • Use a chart to store partial results as we go – NxT table, where v(j,t) is the probability* of the best state sequence for o 1 … o t that ends in state j . • Fill in columns from left to right, with 𝑂 𝑤 𝑘, 𝑢 = max 𝑗=1 𝑤 𝑗, 𝑢 − 1 ∙ 𝑏 𝑗𝑘 ∙ 𝑐 𝑘 𝑝 𝑢 Algorithms for HMMs (Goldwater, ANLP) 21 *Specifically, v(j,t) stores the max of the joint probability P(o 1 …o t ,q 1 …q t-1 ,q t =j| λ )

Viterbi algorithm • Use a chart to store partial results as we go – NxT table, where v(j,t) is the probability* of the best state sequence for o 1 … o t that ends in state j . • Fill in columns from left to right, with 𝑂 𝑤 𝑘, 𝑢 = max 𝑗=1 𝑤 𝑗, 𝑢 − 1 ∙ 𝑏 𝑗𝑘 ∙ 𝑐 𝑘 𝑝 𝑢 • Store a backtrace to show, for each cell, which state at t-1 we came from. Algorithms for HMMs (Goldwater, ANLP) 22 *Specifically, v(j,t) stores the max of the joint probability P(o 1 …o t ,q 1 …q t-1 ,q t =j| λ )

Example • Suppose O=xzy . Our initially empty table: o 1 =x o 2 =z o 3 =y q 1 q 2 Algorithms for HMMs (Goldwater, ANLP) 23

Filling the first column o 1 =x o 2 =z o 3 =y q 1 .6 q 2 0 𝑤 1,1 = 𝑏 <𝑡>1 ∙ 𝑐 1 𝑦) = 1 (.6 𝑤 2,1 = 𝑏 <𝑡>2 ∙ 𝑐 2 𝑦) = 0 (.1 Algorithms for HMMs (Goldwater, ANLP) 24

Starting the second column o 1 =x o 2 =z o 3 =y q 1 .6 q 2 0 𝑂 𝑤 1,2 = max 𝑗=1 𝑤 𝑗, 1 ∙ 𝑏 𝑗1 ∙ 𝑐 1 𝑨 = max 𝑤 1,1 ∙ 𝑏 11 ∙ 𝑐 1 𝑨 = (.6)(.7)(.3) 𝑤 2,1 ∙ 𝑏 21 ∙ 𝑐 1 𝑨 = (0)(.5)(.3) Algorithms for HMMs (Goldwater, ANLP) 25

Starting the second column o 1 =x o 2 =z o 3 =y q 1 .6 .126 q 2 0 𝑂 𝑤 1,2 = max 𝑗=1 𝑤 𝑗, 1 ∙ 𝑏 𝑗1 ∙ 𝑐 1 𝑨 = max 𝑤 1,1 ∙ 𝑏 11 ∙ 𝑐 1 𝑨 = (.6)(.7)(.3) 𝑤 2,1 ∙ 𝑏 21 ∙ 𝑐 1 𝑨 = (0)(.5)(.3) Algorithms for HMMs (Goldwater, ANLP) 26

ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2019 - PowerPoint PPT Presentation

ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2019 Recap: HMM Elements of HMM: Set of states (tags) Output alphabet (word types) Start state (beginning of sentence) State transition probabilities Output

Algorithms for NLP IITP, Spring 2020 HMMs, POS tagging, NER Yulia Tsvetkov 1 Plan POS

HMMs for Acoustic Modeling (Part II) Lecture 3 CS 753 Instructor: Preethi Jyothi Recap: HMMs

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

An introduction to Patterns, An introduction to Patterns, Profiles, HMMs and Profiles, HMMs and

ANLP Lecture 6 N-gram models and smoothing Sharon Goldwater (some slides from Philipp Koehn) 26

ANLP Lecture 8 Part-of-speech tagging Sharon Goldwater (based on slides by Philipp Koehn) 1

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

HMMS ARTS PROGRAMS HMMS ARTS TEACHERS Mrs. DeMayo Mrs. Wilson Theatre & Dance Band &

Pair HMMs and Profile HMMs COMPSCI 260 Spring 2016 HMM

Pair HMMs and Pairwise Sequence Alignment COMP 571 Luay Nakhleh, Rice University Pair HMMs

Today CS 188: Artificial Intelligence HMMs, Particle Filters, and Applications HMMs

Sequential Data Oliver Schulte - CMPT 726 Bishop PRML Ch. 13 Russell and Norvig, AIMA Hidden

Today CS 188: Artificial Intelligence HMMs, Particle Filters, and Applications HMMs

Accelerated Natural Language Processing Lecture 2 Morphology Sharon Goldwater (based on slides

ANLP Lecture 28: Coreference Sharon Goldwater 18 Nov 2019 Todays lecture What is

Two plots from last time Accelerated Natural Language Processing Lecture 2 Morphology Sharon

I n f o r m a t i o n T r a n s m i s s i o n C h a p t e r 5 , C

Sequence Models Spring 2020 Adapted from slides from Danqi Chen and Karthik Narasimhan (Princeton

CSE 527 Lectures 11-12 Markov Models and Hidden Markov Models DNA Methylation CH 3 CpG - 2

What Is the Output of What Is the Output of Visual Data Analysis? Visual Data Analysis? Gennady

Hidden Markov Models CMSC 473/673 UMBC Recap from last time Expectation Maximization

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Hidden Markov

Search Algorithms for Speech Recognition Berlin Chen 2004 References Books 1. X. Huang,

SoC-Network for Interleaving in Wireless Communications Norbert Wehn wehn@eit.uni-kl.de

ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2019 - PowerPoint PPT Presentation

ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2019 Recap: HMM Elements of HMM: Set of states (tags) Output alphabet (word types) Start state (beginning of sentence) State transition probabilities Output

Algorithms for NLP IITP, Spring 2020 HMMs, POS tagging, NER Yulia Tsvetkov 1 Plan POS

HMMs for Acoustic Modeling (Part II) Lecture 3 CS 753 Instructor: Preethi Jyothi Recap: HMMs

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

An introduction to Patterns, An introduction to Patterns, Profiles, HMMs and Profiles, HMMs and

ANLP Lecture 6 N-gram models and smoothing Sharon Goldwater (some slides from Philipp Koehn) 26

ANLP Lecture 8 Part-of-speech tagging Sharon Goldwater (based on slides by Philipp Koehn) 1

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

HMMS ARTS PROGRAMS HMMS ARTS TEACHERS Mrs. DeMayo Mrs. Wilson Theatre &amp; Dance Band &amp;

Pair HMMs and Profile HMMs COMPSCI 260 Spring 2016 HMM

Pair HMMs and Pairwise Sequence Alignment COMP 571 Luay Nakhleh, Rice University Pair HMMs

Today CS 188: Artificial Intelligence HMMs, Particle Filters, and Applications HMMs

Sequential Data Oliver Schulte - CMPT 726 Bishop PRML Ch. 13 Russell and Norvig, AIMA Hidden

Today CS 188: Artificial Intelligence HMMs, Particle Filters, and Applications HMMs

Accelerated Natural Language Processing Lecture 2 Morphology Sharon Goldwater (based on slides

ANLP Lecture 28: Coreference Sharon Goldwater 18 Nov 2019 Todays lecture What is

Two plots from last time Accelerated Natural Language Processing Lecture 2 Morphology Sharon

I n f o r m a t i o n T r a n s m i s s i o n C h a p t e r 5 , C

Sequence Models Spring 2020 Adapted from slides from Danqi Chen and Karthik Narasimhan (Princeton

CSE 527 Lectures 11-12 Markov Models and Hidden Markov Models DNA Methylation CH 3 CpG - 2

What Is the Output of What Is the Output of Visual Data Analysis? Visual Data Analysis? Gennady

Hidden Markov Models CMSC 473/673 UMBC Recap from last time Expectation Maximization

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Hidden Markov

Search Algorithms for Speech Recognition Berlin Chen 2004 References Books 1. X. Huang,

SoC-Network for Interleaving in Wireless Communications Norbert Wehn wehn@eit.uni-kl.de

HMMS ARTS PROGRAMS HMMS ARTS TEACHERS Mrs. DeMayo Mrs. Wilson Theatre & Dance Band &