Part-of-Speech T agging: HMM & structured perceptron CMSC 723 - PowerPoint PPT Presentation

Part-of-Speech T agging: HMM & structured perceptron CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu

Last time… • What are parts of speech (POS)? – Equivalence classes or categories of words – Open class vs. closed class – Nouns, Verbs, Adjectives, Adverbs (English) • What is POS tagging? – Assigning POS tags to words in context – Penn Treebank • How to POS tag text automatically? – Multiclass classification vs. sequence labeling

T oday • 2 approaches to POS tagging – Hidden Markov Models – Structured Perceptron

Hidden Markov Models • Common approach to sequence labeling • A finite state machine with probabilistic transitions • Markov Assumption – next state only depends on the current state and independent of previous history

HMM: Formal Specification • Q : a finite set of N states Markov – Q = { q 0 , q 1 , q 2 , q 3 , …} Assumption • N  N Transition probability matrix A = [ a ij ] – a ij = P ( q j | q i ), Σ a ij = 1  I • Sequence of observations O = o 1 , o 2 , ... o T – Each drawn from a given set of symbols (vocabulary V) • N  | V | Emission probability matrix, B = [ b it ] – b it = b i ( o t ) = P ( o t | q i ), Σ b it = 1  i • Start and end states – An explicit start state q 0 or alternatively, a prior distribution over start states: { π 1 , π 2 , π 3 , …}, Σ π i = 1 – The set of final states: q F

Stock Market HMM ✓ States? ✓ Transitions? ✓ Vocabulary? ✓ Emissions? ✓ Priors? π 3 =0.3 π 1 =0.5 π 2 =0.2

HMMs: Three Problems • Likelihood: Given an HMM λ = ( A , B , ∏ ), and a sequence of observed events O , find P ( O | λ ) • Decoding: Given an HMM λ = ( A , B , ∏ ), and an observation sequence O , find the most likely (hidden) state sequence • Learning: Given a set of observation sequences and the set of states Q in λ , compute the parameters A and B

HMM Problem #1: Likelihood

Computing Likelihood π 3 =0.3 π 1 =0.5 π 2 =0.2 t : 1 2 3 4 5 6 O : ↑ ↓ ↔ ↑ ↓ ↔ λ stock Assuming λ stock models the stock market, how likely are we to observe the sequence of outputs?

Computing Likelihood • First try: – Sum over all possible ways in which we could generate O from λ Takes O( N T ) time to compute!

Forward Algorithm • Use an N  T trellis or chart [ α tj ] • Forward probabilities: α tj or α t ( j ) = P (being in state j after seeing t observations) = P ( o 1 , o 2 , ... o t , q t = j ) • Each cell = ∑ extensions of all paths from other cells α t ( j ) = ∑ i α t-1 ( i ) a ij b j ( o t ) – α t-1 ( i ): forward path probability until ( t - 1 ) – a ij : transition probability of going from state i to j – b j ( o t ): probability of emitting symbol o t in state j • P ( O | λ ) = ∑ i α T ( i )

Forward Algorithm: Formal Definition • Initialization • Recursion • Termination

Forward Algorithm O = ↑ ↓ ↑ find P ( O | λ stock )

Forward Algorithm Static states Bear Bull ↑ ↓ ↑ t=1 t=2 t=3 time

Forward Algorithm: Initialization 0.3  0.3 Static α 1 (Static) =0.09 states 0.5  0.1 Bear α 1 (Bear) =0.05 0.2  0.7=0 Bull α 1 (Bull) .14 ↑ ↓ ↑ t=1 t=2 t=3 time

Forward Algorithm: Recursion 0.3  0.3 Static =0.09 states .... and so on 0.5  0.1 Bear =0.05 ∑ 0.2  0.7=0 Bull 0.0145 .14 0.14  0.6  0.1=0.0084 α 1 (Bull)  a BullBull  b Bull (↓) ↑ ↓ ↑ t=1 t=2 t=3 time

Forward Algorithm: Recursion Work through the rest of these numbers… 0.3  0.3 Static ? ? =0.09 states 0.5  0.1 Bear ? ? =0.05 0.2  0.7=0 Bull 0.0145 ? .14 ↑ ↓ ↑ t=1 t=2 t=3 time What’s the asymptotic complexity of this algorithm?

HMM Problem #2: Decoding

Decoding π 3 =0.3 π 1 =0.5 π 2 =0.2 t : 1 2 3 4 5 6 O : ↑ ↓ ↔ ↑ ↓ ↔ λ stock Given λ stock as our model and O as our observations, what are the most likely states the market went through to produce O ?

Decoding • “Decoding” because states are hidden • First try: – Compute P ( O ) for all possible state sequences, then choose sequence with highest probability

Viterbi Algorithm • “Decoding” = computing most likely state sequence – Another dynamic programming algorithm – Efficient: polynomial vs. exponential (brute force) • Same idea as the forward algorithm – Store intermediate computation results in a trellis – Build new cells from existing cells

Viterbi Algorithm • Use an N  T trellis [ v tj ] – Just like in forward algorithm • v tj or v t ( j ) = P (in state j after seeing t observations and passing through the most likely state sequence so far) = P ( q 1 , q 2 , ... q t-1 , q t=j , o 1 , o 2 , ... o t ) • Each cell = extension of most likely path from other cells v t ( j ) = max i v t-1 ( i ) a ij b j ( o t ) – v t-1 ( i ): Viterbi probability until ( t-1 ) – a ij : transition probability of going from state i to j – b j ( o t ) : probability of emitting symbol o t in state j • P = max i v T ( i )

Viterbi vs. Forward • Maximization instead of summation over previous paths • This algorithm is still missing something! – In forward algorithm, we only care about the probabilities – What’s different here? • We need to store the most likely path (transition): – Use “ backpointers ” to keep track of most likely transition – At the end, follow the chain of backpointers to recover the most likely state sequence

Viterbi Algorithm: Formal Definition • Initialization • Recursion • Termination

Viterbi Algorithm O = ↑ ↓ ↑ find most likely state sequence given λ stock

Viterbi Algorithm Static states Bear Bull ↑ ↓ ↑ t=1 t=2 t=3 time

Viterbi Algorithm: Initialization 0.3  0.3 Static α 1 (Static) =0.09 states 0.5  0.1 Bear α 1 (Bear) =0.05 0.2  0.7=0 Bull α 1 (Bull) .14 ↑ ↓ ↑ t=1 t=2 t=3 time

Viterbi Algorithm: Recursion 0.3  0.3 Static =0.09 states 0.5  0.1 Bear =0.05 Max 0.2  0.7=0 Bull 0.0084 .14 0.14  0.6  0.1=0.0084 α 1 (Bull)  a BullBull  b Bull (↓) ↑ ↓ ↑ t=1 t=2 t=3 time

Viterbi Algorithm: Recursion 0.3  0.3 Static =0.09 states .... and so on 0.5  0.1 Bear =0.05 store backpointer 0.2  0.7=0 Bull 0.0084 .14 ↑ ↓ ↑ t=1 t=2 t=3 time

Viterbi Algorithm: Recursion Work through the rest of the algorithm… 0.3  0.3 Static ? ? =0.09 states 0.5  0.1 Bear ? ? =0.05 0.2  0.7=0 Bull 0.0084 ? .14 ↑ ↓ ↑ t=1 t=2 t=3 time

POS T agging with HMMs

HMM for POS tagging: intuition Credit: Jordan Boyd Graber

HMM Problem #3: Learning

Learning HMMs for POS tagging is a supervised task • A POS tagged corpus tells us the hidden states! • We can compute Maximum Likelihood Estimates (MLEs) for the various parameters – MLE = fancy way of saying “count and divide” • These parameter estimates maximize the likelihood of the data being generated by the model

Supervised Training • Transition Probabilities – Any P ( t i | t i-1 ) = C ( t i-1 , t i ) / C ( t i-1 ), from the tagged data – Example: for P(NN|VB) • count how many times a noun follows a verb • divide by the total number of times you see a verb

Supervised Training • Emission Probabilities – Any P ( w i | t i ) = C ( w i , t i ) / C ( t i ), from the tagged data – For P (bank|NN) • count how many times bank is tagged as a noun • divide by how many times anything is tagged as a noun

Supervised Training • Priors – Any P ( q 1 = t i ) = π i = C ( t i )/ N , from the tagged data – For π NN , count the number of times NN occurs and divide by the total number of tags (states)

Prediction Problems • Given x, predict y A book review Is it positive? Binary Oh, man I love this book! Prediction yes (2 choices) This book is so boring... no A tweet Its language Multi-class On the way to the park! English Prediction 公園に行くなう！ (several choices) Japanese A sentence Its syntactic parse S VP Structured I read a book NP Prediction N VBD DET NN (millions of choices) I read a book

Part-of-Speech T agging: HMM & structured perceptron CMSC 723 - PowerPoint PPT Presentation

Part-of-Speech T agging: HMM & structured perceptron CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu Last time What are parts of speech (POS)? Equivalence classes or categories of words Open class vs.

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Part-of-Speech Tagging Part-of-Speech Tagging Berlin Chen 2003 References: 1. Speech and

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Project Overview Speech Speech Generation Generation Common Semantic Frame Speech Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Part-of-Speech Tagging Part-of-Speech Tagging Berlin Chen 2005 References: 1. Speech and

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Speech Processing 15-492/18-492 Speech Translation Speech Translation Three part systems

Speech sound disorder by Sajjal (2018) Definition A speech sound disorder (SSD) is a speech

Speech of Greta Thunberg at the UN Climate Change COP24 Conference in Katowice Content -Greta

Special Topic Option Strategies: Credit Spreads Brian Houston, Nison Cer0fied Trainer

Shape Constraints for Set Functions Andrew Cotuer, Maya R. Gupta, Heinrich Jiang, Erez Louidor,

61A Lecture 22 Friday, October 25 Announcements Midterm 2 is on Monday 10/28 7pm-9pm

Designing a Course and Construc0ng a Syllabus Best Prac0ces

HOOPA TRIBAL MEDICATED ASSISTED TREATMENT Culture/spirituality/treatment People are asking why

The stochastic sensitivity of bull- and bear states in an asset market Jochen Jungeilges [1 , 2]

Shortest Paths Eric Price UT Austin CS 331, Spring 2020 Coronavirus Edition CS 331, Spring

CS 225 Data Structures Se Sep. p. 25 25 It Iterators and In Intro Trees G G Carl Evans