part of speech
play

Part-of-Speech T agging: HMM & structured perceptron CMSC 723 - PowerPoint PPT Presentation

Part-of-Speech T agging: HMM & structured perceptron CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu Last time What are parts of speech (POS)? Equivalence classes or categories of words Open class vs.


  1. Part-of-Speech T agging: HMM & structured perceptron CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu

  2. Last time… • What are parts of speech (POS)? – Equivalence classes or categories of words – Open class vs. closed class – Nouns, Verbs, Adjectives, Adverbs (English) • What is POS tagging? – Assigning POS tags to words in context – Penn Treebank • How to POS tag text automatically? – Multiclass classification vs. sequence labeling

  3. T oday • 2 approaches to POS tagging – Hidden Markov Models – Structured Perceptron

  4. Hidden Markov Models • Common approach to sequence labeling • A finite state machine with probabilistic transitions • Markov Assumption – next state only depends on the current state and independent of previous history

  5. HMM: Formal Specification • Q : a finite set of N states Markov – Q = { q 0 , q 1 , q 2 , q 3 , …} Assumption • N  N Transition probability matrix A = [ a ij ] – a ij = P ( q j | q i ), Σ a ij = 1  I • Sequence of observations O = o 1 , o 2 , ... o T – Each drawn from a given set of symbols (vocabulary V) • N  | V | Emission probability matrix, B = [ b it ] – b it = b i ( o t ) = P ( o t | q i ), Σ b it = 1  i • Start and end states – An explicit start state q 0 or alternatively, a prior distribution over start states: { π 1 , π 2 , π 3 , …}, Σ π i = 1 – The set of final states: q F

  6. Stock Market HMM ✓ States? ✓ Transitions? ✓ Vocabulary? ✓ Emissions? ✓ Priors? π 3 =0.3 π 1 =0.5 π 2 =0.2

  7. HMMs: Three Problems • Likelihood: Given an HMM λ = ( A , B , ∏ ), and a sequence of observed events O , find P ( O | λ ) • Decoding: Given an HMM λ = ( A , B , ∏ ), and an observation sequence O , find the most likely (hidden) state sequence • Learning: Given a set of observation sequences and the set of states Q in λ , compute the parameters A and B

  8. HMM Problem #1: Likelihood

  9. Computing Likelihood π 3 =0.3 π 1 =0.5 π 2 =0.2 t : 1 2 3 4 5 6 O : ↑ ↓ ↔ ↑ ↓ ↔ λ stock Assuming λ stock models the stock market, how likely are we to observe the sequence of outputs?

  10. Computing Likelihood • First try: – Sum over all possible ways in which we could generate O from λ Takes O( N T ) time to compute!

  11. Forward Algorithm • Use an N  T trellis or chart [ α tj ] • Forward probabilities: α tj or α t ( j ) = P (being in state j after seeing t observations) = P ( o 1 , o 2 , ... o t , q t = j ) • Each cell = ∑ extensions of all paths from other cells α t ( j ) = ∑ i α t-1 ( i ) a ij b j ( o t ) – α t-1 ( i ): forward path probability until ( t - 1 ) – a ij : transition probability of going from state i to j – b j ( o t ): probability of emitting symbol o t in state j • P ( O | λ ) = ∑ i α T ( i )

  12. Forward Algorithm: Formal Definition • Initialization • Recursion • Termination

  13. Forward Algorithm O = ↑ ↓ ↑ find P ( O | λ stock )

  14. Forward Algorithm Static states Bear Bull ↑ ↓ ↑ t=1 t=2 t=3 time

  15. Forward Algorithm: Initialization 0.3  0.3 Static α 1 (Static) =0.09 states 0.5  0.1 Bear α 1 (Bear) =0.05 0.2  0.7=0 Bull α 1 (Bull) .14 ↑ ↓ ↑ t=1 t=2 t=3 time

  16. Forward Algorithm: Recursion 0.3  0.3 Static =0.09 states .... and so on 0.5  0.1 Bear =0.05 ∑ 0.2  0.7=0 Bull 0.0145 .14 0.14  0.6  0.1=0.0084 α 1 (Bull)  a BullBull  b Bull (↓) ↑ ↓ ↑ t=1 t=2 t=3 time

  17. Forward Algorithm: Recursion Work through the rest of these numbers… 0.3  0.3 Static ? ? =0.09 states 0.5  0.1 Bear ? ? =0.05 0.2  0.7=0 Bull 0.0145 ? .14 ↑ ↓ ↑ t=1 t=2 t=3 time What’s the asymptotic complexity of this algorithm?

  18. HMM Problem #2: Decoding

  19. Decoding π 3 =0.3 π 1 =0.5 π 2 =0.2 t : 1 2 3 4 5 6 O : ↑ ↓ ↔ ↑ ↓ ↔ λ stock Given λ stock as our model and O as our observations, what are the most likely states the market went through to produce O ?

  20. Decoding • “Decoding” because states are hidden • First try: – Compute P ( O ) for all possible state sequences, then choose sequence with highest probability

  21. Viterbi Algorithm • “Decoding” = computing most likely state sequence – Another dynamic programming algorithm – Efficient: polynomial vs. exponential (brute force) • Same idea as the forward algorithm – Store intermediate computation results in a trellis – Build new cells from existing cells

  22. Viterbi Algorithm • Use an N  T trellis [ v tj ] – Just like in forward algorithm • v tj or v t ( j ) = P (in state j after seeing t observations and passing through the most likely state sequence so far) = P ( q 1 , q 2 , ... q t-1 , q t=j , o 1 , o 2 , ... o t ) • Each cell = extension of most likely path from other cells v t ( j ) = max i v t-1 ( i ) a ij b j ( o t ) – v t-1 ( i ): Viterbi probability until ( t-1 ) – a ij : transition probability of going from state i to j – b j ( o t ) : probability of emitting symbol o t in state j • P = max i v T ( i )

  23. Viterbi vs. Forward • Maximization instead of summation over previous paths • This algorithm is still missing something! – In forward algorithm, we only care about the probabilities – What’s different here? • We need to store the most likely path (transition): – Use “ backpointers ” to keep track of most likely transition – At the end, follow the chain of backpointers to recover the most likely state sequence

  24. Viterbi Algorithm: Formal Definition • Initialization • Recursion • Termination

  25. Viterbi Algorithm O = ↑ ↓ ↑ find most likely state sequence given λ stock

  26. Viterbi Algorithm Static states Bear Bull ↑ ↓ ↑ t=1 t=2 t=3 time

  27. Viterbi Algorithm: Initialization 0.3  0.3 Static α 1 (Static) =0.09 states 0.5  0.1 Bear α 1 (Bear) =0.05 0.2  0.7=0 Bull α 1 (Bull) .14 ↑ ↓ ↑ t=1 t=2 t=3 time

  28. Viterbi Algorithm: Recursion 0.3  0.3 Static =0.09 states 0.5  0.1 Bear =0.05 Max 0.2  0.7=0 Bull 0.0084 .14 0.14  0.6  0.1=0.0084 α 1 (Bull)  a BullBull  b Bull (↓) ↑ ↓ ↑ t=1 t=2 t=3 time

  29. Viterbi Algorithm: Recursion 0.3  0.3 Static =0.09 states .... and so on 0.5  0.1 Bear =0.05 store backpointer 0.2  0.7=0 Bull 0.0084 .14 ↑ ↓ ↑ t=1 t=2 t=3 time

  30. Viterbi Algorithm: Recursion Work through the rest of the algorithm… 0.3  0.3 Static ? ? =0.09 states 0.5  0.1 Bear ? ? =0.05 0.2  0.7=0 Bull 0.0084 ? .14 ↑ ↓ ↑ t=1 t=2 t=3 time

  31. POS T agging with HMMs

  32. HMM for POS tagging: intuition Credit: Jordan Boyd Graber

  33. HMM for POS tagging: intuition Credit: Jordan Boyd Graber

  34. HMMs: Three Problems • Likelihood: Given an HMM λ = ( A , B , ∏ ), and a sequence of observed events O , find P ( O | λ ) • Decoding: Given an HMM λ = ( A , B , ∏ ), and an observation sequence O , find the most likely (hidden) state sequence • Learning: Given a set of observation sequences and the set of states Q in λ , compute the parameters A and B

  35. HMM Problem #3: Learning

  36. Learning HMMs for POS tagging is a supervised task • A POS tagged corpus tells us the hidden states! • We can compute Maximum Likelihood Estimates (MLEs) for the various parameters – MLE = fancy way of saying “count and divide” • These parameter estimates maximize the likelihood of the data being generated by the model

  37. Supervised Training • Transition Probabilities – Any P ( t i | t i-1 ) = C ( t i-1 , t i ) / C ( t i-1 ), from the tagged data – Example: for P(NN|VB) • count how many times a noun follows a verb • divide by the total number of times you see a verb

  38. Supervised Training • Emission Probabilities – Any P ( w i | t i ) = C ( w i , t i ) / C ( t i ), from the tagged data – For P (bank|NN) • count how many times bank is tagged as a noun • divide by how many times anything is tagged as a noun

  39. Supervised Training • Priors – Any P ( q 1 = t i ) = π i = C ( t i )/ N , from the tagged data – For π NN , count the number of times NN occurs and divide by the total number of tags (states)

  40. HMMs: Three Problems • Likelihood: Given an HMM λ = ( A , B , ∏ ), and a sequence of observed events O , find P ( O | λ ) • Decoding: Given an HMM λ = ( A , B , ∏ ), and an observation sequence O , find the most likely (hidden) state sequence • Learning: Given a set of observation sequences and the set of states Q in λ , compute the parameters A and B

  41. Prediction Problems • Given x, predict y A book review Is it positive? Binary Oh, man I love this book! Prediction yes (2 choices) This book is so boring... no A tweet Its language Multi-class On the way to the park! English Prediction 公園に行くなう! (several choices) Japanese A sentence Its syntactic parse S VP Structured I read a book NP Prediction N VBD DET NN (millions of choices) I read a book

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend