SLIDE 1 1
Yulia Tsvetkov
Algorithms for NLP
CS 11711, Fall 2019
Lecture 7: HMMs, POS tagging
SLIDE 2 2
▪ J&M SLP3 https://web.stanford.edu/~jurafsky/slp3/8.pdf
▪
Collins (2011)
http://www.cs.columbia.edu/~mcollins/hmms-spring2013.pdf
Readings for today’s lecture
SLIDE 3 3
Levels of linguistic knowledge
Slide credit: Noah Smith
SLIDE 4 4
▪ map a sequence of words to a sequence of labels ▪ Part-of-speech tagging (Church, 1988; Brants, 2000) ▪ Named entity recognition (Bikel et al., 1999) ▪ Text chunking and shallow parsing (Ramshaw and Marcus, 1995) ▪ Word alignment of parallel text (Vogel et al., 1996) ▪ Compression (Conroy and O’Leary, 2001) ▪ Acoustic models, discourse segmentation, etc.
Sequence Labeling
SLIDE 5 5
Sequence labeling as classification
SLIDE 6
Generative sequence labeling: Hidden Markov Models
SLIDE 7 Markov Chain: weather
the future is independent of the past given the present
SLIDE 8
Markov Chain
SLIDE 9 Markov Chain: words
the future is independent of the past given the present
SLIDE 10
▪ In real world many events are not
▪ Speech recognition: we observe acoustic features but not the phones ▪ POS tagging: we observe words but not the POS tags
Hidden Markov Models
q1 q2 qn
...
SLIDE 12 HMM example
From J&M
SLIDE 13 Generative vs. Discriminative models
▪ Generative models specify a joint distribution over the labels and the data. With this you could generate new data ▪ Discriminative models specify the conditional distribution of the label y given the data x. These models focus on how to discriminate between the classes
From Bamman
SLIDE 14 Types of HMMs
▪ + many more
From J&M
SLIDE 15
HMM in Language Technologies
▪ Part-of-speech tagging (Church, 1988; Brants, 2000) ▪ Named entity recognition (Bikel et al., 1999) and other information extraction tasks ▪ Text chunking and shallow parsing (Ramshaw and Marcus, 1995) ▪ Word alignment of parallel text (Vogel et al., 1996) ▪ Acoustic models in speech recognition (emissions are continuous) ▪ Discourse segmentation (labeling parts of a document)
SLIDE 16 HMM Parameters
From J&M
SLIDE 17 HMMs:Questions
From J&M
SLIDE 18 HMMs:Algorithms
From J&M Forward Viterbi Forward–Backward; Baum–Welch
SLIDE 19
HMM tagging as decoding
SLIDE 20
HMM tagging as decoding
SLIDE 21
HMM tagging as decoding
SLIDE 22
HMM tagging as decoding
SLIDE 23
HMM tagging as decoding
SLIDE 24
HMM tagging as decoding
SLIDE 25 HMM tagging as decoding
How many possible choices?
SLIDE 26
Part of speech tagging example
Slide credit: Noah Smith
SLIDE 27 Part of speech tagging example
Slide credit: Noah Smith
Greedy decoding?
SLIDE 28 Part of speech tagging example
Slide credit: Noah Smith
Greedy decoding? Consider: “the old dog the footsteps of the young”
SLIDE 29
The Viterbi Algorithm
SLIDE 30
The Viterbi Algorithm
SLIDE 31
The Viterbi Algorithm
SLIDE 32
The Viterbi Algorithm
SLIDE 33
The Viterbi Algorithm
Complexity?
SLIDE 34
Beam search
SLIDE 35
Viterbi
▪ n-best decoding ▪ relationship to sequence alignment ▪
SLIDE 36 HMMs:Algorithms
From J&M Forward Viterbi Forward–Backward; Baum–Welch
SLIDE 37
The Forward Algorithm
sum instead of max
SLIDE 38
Parts of Speech
SLIDE 39
The closed classes
SLIDE 40
More Fine-Grained Classes
SLIDE 41
More Fine-Grained Classes
SLIDE 42
The Penn Treebank Part-of-Speech Tagset
SLIDE 43 The Universal POS tagset
https://universaldependencies.org
SLIDE 44
POS tagging
SLIDE 45
POS tagging
goal: resolve POS ambiguities
SLIDE 46
POS tagging
SLIDE 47 Most Frequent Class Baseline
The WSJ training corpus and test on sections 22-24 of the same corpus the most-frequent-tag baseline achieves an accuracy
SLIDE 48 Most Frequent Class Baseline
The WSJ training corpus and test on sections 22-24 of the same corpus the most-frequent-tag baseline achieves an accuracy
- f 92.34%.
- 97% tag accuracy achievable by most algorithms
(HMMs, MEMMs, neural networks, rule-based algorithms)