Algorithms for NLP CS 11711, Fall 2019 Lecture 7: HMMs, POS tagging - - PowerPoint PPT Presentation

algorithms for nlp
SMART_READER_LITE
LIVE PREVIEW

Algorithms for NLP CS 11711, Fall 2019 Lecture 7: HMMs, POS tagging - - PowerPoint PPT Presentation

Algorithms for NLP CS 11711, Fall 2019 Lecture 7: HMMs, POS tagging Yulia Tsvetkov 1 Readings for todays lecture J&M SLP3 https://web.stanford.edu/~jurafsky/slp3/8.pdf Collins (2011)


slide-1
SLIDE 1

1

Yulia Tsvetkov

Algorithms for NLP

CS 11711, Fall 2019

Lecture 7: HMMs, POS tagging

slide-2
SLIDE 2

2

▪ J&M SLP3 https://web.stanford.edu/~jurafsky/slp3/8.pdf

Collins (2011)

http://www.cs.columbia.edu/~mcollins/hmms-spring2013.pdf

Readings for today’s lecture

slide-3
SLIDE 3

3

Levels of linguistic knowledge

Slide credit: Noah Smith

slide-4
SLIDE 4

4

▪ map a sequence of words to a sequence of labels ▪ Part-of-speech tagging (Church, 1988; Brants, 2000) ▪ Named entity recognition (Bikel et al., 1999) ▪ Text chunking and shallow parsing (Ramshaw and Marcus, 1995) ▪ Word alignment of parallel text (Vogel et al., 1996) ▪ Compression (Conroy and O’Leary, 2001) ▪ Acoustic models, discourse segmentation, etc.

Sequence Labeling

slide-5
SLIDE 5

5

Sequence labeling as classification

slide-6
SLIDE 6

Generative sequence labeling: Hidden Markov Models

slide-7
SLIDE 7

Markov Chain: weather

the future is independent of the past given the present

slide-8
SLIDE 8

Markov Chain

slide-9
SLIDE 9

Markov Chain: words

the future is independent of the past given the present

slide-10
SLIDE 10
  • 1
  • 2
  • n

▪ In real world many events are not

  • bservable

▪ Speech recognition: we observe acoustic features but not the phones ▪ POS tagging: we observe words but not the POS tags

Hidden Markov Models

q1 q2 qn

...

slide-11
SLIDE 11

HMM

From J&M

slide-12
SLIDE 12

HMM example

From J&M

slide-13
SLIDE 13

Generative vs. Discriminative models

▪ Generative models specify a joint distribution over the labels and the data. With this you could generate new data ▪ Discriminative models specify the conditional distribution of the label y given the data x. These models focus on how to discriminate between the classes

From Bamman

slide-14
SLIDE 14

Types of HMMs

▪ + many more

From J&M

slide-15
SLIDE 15

HMM in Language Technologies

▪ Part-of-speech tagging (Church, 1988; Brants, 2000) ▪ Named entity recognition (Bikel et al., 1999) and other information extraction tasks ▪ Text chunking and shallow parsing (Ramshaw and Marcus, 1995) ▪ Word alignment of parallel text (Vogel et al., 1996) ▪ Acoustic models in speech recognition (emissions are continuous) ▪ Discourse segmentation (labeling parts of a document)

slide-16
SLIDE 16

HMM Parameters

From J&M

slide-17
SLIDE 17

HMMs:Questions

From J&M

slide-18
SLIDE 18

HMMs:Algorithms

From J&M Forward Viterbi Forward–Backward; Baum–Welch

slide-19
SLIDE 19

HMM tagging as decoding

slide-20
SLIDE 20

HMM tagging as decoding

slide-21
SLIDE 21

HMM tagging as decoding

slide-22
SLIDE 22

HMM tagging as decoding

slide-23
SLIDE 23

HMM tagging as decoding

slide-24
SLIDE 24

HMM tagging as decoding

slide-25
SLIDE 25

HMM tagging as decoding

How many possible choices?

slide-26
SLIDE 26

Part of speech tagging example

Slide credit: Noah Smith

slide-27
SLIDE 27

Part of speech tagging example

Slide credit: Noah Smith

Greedy decoding?

slide-28
SLIDE 28

Part of speech tagging example

Slide credit: Noah Smith

Greedy decoding? Consider: “the old dog the footsteps of the young”

slide-29
SLIDE 29

The Viterbi Algorithm

slide-30
SLIDE 30

The Viterbi Algorithm

slide-31
SLIDE 31

The Viterbi Algorithm

slide-32
SLIDE 32

The Viterbi Algorithm

slide-33
SLIDE 33

The Viterbi Algorithm

Complexity?

slide-34
SLIDE 34

Beam search

slide-35
SLIDE 35

Viterbi

▪ n-best decoding ▪ relationship to sequence alignment ▪

slide-36
SLIDE 36

HMMs:Algorithms

From J&M Forward Viterbi Forward–Backward; Baum–Welch

slide-37
SLIDE 37

The Forward Algorithm

sum instead of max

slide-38
SLIDE 38

Parts of Speech

slide-39
SLIDE 39

The closed classes

slide-40
SLIDE 40

More Fine-Grained Classes

slide-41
SLIDE 41

More Fine-Grained Classes

slide-42
SLIDE 42

The Penn Treebank Part-of-Speech Tagset

slide-43
SLIDE 43

The Universal POS tagset

https://universaldependencies.org

slide-44
SLIDE 44

POS tagging

slide-45
SLIDE 45

POS tagging

goal: resolve POS ambiguities

slide-46
SLIDE 46

POS tagging

slide-47
SLIDE 47

Most Frequent Class Baseline

The WSJ training corpus and test on sections 22-24 of the same corpus the most-frequent-tag baseline achieves an accuracy

  • f 92.34%.
slide-48
SLIDE 48

Most Frequent Class Baseline

The WSJ training corpus and test on sections 22-24 of the same corpus the most-frequent-tag baseline achieves an accuracy

  • f 92.34%.
  • 97% tag accuracy achievable by most algorithms

(HMMs, MEMMs, neural networks, rule-based algorithms)