Language Modeling Recap CMSC 473/673 UMBC Some slides adapted from - - PowerPoint PPT Presentation

language modeling recap
SMART_READER_LITE
LIVE PREVIEW

Language Modeling Recap CMSC 473/673 UMBC Some slides adapted from - - PowerPoint PPT Presentation

Language Modeling Recap CMSC 473/673 UMBC Some slides adapted from 3SLP n-grams = Chain Rule + Backoff (Markov assumption) N-Gram Terminology how to (efficiently) compute p(Colorless green ideas sleep furiously)? Commonly History Size


slide-1
SLIDE 1

Language Modeling Recap

CMSC 473/673 UMBC

Some slides adapted from 3SLP

slide-2
SLIDE 2

n-grams = Chain Rule + Backoff (Markov assumption)

slide-3
SLIDE 3

N-Gram Terminology

n Commonly called History Size (Markov order) Example 1 unigram p(furiously) 2 bigram 1 p(furiously | sleep) 3 trigram (3-gram) 2 p(furiously | ideas sleep) 4 4-gram 3 p(furiously | green ideas sleep) n n-gram n-1 p(wi | wi-n+1 … wi-1) how to (efficiently) compute p(Colorless green ideas sleep furiously)?

slide-4
SLIDE 4

Language Models & Smoothing

Maximum likelihood (MLE): simple counting Laplace smoothing, add- λ Interpolation models Discounted backoff Interpolated (modified) Kneser-Ney Good-Turing Witten-Bell

slide-5
SLIDE 5

Language Models & Smoothing

Maximum likelihood (MLE): simple counting Laplace smoothing, add- λ Interpolation models Discounted backoff Interpolated (modified) Kneser-Ney Good-Turing Witten-Bell

Q: Why do we have all these options? Why is MLE not sufficient?

slide-6
SLIDE 6

Language Models & Smoothing

Maximum likelihood (MLE): simple counting Laplace smoothing, add- λ Interpolation models Discounted backoff Interpolated (modified) Kneser-Ney Good-Turing Witten-Bell

Q: Why do we have all these options? Why is MLE not sufficient? A: Do we trust our training corpus? (insufficient counts → 0s; corpora have lexical biases; …)

slide-7
SLIDE 7

Language Models & Smoothing

Maximum likelihood (MLE): simple counting Laplace smoothing, add- λ Interpolation models Discounted backoff Interpolated (modified) Kneser-Ney Good-Turing Witten-Bell

Q: What are the parameters we learn?

slide-8
SLIDE 8

Language Models & Smoothing

Maximum likelihood (MLE): simple counting Laplace smoothing, add- λ Interpolation models Discounted backoff Interpolated (modified) Kneser-Ney Good-Turing Witten-Bell

Q: What are the parameters we learn? A: The counts or normalized probability values

slide-9
SLIDE 9

Language Models & Smoothing

Maximum likelihood (MLE): simple counting Laplace smoothing, add- λ Interpolation models Discounted backoff Interpolated (modified) Kneser-Ney Good-Turing Witten-Bell

Q: What are the hyperparameters?

slide-10
SLIDE 10

Language Models & Smoothing

Maximum likelihood (MLE): simple counting Laplace smoothing, add- λ Interpolation models Discounted backoff Interpolated (modified) Kneser-Ney Good-Turing Witten-Bell

Q: What are the hyperparameters? A: Laplace, backoff, KN: The adjustments to counts Interpolation: reweighting values

slide-11
SLIDE 11

Evaluation Framework

Training Data

Dev Data Test Data

acquire primary statistics for learning model parameters fine-tune any secondary (hyper)parameters perform final evaluation

DO NOT ITERATE ON THE TEST DATA

slide-12
SLIDE 12

Setting Hyperparameters

Use a development corpus Choose hyperparameters to maximize the likelihood

  • f dev data:

Fix the N-gram probabilities/counts (on the training data) Search for λs that give largest probability to held-out set

Training Data

Dev Data Test Data

slide-13
SLIDE 13

Evaluating Language Models

What is “correct?” What is working “well?” Extrinsic: Evaluate LM in downstream task Test an MT, ASR, etc. system and see which LM does better Propagate & conflate errors Intrinsic: Treat LM as its own downstream task Use perplexity (from information theory)

slide-14
SLIDE 14

Perplexity

Lower is better: lower perplexity --> less surprised perplexity = exp(

−1 𝑁 σ𝑗=1 𝑁 log 𝑞 𝑥𝑗

ℎ𝑗))

n-gram history (n-1 items)

slide-15
SLIDE 15

Implementation: Unknown words

Create an unknown word token <UNK> Training:

  • 1. Create a fixed lexicon L of size V
  • 2. Change any word not in L to <UNK>
  • 3. Train LM as normal

Evaluation:

Use UNK probabilities for any word not in training

slide-16
SLIDE 16

Implementation: EOS Padding

Create an end of sentence (“chunk”) token <EOS> Don’t estimate p(<BOS> | <EOS>) Training & Evaluation:

  • 1. Identify “chunks” that are relevant

(sentences, paragraphs, documents)

  • 2. Append the <EOS> token to the end of

the chunk

  • 3. Train or evaluate LM as normal

Post 33

slide-17
SLIDE 17

Context: x y Word (Type): z Raw Count Add-1 count Norm. Probability p(z | x y) The film The The film film The film got 1 The film went The film OOV The film EOS … a great great a great

  • pening

1 a great and a great the …

An Extended Example

The film got a great opening and the film went on to become a hit .

Q: With OOV, EOS, and BOS, how many types (for normalization)?

slide-18
SLIDE 18

Context: x y Word (Type): z Raw Count Add-1 count Norm. Probability p(z | x y) The film The The film film The film got 1 The film went The film OOV The film EOS … a great great a great

  • pening

1 a great and a great the …

An Extended Example

The film got a great opening and the film went on to become a hit .

Q: With OOV, EOS, and BOS, how many types (for normalization)? A: 16 (why don’t we count BOS?)

slide-19
SLIDE 19

Context: x y Word (Type): z Raw Count Add-1 count Norm. Probability p(z | x y) The film The 1 17 (=1+16*1) 1/17 The film film 1

1/17

The film got 1 2

2/17

The film went 1

1/17

The film OOV 1 1/17 The film EOS 1 1/17 … a great great 1

17

1/17 a great

  • pening

1 2 2/17 a great and 1 1/17 a great the 1 1/17 …

An Extended Example

The film got a great opening and the film went on to become a hit .

Q: With OOV, EOS, and BOS, how many types (for normalization)? A: 16 (why don’t we count BOS?)

slide-20
SLIDE 20

The film got a great opening and the film went on to become a hit .

Context: x y Word (Type): z Raw Count Add-1 count Norm. Probability p(z | x y) The film The 1 17 (=1+16*1) 1/17 The film film 1

1/17

The film got 1 2

2/17

The film went 1

1/17

The film OOV 1 1/17 The film EOS 1 1/17 … a great great 1

17

1/17 a great

  • pening

1 2 2/17 a great and 1 1/17 a great the 1 1/17 …

An Extended Example

Q: What is the perplexity for the sentence “The film , a hit !”

slide-21
SLIDE 21

What are the tri-grams for “The film , a hit !”

Trigrams MLE p(trigram) <BOS> <BOS> The 1 <BOS> The film 1 The film , film , a , a hit a hit ! hit ! <EOS> Perplexity ???

slide-22
SLIDE 22

What are the tri-grams for “The film , a hit !”

Trigrams MLE p(trigram) <BOS> <BOS> The 1 <BOS> The film 1 The film , film , a , a hit a hit ! hit ! <EOS> Perplexity Infinity

slide-23
SLIDE 23

What are the tri-grams for “The film , a hit !”

Trigrams MLE p(trigram) UNK-ed trigrams <BOS> <BOS> The 1 <BOS> <BOS> The <BOS> The film 1 <BOS> The film The film , The film <UNK> film , a film <UNK> a , a hit <UNK> a hit a hit ! a hit <UNK> hit ! <EOS> hit <UNK> <EOS> Perplexity Infinity

slide-24
SLIDE 24

What are the tri-grams for “The film , a hit !”

Trigrams MLE p(trigram) UNK-ed trigrams Smoothed p(trigram) <BOS> <BOS> The 1 <BOS> <BOS> The 2/17 <BOS> The film 1 <BOS> The film 2/17 The film , The film <UNK> 1/17 film , a film <UNK> a 1/16 , a hit <UNK> a hit 1/16 a hit ! a hit <UNK> 1/17 hit ! <EOS> hit <UNK> <EOS> 1/16 Perplexity Infinity Perplexity ???

slide-25
SLIDE 25

What are the tri-grams for “The film , a hit !”

Trigrams MLE p(trigram) UNK-ed trigrams Smoothed p(trigram) <BOS> <BOS> The 1 <BOS> <BOS> The 2/17 <BOS> The film 1 <BOS> The film 2/17 The film , The film <UNK> 1/17 film , a film <UNK> a 1/16 , a hit <UNK> a hit 1/16 a hit ! a hit <UNK> 1/17 hit ! <EOS> hit <UNK> <EOS> 1/16 Perplexity Infinity Perplexity 13.59

slide-26
SLIDE 26

How to Compute Perplexity

  • If you have a list of the probabilities for each
  • bserved n-gram “token:”

numpy.exp(-numpy.mean(numpy.log(probs_per_trigram_token)))

  • If you have a list of observed n-gram “types” t and

counts c, and log-prob. function lp:

numpy.exp(-numpy.mean(c*lp(t) for (t, c) in ngram_types.items()))