Language Models Fall 2020 2020-09-11 Adapted from slides from Anoop - - PowerPoint PPT Presentation

language models
SMART_READER_LITE
LIVE PREVIEW

Language Models Fall 2020 2020-09-11 Adapted from slides from Anoop - - PowerPoint PPT Presentation

SFU NatLangLab CMPT 413/825: Natural Language Processing Language Models Fall 2020 2020-09-11 Adapted from slides from Anoop Sarkar, Danqi Chen and Karthik Narasimhan 1 Announcements Sign up on Piazza for announcements, discussion, and


slide-1
SLIDE 1

Language Models

Fall 2020 2020-09-11 CMPT 413/825: Natural Language Processing

SFU NatLangLab

Adapted from slides from Anoop Sarkar, Danqi Chen and Karthik Narasimhan

1

slide-2
SLIDE 2
  • Sign up on Piazza for announcements, discussion, and course materials:

piazza.com/sfu.ca/fall2020/cmpt413825

  • Homework 0 is out — due 9/16, 11:59pm
  • Review problems on probability, linear algebra, and calculus
  • Programming - Setup group, github, and starter problem
  • Try to have unique group name
  • Make sure your Coursys group name and your GitHub repo name match
  • Avoid strange characters in your group name
  • Interactive Tutorial Session
  • 11:50am to 12:20pm - last 30 minutes of lecture
  • (optional) but recommended review of math background

Announcements

2

slide-3
SLIDE 3

Today, in Vancouver, it is 76 F and red Today, in Vancouver, it is 76 F and sunny

Consider

vs

  • Both are grammatical
  • But which is more likely?

3

slide-4
SLIDE 4

Language Modeling

  • We want to be able to estimate the probability of a sequence of words
  • How likely is a given phrase / sentence / paragraph / document?

Why is this useful?

4

slide-5
SLIDE 5

Applications

  • Predicting words is important in many situations
  • Machine translation
  • Speech recognition/Spell checking
  • Information extraction, Question answering

P(a smooth finish) > P(a flat finish)

P(high school principal) > P(high school principle)

5

slide-6
SLIDE 6

Language models are everywhere

Autocomplete

6

slide-7
SLIDE 7

Impact on downstream applications

(Miki et al., 2006)

7

slide-8
SLIDE 8

What is a language model?

Setup: Assume a finite vocabulary of words

V

V = {killer, crazy, clown}

Probabilistic model of a sequence of words can be used to construct a infinite set of sentences (sequences of words) where a sentence is defined as where

V

V+ = {clown, killer clown, crazy clown,

crazy killer clown, killer crazy clown, …}

s ∈ V+ s = {w1, …, wn}

8

slide-9
SLIDE 9

What is a language model?

Given a training data set of example sentences Estimate a probability model

S = {s1, s2, …, sN}, si ∈ V+ ∑

si∈V+

p(si) = ∑

i

p(w1, …, wni) = 1.0

Probabilistic model of a sequence of words Language Model

9

slide-10
SLIDE 10

Learning language models

  • We can directly count using a training data set of sentences
  • is a function that counts how many times each sentence
  • ccurs
  • N is the sum over all possible

values

P(w1, …, wn) = c(w1, …, wn) N c c( ⋅ )

How to estimate the probability of a sentence?

10

slide-11
SLIDE 11

Learning language models

  • Problem: does not generalize to new sentences unseen in

the training data

  • What are the chances you will see a sentence

crazy killer clown crazy killer

  • In NLP applications, we often need to assign non-zero

probability to previously unseen sentences How to estimate the probability of a sentence?

P(w1, …, wn) = c(w1, …, wn) N

11

slide-12
SLIDE 12

Estimating joint probabilities with the chain rule

P(the cat sat on the mat) = P(the) ∗ P(cat|the) ∗ P(sat|the cat) ∗P(on|the cat sat) ∗ P(the|the cat sat on) ∗P(mat|the cat sat on the)

p(w1, w2, …, wn) = p(w1)p(w2|w1)p(w3|w1, w2) × … × p(wn|w1, w2, …, wn−1)

Sentence: “the cat sat on the mat”

Example

12

slide-13
SLIDE 13

Estimating probabilities

  • With a vocabulary of size
  • # sequences of length :
  • Typical vocabulary ~ 50k words
  • even sentences of length

results in sequences! (# of atoms in the earth )

|V| n |V|n ≤ 11 ≈ 4.9 × 1051 ≈ 1050

P(sat|the cat) = count(the cat sat) count(the cat)

P(on|the cat sat) = count(the cat sat on) count(the cat sat)

Maximum likelihood estimate (MLE) Let’s count again!

13

slide-14
SLIDE 14

Markov assumption

  • Use only the recent past to predict the next word
  • Reduces the number of estimated parameters in exchange

for modeling capacity

  • 1st order
  • 2nd order

P(mat|the cat sat on the) ≈ P(mat|the) P(mat|the cat sat on the) ≈ P(mat|on the)

14

slide-15
SLIDE 15

kth order Markov

  • Consider only the last k words for context

which implies the probability of a sequence is:

(k+1) gram

15

slide-16
SLIDE 16

n-gram models

P(w1, w2, ...wn) =

n

Y

i=1

P(wi)

Larger the n, more accurate and better the language model (but also higher costs) Unigram

P(w1, w2, ...wn) =

n

Y

i=1

P(wi|wi−1)

Bigram and Trigram, 4-gram, and so on. Caveat: Assuming infinite data!

16

slide-17
SLIDE 17

Unigram Model

17

slide-18
SLIDE 18

Bigram Model

18

slide-19
SLIDE 19

Trigram Model

19

slide-20
SLIDE 20

Maximum Likelihood Estimate

20

slide-21
SLIDE 21

Number of Parameters

Question

21

slide-22
SLIDE 22

Number of Parameters

Question

22

slide-23
SLIDE 23

Number of Parameters

Question

23

slide-24
SLIDE 24

Number of parameters

24

slide-25
SLIDE 25

Generalization of n-grams

  • Not all n-grams will be observed in training data!
  • Test corpus might have some that have zero probability

under our model

  • Training set: Google news
  • Test set: Shakespeare
  • P (affray | voice doth us) = 0 P(test corpus) = 0

25

slide-26
SLIDE 26

Sparsity in language

Frequency Rank

  • Long tail of infrequent words
  • Most finite-size corpora will have this problem.

Zipf’s Law

freq ∝ 1 rank

26

slide-27
SLIDE 27

Smoothing n-gram Models

27

slide-28
SLIDE 28

Handling unknown words

28

slide-29
SLIDE 29

Smoothing

  • Smoothing deals with events that have been observed zero or very few

times

  • Handle sparsity by making sure all probabilities are non-zero in our model
  • Additive: Add a small amount to all probabilities
  • Interpolation: Use a combination of different n-grams
  • Discounting: Redistribute probability mass from observed n-grams to

unobserved ones

  • Back-off: Use lower order n-grams if higher ones are too sparse

29

slide-30
SLIDE 30

Smoothing intuition

(Credits: Dan Klein)

Taking from the rich and giving to the poor

30

slide-31
SLIDE 31
  • Simplest form of smoothing: Just add 1 to all counts and

renormalize!

  • Max likelihood estimate for bigrams:
  • Let

be the number of words in our vocabulary. Assign count of 1 to unseen bigrams

  • After smoothing:

|V|

Add-one (Laplace) smoothing

31

slide-32
SLIDE 32

Add-one (Laplace) smoothing

32

slide-33
SLIDE 33
  • Why add 1? 1 is an overestimate for unobserved events
  • Additive smoothing (

):

  • Also known as add-alpha (the symbol is used instead of )

0 < δ ≤ 1 α δ

Additive smoothing

(Lidstone 1920, Jeffreys 1948)

33

slide-34
SLIDE 34

Linear Interpolation (Jelinek-Mercer Smoothing)

  • Use a combination of models to estimate probability
  • Strong empirical performance

ˆ P(wi|wi−1, wi−2) = λ1P(wi|wi−1, wi−2) +λ2P(wi|wi−1) +λ3P(wi) X

i

λi = 1

34

slide-35
SLIDE 35

Linear Interpolation (Jelinek-Mercer Smoothing)

35

slide-36
SLIDE 36

Linear Interpolation: Finding lambda

36

slide-37
SLIDE 37
  • More on language models
  • Using language models for generation
  • Evaluating language models
  • Text classification
  • Video lecture on levels of linguistic representation

Next Week

37