Algorithms for NLP CS 11711, Fall 2019 Lecture 2: Language Models - - PowerPoint PPT Presentation

algorithms for nlp
SMART_READER_LITE
LIVE PREVIEW

Algorithms for NLP CS 11711, Fall 2019 Lecture 2: Language Models - - PowerPoint PPT Presentation

Algorithms for NLP CS 11711, Fall 2019 Lecture 2: Language Models Yulia Tsvetkov 1 Announcements Homework 1 released on 9/3 you need to attend next lecture to understand it Chan will give an overview in the end of the next


slide-1
SLIDE 1

1

Yulia Tsvetkov

Algorithms for NLP

CS 11711, Fall 2019

Lecture 2: Language Models

slide-2
SLIDE 2

2

▪ Homework 1 released on 9/3

▪ you need to attend next lecture to understand it ▪ Chan will give an overview in the end of the next lecture ▪ + recitation on 9/6

Announcements

slide-3
SLIDE 3

3

1-slide review of probability

Slide credit: Noah Smith

slide-4
SLIDE 4

4

1-slide review of probability

Slide credit: Noah Smith

slide-5
SLIDE 5

5

1-slide review of probability

Slide credit: Noah Smith

slide-6
SLIDE 6

6

1-slide review of probability

Slide credit: Noah Smith

slide-7
SLIDE 7

7

1-slide review of probability

Slide credit: Noah Smith

slide-8
SLIDE 8

8

1-slide review of probability

Slide credit: Noah Smith

slide-9
SLIDE 9

9

slide-10
SLIDE 10

10

My legal name is Alexander Perchov.

slide-11
SLIDE 11

11

My legal name is Alexander Perchov. But all of my many friends dub me Alex, because that is a more flaccid-to-utter version of my legal name.

slide-12
SLIDE 12

12

My legal name is Alexander Perchov. But all of my many friends dub me Alex, because that is a more flaccid-to-utter version of my legal name. Mother dubs me Alexi-stop-spleening-me!, because I am always spleening her.

slide-13
SLIDE 13

13

My legal name is Alexander Perchov. But all of my many friends dub me Alex, because that is a more flaccid-to-utter version of my legal name. Mother dubs me Alexi-stop-spleening-me!, because I am always spleening her. If you want to know why I am always spleening her, it is because I am always elsewhere with friends, and disseminating so much currency, and performing so many things that can spleen a mother.

slide-14
SLIDE 14

14

My legal name is Alexander Perchov. But all of my many friends dub me Alex, because that is a more flaccid-to-utter version of my legal name. Mother dubs me Alexi-stop-spleening-me!, because I am always spleening her. If you want to know why I am always spleening her, it is because I am always elsewhere with friends, and disseminating so much currency, and performing so many things that can spleen a mother. Father used to dub me Shapka, for the fur hat I would don even in the summer month.

slide-15
SLIDE 15

15

My legal name is Alexander Perchov. But all of my many friends dub me Alex, because that is a more flaccid-to-utter version of my legal name. Mother dubs me Alexi-stop-spleening-me!, because I am always spleening her. If you want to know why I am always spleening her, it is because I am always elsewhere with friends, and disseminating so much currency, and performing so many things that can spleen a mother. Father used to dub me Shapka, for the fur hat I would don even in the summer

  • month. He ceased dubbing me that because I
  • rdered him to cease dubbing me that. It sounded

boyish to me, and I have always thought of myself as very potent and generative.

slide-16
SLIDE 16

16

▪ a judge of grammaticality ▪ a judge of semantic plausibility ▪ an enforcer of stylistic consistency ▪ a repository of knowledge (?)

Language models play the role of ...

slide-17
SLIDE 17

17

▪ Assign a probability to every sentence (or any string of words)

▪ finite vocabulary (e.g. words or characters) {the, a, telescope, …} ▪ infinite set of sequences

a telescope STOP ▪ a STOP ▪ the the the STOP ▪ I saw a woman with a telescope STOP ▪ STOP ▪ ...

The Language Modeling problem

slide-18
SLIDE 18

18

The Language Modeling problem

▪ Assign a probability to every sentence (or any string of words)

▪ finite vocabulary (e.g. words or characters) ▪ infinite set of sequences

slide-19
SLIDE 19

19

p(disseminating so much currency STOP) = 10-15 p(spending a lot of money STOP) = 10-9

slide-20
SLIDE 20

20

▪ Assign a probability to every sentence (or any string of words)

▪ finite vocabulary (e.g. words or characters) ▪ infinite set of sequences

The Language Modeling problem Objections?

slide-21
SLIDE 21

21

▪ Machinetranslation

▪ p(strong winds) > p(large winds)

▪ SpellCorrection

▪ The office is about fifteen minuets from my house ▪ p(about fifteen minutes from) > p(about fifteen minuets from)

▪ Speech Recognition

▪ p(I saw a van) >> p(eyes awe of an)

▪ Summarization, question-answering, handwriting recognition, OCR, etc.

Motivation

slide-22
SLIDE 22

22

▪ Speech recognition: we want to predict a sentence given acoustics

Motivation s p ee ch l a b

slide-23
SLIDE 23

23

▪ Speech recognition: we want to predict a sentence given acoustics

the station signs are in deep in english

  • 14732

the stations signs are in deep in english

  • 14735

the station signs are in deep into english

  • 14739

the station 's signs are in deep in english

  • 14740

the station signs are in deep in the english

  • 14741

the station signs are indeed in english

  • 14757

the station 's signs are indeed in english

  • 14760

the station signs are indians in english

  • 14790

the station signs are indian in english

  • 14799

the stations signs are indians in english

  • 14807

the stations signs are indians and english

  • 14815

Motivation

slide-24
SLIDE 24

24

Motivation: the Noisy-Channel Model

source

W A

noisy channel

slide-25
SLIDE 25

25

Motivation: the Noisy-Channel Model

source

W A

noisy channel decoder

  • bserved

w a best

slide-26
SLIDE 26

26

Motivation: the Noisy-Channel Model

source

W A

noisy channel decoder

  • bserved

w a best

slide-27
SLIDE 27

27

Motivation: the Noisy-Channel Model

source

W A

noisy channel decoder

  • bserved

w a best

▪ We want to predict a sentence given acoustics:

slide-28
SLIDE 28

28

▪ We want to predict a sentence given acoustics: ▪ The noisy-channel approach:

Motivation: the Noisy-Channel Model

slide-29
SLIDE 29

29

▪ The noisy-channel approach:

Motivation: the Noisy-Channel Model

channel model source model

source

W A

noisy channel decoder

  • bserved

w a best

slide-30
SLIDE 30

30

▪ The noisy-channel approach:

Motivation: the Noisy-Channel Model

source

W A

noisy channel decoder

  • bserved

w a best

Prior Acoustic model (HMMs) Likelihood Language model: Distributions over sequences

  • f words (sentences)
slide-31
SLIDE 31

31

Noisy channel example: Automatic Speech Recognition

source P(w)

w a

decoder

  • bserved

argmax P(w|a) = argmax P(a|w)P(w) w w w a best

channel P(a|w)

Language Model Acoustic Model

slide-32
SLIDE 32

32

the station signs are in deep in english

  • 14732

the stations signs are in deep in english

  • 14735

the station signs are in deep into english

  • 14739

the station 's signs are in deep in english

  • 14740

the station signs are in deep in the english

  • 14741

the station signs are indeed in english

  • 14757

the station 's signs are indeed in english

  • 14760

the station signs are indians in english

  • 14790

the station signs are indian in english

  • 14799

the stations signs are indians in english

  • 14807

the stations signs are indians and english

  • 14815

Noisy channel example: Automatic Speech Recognition

source P(w)

w a

decoder

  • bserved

w a best

channel P(a|w)

Language Model Acoustic Model

the station 's signs are in deep in english

slide-33
SLIDE 33

33

Noisy channel example: Machine Translation

source P(e)

e f

decoder

  • bserved

argmax P(e|f) = argmax P(f|e)P(e) e e e f best

channel P(f|e)

Language Model Translation Model

sent transmission: English recovered transmission: French recovered message: English’

slide-34
SLIDE 34
slide-35
SLIDE 35

35

▪ speech recognition ▪ machine translation ▪

  • ptical character recognition

▪ spelling and grammar correction ▪ handwriting recognition ▪ document summarization ▪ dialog generation ▪ linguistic decipherment ▪ etc.

Noisy Channel Examples

slide-36
SLIDE 36

36

▪ what is language modeling ▪ motivation ▪ how to build an n-gram LMs ▪ how to estimate parameters from training data (n-gram probabilities) ▪ how to evaluate (perplexity) ▪ how to select vocabulary, what to do with OOVs (smoothing)

Plan

slide-37
SLIDE 37

37

▪ Assign a probability to every sentence (or any string of words)

▪ finite vocabulary (e.g. words or characters) ▪ infinite set of sequences

The Language Modeling problem

slide-38
SLIDE 38

38

▪ Assume we have n training sentences ▪ Let x1, x2, …, xn be a sentence, and c(x1, x2, …, xn) be the number of times it appeared in the training data. ▪ Define a language model:

A trivial model

slide-39
SLIDE 39

39

▪ Assume we have n training sentences ▪ Let x1, x2, …, xn be a sentence, and c(x1, x2, …, xn) be the number of times it appeared in the training data. ▪ Define a language model: ▪ No generalization!

A trivial model

slide-40
SLIDE 40

40

▪ Markov processes:

Given a sequence of n random variables: ▪ We want a sequence probability model

Markov processes

slide-41
SLIDE 41

41

▪ Markov processes:

Given a sequence of n random variables: ▪ We want a sequence probability model ▪ There are |V|n possible sequences

Markov processes

slide-42
SLIDE 42

42

Chain rule

First-order Markov process

slide-43
SLIDE 43

43

Chain rule Markov assumption

First-order Markov process

slide-44
SLIDE 44

44

▪ Relax independence assumption:

Second-order Markov process:

slide-45
SLIDE 45

45

▪ Relax independence assumption: ▪ Simplify notation:

Second-order Markov process:

slide-46
SLIDE 46

46

▪ We want probability distribution over sequences of any length

Detail: variable length

slide-47
SLIDE 47

47

▪ Probability distribution over sequences of any length ▪ Define always Xn=STOP, where STOP is a special symbol

Detail: variable length

slide-48
SLIDE 48

48

▪ Probability distribution over sequences of any length ▪ Define always Xn=STOP, where STOP is a special symbol ▪ Then use a Markov process as before: ▪ We now have probability distribution over all sequences

▪ Intuition: at every step you have probability 𝛽h to stop (conditioned

  • n history) and (1-𝛽h) to keep going

Detail: variable length

slide-49
SLIDE 49

49

slide-50
SLIDE 50

50

▪ A trigram language model contains

▪ a vocabulary V ▪ a non negative parameters q(w|u,v) for every trigram, such that ▪ the probability of a sentence x1, …, xn, where xn=STOP is

3-gram LMs

slide-51
SLIDE 51

51

Example

slide-52
SLIDE 52

52

Example

slide-53
SLIDE 53

53

Example

slide-54
SLIDE 54

54

Limitations?

slide-55
SLIDE 55

55

▪ Markovian assumption is false ▪ We would want to model longer dependencies

Limitation

slide-56
SLIDE 56

56

▪ what is language modeling ▪ motivation ▪ how to build n-gram LMs ▪ how to estimate parameters from training data (n-gram probabilities) ▪ how to evaluate (perplexity) ▪ how to select vocabulary, what to do with OOVs (smoothing)

Plan

slide-57
SLIDE 57

57

▪ How do we know P(w | history)?

▪ Use statistics from data (examples using Google N-Grams) ▪ E.g. what is P(door | the)?

Empirical N-Grams

198015222 the first 194623024 the same 168504105 the following 158562063 the world … 14112454 the door

  • 23135851162 the *

Training Counts

slide-58
SLIDE 58

58

▪ Higher orders capture more dependencies

Increasing N-Gram Order

198015222 the first 194623024 the same 168504105 the following 158562063 the world … 14112454 the door

  • 23135851162 the *

197302 close the window 191125 close the door 152500 close the gap 116451 close the thread 87298 close the deal …

  • 3785230 close the *

Bigram Model Trigram Model P(door | the) = 0.0006 P(door | close the) = 0.05

slide-59
SLIDE 59

59

▪ can you tell me about any good cantonese restaurants close by ▪ mid priced that food is what i’m looking for ▪ tell me about chez pansies ▪ can you give me a listing of the kinds of food that are available ▪ i’m looking for a good place to eat breakfast ▪ when is caffe venezia open during the day

Berkeley restaurant project sentences

slide-60
SLIDE 60

60

  • ut of 9,222 sentences

Bigram counts (~10K sentences)

slide-61
SLIDE 61

61

Bigram probabilities

slide-62
SLIDE 62

62

▪ p(English | want) < p(Chinese | want) - people like Chinese stuff more when it comes to this corpus ▪ p(to | want) = 0.66 - English behaves in a certain way ▪ p(eat | to) = 0.28 - English behaves in a certain way

What did we learn

slide-63
SLIDE 63

63

▪ Maximum likelihood for estimating q

▪ Let c(w1, …, wn) be the number of times that n-gram appears in a corpus ▪ If vocabulary has 20,000 words ⇒ Number of parameters is 8 x 1012!

Sparseness

slide-64
SLIDE 64

64

▪ Maximum likelihood for estimating q

▪ Let c(w1, …, wn) be the number of times that n-gram appears in a corpus ▪ If vocabulary has 20,000 words ⇒ Number of parameters is 8 x 1012! ▪ Most n-grams will never be observed, even if they are linguistically plausible (Zipf law) ▪ ⇒ Most sentences will have zero or undefined probabilities

Sparseness

slide-65
SLIDE 65

65

▪ what is language modeling ▪ motivation ▪ how to build n-gram LMs ▪ how to estimate parameters from training data (n-gram probabilities) ▪ how to evaluate (perplexity) ▪ how to select vocabulary, what to do with OOVs (smoothing)

Plan

slide-66
SLIDE 66

66

▪ Extrinsic evaluation: build a new language model, use it for some task (MT, ASR, etc.) ▪ Intrinsic: measure how good we are at modeling language

Evaluation

slide-67
SLIDE 67

67

▪ Intuitively, language models should assign high probability to real language they have not seen before ▪ Want to maximize likelihood on test, not training data ▪ Models derived from counts / sufficient statistics require generalization parameters to be tuned on held-out data to simulate test generalization ▪ Set hyperparameters to maximize the likelihood of the held-out data (usually with grid search or EM)

Intrinsic evaluation

Training Data Held-Out Data Test Data

Counts / parameters from here Hyperparameters from here Evaluate here

slide-68
SLIDE 68

68

▪ Test data: S = {s1, s2, …, ssent}

▪ Parameters are not estimated from S ▪ Perplexity is the normalized inverse probability of S

Evaluation: perplexity

slide-69
SLIDE 69

69

▪ Test data: S = {s1, s2, …, ssent}

▪ parameters are estimated on training data ▪ sent is the number of sentences in the test data ▪ M is the number of words in the test corpus ▪ A good language model has high p(S) and low perplexity

Evaluation: perplexity

slide-70
SLIDE 70

70

Understanding perplexity

▪ It’s a branching factor

▪ assign probability of 1 to the test data ⇒ perplexity = 1 ▪ assign probability of 1/|V| to every word ⇒ perplexity = |V| ▪ assign probability of 0 to anything ⇒ perplexity = ∞

▪ this motivates the proper probability constraint ▪

▪ cannot compare perplexities of LMs trained on different corpora

slide-71
SLIDE 71

71

▪ When |V| = 50,000 ▪ trigram model perplexity: 74 (<< 50,000) ▪ bigram model: 137 ▪ unigram model: 955

Typical values of perplexity

slide-72
SLIDE 72

72

▪ what is language modeling ▪ motivation ▪ how to build n-gram LMs ▪ how to estimate parameters from training data (n-gram probabilities) ▪ how to evaluate (perplexity) ▪ how to select vocabulary, what to do with OOVs (smoothing)

▪ better parameter estimation methods

Plan

slide-73
SLIDE 73

73

▪ Define a special OOV or “unknown” symbol unk. Transform some (or all) rare words in the training data to unk

▪ You cannot fairly compare two language models that apply different unk treatments

▪ Build a language model at the character level

Dealing with Out-of-Vocabulary terms

slide-74
SLIDE 74

74

▪ For most N‐grams, we have few observations ▪ General approach: modify observed counts to improve estimates

▪ Discounting: allocate probability mass for unobserved events by discounting counts for observed events ▪ Interpolation: approximate counts of N‐gram using combination of estimates from related denser histories ▪ Back‐off: approximate counts of unobserved N‐gram based on the proportion of back‐off events (e.g., N‐1 gram)

Dealing with sparsity: Smoothing

slide-75
SLIDE 75

75

▪ Given a corpus of length M

Bias-variance tradeoff

slide-76
SLIDE 76

76

slide-77
SLIDE 77

77

▪ Combine the three models to get all benefits

Linear interpolation

slide-78
SLIDE 78

78

▪ Need to verify the parameters define a probability distribution

Linear interpolation

slide-79
SLIDE 79

79

Estimating coefficients

Training Data Held-Out Data Test Data

Counts / parameters from here Hyperparameters from here Evaluate here

slide-80
SLIDE 80

80

▪ Low count bigrams have high estimates

Discounting methods

slide-81
SLIDE 81

81

Discounting methods

slide-82
SLIDE 82

82

▪ next time: Kneser-Ney Smoothing

Discounting + Backoff

slide-83
SLIDE 83

83

▪ KN smoothing ▪ Efficient LMs

▪ relevant to your homework 1

Next class