Language Models January 22, 2013 Tuesday, January 22, 13 Still no - - PowerPoint PPT Presentation

language models
SMART_READER_LITE
LIVE PREVIEW

Language Models January 22, 2013 Tuesday, January 22, 13 Still no - - PowerPoint PPT Presentation

Language Models January 22, 2013 Tuesday, January 22, 13 Still no MT?? Today we will talk about models of p (sentence) The rest of this semester will deal with p (translated sentence | input sentence) Why do it this way?


slide-1
SLIDE 1

Language Models

January 22, 2013

Tuesday, January 22, 13

slide-2
SLIDE 2

Still no MT??

  • Today we will talk about models of p(sentence)
  • The rest of this semester will deal with

p(translated sentence | input sentence)

  • Why do it this way?
  • Conditioning on more stuff makes modeling

more complicated. That is: p(sentence) is easier than p(translated sentence | input sentence).

  • Language models are arguably the most

important models in statistical MT

Tuesday, January 22, 13

slide-3
SLIDE 3

My legal name is Alexander Perchov. But all of my many friends dub me Alex, because that is a more flaccid-to-utter version of my legal name. Mother dubs me Alexi-stop-spleening-me!, because I am always spleening her. If you want to know why I am always spleening her, it is because I am always elsewhere with friends, and disseminating so much currency, and performing so many things that can spleen a mother. Father used to dub me Shapka, for the fur hat I would don even in the summer

  • month. He ceased dubbing me that because I
  • rdered him to cease dubbing me that. It sounded

boyish to me, and I have always thought of myself as very potent and generative. I have many many girls, believe me, and they all have a different name for me. One dubs me Baby, not because I am a baby, but because she attends to me.

Tuesday, January 22, 13

slide-4
SLIDE 4

My legal name is Alexander Perchov. But all of my many friends dub me Alex, because that is a more flaccid-to-utter version of my legal name. Mother dubs me Alexi-stop-spleening-me!, because I am always spleening her. If you want to know why I am always spleening her, it is because I am always elsewhere with friends, and disseminating so much currency, and performing so many things that can spleen a mother. Father used to dub me Shapka, for the fur hat I would don even in the summer

  • month. He ceased dubbing me that because I
  • rdered him to cease dubbing me that. It sounded

boyish to me, and I have always thought of myself as very potent and generative. I have many many girls, believe me, and they all have a different name for me. One dubs me Baby, not because I am a baby, but because she attends to me.

Tuesday, January 22, 13

slide-5
SLIDE 5

My legal name is Alexander Perchov. But all of my many friends dub me Alex, because that is a more flaccid-to-utter version of my legal name. Mother dubs me Alexi-stop-spleening-me!, because I am always spleening her. If you want to know why I am always spleening her, it is because I am always elsewhere with friends, and disseminating so much currency, and performing so many things that can spleen a mother. Father used to dub me Shapka, for the fur hat I would don even in the summer

  • month. He ceased dubbing me that because I
  • rdered him to cease dubbing me that. It sounded

boyish to me, and I have always thought of myself as very potent and generative. I have many many girls, believe me, and they all have a different name for me. One dubs me Baby, not because I am a baby, but because she attends to me.

Tuesday, January 22, 13

slide-6
SLIDE 6

My legal name is Alexander Perchov. But all of my many friends dub me Alex, because that is a more flaccid-to-utter version of my legal name. Mother dubs me Alexi-stop-spleening-me!, because I am always spleening her. If you want to know why I am always spleening her, it is because I am always elsewhere with friends, and disseminating so much currency, and performing so many things that can spleen a mother. Father used to dub me Shapka, for the fur hat I would don even in the summer

  • month. He ceased dubbing me that because I
  • rdered him to cease dubbing me that. It sounded

boyish to me, and I have always thought of myself as very potent and generative. I have many many girls, believe me, and they all have a different name for me. One dubs me Baby, not because I am a baby, but because she attends to me.

Tuesday, January 22, 13

slide-7
SLIDE 7

My legal name is Alexander Perchov. But all of my many friends dub me Alex, because that is a more flaccid-to-utter version of my legal name. Mother dubs me Alexi-stop-spleening-me!, because I am always spleening her. If you want to know why I am always spleening her, it is because I am always elsewhere with friends, and disseminating so much currency, and performing so many things that can spleen a mother. Father used to dub me Shapka, for the fur hat I would don even in the summer

  • month. He ceased dubbing me that because I
  • rdered him to cease dubbing me that. It sounded

boyish to me, and I have always thought of myself as very potent and generative. I have many many girls, believe me, and they all have a different name for me. One dubs me Baby, not because I am a baby, but because she attends to me.

Tuesday, January 22, 13

slide-8
SLIDE 8

My legal name is Alexander Perchov. But all of my many friends dub me Alex, because that is a more flaccid-to-utter version of my legal name. Mother dubs me Alexi-stop-spleening-me!, because I am always spleening her. If you want to know why I am always spleening her, it is because I am always elsewhere with friends, and disseminating so much currency, and performing so many things that can spleen a mother. Father used to dub me Shapka, for the fur hat I would don even in the summer

  • month. He ceased dubbing me that because I
  • rdered him to cease dubbing me that. It sounded

boyish to me, and I have always thought of myself as very potent and generative. I have many many girls, believe me, and they all have a different name for me. One dubs me Baby, not because I am a baby, but because she attends to me.

Tuesday, January 22, 13

slide-9
SLIDE 9

My legal name is Alexander Perchov. But all of my many friends dub me Alex, because that is a more flaccid-to-utter version of my legal name. Mother dubs me Alexi-stop-spleening-me!, because I am always spleening her. If you want to know why I am always spleening her, it is because I am always elsewhere with friends, and disseminating so much currency, and performing so many things that can spleen a mother. Father used to dub me Shapka, for the fur hat I would don even in the summer

  • month. He ceased dubbing me that because I
  • rdered him to cease dubbing me that. It sounded

boyish to me, and I have always thought of myself as very potent and generative. I have many many girls, believe me, and they all have a different name for me. One dubs me Baby, not because I am a baby, but because she attends to me.

Tuesday, January 22, 13

slide-10
SLIDE 10

Tuesday, January 22, 13

slide-11
SLIDE 11

Tuesday, January 22, 13

slide-12
SLIDE 12

Tuesday, January 22, 13

slide-13
SLIDE 13

Tuesday, January 22, 13

slide-14
SLIDE 14

Tuesday, January 22, 13

slide-15
SLIDE 15

Tuesday, January 22, 13

slide-16
SLIDE 16

Language Models Matter

  • Language models play the role of ...
  • a judge of grammaticality
  • a judge of semantic plausibility
  • an enforcer of stylistic consistency
  • a repository of knowledge (?)

Tuesday, January 22, 13

slide-17
SLIDE 17

What is the probability

  • f a sentence?
  • Requirements
  • Assign a probability to every sentence

(i.e., string of words)

  • Questions
  • How many sentences are there in

English?

  • Too many :)

Tuesday, January 22, 13

slide-18
SLIDE 18

What is the probability

  • f a sentence?
  • Requirements
  • Assign a probability to every sentence

(i.e., string of words)

  • Questions
  • How many sentences are there in

English?

  • Too many :)

Tuesday, January 22, 13

slide-19
SLIDE 19

What is the probability

  • f a sentence?
  • Requirements
  • Assign a probability to every sentence

(i.e., string of words)

  • Questions
  • How many sentences are there in

English?

  • Too many :)

X

e∈Σ∗

pLM(e) = 1 pLM(e) ≥ 0 ∀e ∈ Σ∗

Tuesday, January 22, 13

slide-20
SLIDE 20

p(e1)× p(e2 | e1)× p(e3 | e1, e2)× p(e4 | e1, e2, e3)× · · · × p(e` | e1, e2, . . . , e`−2, e`−1) pLM(e) =p(e1, e2, e3, . . . , e`) =

n-gram LMs

Tuesday, January 22, 13

slide-21
SLIDE 21

p(e1)× p(e2 | e1)× p(e3 | e1, e2)× p(e4 | e1, e2, e3)× · · · × p(e` | e1, e2, . . . , e`−2, e`−1) pLM(e) =p(e1, e2, e3, . . . , e`) =

n-gram LMs

Vector-valued random variable

Tuesday, January 22, 13

slide-22
SLIDE 22

p(e1)× p(e2 | e1)× p(e3 | e1, e2)× p(e4 | e1, e2, e3)× · · · × p(e` | e1, e2, . . . , e`−2, e`−1) pLM(e) =p(e1, e2, e3, . . . , e`) =

n-gram LMs

Tuesday, January 22, 13

slide-23
SLIDE 23

p(e1)× p(e2 | e1)× p(e3 | e1, e2)× p(e4 | e1, e2, e3)× · · · × p(e` | e1, e2, . . . , e`−2, e`−1) pLM(e) =p(e1, e2, e3, . . . , e`) =

n-gram LMs

Tuesday, January 22, 13

slide-24
SLIDE 24

p(e1)× p(e2 | e1)× p(e3 | e1, e2)× p(e4 | e1, e2, e3)× · · · × p(e` | e1, e2, . . . , e`−2, e`−1) pLM(e) =p(e1, e2, e3, . . . , e`) =

n-gram LMs

Tuesday, January 22, 13

slide-25
SLIDE 25

p(e1)× p(e2 | e1)× p(e3 | e1, e2)× p(e4 | e1, e2, e3)× · · · × p(e` | e1, e2, . . . , e`−2, e`−1) pLM(e) =p(e1, e2, e3, . . . , e`) =

n-gram LMs

Tuesday, January 22, 13

slide-26
SLIDE 26

p(e1)× p(e2 | e1)× p(e3 | e1, e2)× p(e4 | e1, e2, e3)× · · · × p(e` | e1, e2, . . . , e`−2, e`−1) pLM(e) =p(e1, e2, e3, . . . , e`) =

n-gram LMs

Tuesday, January 22, 13

slide-27
SLIDE 27

p(e1)× p(e2 | e1)× p(e3 | e1, e2)× p(e4 | e1, e2, e3)× · · · × p(e` | e1, e2, . . . , e`−2, e`−1) pLM(e) =p(e1, e2, e3, . . . , e`)

n-gram LMs

Tuesday, January 22, 13

slide-28
SLIDE 28

p(e1)× p(e2 | e1)× p(e3 | e1, e2)× p(e4 | e1, e2, e3)× · · · × p(e` | e1, e2, . . . , e`−2, e`−1) pLM(e) =p(e1, e2, e3, . . . , e`)

n-gram LMs

Tuesday, January 22, 13

slide-29
SLIDE 29

p(e1)× p(e2 | e1)× p(e3 | e1, e2)× p(e4 | e1, e2, e3)× · · · × p(e` | e1, e2, . . . , e`−2, e`−1) pLM(e) =p(e1, e2, e3, . . . , e`)

n-gram LMs

Which do you think is better? Why?

Tuesday, January 22, 13

slide-30
SLIDE 30

p(e1)× p(e2 | e1)× p(e3 | e1, e2)× p(e4 | e1, e2, e3)× · · · × p(e` | e1, e2, . . . , e`−2, e`−1) pLM(e) =p(e1, e2, e3, . . . , e`)

n-gram LMs

Tuesday, January 22, 13

slide-31
SLIDE 31

p(e1)× p(e2 | e1)× p(e3 | e1, e2)× p(e4 | e1, e2, e3)× · · · × p(e` | e1, e2, . . . , e`−2, e`−1) pLM(e) =p(e1, e2, e3, . . . , e`)

n-gram LMs

≈ = p(e1 | START) ×

`

Y

i=2

p(ei | ei−1) × p(STOP | e`)

Tuesday, January 22, 13

slide-32
SLIDE 32

START

Tuesday, January 22, 13

slide-33
SLIDE 33

START my

p(my | START)

Tuesday, January 22, 13

slide-34
SLIDE 34

START my

p(my | START)

friends

×p(friends | my)

Tuesday, January 22, 13

slide-35
SLIDE 35

START my

p(my | START)

friends

×p(friends | my)

call

×p(call | friends)

Tuesday, January 22, 13

slide-36
SLIDE 36

START my

p(my | START)

friends

×p(friends | my)

call

×p(call | friends)

me

×p(me | call)

Tuesday, January 22, 13

slide-37
SLIDE 37

START my

p(my | START)

friends

×p(friends | my)

call

×p(call | friends)

me

×p(me | call)

Alex

×p(Alex | me)

Tuesday, January 22, 13

slide-38
SLIDE 38

START my

p(my | START)

friends

×p(friends | my)

call

×p(call | friends)

me

×p(me | call)

Alex

×p(Alex | me)

STOP

×p(STOP | Alex)

Tuesday, January 22, 13

slide-39
SLIDE 39

START my

p(my | START)

friends

×p(friends | my)

call

×p(call | friends)

me

×p(me | call)

Alex

×p(Alex | me)

STOP

×p(STOP | Alex)

START my

p(my | START)

friends

×p(friends | my)

dub me Alex

×p(Alex | me)

STOP

×p(STOP | Alex) ×p(dub | friends) ×p(me | dub)

Tuesday, January 22, 13

slide-40
SLIDE 40

START my

p(my | START)

friends

×p(friends | my)

call

×p(call | friends)

me

×p(me | call)

Alex

×p(Alex | me)

STOP

×p(STOP | Alex)

START my

p(my | START)

friends

×p(friends | my)

dub me Alex

×p(Alex | me)

STOP

×p(STOP | Alex) ×p(dub | friends) ×p(me | dub)

Tuesday, January 22, 13

slide-41
SLIDE 41

START my

p(my | START)

friends

×p(friends | my)

call

×p(call | friends)

me

×p(me | call)

Alex

×p(Alex | me)

STOP

×p(STOP | Alex)

START my

p(my | START)

friends

×p(friends | my)

dub me Alex

×p(Alex | me)

STOP

×p(STOP | Alex) ×p(dub | friends) ×p(me | dub)

These sentences have many terms in common.

Tuesday, January 22, 13

slide-42
SLIDE 42

Categorical Distributions

A categorical distribution characterizes a random event that can take on exactly one of K possible outcomes. (nb. we often call these “multinomial distributions”)

p(x) =                p1 if x = 1 p2 if x = 2 . . . pK if x = K

  • therwise

X

i

pi = 1 pi ≥ 0 ∀i

Tuesday, January 22, 13

slide-43
SLIDE 43

Outcome p

the 0.3 and 0.1 said 0.04 says 0.004

  • f

0.12 why 0.008 Why 0.0007 restaurant 0.00009 destitute 0.00000064

Probability tables like this are the workhorses of language (and translation) modeling.

p(·)

Tuesday, January 22, 13

slide-44
SLIDE 44

p(· | some context) p(· | other context)

Outcome p

the 0.6 and 0.04 said 0.009 says 0.00001

  • f

0.1 why 0.1 Why 0.00008 restaurant 0.0000008 destitute 0.00000064

Outcome p

the 0.01 and 0.01 said 0.003 says 0.009

  • f

0.002 why 0.003 Why 0.0006 restaurant 0.2 destitute 0.1

Tuesday, January 22, 13

slide-45
SLIDE 45

Outcome p

the 0.6 and 0.04 said 0.009 says 0.00001

  • f

0.1 why 0.1 Why 0.00008 restaurant 0.0000008 destitute 0.00000064

Outcome p

the 0.01 and 0.01 said 0.003 says 0.009

  • f

0.002 why 0.003 Why 0.0006 restaurant 0.2 destitute 0.1 p(· | some context) p(· | other context) p(· | in) p(· | the)

Tuesday, January 22, 13

slide-46
SLIDE 46

LM Evaluation

  • Extrinsic evaluation: build a new language model, use

it for some task (MT, ASR, etc.)

  • Intrinsic: measure how good we are at modeling

language

We will use perplexity to evaluate models

Given: w, pLM PPL = 2

1 |w| log2 pLM(w)

0 ≤ PPL ≤ ∞

Tuesday, January 22, 13

slide-47
SLIDE 47

Perplexity

  • Generally fairly good correlations with BLEU for n-gram

models

  • Perplexity is a generalization of the notion of branching

factor

  • How many choices do I have at each position?
  • State-of-the-art English LMs have PPL of ~100 word

choices per position

  • A uniform LM has a perplexity of
  • Humans do much better
  • ... and bad models can do even worse than uniform!

|Σ|

Tuesday, January 22, 13

slide-48
SLIDE 48

Whence parameters?

Tuesday, January 22, 13

slide-49
SLIDE 49

Whence parameters?

Estimation.

Tuesday, January 22, 13

slide-50
SLIDE 50

p(x | y) = p(x, y) p(y) ˆ pMLE(x) = count(x) N ˆ pMLE(x, y) = count(x, y) N ˆ pMLE(x | y) = count(x, y) N × N count(y) = count(x, y) count(y)

Tuesday, January 22, 13

slide-51
SLIDE 51

p(x | y) = p(x, y) p(y) ˆ pMLE(x) = count(x) N ˆ pMLE(x, y) = count(x, y) N ˆ pMLE(x | y) = count(x, y) N × N count(y) = count(x, y) count(y)

Tuesday, January 22, 13

slide-52
SLIDE 52

p(x | y) = p(x, y) p(y) ˆ pMLE(x) = count(x) N ˆ pMLE(x, y) = count(x, y) N ˆ pMLE(x | y) = count(x, y) N × N count(y) = count(x, y) count(y) ˆ pMLE(call | friends) = count(friends call) count(friends)

Tuesday, January 22, 13

slide-53
SLIDE 53

MLE & Perplexity

  • What is the lowest (best) perplexity

possible for your model class?

  • Compute the MLE!
  • Well, that’s easy...

Tuesday, January 22, 13

slide-54
SLIDE 54

START my

p(my | START)

friends

×p(friends | my)

call

×p(call | friends)

me

×p(me | call)

Alex

×p(Alex | me)

STOP

×p(STOP | Alex)

START my

p(my | START)

friends

×p(friends | my)

dub me Alex

×p(Alex | me)

STOP

×p(STOP | Alex) ×p(dub | friends) ×p(me | dub)

Tuesday, January 22, 13

slide-55
SLIDE 55

START my

p(my | START)

friends

×p(friends | my)

call

×p(call | friends)

me

×p(me | call)

Alex

×p(Alex | me)

STOP

×p(STOP | Alex)

START my

p(my | START)

friends

×p(friends | my)

dub me Alex

×p(Alex | me)

STOP

×p(STOP | Alex) ×p(dub | friends) ×p(me | dub)

MLE MLE

Tuesday, January 22, 13

slide-56
SLIDE 56

START my

p(my | START)

friends

×p(friends | my)

call

×p(call | friends)

me

×p(me | call)

Alex

×p(Alex | me)

STOP

×p(STOP | Alex)

START my

p(my | START)

friends

×p(friends | my)

dub me Alex

×p(Alex | me)

STOP

×p(STOP | Alex) ×p(dub | friends) ×p(me | dub)

  • 3.65172
  • 3.65172

MLE MLE

Tuesday, January 22, 13

slide-57
SLIDE 57

START my

p(my | START)

friends

×p(friends | my)

call

×p(call | friends)

me

×p(me | call)

Alex

×p(Alex | me)

STOP

×p(STOP | Alex)

START my

p(my | START)

friends

×p(friends | my)

dub me Alex

×p(Alex | me)

STOP

×p(STOP | Alex) ×p(dub | friends) ×p(me | dub)

  • 3.65172
  • 3.65172
  • 2.07101
  • 2.07101

MLE MLE

Tuesday, January 22, 13

slide-58
SLIDE 58

START my

p(my | START)

friends

×p(friends | my)

call

×p(call | friends)

me

×p(me | call)

Alex

×p(Alex | me)

STOP

×p(STOP | Alex)

START my

p(my | START)

friends

×p(friends | my)

dub me Alex

×p(Alex | me)

STOP

×p(STOP | Alex) ×p(dub | friends) ×p(me | dub)

  • 3.65172
  • 3.65172
  • 2.07101
  • 2.07101
  • 3.32231

MLE MLE

Tuesday, January 22, 13

slide-59
SLIDE 59

START my

p(my | START)

friends

×p(friends | my)

call

×p(call | friends)

me

×p(me | call)

Alex

×p(Alex | me)

STOP

×p(STOP | Alex)

START my

p(my | START)

friends

×p(friends | my)

dub me Alex

×p(Alex | me)

STOP

×p(STOP | Alex) ×p(dub | friends) ×p(me | dub)

  • 3.65172
  • 3.65172
  • 2.07101
  • 2.07101
  • 3.32231
  • 0.271271
  • 2.54562

MLE MLE

Tuesday, January 22, 13

slide-60
SLIDE 60

START my

p(my | START)

friends

×p(friends | my)

call

×p(call | friends)

me

×p(me | call)

Alex

×p(Alex | me)

STOP

×p(STOP | Alex)

START my

p(my | START)

friends

×p(friends | my)

dub me Alex

×p(Alex | me)

STOP

×p(STOP | Alex) ×p(dub | friends) ×p(me | dub)

  • 3.65172
  • 3.65172
  • 2.07101
  • 2.07101
  • 3.32231
  • 0.271271
  • 2.54562
  • 4.961
  • 4.961

MLE MLE

Tuesday, January 22, 13

slide-61
SLIDE 61

START my

p(my | START)

friends

×p(friends | my)

call

×p(call | friends)

me

×p(me | call)

Alex

×p(Alex | me)

STOP

×p(STOP | Alex)

START my

p(my | START)

friends

×p(friends | my)

dub me Alex

×p(Alex | me)

STOP

×p(STOP | Alex) ×p(dub | friends) ×p(me | dub)

  • 3.65172
  • 3.65172
  • 2.07101
  • 2.07101
  • 3.32231
  • 0.271271
  • 2.54562
  • 4.961
  • 4.961
  • 1.96773
  • 1.96773

MLE MLE

Tuesday, January 22, 13

slide-62
SLIDE 62

START my

p(my | START)

friends

×p(friends | my)

call

×p(call | friends)

me

×p(me | call)

Alex

×p(Alex | me)

STOP

×p(STOP | Alex)

START my

p(my | START)

friends

×p(friends | my)

dub me Alex

×p(Alex | me)

STOP

×p(STOP | Alex) ×p(dub | friends) ×p(me | dub)

  • 3.65172
  • 3.65172
  • 2.07101
  • 2.07101
  • 3.32231
  • 0.271271
  • 2.54562
  • 4.961
  • 4.961
  • 1.96773
  • 1.96773

MLE MLE

MLE assigns probability zero to unseen events

Tuesday, January 22, 13

slide-63
SLIDE 63

Zeros

  • Two kinds of zero probs:
  • Sampling zeros: zeros in the MLE due to

impoverished observations

  • Structural zeros: zeros that should be there.

Do these really exist?

  • Just because you haven’t seen something, doesn’t

mean it doesn’t exist.

  • In practice, we don’t like probability zero, even if

there is an argument that it is a structural zero.

Tuesday, January 22, 13

slide-64
SLIDE 64

Zeros

  • Two kinds of zero probs:
  • Sampling zeros: zeros in the MLE due to

impoverished observations

  • Structural zeros: zeros that should be there.

Do these really exist?

  • Just because you haven’t seen something, doesn’t

mean it doesn’t exist.

  • In practice, we don’t like probability zero, even if

there is an argument that it is a structural zero.

the a ’s are nearing the end of their lease in oakland

Tuesday, January 22, 13

slide-65
SLIDE 65

Smoothing

p(e) > 0 ∀e ∈ Σ∗

Smoothing an refers to a family of estimation techniques that seek to model important general patterns in data while avoiding modeling noise or sampling

  • artifacts. In particular, for language modeling, we seek

We will assume that is known and finite.

Σ

Tuesday, January 22, 13

slide-66
SLIDE 66

Add- Smoothing

p ∼ Dirichlet(α) xi ∼ Categorical(p) ∀1 ≤ i ≤ |x|

Assuming this model, what is the most probable value of p, having observed training data x?

(bunch of calculus - read about it on Wikipedia)

p∗

x = count(x) + αx − 1

N + P

x0 (αx0 − 1)

∀αx > 1

α

Tuesday, January 22, 13

slide-67
SLIDE 67
  • Simplest possible smoother
  • Surprisingly effective in many models
  • Does not work well for language models
  • There are procedures for dealing with 0 <

alpha < 1

  • When might these be useful?

Add- Smoothing

α

Tuesday, January 22, 13

slide-68
SLIDE 68

Interpolation

  • “Mixture of MLEs”

ˆ p(dub | my friends) =λ3ˆ pMLE(dub | my friends) + λ2ˆ pMLE(dub | friends) + λ1ˆ pMLE(dub) + λ0 1 |Σ|

Tuesday, January 22, 13

slide-69
SLIDE 69

Interpolation

  • “Mixture of MLEs”

ˆ p(dub | my friends) =λ3ˆ pMLE(dub | my friends) + λ2ˆ pMLE(dub | friends) + λ1ˆ pMLE(dub) + λ0 1 |Σ|

Where do the lambdas come from?

Tuesday, January 22, 13

slide-70
SLIDE 70

Discounting

Discounting adjusts the frequencies of observed events downward to reserve probability for the things that have not been observed. Note only when

f(w3 | w1, w2) > 0 count(w1, w2, w3) > 0 0 ≤ f ∗(w3 | w1, w2) ≤ f(w3 | w1, w2)

We introduce a discounted frequency: The total discount is the zero-frequency probability:

λ(w1, w2) = 1 − X

w0

f ⇤(w0 | w1, w2)

Tuesday, January 22, 13

slide-71
SLIDE 71

Back-off

ˆ pBO(w3 | w1, w2) = ( f ∗(w3 | w1, w2) if f ∗(w3 | w1, w2) > 0 αw1,w2 × λ(w1, w2) × ˆ pBO(w3 | w1, w2)

  • therwise

Recursive formulation of probability:

Tuesday, January 22, 13

slide-72
SLIDE 72

Back-off

ˆ pBO(w3 | w1, w2) = ( f ∗(w3 | w1, w2) if f ∗(w3 | w1, w2) > 0 αw1,w2 × λ(w1, w2) × ˆ pBO(w3 | w1, w2)

  • therwise

{

“Back-off weight” Recursive formulation of probability:

Tuesday, January 22, 13

slide-73
SLIDE 73

Back-off

ˆ pBO(w3 | w1, w2) = ( f ∗(w3 | w1, w2) if f ∗(w3 | w1, w2) > 0 αw1,w2 × λ(w1, w2) × ˆ pBO(w3 | w1, w2)

  • therwise

{

“Back-off weight” Question: how do we discount? Recursive formulation of probability:

Tuesday, January 22, 13

slide-74
SLIDE 74

Witten-Bell Discounting

Let’s assume that the probability of a zero off can be estimated as follows:

λ(a, b) ∝

Tuesday, January 22, 13

slide-75
SLIDE 75

Witten-Bell Discounting

Let’s assume that the probability of a zero off can be estimated as follows: a

λ(a, b) ∝

Tuesday, January 22, 13

slide-76
SLIDE 76

Witten-Bell Discounting

Let’s assume that the probability of a zero off can be estimated as follows: a b

λ(a, b) ∝

Tuesday, January 22, 13

slide-77
SLIDE 77

Witten-Bell Discounting

Let’s assume that the probability of a zero off can be estimated as follows: a b c 1

λ(a, b) ∝

Tuesday, January 22, 13

slide-78
SLIDE 78

Witten-Bell Discounting

Let’s assume that the probability of a zero off can be estimated as follows: a b c a 1

λ(a, b) ∝

Tuesday, January 22, 13

slide-79
SLIDE 79

Witten-Bell Discounting

Let’s assume that the probability of a zero off can be estimated as follows: a b c a b 1

λ(a, b) ∝

Tuesday, January 22, 13

slide-80
SLIDE 80

Witten-Bell Discounting

Let’s assume that the probability of a zero off can be estimated as follows: a b c a b c 1

λ(a, b) ∝

Tuesday, January 22, 13

slide-81
SLIDE 81

Witten-Bell Discounting

Let’s assume that the probability of a zero off can be estimated as follows: a b c a b c a 1

λ(a, b) ∝

Tuesday, January 22, 13

slide-82
SLIDE 82

Witten-Bell Discounting

Let’s assume that the probability of a zero off can be estimated as follows: a b c a b c a b 1

λ(a, b) ∝

Tuesday, January 22, 13

slide-83
SLIDE 83

Witten-Bell Discounting

Let’s assume that the probability of a zero off can be estimated as follows: a b c a b c a b x 1+1

λ(a, b) ∝

Tuesday, January 22, 13

slide-84
SLIDE 84

Witten-Bell Discounting

Let’s assume that the probability of a zero off can be estimated as follows: a b c a b c a b x a 1+1

λ(a, b) ∝

Tuesday, January 22, 13

slide-85
SLIDE 85

Witten-Bell Discounting

Let’s assume that the probability of a zero off can be estimated as follows: a b c a b c a b x a b 1+1

λ(a, b) ∝

Tuesday, January 22, 13

slide-86
SLIDE 86

Witten-Bell Discounting

Let’s assume that the probability of a zero off can be estimated as follows: a b c a b c a b x a b c 1+1

λ(a, b) ∝

Tuesday, January 22, 13

slide-87
SLIDE 87

Witten-Bell Discounting

Let’s assume that the probability of a zero off can be estimated as follows: a b c a b c a b x a b c c 1+1

λ(a, b) ∝

Tuesday, January 22, 13

slide-88
SLIDE 88

Witten-Bell Discounting

Let’s assume that the probability of a zero off can be estimated as follows: a b c a b c a b x a b c c a 1+1

λ(a, b) ∝

Tuesday, January 22, 13

slide-89
SLIDE 89

Witten-Bell Discounting

Let’s assume that the probability of a zero off can be estimated as follows: a b c a b c a b x a b c c a b 1+1

λ(a, b) ∝

Tuesday, January 22, 13

slide-90
SLIDE 90

Witten-Bell Discounting

Let’s assume that the probability of a zero off can be estimated as follows: a b c a b c a b x a b c c a b a 1+1+1

λ(a, b) ∝

Tuesday, January 22, 13

slide-91
SLIDE 91

Witten-Bell Discounting

Let’s assume that the probability of a zero off can be estimated as follows: a b c a b c a b x a b c c a b a b 1+1+1

λ(a, b) ∝

Tuesday, January 22, 13

slide-92
SLIDE 92

Witten-Bell Discounting

Let’s assume that the probability of a zero off can be estimated as follows: a b c a b c a b x a b c c a b a b x 1+1+1

λ(a, b) ∝

Tuesday, January 22, 13

slide-93
SLIDE 93

Witten-Bell Discounting

Let’s assume that the probability of a zero off can be estimated as follows: a b c a b c a b x a b c c a b a b x c 1+1+1

λ(a, b) ∝

Tuesday, January 22, 13

slide-94
SLIDE 94

Witten-Bell Discounting

Let’s assume that the probability of a zero off can be estimated as follows: a b c a b c a b x a b c c a b a b x c 1+1+1

λ(a, b) ∝

=3

Tuesday, January 22, 13

slide-95
SLIDE 95

Witten-Bell Discounting

Let’s assume that the probability of a zero off can be estimated as follows: a b c a b c a b x a b c c a b a b x c 1+1+1

λ(a, b) ∝

=3

t(a, b) = |{x : count(a, b, x) > 0}|

Tuesday, January 22, 13

slide-96
SLIDE 96

Witten-Bell Discounting

Let’s assume that the probability of a zero off can be estimated as follows: a b c a b c a b x a b c c a b a b x c 1+1+1

λ(a, b) ∝

=3

t(a, b) = |{x : count(a, b, x) > 0}| λ(a, b) = t(a, b) count(a, b) + t(a, b)

Tuesday, January 22, 13

slide-97
SLIDE 97

Witten-Bell Discounting

Let’s assume that the probability of a zero off can be estimated as follows: a b c a b c a b x a b c c a b a b x c 1+1+1

λ(a, b) ∝

=3

t(a, b) = |{x : count(a, b, x) > 0}| λ(a, b) = t(a, b) count(a, b) + t(a, b) f ∗(c | a, b) = count(a, b, c) count(a, b) + t(a, b)

Tuesday, January 22, 13

slide-98
SLIDE 98

Kneser-Ney Discounting

  • State-of-the-art in language modeling for 15 years
  • Two major intuitions
  • Some contexts have lots of new words
  • Some words appear in lots of contexts
  • Procedure
  • Only register a lower-order count the first time it is seen in a backoff

context

  • Example: bigram model
  • “San Francisco” is a common bigram
  • But, we only count the unigram “Francisco” the first time we see the

bigram “San Francisco” - we change its unigram probability

Tuesday, January 22, 13

slide-99
SLIDE 99

Kneser-Ney II

f ∗(b | a) = max{t(·, a, b) − d, 0} t(·, a, ·) t(·, a, b) = |{w : count(w, a, b) > 0}| t(·, a, ·) = |{(w, w0) : count(w, a, w0) > 0}|

Tuesday, January 22, 13

slide-100
SLIDE 100

Kneser-Ney II

f ∗(b | a) = max{t(·, a, b) − d, 0} t(·, a, ·) t(·, a, b) = |{w : count(w, a, b) > 0}| t(·, a, ·) = |{(w, w0) : count(w, a, w0) > 0}|

Max-order n-grams estimated normally!

Tuesday, January 22, 13

slide-101
SLIDE 101

Other Formulations

  • N-gram class-based language models
  • Syntactic language models
  • generative syntactic models induce

distributions over strings

p(w) = X

τ:yield(τ)=w

p(τ, w) p(w) =

`

Y

i=1

p(ci | ci−n+1, . . . , ci−1) × p(wi | ci)

Tuesday, January 22, 13

slide-102
SLIDE 102

Other Formulations

  • N-gram class-based language models
  • Syntactic language models
  • generative syntactic models induce

distributions over strings

p(w) = X

τ:yield(τ)=w

p(τ, w) p(w) =

`

Y

i=1

p(ci | ci−n+1, . . . , ci−1) × p(wi | ci)

Tuesday, January 22, 13

slide-103
SLIDE 103

Pauls & Klein (2012)

p(τ, w) = p(τ) × p(w | τ)

Tuesday, January 22, 13

slide-104
SLIDE 104

Pauls & Klein (2012)

p(τ, w) = p(τ) × p(w | τ)

Tuesday, January 22, 13

slide-105
SLIDE 105

Pauls & Klein (2012)

p(τ, w) = p(τ) × p(w | τ)

Tuesday, January 22, 13

slide-106
SLIDE 106

Feature-based Models

  • Rosenfeld (1996)
  • “Maximum entropy” language models
  • Replace independent parameters with a

multinomial logit distribution

  • Encode domain-specific knowledge
  • Expressive, but expensive

Tuesday, January 22, 13

slide-107
SLIDE 107

Less Stupid Multinomials

Tuesday, January 22, 13

slide-108
SLIDE 108

Less Stupid Multinomials

Features of w

Ends in -ing? Contains a digit? Found in Gigaword? Contains a capital letter?

Tuesday, January 22, 13

slide-109
SLIDE 109

Less Stupid Multinomials

Features of w

Ends in -ing? Contains a digit? Found in Gigaword? Contains a capital letter?

Parameters

Tuesday, January 22, 13

slide-110
SLIDE 110

Less Stupid Multinomials

Tuesday, January 22, 13

slide-111
SLIDE 111

Less Stupid Multinomials

Tuesday, January 22, 13

slide-112
SLIDE 112

Less Stupid Multinomials

Tuesday, January 22, 13

slide-113
SLIDE 113

Less Stupid Multinomials

Tuesday, January 22, 13

slide-114
SLIDE 114

Less Stupid Multinomials

No analytic solution! :(

Tuesday, January 22, 13

slide-115
SLIDE 115

Announcements

  • First language-in-10 start next week
  • Tuesday, Jan 29: David - Latin
  • Thursday, Jan 31: Weston - Mandarin
  • HW 1 will be posted Thursday after class

Tuesday, January 22, 13