Statistical Language Modeling for Speech Recognition Berlin Chen - - PowerPoint PPT Presentation

statistical language modeling for speech recognition
SMART_READER_LITE
LIVE PREVIEW

Statistical Language Modeling for Speech Recognition Berlin Chen - - PowerPoint PPT Presentation

Statistical Language Modeling for Speech Recognition Berlin Chen 2003 References: 1. X. Huang et. al., Spoken Language Processing, Chapter 11 2. R. Rosenfeld, Two Decades of Statistical Language Modeling: Where Do We Go from Here?,


slide-1
SLIDE 1

Statistical Language Modeling for Speech Recognition

References:

  • 1. X. Huang et. al., Spoken Language Processing, Chapter 11
  • 2. R. Rosenfeld, ”Two Decades of Statistical Language Modeling: Where Do We Go from

Here?,” Proceedings of IEEE, August, 2000

  • 3. Joshua Goodman’s (Microsoft Research) Public Presentation Material
  • 4. S. M. Katz, “Estimation of probabilities from sparse data for the language model component of

a speech recognizer,” IEEE ASSP, March 1987

  • 5. R. Kneser and H. Ney, “Improved backing-off for m-gram language modeling,” ICASSP 1995

Berlin Chen 2003

slide-2
SLIDE 2

2

What is Language Modeling ?

  • Language Modeling (LM) deals with the probability

distribution of word sequences, e.g.:

P(“hi”)=0.01, P(“and nothing but the truth”) ≈ 0.001 P(“and nuts sing on the roof”) ≈ 0

From Joshua Goodman’s material

slide-3
SLIDE 3

3

What is Language Modeling ?

  • For a word sequence , can be decomposed into

a product of conditional probabilities:

– E.g.: P(“and nothing but the truth”) = P(“and”) ×P(“nothing|and”) × P(“but|and nothing”) × P(“the|and nothing but”) × P(“truth|and nothing but the”) – However, it’s impossible to estimate and store if is large (data sparseness problem etc.)

( )

W P

W

( ) ( ) ( ) (

) ( ) ( )

( )

( )

= − −

= = =

m i i i m m m

w w w w P w P w w w w P w w w P w w P w P w w w P P

2 1 2 1 1 1 2 1 2 1 3 2 1 1 2 1

,..., , ,..., , ... , ,..., , W

i

( )

1 i 2 1 i

w ,..., w , w w P

History of wi

chain rule

slide-4
SLIDE 4

4

What is LM Used for ?

  • Statistical language modeling attempts to capture the

regularities of natural languages

– Improve the performance of various natural language applications by estimating the probability distribution of various linguistic units, such as words, sentences, and whole documents – First significant model was proposed in 1980

( ) ( )

? ,..., ,

2 1 m

w w w P P = W

slide-5
SLIDE 5

5

What is LM Used for ?

  • Statistical language modeling is most prevailing in many

application domains

– Speech recognition – Spelling correction – Handwriting recognition – Optical character recognition (OCR) – Machine translation – Document classification and routing – Information retrieval

slide-6
SLIDE 6

6

Current Status

  • Ironically, the most successful statistical language

modeling techniques use very little knowledge of what language is

– The most prevailing n-gram language models take no advantage

  • f the fact that what is being modeled is language

– it may be a sequence of arbitrary symbols, with no deep structure, intention, or though behind then – F. Jelinek said “put language back into language modeling”

( ) ( )

1 2 1 1 2 1

,..., , ,..., ,

− + − + − −

i n i n i i i i

w w w w P w w w w P

History of length n-1

slide-7
SLIDE 7

7

LM in Speech Recognition

  • For a given acoustic observation , the goal of

speech recognition is to find out the corresponding word sequence that has the maximum posterior probability

n 2 1

,..., , x x x X =

m 2 1

,...,w ,w w = W

( )

X W P

( ) ( ) (

) ( )

( ) (

)

W W X X W W X X W W

W W W

P P P P P P max arg max arg max arg ˆ = = =

Language Modeling Acoustic Modeling

{ }

V i m i

,.....,w ,w w w ,...,w ,..w ,w w

2 1 2 1

Voc where ∈ = W

Prior Probability Posterior Probability

slide-8
SLIDE 8

8

The Trigram Approximation

  • The trigram modeling assumes that each word depends
  • nly on the previous two words (a window of three words

total) – “tri” means three, “gram” means writing

– E.g.: P(“the|… whole truth and nothing but”) ≈ P(“the|nothing but”) P(“truth|… whole truth and nothing but the”) ≈ P(“truth|but the”) – Similar definition for bigram (a window of two words total)

  • How do we find probabilities?

– Get real text, and start counting (empirically) ! P(“the | nothing but”) ≈C[“nothing but the”]/C[“nothing but”]

count Probability may be 0

slide-9
SLIDE 9

9

Maximum Likelihood Estimate (ML/MLE) for LM

  • Given a a training corpus T and the language model

– Essentially, the distribution of the sample counts with the same history referred as a multinominal (polynominal) distribution

{ }

V th L th k th th

w ,w w W w w w w T ,..., Vocabulary ...... ... Corpus

2 1 2 1

= =

− − − −

Λ

…陳水扁 總統 訪問 美國 紐約 … 陳水扁 總統 在 巴拿馬 表示 …

P(總統|陳水扁)=?

( ) ( )

∏ ∏ ∏

= ≅ Λ

− − h w N hw w th k th k

i i hw i th k

w w p T p

  • f

history λ

N-grams with same history are collected together

i hw

N

h

( )

∑ ∑ ∏ ∏

= = = ∈ ∀

j j i i i i hw i i i V

w hw h w hw w N hw w hw h hw hw

N N N N N N P T h 1 and , ! ! ,..., ,

1

λ λ

= ∈ ∀

j j

w hw

T h 1 , λ

( )

[ ] [ ] [ ]

T h C hw C N hw C N h w p

i i i i

w i hw i hw hw i

corpus in , , where = = = =

λ

slide-10
SLIDE 10

10

Maximum Likelihood Estimate (ML/MLE) for LM

  • Take logarithm of , we have
  • For any pair , try to maximize and subject

to

( )

Λ T p

( )

( )

∑ ∑

= Λ = Λ Φ

h w hw hw

i i i

N T p λ log log

) , (

j

w h h

j j

w hw

∀ =

, 1 λ

( )

Λ Φ ( ) [ ] [ ]

h hw N N N N l l N l N N N l N l N

i h hw hw h w hw h h w hw w hw h hw hw hw hw hw hw h hw hw hw h w hw h h w hw hw hw

i i s s j j s s V V i i i j j i i i i

C C ˆ ...... 1 log

2 2 1 1

= = ∴ − = − = ⇒ − = ⇒ − = = = = ⇒ = + ⇒ ∂ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − + ∂ = ∂ Λ Φ ∂

∑ ∑ ∑ ∑ ∑ ∑ ∑

λ λ λ λ λ λ λ λ λ λ

( ) ( ) ∑

⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − + Λ Φ = Λ Φ

h w hw h

j j

l 1 λ

slide-11
SLIDE 11

11

Main Issues for LM

  • Evaluation

– How can you tell a good language model from a bad one – Run a speech recognizer or adopt other statistical measurements

  • Smoothing

– Deal with data sparseness of real training data – Variant approaches have been proposed

  • Caching

– If you say something, you are likely to say it again later – Adjust word frequencies observed in the current conversation

  • Clustering

– Group words with similar properties (similar semantic or grammatical) into the same class – Another efficient way to handle the data sparseness problem

slide-12
SLIDE 12

12

Evaluation

  • Two most common metrics for evaluation a language

model

– Word Recognition Error Rate (WER) – Perplexity (PP)

  • Word Recognition Error Rate

– Requires the participation of a speech recognition system (slow!) – Need to deal with the combination of acoustic probabilities and language model probabilities (penalizing or weighting between them)

slide-13
SLIDE 13

13

Evaluation

  • Perplexity

– Perplexity is geometric average inverse language model probability (measure language model difficulty, not acoustic difficulty/confusability) – Can be roughly interpreted as the geometric mean of the branching factor of the text when presented to the language model – For trigram modeling:

( )

m m 2 i 1 i 2 1 i 1 m 2 1

) w ,..., w , w | w ( P 1 ) w ( P 1 w ,..., w , w PP

= −

⋅ = = W

( )

m m 3 i 1 i 2 i i 1 2 1 m 2 1

) w , w | w ( P 1 ) w w ( P 1 ) w ( P 1 w ,..., w , w PP

= − −

⋅ ⋅ = = W

slide-14
SLIDE 14

14

Evaluation

  • More about Perplexity

– Perplexity is an indication of the complexity of the language if we have an accurate estimate of – A language with higher perplexity means that the number of words branching from a previous word is larger on average – A langue model with perplexity L has roughly the same difficulty as another language model in which every word can be followed by L different words with equal probabilities – Examples:

  • Ask a speech recognizer to recognize digits: “0, 1, 2, 3, 4, 5, 6, 7, 8,

9” – easy – perplexity ≈10

  • Ask a speech recognizer to recognize names at a large institute

(10,000 persons) – hard – perplexity ≈ 10,000

( )

W P

slide-15
SLIDE 15

15

Evaluation

  • More about Perplexity (Cont.)

– Training-set perplexity: measures how the language model fits the training data – Test-set perplexity: evaluates the generalization capability of the language model

  • When we say perplexity, we mean “test-set perplexity”
slide-16
SLIDE 16

16

Evaluation

  • Is a language model with lower perplexity is better?

– The true (optimal) model for data has the lowest possible perplexity – Lower the perplexity, the closer we are to true model – Typically, perplexity correlates well with speech recognition word error rate

  • Correlates better when both models are trained on same data
  • Doesn’t correlate well when training data changes

– The 20,000-word continuous speech recognition for Wall Street Journal (WSJ) task has a perplexity about 128 ~ 176 (trigram) – The 2,000-word conversational Air Travel Information System (ATIS) task has a perplexity less than 20

slide-17
SLIDE 17

17

Evaluation

  • The perplexity of bigram with different vocabulary size
slide-18
SLIDE 18

18

Evaluation

  • A rough rule of thumb (by Rosenfeld)

– Reduction of 5% in perplexity is usually not practically significant – A 10% ~ 20% reduction is noteworthy, and usually translates into some improvement in application performance – A perplexity improvement of 30% or more over a good baseline is quite significant

slide-19
SLIDE 19

19

Smoothing

  • Maximum likelihood (ML) estimate of language models

has been shown previously, e.g.: – Trigam probabilities – Bigram probabilities

( ) [ ] [ ] [ ] [ ]

xy C xyz C xyw C xyz C xy z P

w ML

= = ∑ |

( ) [ ] [ ] [ ] [ ]

x C xy C xw C xy C x y P

w ML

= = ∑ |

count

slide-20
SLIDE 20

20

Smoothing

  • Data Sparseness

– Many actually possible events (word successions) in the test set may not be well observed in the training set/data

  • E.g. bigram modeling

P(read|Mulan)=0 P(Mulan read a book)=0 P(W)=0 P(X|W)P(W)=0 – Whenever a string such that occurs during speech recognition task, an error will be made

( )

= W P

W

slide-21
SLIDE 21

21

Smoothing

  • Operations of smoothing

– Assign all strings (or events/word successions) a nonzero probability if they never occur in the training data – Tend to make distributions flatter by adjusting lower probabilities upward and high probabilities downward

slide-22
SLIDE 22

22

Smoothing: Simple Models

  • Add-one smoothing

– For example, pretend each trigram occurs once more than it actually does

  • Add delta smoothing

( ) [ ] [ ] ( ) [ ] [ ]

y words vocabular total

  • f

number : 1 1 1 | V V xy C xyz C xyw C xyz C xy z P

w smooth

+ + = + + ≈ ∑

( )

[ ] [ ]

δ δ V xy C xyz C xy z Psmooth + + ≈ |

Works badly. DO NOT DO THESE TWO (Joshua Goodman said)

slide-23
SLIDE 23

23

Smoothing: Back-Off Models

  • The general form for n-gram back-off

– : normalizing/scaling factor chosen to make the conditional probability sum to 1

  • I.e.,

( )

1 1,..., − + − i n i

w w α

( ) ( ) [ ] ( ) ( ) [ ]

⎪ ⎩ ⎪ ⎨ ⎧ = ⋅ > =

− + − − + − − + − − + − − + − − + −

, ,..., if ,..., | ,..., , ,..., if ,..., | ˆ ,..., |

1 1 1 2 1 1 1 1 1 1 1 1 i i n i i n i i smooth i n i i i n i i n i i i n i i smooth

w w w C w w w P w w w w w C w w w P w w w P α

( )

1 ,..., |

1 1

=

− + −

i

w i n i i smooth

w w w P

( ) ( )

[ ]

( )

[ ]

∑ ∑

= − + − > − + − − + −

− + − − + −

− =

, ,..., , 1 2 , ,..., , 1 1 1 1

1 1 1 1

,..., | ,..., | ˆ 1 ,..., example, For

i i n i j i i n i i

w w w C w i n i j smooth w w w C w i n i i i n i

w w w P w w w P w w α

slide-24
SLIDE 24

24

Smoothing: Interpolated Models

  • The general form for Interpolated n-gram back-off
  • The key difference between backoff and interpolated

models

– For n-grams with nonzero counts, interpolated models use information from lower-order distributions while back-off models do not – Moreover, n-grams with the same counts can have different probability estimates ( ) ( ) ( ) ( ) ( ) ( )

1 2 1 1 1 1 1 1 1 1

,..., | ,..., 1 ,..., | ,..., ,..., |

− + − − + − − + − − + − − + −

− + =

i n i i smooth i n i i n i i ML i n i i n i i smooth

w w w P w w w w w P w w w w w P λ λ

( ) [ ] [ ]

1 1 1 1 1 1

,..., , ,..., ,..., |

− + − − + − − + −

=

i n i i i n i i n i i ML

w w C w w w C w w w P

count

slide-25
SLIDE 25

25

Clustering

  • Class-based language Models

– Define classes for words that exhibit similar semantic or grammatical behavior WEEKDAY = Sunday, Monday, Tuesday, … MONTH = January, February, April, May, June, … EVENT=meeting, class, party, …

  • P(Tuesday| party on) is similar to P(Monday| party on)
slide-26
SLIDE 26

26

Clustering

  • A word may belong to more than one class and a class

may contain more than one word (many-to-many mapping)

a meeting Sunday is canceled the date on Monday will be postponed

  • ne party Tuesday

in January prepared February arranged April

slide-27
SLIDE 27

27

Clustering

  • The n-gram model can be computed based on the

previous n-1 classes

– If trigram approximation and unique mappings from words to word classes are used – Empirically estimate the probabilities

( ) ( ) ( )

( )

( )

( ) ( ) ( )

( )

( )

to belongs which class the :

i i 1 i 2 i i i i 1 i 2 i i 1 i 2 i i 1 i 1 n i i

w w Class w Class w Class w Class P w Class w P w , w w P w , w w P w ... w w P

− − − − − − − + −

= =

( )

( )

[ ] ( ) [ ] ( ) ( ) ( )

( )

( ) ( ) ( ) [ ] ( ) ( ) [ ]

1 i 2 i i 1 i 2 i 1 i 2 i i i i i i

w Class w Class C w Class w Class w Class C w Class w Class w Class P w Class C w C w Class w P

− − − − − −

= =

slide-28
SLIDE 28

28

Clustering

  • Clustering is another way to battle data sparseness

problem (smoothing of the language model)

  • For general-purpose large vocabulary dictation

application, class-based n-grams have not significant improved recognition accuracy

– Mainly used as a back-off model to complement the lower-order n-grams for better smoothing

  • For limited (or narrow discourse) domain speech

recognition, the class-based n-gram is very helpful

– Because the class can efficiently encode semantic information for improved keyword-spotting and speech understanding accuracy – Good results are often achieved by manual clustering of semantic categories

slide-29
SLIDE 29

29

Caching

  • The basic idea of cashing is to accumulate n-grams

dictated so far in the current document/conversation and use these to create dynamic n-grams model

  • Trigram interpolated with unigram cache
  • Trigram interpolated with bigram cache

( ) ( ) ( ) ( ) ( ) [ ] [ ] [ ] [ ]

∈ ∈ = ∈ = − + ≈

w cache cache smooth cache

history w C history z C history length history z C history | z P history history | z P 1 xy | z P xy | z P far so dictated n

  • nversatio

document/c : λ λ

( ) ( ) ( ) ( ) ( ) [ ] [ ]

history y C history yz C history , y | z P history , y | z P 1 xy | z P xy | z P

cache cache smooth cache

∈ ∈ = − + ≈ λ λ

slide-30
SLIDE 30

30

Caching

  • Real Life of Caching

– Someone says “I swear to tell the truth” – System hears “I swerve to smell the soup” – Someone says “The whole truth”, and, with cache, system hears “The toll booth.” – errors are locked in – Caching works well when users corrects as they go, poorly or even hurts without correction Cache remembers!

slide-31
SLIDE 31

31

Known Weakness in Current LM

  • Brittleness Across Domain

– Current language models are extremely sensitive to changes in the style or topic of the text on which they are trained – E.g., conversations vs. news broadcasts

  • False Independent Assumption

– In order to remain trainable, the n-gram modeling assumes the probability of next word in a sentence depends only on the identity of last n-1 words

slide-32
SLIDE 32

32

LM Integrated into Speech Recognition

  • Theoretically,
  • Practically, language model is a better predictor while

acoustic probabilities aren’t “real” probabilities

– Penalize insertions

( ) (

)

W X W W

W

P P ˆ max arg =

( )

( )

( )

decided y empiricall be can where max arg β α β

α

, , P P ˆ

length W W

W X W W ⋅ =

slide-33
SLIDE 33

33

Good-Turing Estimate

  • First published by Good (1953) while Turing is

acknowledged

  • A smoothing technique to deal with infrequent m-grams

(m-gram smoothing), but it usually needs to be used together with other back-off schemes to achieve good performance

  • How many words were seen once? Estimate for how

many are unseen. All other estimates are adjusted (down) to give probabilities for unseen

Use the notation m-grams instead of n-grams here

slide-34
SLIDE 34

34

Good-Turing Estimate

  • For any m-gram, ,that occurs r times ( ),

we pretend it occurs r* times ( ),

– The probability estimate for a m-gram, , with r counts

  • The size (word counts) of the training data remains the

same

( )

data training in the times exactly

  • ccurs

that grams

  • f

number the is ere wh , 1

1 *

r m n n n r r

r r r+

+ =

( )

data training the

  • f

counts) word (total size the is where ,

*

N N r P

GT

= a

( )

N n r n r n r N

r r r r r r

= ⋅ = ⋅ + = ⋅ =

∑ ∑ ∑

∞ = ∞ = + ∞ = 1 *

1

m

w1 = a

m

w1 = a

[ ]

m

w c r

1

=

[ ]

m

w c r

1 * * =

Not a conditional probability !

slide-35
SLIDE 35

35

Good-Turing Estimate

  • It follows from above that the total probability estimate

using for the set of m-grams that actually occurred in the sample is

  • The probability of observing some previously unseen m-

grams is

– Which is just a fraction of the singletons (m-grams occurring only

  • nce) in the text sample

( )

[ ] N n w P

m m

w c w m GT 1 , 1

1

1 1

− =

>

( )

[ ] N n w P

m m

w c w m GT 1 , 1

1 1

=

=

slide-36
SLIDE 36

36

Good-Turing Estimate: Example

  • Imagine you are fishing. You have caught 10 Carp (鯉魚),

3 Cod (鱈魚), 2 tuna(鮪魚), 1 trout(鱒魚), 1 salmon(鮭魚), 1 eel(鰻魚)

  • How likely is it that next species is new?

– p0=n1/N=3/18= 1/6

  • How likely is eel? 1*

– n1 =3, n2 =1 – 1* =2 ×1/3 = 2/3 – P(eel) = 1* /N = (2/3)/18 = 1/27

  • How likely is tuna? 2*

– n2 =1, n3 =1 – 2* =3 ×1/1 = 3 – P(tuna) = 2* /N = 3/18 = 1/6

  • But how likely is Cod? 3*

– Need smoothing for n4 in advance

slide-37
SLIDE 37

37

Good-Turing Estimate

  • The Good-Turing estimate may yield some problems

when nr+1=0

– An alternative strategy is to apply Good-Turing to the n-grams (events) seem at most k times, where k is a parameter chosen so that nr+1 ≠0, r=1,…,k

slide-38
SLIDE 38

38

Good-Turing Estimate

  • For Good-Turing estimate, it may happen that an m-gram

(event) occurring k times takes on a higher probability than an event occurring k+1 times

– The choice of k may be selected in an attempt to overcome such a drawback – Experimentally, k ranging from 4 to 8 will not allow the about condition to be true (for r ≤ k)

( ) ( )

1 2 1 1

2 ˆ 1 ˆ

+ + + +

⋅ + = ⋅ + =

k k k GT k k k GT

n n N k a P n n N k a P

( ) ( ) ( ) ( )

2 1 ˆ ˆ

2 2 1 1

< + ⋅ − ⋅ + ⇒ <

+ + +

k n n n k a P a P

k k k k GT k GT

slide-39
SLIDE 39

39

Katz Back-off Smoothing

  • Extend the intuition of the Good-Turing estimate by

adding the combination of higher-order language models with lower-order ones

– E.g., bigrams and unigram language models

  • Larger counts are taken to be reliable, so they are not

discounted

– E.g., for frequency counts r > k

  • Lower counts are discounted, with total reduced counts

assigned to unseen events, based on the Good-Turning estimate

– E.g., for frequency counts r ≤ k

slide-40
SLIDE 40

40 Assume lower level LM probability has been defined

Katz Back-off Smoothing

  • Take the bigram (m-gram, m=2) counts for example:

1.

  • 2. : discount constant, satisfying to the following

two equations and 3.

[ ]

( ) ( )

⎪ ⎩ ⎪ ⎨ ⎧ = > ≥ > =

− −

if if if

1 1 *

r w P w r k r d k r r w w C

i Katz i r i i

β

( ) [ ] [ ]

[ ]

( )

[ ]

∑ ∑ ∑

= > − − −

− −

− =

: : 1 * 1 1

1 1 i i i i i i i

w w C w i Katz w w C w i i w i i i

w P w w C w w C w β

r r d r

*

r r d r

*

µ =

( )

1 1

1 n r d n

k r r r

= −

=

[ ]

i i

w w C r

1 −

=

Note: dr should be calculated for different n-gram counts and different n-gram histories, e.g., wi-1 here

slide-41
SLIDE 41

41

Katz Back-off Smoothing

  • Derivation of the discount constant:

( ) ( )

1 1 1 1 *

1 1 1 n n k n n k r r d

k k r + +

+ − + − =

( )

⎪ ⎪ ⎩ ⎪ ⎪ ⎨ ⎧ = = −

=

r r d n r d n

r k r r r * 1 1

1 imposed are s constraint Two µ

[ ]

[ ]

( )

2 1 1

1 1 * 1 1 * 1 1

∑ ∑ ∑

= = =

= − ⇒ = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ − ⇒ = −

k r r k r r k r r

n r r n n r r r n n r d n µ µ

( )

( )

( )

( )

( )

( )

( )

( )

( ) ( )

1 1 1 1 1 1 1 known is equation following the Also,

1 1 1 1 1 * 1 1 1 1 1 * 1 1 1 * 1 1 1 * 1 1 1 * 1

n n k n n r r n n n k n n r r n n k n r r n n k n n r rn n k n n r rn

k r k r k k r r k k r r k k r r r k k r r k r r

= + − − ⇒ = + − − ⇒ = + − − ⇒ + − = − ⇒ + − = −

∑ ∑ ∑ ∑ ∑ ∑

= + + = + = + = + = =

( )

( ) ( )

( )

( ) [ ]

( )

( ) [ ]

1 1 1 * * 1 1 1 * * 1 1 1 *

1 1 1 1 1 3 1 have we together, related are (2) and (1) equations If

+ + +

+ − − − = ⇒ − = − = + − − ⇒ − = + − − ⇒

k r r k k

n k n r n r r d d r r u n k n r n r r r u r n k n n r r

Both sides multiplied by n1 Both sides divided by r

slide-42
SLIDE 42

42

Katz Back-off Smoothing

  • Derivation of the discount constant

r

d

( )

( ) [ ] ( ) [ ] (

)

( ) [ ] ( ) ( ) [ ] ( ) ( )

1 1 1 1 * 1 1 1 1 * 1 1 1 * 1 1 1 1 1 *

1 1 1 1 1 1 1 1 1 n n k n n k r r d n k n r n k r n r n k n r n r r n k n r n k n r n r r d

k k r k k k k k r + + + + + + +

+ − + − = ∴ + − + − = + − − − + − = + − − − = ⇒

Both the nominator and denominator are divided by r∙n1 the r∙n1 term in the nominator is eliminated

slide-43
SLIDE 43

43

Katz Back-off Smoothing

  • Take the conditional probabilities of bigrams (m-gram, m=2)

for example:

  • 1. discount constant
  • 2. normalizing constant

( )

[ ] [ ] [ ] [ ]

( ) ( )

⎪ ⎩ ⎪ ⎨ ⎧ = > ≥ > =

− − − − − −

if if , if ,

1 1 1 1 1 1

r w P w r k w C w w C d k r w C w w C w w P

i Katz i i i i r i i i i i Katz

α

( ) ( )

1 1 1 1 *

1 1 1 n n k n n k r r d

k k r + +

+ − + − =

( )

( )

[ ]

( )

[ ]

∑ ∑

= > − −

− −

− =

: : 1 1

1 1

1

i i i i i i

w w C w i Katz w w C w i i Katz i

w P w w P w α

slide-44
SLIDE 44

44

Katz Back-off Smoothing: Example

  • A small vocabulary consists of only five words,

i.e., . The frequency counts for word pairs started with are: , and the word frequency counts are:

.

Katz back-off smoothing with Good-Turing estimate is used here for word pairs with frequency counts equal to

  • r less than two. Show the conditional probabilities of

word bigrams started with , i.e.,

{ }

5 2 1

,..., , w w w V =

1

w

[ ] [ ] [ ] [ ] [ ] 0

, , , 1 , , 2 , , 3 ,

5 1 1 1 4 1 3 1 2 1

= = = = = w w C w w C w w C w w C w w C

[ ] [ ] [ ] [ ] [ ] 4

, 6 , 10 , 8 , 6

5 4 3 2 1

= = = = = w C w C w C w C w C

1

w

( ) ( ) ( ) ?

...., , ,

1 5 1 2 1 1

w w P w w P w w P

Katz Katz Katz

slide-45
SLIDE 45

45

Katz Back-off Smoothing: Example

( )

data training in the times exactly

  • ccurs

that grams

  • f

number the is where , 1

1 *

r n n n n r r

r r r+

+ =

( ) ( )

( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( )

( )

( )

( ) ( )

( )

( ) ( )

( ) ( ) ( )

1 .... And 15 1 34 4 12 2 10 34 10 1 34 6 12 2 10 34 For 12 2 10 34 34 4 34 6 12 1 4 1 2 1 1 12 1 6 1 2 1 1 For 4 1 6 2 4 3 2 For 2 1 3 1 3 2 1 1 1 2 1 1 1 1 2 1 2 , 4 3 2 3 2 3 1 1 1 2 1 1 1 1 2 2 3 3 1 1 1 2 2 , 2 1 1 1 1 1 2 1 6 3

1 5 1 2 1 1 5 1 1 5 1 1 1 1 1 1 4 1 1 4 1 3 2 1 3 1 2 * * 1 2 1 2

= + + + = ⋅ ⋅ = ⋅ = = ⋅ ⋅ = ⋅ = ⇒ = ⋅ = + − − − = = ⋅ = = ⇒ = = ⋅ = ⋅ = ⇒ = = − − = ⋅ + − ⋅ + − = = − − = ⋅ + − ⋅ + − = = ⋅ + = = ⋅ + = = = = ∴ w w P w w P w w P w P w w w P w P w w w P r w w w P d w w P r w w P d w w P r d d w w P w w P

Katz Katz Katz ML Katz ML Katz ML Katz ML Katz ML Katz

α α α

( ) ( )

1 1 1 1 *

1 1 1 n n k n n k r r d

k k r + +

+ − + − =

( ) ( ) here

that Noteice w P w P

ML Katz

=

slide-46
SLIDE 46

46

Kneser-Ney Back-off Smoothing

  • Absolute discounting without the Good-Turning

estimate

  • The lower n-gram (back-off n-gram) is not proportional

to the number of occurrences of a word but instead to the number of different words that it follows, e.g.:

– In “San Francisco”, “Francisco” only follows a single history, it should receive a low unigram probability – In “US dollars”, “TW dollars” etc., “dollars” should receive a high unigram probability C(US dollars)=200 C(HK dollars)=100 C(TW dollars)=25 . .

slide-47
SLIDE 47

47

Kneser-Ney Back-off Smoothing

  • Take the conditional probabilities of bigrams (m-gram, m=2)

for example:

1.

  • 2. normalizing constant

( )

[ ] { } [ ] [ ] ( ) ( )

⎪ ⎩ ⎪ ⎨ ⎧ > − =

− − − − −

  • therwise

, if , , max

1 1 1 1 1 i KN i i i i i i i i KN

w P w w w C w C D w w C w w P α

( ) [ ]

[ ]

[ ]

preceding words unique the is , /

i i w j i i KN

w w w w w P

j

  • =

C C C

( )

[ ]

{ }

[ ]

[ ]

( )

[ ]

∑ ∑

= > − − −

− −

− − =

: : 1 1 1

1 1

, max 1

i i i i i i

w w C w i KN w w C w i i i i

w P w C D w w C w α

1 ≤ ≤ D

slide-48
SLIDE 48

48

Kneser-Ney Back-off Smoothing: Example

  • Given a text sequence as the following:

SABCAABBCS (S is the sequence’s start/end marks)

Show the corresponding unigram conditional probabilities:

[ ] [ ] [ ] [ ] ( ) ( ) ( ) ( )

7 1 7 1 7 2 7 3 1 1 2 3 = = = = ⇒ =

  • =
  • =
  • =
  • S

P C P B P A P S C B A

KN KN KN KN

C C C C

slide-49
SLIDE 49

49

Katz vs. Kneser-Ney Back-off Smoothing

  • Example 1: Wall Street Journal (JSW), English

– A vocabulary of 60,000 words and a corpus of 260 million words (read speech) from a newspaper such as Wall Street Journal

slide-50
SLIDE 50

50

Katz vs. Kneser-Ney Back-off Smoothing

  • Example 2: Broadcast News Speech, Mandarin

– A vocabulary of 72,000 words and a corpus of 170 million Chinese characters from Central News Agency (CNA) – Tested on Mandarin broadcast news speech collected in Taiwan, September 2002, about 3.7 hours – The perplexities are high here, because the LM training materials are not speech transcripts but merely newswire texts

14.90 670.24 Tigram Kneser-Ney 14.62 752.49 Tigram Katz 18.17 942.34 Bigram Kneser-Ney 16.81 959.56 Bigram Katz Character Error Rate

(after tree-copy search, TC )

Perplexity Models

slide-51
SLIDE 51

51

Interpolated Kneser-Ney Smoothing

  • Always combine both the higher-order and the lower-
  • rder LM probability distributions
  • Take the bigram (m-gram, m=2) conditional probabilities

for example:

– Where

  • : the number of unique words that precede
  • : a normalizing constant that makes the probabilities

sum to 1

[ ]

{ }

[ ] [ ] [ ]

  • +

− =

− − − − w i i i i i i i IKN

w C w C w w C D w w C w w P ) ( , max ) | (

1 1 1 1

λ

[ ]

i

w C •

i

w

) (

1 − i

w λ

[ ] [ ]

  • =

− − − 1 1 1)

(

i i i

w C w C D w λ

[ ]

1 1

history the follow that words unique

  • f

number the :

− − • i i

w w C

slide-52
SLIDE 52

52

Interpolated Kneser-Ney Smoothing

  • The exact formula for interpolated Kneser-Ney smoothed

trigram conditional probabilities

[ ] { } [ ] [ ] { } [ ] [ ] { } [ ]

| | 1 , max ) ( ) ( ) ( , max ) | ( ) | ( ) ( , max ) | (

1 1 1 2 1 1 1 1 2 1 2 3 1 2 1 2

V w C D w C w P w P w w w C D w w C w w P w w P w w w w C D w w w C w w w P

w i i IKN i IKN i w i i i i i IKN i i IKN i i i i i i i i i i IKN

λ λ λ +

  • =

+

  • =

+ − =

∑ ∑

− − − − − − − − − − − − −

For the IKN bigram and unigram, the number of unique words that precede a given history is considered, instead of the frequency counts.

slide-53
SLIDE 53

53

Back-off vs. Interpolation

  • When determining the probability of n-grams with

nonzero counts, interpolated models use information from lower-order distributions while back-off models do not

  • In both back-off and interpolated models, lower-order

distributions are used in determining the probability

  • f n-grams with zero counts
  • It is easy to create a back-off version of an interpolated

algorithm by modifying the normalizing constant