Language Modeling Hsin-min Wang References: 1. X. Huang et. al., - - PowerPoint PPT Presentation

language modeling
SMART_READER_LITE
LIVE PREVIEW

Language Modeling Hsin-min Wang References: 1. X. Huang et. al., - - PowerPoint PPT Presentation

Language Modeling Hsin-min Wang References: 1. X. Huang et. al., Spoken Language Processing, Chapter 11 2. R. Rosenfeld, Two Decades of Statistical Language Modeling: Where Do We Go from Here?, Proceedings of IEEE, August, 2000 3. Joshua


slide-1
SLIDE 1

Language Modeling

Hsin-min Wang

References:

  • 1. X. Huang et. al., Spoken Language Processing, Chapter 11
  • 2. R. Rosenfeld, ”Two Decades of Statistical Language Modeling: Where Do We Go from Here?,” Proceedings
  • f IEEE, August, 2000
  • 3. Joshua Goodman’s (Microsoft Research) Public Presentation Material
  • 4. S. M. Katz, “Estimation of probabilities from sparse data for the language model component of a speech

recognizer,” IEEE ASSP, March 1987

  • 5. R. Kneser and H. Ney, “Improved backing-off for m-gram language modeling,” ICASSP 1995
slide-2
SLIDE 2

2

Acoustic vs. Linguistic

( )

( ) (

)

W X W X W W

W W

P P P max arg max arg ˆ = =

Acoustic pattern matching and knowledge about language are equally important in recognizing and understanding natural speech

– Lexical knowledge (vocabulary definition and word pronunciation) is required, as are the syntax and semantics of the language (the rules that determine what sequences of words are grammatically well-formed and meaningful) – In addition, knowledge of the pragmatics of language (the structure of extended discourse, and what people are likely to say in particular contexts) can be important to achieving the goal

  • f spoken language understanding systems
slide-3
SLIDE 3

3

Language Modeling - Formal vs. Probabilistic

The formal language model – grammar and parsing

– The grammar is a formal specification of the permissible structures for the language – The parsing technique is the method of analyzing the sentence to see if its structure is compliant with the grammar

The probabilistic (or stochastic) language model

– Stochastic language models take a probabilistic viewpoint of language modeling

  • The probabilistic relationship among a sequence of words can be

derived and modeled from the corpora

– Avoid the need to create broad coverage formal grammars – Stochastic language models play a critical role in building a working spoken language system – N-gram language models are most widely used

slide-4
SLIDE 4

4

N-gram Language Models - Applications

N-gram language models are widely used in many application domains

– Speech recognition – Spelling correction – Handwriting recognition – Optical character recognition (OCR) – Machine translation – Document classification and routing – Information retrieval

slide-5
SLIDE 5

5

N-gram Language Models

For a word sequence W, P(W) can be decomposed into a product of conditional probabilities:

– In reality, the probability is impossible to estimate for even moderate values of i (data sparseness problem) – A practical solution is to assume that depends only on the several previous words

( )

1 i 2 1 i

w ,..., w , w w P

( ) ( ) ( ) (

) ( ) ( )

( )

( )

∏ = = =

= − − m i i i m m m

w w w w P w P w w w w P w w w P w w P w P w w w P P

2 1 2 1 1 1 2 1 2 1 3 2 1 1 2 1

,..., , ,..., , ... , ,..., , W

History of wi

chain rule

( )

1 i 2 1 i

w ,..., w , w w P

1 2 1

,..., ,

− + − + − i N i N i

w w w

N-gram language models

slide-6
SLIDE 6

6

N-gram Language Models (cont.)

If the word depends on the previous two words, we have a trigram: Similarly, we can have unigram: or bigram: To calculate P(Mary loves that person)

– In trigram models, we would take – In bigram models, we would take – In unigram models, we would take

( )

1 2 , − − i i i

w w w P

( )

1 − i i w

w P

( )

i

w P

P(Mary loves that person)= P(Mary|<s>)P(loves|<s>,Mary)P(that|Mary,loves)P(person|loves,that)P(</s>|that,person) P(Mary loves that person)= P(Mary|<s>)P(loves|Mary)P(that|loves)P(person|that)P(</s>|person) P(Mary loves that person)=P(Mary)P(loves)P(that)P(person)

slide-7
SLIDE 7

7

N-gram Probability Estimation

The trigram can be estimated by observing from a text corpus the frequencies or counts of the word pair C(wi-2,wi-1) and the triplet C(wi-2,wi-1,wi) as follows:

– This estimation is based on the maximum likelihood (ML) principle

  • This assignment of probabilities yields the trigram model that assigns

the highest probability to the training data of all possible trigram models

The bigram can be estimated as The unigram can be estimated as

) , ( ) , , ( ) , | (

1 2 1 2 1 2 − − − − − −

=

i i i i i i i i

w w C w w w C w w w P ) ( ) , ( ) | (

1 1 1 − − −

=

i i i i i

w C w w C w w P size corpus w C w P

i i

) ( ) ( =

slide-8
SLIDE 8

8

Maximum Likelihood Estimation of N-gram Probability

Given a training corpus T and the language model Λ

{ }

V th L th k th th

w ,w w W w w w w T ,..., Vocabulary ...... ... Corpus

2 1 2 1

= =

− − − −

…陳水扁 總統 訪問 美國 紐約 … 陳水扁 總統 在 巴拿馬 表示 …

P(總統|陳水扁)=?

( ) ( )

∏ ∏ = ∏ ≅ Λ

− − h w N hw w th k th k

i i hw i th k

w w p T p

  • f

history λ

N-grams with same history are collected together

( )

[ ]

i hw w hw i hw

hw C N h w p T h

i i i i

= ∑ = = ∈ ∀ , 1 , , λ λ

slide-9
SLIDE 9

9

Maximum Likelihood Estimation of N-gram Probability (cont.)

Take logarithm of , we have For any pair , try to maximize subject to

( )

Λ T p

( )

( )

∑ ∑

= Λ = Λ Φ

h w hw hw

i i i

N T p λ log log

) , (

i

w h h

i i

w hw

∀ ∑ = , 1 λ

( )

Λ Φ

( ) [ ] [ ]

h hw N N N N l l N l N N N l N l N

i h hw hw h w hw h h w hw w hw h hw hw hw hw hw hw h hw hw hw h w hw h h w hw hw hw

i i s s j j s s V V i i i j j i i i i

C C ˆ ...... 1 log

2 2 1 1

= = ∴ − = ∑ − = ⇒ − = ∑ ∑ ⇒ − = = = = ⇒ = + ⇒ ∂       ∑         ∑ − ∑ + ∑ ∂ = ∂ Λ Φ ∂ λ λ λ λ λ λ λ λ λ λ

( ) ( ) ∑

        − + Λ Φ = Λ Φ

h w hw h

j j

l 1 λ

slide-10
SLIDE 10

10

A Simple Bigram Example

bigram bigram

slide-11
SLIDE 11

11

Major Issues for N-gram LM

Evaluation

– How can you tell a good language model from a bad one – Run a speech recognizer or adopt other statistical measurements

Smoothing

– Deal with the data sparseness of real training data – Variant approaches have been proposed

Adaptation

– Dynamic adjustment of the language model parameter, such as the n-gram probabilities, vocabulary size, and the choice of words in the vocabulary – E.g P(table|the operating), P(system|the operating)

slide-12
SLIDE 12

12

How to Evaluate a Language Model?

Given two languages models, how to compare them? Given two languages models, how to compare them? Use them in a recognizer and find the one that leads to the lower recognition error rate

– The best way to evaluate a language model – Expensive!

Use the information theory (Entropy&Perplexity) to get an estimate of how good a language model might be

– Perplexity: the geometric mean of the number of words that can follow a history (or word) after the language model has been applied

slide-13
SLIDE 13

13

Entropy

The information derivable from outcome xi depends on its probability P(xi), and the amount of information is defined as The entropy H(X) of the random variable X is defined as the average amount of information

– The entropy H(X) attains the maximum value when the random variable X has a uniform distribution; i.e., – The entropy H(X) is nonnegative and becomes zero only if the probability function is deterministic; i.e.,

( )

) ( 1 log

i i

x P w I =

∑ − = = ∑ = =

S i i i S i i

x P E x P x P x I x P X I E X H )] ( log [ ) ( 1 log ) ( ) ( ) ( )] ( [ ) (

i N x P

i

∀ = 1 ) (

i i

x x P some for 1 ) ( =

slide-14
SLIDE 14

14

Cross-Entropy

The entropy of a language is It can be proved that The entropy of a language with a vocabulary size of |V|

  • n a per word basis is

– If every word is equally likely

event language a is where , ) ( log ) ( ) (

2 i i i

E E P E P language H ∑ − = ∑ − ≤ ∑ − ) | ( log ) ( ) ( log ) (

2 2

Model E P E P E P E P

i i i i

The cross-entropy of a model with respect to the correct model Better models will have lower cross-entropy

∑ − =

= | | 1 2

) ( log ) (

V i i i

w P w P H H V V V

V i

| | log | | 1 log | | 1

2 | | 1 2

≥ ∑ = −

=

true entropy

slide-15
SLIDE 15

15

Logprob

For a language with a vocabulary size of |V|, the cross entropy of a model with respect to the correct model on a per word basis is Given a text corpus W=w1,w2,…wN, the cross-entropy of a model can be estimated by logprob (LP) defined as

∑ − = − =

= N i i Model

w P N Model P N Model LP

1 2 2

) | ( log 1 ) | ( log 1 ) ( W ∑ −

= | | 1 2

) | ( log ) (

V i i i

Model w P w P H Model LP ≥ ) (

true entropy

The goal is to find a model which has a logprob that is as close as possible to the true entropy

slide-16
SLIDE 16

16

Perplexity

The perplexity PP(W) is defined as the reciprocal of the geometric average probability assigned by the model to each word in the test set W

– This is a measure, related to cross-entropy, known as test-set perplexity

( )

N N i i i N

w w w w P w P w w w PP ∏ ⋅ = =

= − 2 1 2 1 1 2 1

) ,..., , | ( 1 ) ( 1 ,..., , W

∑ −

= −

= =

N i i i

w w w P N Model LP

PP

1 1 1 2

) ... | ( log 1 ) (

2 2 ) (W

N N i i i N i N i i N i N i i N i w w w P N w w w P N

w w w P w w w P w w w P

i i N i i i

∏ = ∏ = ∏ = ∏ =

= − = − = − = ⋅ ∑ −

− = −

1 1 1 1 1 1 1 1 1 1 1 1 ) ... | ( log 1 ) ... | ( log 1

) ... | ( 1 ) ... | ( 1 ) ... | ( 1 2 1 2

1 1 2 1 1 1 2

slide-17
SLIDE 17

17

More about Perplexity

A LM that assigned equal probability to 100 words would have perplexity 100 Ask a speech recognizer to recognize digits: “0, 1, 2, 3, 4, 5, 6, 7, 8, 9” – easy – perplexity ≈10 Ask a speech recognizer to recognize names at a large institute (10,000 persons) – difficult – perplexity ≈ 10,000

100 log 100 log 100 1 ) ( 1 log ) ( ) ( log ) (

2 100 1 2 100 1 2 100 1 2

= ∑ = ∑ = ∑ − =

= = = i i i i i i i

w p w p w p w p Entropy

100 2

100 log2

= = ∴ perplexity

slide-18
SLIDE 18

18

More about Perplexity (cont.)

Perplexity is an indication of the complexity of the language if we have an accurate estimate of P(W) A language with higher perplexity means that the number of words branching from a previous word is larger on average A langue model with perplexity L has roughly the same difficulty as another language model in which every word can be followed by L different words with equal probabilities

) (

2 ) (

language H

language PP =

slide-19
SLIDE 19

19

More about Perplexity (cont.)

Training-set perplexity measures how the language model fits the training data Test-set perplexity evaluates the generalization capability

  • f the language model

– When we say perplexity, we mean “test-set perplexity”

It is generally true that the lower perplexity correlates with better recognition performance

– The perplexity is essentially a statistically weighted word branching measure on the test set – The higher the perplexity, the more branches the speech recognizer needs to consider statistically

slide-20
SLIDE 20

20

Are Language Models with Lower Perplexity Better?

The true (optimal) model for data has the lowest possible perplexity The lower the perplexity, the closer we are to true model Typically, perplexity correlates well with speech recognition word error rate

– Correlates better when both models are trained on same data – Doesn’t correlate well when training data changes

The 20,000-word continuous speech recognition for Wall Street Journal (WSJ) task has a perplexity about 128 ~ 176 (trigram) The 2,000-word conversational Air Travel Information System (ATIS) task has a perplexity less than 20

slide-21
SLIDE 21

21

Perplexity - rule of thumb

A rough rule of thumb (by Rosenfeld) – Reduction of 5% in perplexity is usually not practically significant – A 10% ~ 20% reduction is noteworthy, and usually translates into some improvement in application performance – A perplexity improvement of 30% or more over a good baseline is quite significant

slide-22
SLIDE 22

22

Perplexity vs. Vocabulary Size

The perplexity of bigram with different vocabulary size

– The perplexity generally increases with the vocabulary size – There are generally more competing words for a given context when the vocabulary size becomes big

slide-23
SLIDE 23

23

N-gram Smoothing

Maximum likelihood (ML) estimate of N-gram language models are computed as

– Trigram probabilities – Bigram probabilities

( ) [ ] [ ] [ ] [ ]

xy C xyz C xyw C xyz C xy z P

w ML

= = ∑ |

count

( ) [ ] [ ] [ ] [ ]

x C xy C xw C xy C x y P

w ML

= = ∑ |

slide-24
SLIDE 24

24

Why N-gram Smoothing?

Data Sparseness of real training data

– If the training corpus is not large enough, many actually possible word successions may not be well observed, leading to many extremely small probabilities

  • E.g. bigram modeling

P(read|Mulan)=0 P(Mulan read a book)=0 P(W)=0 P(X|W)P(W)=0 – In speech recognition, if P(W) is zero, the string W will never be considered as a possible transcription, regardless of how unambiguous the acoustic signal is → an error will be made – Assigning all strings a nonzero probability helps prevent errors in speech recognition → smoothing

slide-25
SLIDE 25

25

Why N-gram Smoothing? (cont.)

Smoothing techniques adjust the maximum likelihood estimate of probabilities to produce more robust probabilities for unseen data, although the likelihood for the training data may be hurt slightly

– Tend to make distributions flatter by adjusting low probabilities such as zero probabilities upward, and high probabilities downward

slide-26
SLIDE 26

26

Simple Smoothing Methods

Add-one smoothing

– Pretend each trigram occurs once more than it actually does

Add-delta smoothing

( ) [ ] [ ] ( ) [ ] [ ] ( ) [ ] [ ]

y vocabular the

  • f

size the : 1 | , 1 1 1 | V V y C yz C y z P V xy C xyz C xyw C xyz C xy z P

smooth w smooth

+ + = + + = ∑ + + =

( ) [ ] [ ] ( ) [ ] [ ]

δ δ δ δ V y C yz C y z P V xy C xyz C xy z P

smooth smooth

+ + = + + = | , | Works badly. DO NOT DO THESE TWO (Joshua Goodman said)

slide-27
SLIDE 27

27

A Simple Bigram Example (cont.)

bigram bigram

V=11 {John, read, her, book, I, a, different, by, Mulan, <s>, </s>}

3/14 3/13 3/14 2/13 3/14 3/14x3/13x3/14x2/13x3/14=0.00035

P(Mulan read a book) =P(Mulan|<s>)P(read|Mulan)P(a|read)P(book|a)P(</s>|book) =(0+1)/(3+11)x(0+1)/(1+11)x3/14x2/13x3/14 =0.000042

slide-28
SLIDE 28

28

General Backoff Smoothing

The general form for n-gram backoff

– If an n-gram has a nonzero count we use the distribution – Otherwise, we backoff to the lower-order n-gram distribution

  • The scaling factor is chosen to make the

conditional probability sum to 1; i.e.,

( )

1 1,..., − + − i n i

w w γ

( ) ( ) [ ] ( ) ( ) [ ]

   = ⋅ > =

− + − − + − − + − − + − − + − − + −

, ,..., if ,..., | ,..., , ,..., if ,..., | ,..., |

1 1 1 2 1 1 1 1 1 1 1 1 i i n i i n i i smooth i n i i i n i i n i i i n i i smooth

w w w C w w w P w w w w w C w w w w w w P γ α

( )

1 ,..., |

1 1

=

− + − w i n i i smooth

w w w P

( )

1 1,...,

|

− + − i n i i

w w w α

( )

1 2,...,

|

− + − i n i i smooth

w w w P

( )

1 2,

|

− − i i i

w w w α

( )

1 2,

|

− − i i i smooth

w w w P

( ) ( )

1 1 2

| ,

− − −

i i smooth i i

w w P w w γ

trigram

( ) ( )

1 1 1 1

,..., | ,..., |

− + − − + −

<

i n i i ML i n i i

w w w P w w w α

i

slide-29
SLIDE 29

29

Interpolated Smoothing

The general form for Interpolated n-gram smoothing The key difference between backoff and interpolated models

– For n-grams with nonzero counts, interpolated models use information from lower-order distributions while back-off models do not

In both models, lower-order distributions are used in determining the probability of n-grams with zero counts

( ) ( ) ( )

1 2 ,..., 1 1 ,..., 1 1

,..., | ) 1 ( ,..., | ,..., |

1 1 1 1

− + − − + − − + −

− + − − + −

− + =

i n i i smooth w w i n i i ML w w i n i i smooth

w w w P w w w P w w w P

i n i i n i

λ λ ( )

[ ] [ ]

1 1 1 1 1 1

,..., , ,..., ,..., |

− + − − + − − + −

=

i n i i i n i i n i i ML

w w C w w w C w w w P

count

( )

1

|

i i smooth

w w P

( ) ( )

N w P w P

i ML i smooth

1 ) 1 ( λ λ − + =

( )

1 2,

|

− −

i i i smooth

w w w P

slide-30
SLIDE 30

30

Why Backoff Smoothing ?

Backoff smoothing is attractive because it is easy to implement for practical speech recognition systems

– The Katz backoff model, based on the Good-Turing smoothing principle, is widely used

slide-31
SLIDE 31

31

Good-Turing Estimate

First published by Good (1953) while Turing is acknowledged A smoothing technique to deal with infrequent m-grams (m-gram smoothing), but it usually needs to be used together with the other backoff schemes to achieve good performance How many words were seen once? Estimate for how many are unseen. All other estimates are adjusted (down) to give probabilities for unseen

Use the notation m-grams instead of n-grams here

slide-32
SLIDE 32

32

Good-Turing Estimate (cont.)

The Good-Turing estimate states that for any m-gram, that occurs r times ( ), we should pretend it occurs r* times ( ) as follows:

– The probability estimate for a m-gram, , with r counts

( )

data training in the times exactly

  • ccur

that grams

  • f

number the is ere wh , 1

1 *

r m n n n r r

r r r+

+ =

( )

∑ ⋅ = =

∞ =0 * *

where ,

r r GT

n r N N r P a

( )

∑ ⋅ = ∑ ⋅ + = ∑ ⋅ =

∞ = ∞ = + ∞ = 1 *

1

r r r r r r

n r n r n r N

m

w1 = a

m

w1 = a

] [

1 m

w c r =

] [

1 * * m

w c r =

Not a conditional probability !

N is equal to the original number of counts in the training data

slide-33
SLIDE 33

33

Good-Turing Estimate (cont.)

It follows from above that the total probability estimate for the set of m-grams that actually occur in the training data is The probability of observing the previously unseen m- grams is

– Which is just the fraction of singletons (m-grams occurring only

  • nce) in the training data

( )

N n w P

m m

w c w m GT 1 ] [ , 1

1 1

= ∑

=

( )

N n n N n N r n N r w P

r r r r w c w m GT

m m

1 * * 1 * ] [ , 1

1

1 1

− = − ∑ = ∑ = ∑

∞ = ∞ = >

( )

1 * 1 *

1 n n n n r r

r r

= ⇒ + =

+

slide-34
SLIDE 34

34

Good-Turing Estimate: An Example

Imagine you are fishing. You have caught 10 carps (鯉魚), 3 cods (鱈魚), 2 tunas(鮪魚), 1 trout(鱒魚), 1 salmon(鮭魚), 1 eel(鰻魚) How likely is it that the next species is new?

– P0=n1/N=3/18= 1/6

How likely is eel? 1*

– n1 =3, n2 =1 – 1* =2 ×1/3 = 2/3 – P(eel) = 1* /N = (2/3)/18 = 1/27 (P(trout) = P(salmon) = 1/27)

How likely is tuna? 2*

– n2 =1, n3 =1 – 2* =3 ×1/1 = 3 – P(tuna) = 2* /N = 3/18 = 1/6

But how likely is Cod? 3*

– Need smoothing for n4 in advance

( )

N n w P

m m

w c w m GT 1 ] [ , 1

1 1

= ∑

=

( )

r r

n n r r

1 *

1

+

+ =

slide-35
SLIDE 35

35

Good-Turing Estimate (cont.)

The Good-Turing estimate may yield some problems when nr+1=0

– An alternative strategy is to apply Good-Turing to the m-grams (events) seen at most k times, where k is a parameter chosen so that nr+1 ≠0, r=1,…,k

slide-36
SLIDE 36

36

Good-Turing Estimate (cont.)

In Good-Turing estimate, it may happen that an m-gram (event) occurring k times takes a higher probability than an event occurring k+1 times

– The choice of k may be selected in an attempt to overcome such a drawback – Experimentally, k ranging from 4 to 8 will not allow the about condition to be true (for r ≤ k)

( ) ( )

1 2 1 1

2 ˆ 1 ˆ

+ + + +

⋅ + = ⋅ + =

k k k GT k k k GT

n n N k a P n n N k a P

( ) ( ) ( ) ( )

2 1 ˆ ˆ

2 2 1 1

< + ⋅ − ⋅ + ⇒ <

+ + +

k n n n k a P a P

k k k k GT k GT

* ) 1 ( * i.e., ) 2 ( ) 1 ( that possible is It

1 2 1

+ > ⋅ + > ⋅ +

+ + +

k k n n k n n k

k k k k

slide-37
SLIDE 37

37

Katz Backoff Smoothing

Katz smoothing extends the intuitions of the Good-Turing estimate by adding the combination of higher-order language models with lower-order models

– E.g., bigram and unigram language models

Large counts are taken to be reliable, so they are not discounted

– E.g., r*=r for all r > k for some k, say k in the range of 5-8

The discount ratios for the lower counts r ≤ k are derived from the Good-Turning estimate

slide-38
SLIDE 38

38

Katz Backoff Smoothing (cont.)

Take the bigram (m-gram, m=2) counts as an example:

[ ]

( ) ( )

     = > ≥ > =

− −

if if if

1 1 *

r w P w r k r d k r r w w C

i i r i i

β

[ ]

i i

w w C r

1 −

=

1. r r d r

*

µ =

1 1

) 1 ( n r d n

k r r r

= ∑ −

=

and

The Good-Turing estimate predicts that the total mass assigned to bigrams with zero counts is n0∙n1/n0=n1

r r d r

*

  • 2. The discount ratio, , must satisfy two constraints
  • 3. The value is chosen to equalize the total number of counts

in the distribution; i.e.,

( )

1 − i

w β

[ ] [ ]

∑ = ∑

− −

i i

w i i w i i

w w C w w C

1 1 *

slide-39
SLIDE 39

39

Katz Backoff Smoothing - Derivation of The Discount

Ratio

( )

       = ∑ = −

=

r r d n r d n

r k r r r * 1 1

1 s constraint Given two µ

[ ]

[ ]

( )

2 1 1

1 1 * 1 1 * 1 1

∑ = − ⇒ ∑ =       − ⇒ ∑ = −

= = = k r r k r r k r r r

n r r n n r r r n n r d n µ µ

( )

( )

( )

( )

( )

( )

( )

( )

( ) ( )

1 1 1 1 1 1 1 ) 1 ( ) 1 ( ) 1 (

1 1 1 1 1 * 1 1 1 1 1 * 1 1 1 * 1 1 1 * 1 1 1 1 1 1 2 1 1 1 1 1 * 1

n n k n n r r n n n k n n r r n n k n r r n n k n n r rn n k n n k n r rn n n n n r rn n r rn

k r k r k k r r k k r r k k r r r k k k r r k r r k r r r r k r r k r r k r r

= ∑ + − − ⇒ = + − ∑ − ⇒ = + − ∑ − ⇒ + − = ∑ − ⇒ + − = + − ∑ + − ∑ + = ∑ + − ∑ = ∑ − ∑

= + + = + = + = + + − = + = = + = = =

( )

( ) ( )

( )

( ) [ ]

( )

( ) [ ]

1 1 1 * * 1 1 1 * * 1 1 1 *

1 1 1 1 1 3 1 have we together, related are (2) and (1) equations If

+ + +

+ − − − = ⇒ − = − = + − − ⇒ − = + − − ⇒

k r r k k

n k n r n r r d d r r u n k n r n r r r u r n k n n r r

Both sides multiplied by n1 Both sides divided by r

slide-40
SLIDE 40

40

Katz Backoff Smoothing - Derivation of The Discount

Ratio (cont.)

( )

( ) [ ] ( ) [ ] (

)

( ) [ ] ( ) ( ) [ ] ( ) ( )

1 1 1 1 * 1 1 1 1 * 1 1 1 * 1 1 1 1 1 *

1 1 1 1 1 1 1 1 1 n n k n n k r r d n k n r n k r n r n k n r n r r n k n r n k n r n r r d

k k r k k k k k r + + + + + + +

+ − + − = ∴ + − + − = + − − − + − = + − − − = ⇒

Both the numerator and denominator are divided by r∙n1 the r∙n1 term in the numerator is eliminated

slide-41
SLIDE 41

41

Katz Backoff Smoothing - Derivation of The

Normalizing Constant

  • 3. The value is chosen to equalize the total number of counts

in the distribution; i.e.,

( )

1 − i

w β

[ ] [ ]

∑ = ∑

− −

i i

w i i w i i

w w C w w C

1 1 *

The appropriate value for is computed so that the smoothed bigram satisfies the probability constraint

( )

1 − i

w β

( )

[ ]

( )

[ ]

[ ] [ ]

[ ]

[ ] [ ]

[ ]

[ ] [ ]

[ ]

( ) ( ) [ ]

[ ]

( ) [ ] [ ]

[ ]

( )

[ ]

∑ ∑ − ∑ = = ∑ ∑ + ∑ ∑ = ∑ ∑ + ∑ ∑ = ∑ + ∑

= > − − − = − − > − − = − − > − − = − > −

− − − − − − − −

: : 1 * 1 1 : 1 1 : 1 1 * : 1 * 1 * : 1 * 1 * : 1 * : 1 *

1 1 1 1 1 1 1 1

1 1 1

i i i i i i k i i i k i i i k i i i k i i i k i i i i i i

w w C w i w w C w i i w k i i w w C w w k i i i w w C w w k i i i w w C w w k i i i w w C w w k i i i w w C w i i w w C w i i

w P w w C w w C w w w C w P w w w C w w C w w C w w C w w C w w C w w P w w P β β

slide-42
SLIDE 42

42

Katz Backoff Smoothing - Derivation of The

Normalizing Constant [ ] [ ]

∑ =

− − −

i

w i i i i i i

w w C w w C w w P

1 * 1 * 1 *

) | (

[ ] ( ) ( )

     = > ≥ > =

− −

if if if

1 1 *

r w P w r k r d k r r w w C

i i r i i

β

( ) [ ] [ ]

[ ]

( )

[ ]

∑ ∑ − ∑ =

= > − − −

− −

: : 1 * 1 1

1 1 i i i i i i k

w w C w i w w C w i i w k i i

w P w w C w w C w β

( )

[ ] [ ] [ ] [ ] ( ) ( )

     = > ≥ > =

− − − − − −

if if , if ,

1 1 1 1 1 1 *

r w P w r k w C w w C d k r w C w w C w w P

i i i i i r i i i i i

α

( ) ( ) [ ] [ ] [ ]

[ ]

( )

[ ]

( ) [ ]

( )

[ ]

( )

[ ]

( )

i w w C w i w w C w i i w i i i w w C w i w w C w i i w k i w i i i i

w P w P w w P w w C w P w P w w C w w C w w C w P w

i i i i i i i i i i i i i k i

∑ ∑ − = ∑ ∑ ∑ − ∑ = ∑

= > − − = > − − − −

− − − −

: : 1 * 1 * : : 1 * 1 1 * 1

1 1 1 1

1 β

( )

1 − i

w α

slide-43
SLIDE 43

43

Katz Backoff Smoothing

Take the conditional probabilities of bigrams (m-gram, m=2) as an example:

( )

[ ] [ ] [ ] [ ]

( ) ( )

     = > ≥ > =

− − − − − −

if if , if ,

1 1 1 1 1 1

r w P w r k w C w w C d k r w C w w C w w P

i i i i i r i i i i i Katz

α

( ) ( )

1 1 1 1 *

1 1 1 n n k n n k r r d

k k r + +

+ − + − =

( )

( )

[ ]

( )

[ ]

∑ ∑ − =

= > − −

− −

: : 1 1

1 1

1

i i i i i i

w w C w i w w C w i i Katz i

w P w w P w α

  • 1. The discount ratio
  • 2. The normalizing constant
slide-44
SLIDE 44

44

Katz Backoff Smoothing: An Example

A small vocabulary consists of only five words, i.e., . The frequency counts for word pairs started with are: and the word frequency counts are:

.

Katz back-off smoothing with Good-Turing estimate is used here for word pairs with frequency counts equal to

  • r less than two. Show the conditional probabilities of

word bigrams started with , i.e.,

{ }

5 2 1

,..., , w w w V =

1

w

[ ] [ ] [ ] [ ] [ ] 0

, , , 1 , , 2 , , 3 ,

5 1 1 1 4 1 3 1 2 1

= = = = = w w C w w C w w C w w C w w C

[ ] [ ] [ ] [ ] [ ] 4

, 6 , 10 , 8 , 6

5 4 3 2 1

= = = = = w C w C w C w C w C

1

w

( ) ( ) ( ) ?

...., , ,

1 5 1 2 1 1

w w P w w P w w P

Katz Katz Katz

slide-45
SLIDE 45

45

Katz Backoff Smoothing: An Example (cont.)

( )

( )

[ ]

( )

[ ]

∑ ∑ − =

= > − −

− −

: : 1 1

1 1

1

i i i i i i

w w C w i w w C w i i Katz i

w P w w P w α

( )

data training in the times exactly

  • ccurs

that grams

  • f

number the is where , 1

1 *

r n n n n r r

r r r+

+ =

( ) ( )

( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( )

( )

( )

( ) ( )

( )

( ) ( )

( ) ( ) ( )

1 .... And 15 1 34 4 12 2 10 34 10 1 34 6 12 2 10 34 For 12 2 10 34 34 4 34 6 12 1 4 1 2 1 1 12 1 6 1 2 1 1 For 4 1 6 2 4 3 2 For 2 1 3 1 3 2 1 1 1 2 1 1 1 1 2 1 2 , 4 3 2 3 2 3 1 1 1 2 1 1 1 1 2 2 3 3 1 1 1 2 2 , 2 1 1 1 1 1 2 1 6 3

1 5 1 2 1 1 5 1 1 5 1 1 1 1 1 1 4 1 1 4 1 3 2 1 3 1 2 * * 1 2 1 2

= + + + = ⋅ ⋅ = ⋅ = = ⋅ ⋅ = ⋅ = ⇒ = ⋅ = + − − − = = ⋅ = ⋅ = ⇒ = = ⋅ = ⋅ = ⇒ = = − − = ⋅ + − ⋅ + − = = − − = ⋅ + − ⋅ + − = = ⋅ + = = ⋅ + = = = = ∴ w w P w w P w w P w P w w w P w P w w w P r w w w P d w w P r w w P d w w P r d d w w P w w P

Katz Katz Katz ML Katz ML Katz ML Katz ML Katz ML Katz

α α α

( ) ( )

1 1 1 1 *

1 1 1 n n k n n k r r d

k k r + +

+ − + − =

( ) ( ) here

that Notice w P w P

ML Katz

=

( )

[ ] [ ] [ ] [ ] ( ) ( )

     = > ≥ > =

− − − − − −

if if , if ,

1 1 1 1 1 1

r w P w r k w C w w C d k r w C w w C w w P

i i i i i r i i i i i Katz

α

here 2 = k

slide-46
SLIDE 46

46

Absolute Discounting

Absolute discounting involves subtracting a fixed discount D≤1 from each nonzero count If we express the absolute discounting in terms of interpolated models, we have

( )

[ ] { } [ ]

( )

1 2 ... 1 1 1 1

... ) 1 ( ... , ... max ...

1 1

− + − + − + − − + −

− + −

− + ∑ − =

i N i i abs w w w i N i i N i i N i i abs

w w w P w w C D w w C w w w P

i N i i

λ

To make this distribution sum to 1, we normalize it to determine

1 1... − + − i N i

w w

λ

slide-47
SLIDE 47

47

The Problem of Using Lower-order N-grams for Smoothing

Absolute discounting and the Katz smoothing assign a relatively high probability P(Francisco|w) to an unobserved event “w Francisco” because C(Francisco) is high and the unigram probability P(Francisco) is high

– But ”Francisco” follows only a single history “San”

Also considering the probability P(dollars|w)

– An unobserved event “w dollars” will receive a high P(dollars|w) since P(dollars) is high – “dollars” usually follows country name or number, e.g., “US dollars”, “TW dollars”, “two dollars”, etc

How about to apply a backoff distribution that is not proportional to the number of occurrences of a word but instead to the number of different words that it follows?

slide-48
SLIDE 48

48

Kneser-Ney Backoff Smoothing

Take the conditional probabilities of bigrams (m-gram, m=2) as an example:

( )

[ ] { } [ ] [ ] ( ) ( )

     > − =

− − − − −

  • therwise

, if , , max

1 1 1 1 1 i KN i i i i i i i i KN

w P w w w C w C D w w C w w P α

( ) [ ]

[ ]

[ ]

preceding words unique

  • f

number the is , /

i i w j i i KN

w w w w w P

j

  • =

C C C

( ) [ ] { } [ ]

[ ]

( )

[ ]

∑ ∑ − − =

= > − − −

− −

: : 1 1 1

1 1

, max 1

i i i i i i

w w C w i KN w w C w i i i i

w P w C D w w C w α

1 ≤ ≤ D

1.

  • 2. The normalizing constant

[ ] { } [ ]

[ ]

( ) ( )

[ ]

1 , max

: 1 : 1 1

1 1

= ∑ + ∑ −

= − > − −

− − i i i i i i

w w C w i KN i w w C w i i i

w P w w C D w w C α

slide-49
SLIDE 49

49

Kneser-Ney Backoff Smoothing: An Example

Given a text sequence as follows:

SABCAABBCS (S is the sequence’s start/end marks)

Show the corresponding unigram conditional probabilities:

[ ] [ ] [ ] [ ] ( ) ( ) ( ) ( )

7 1 7 1 7 2 7 3 1 1 2 3 = = = = ⇒ =

  • =
  • =
  • =
  • S

P C P B P A P S C B A

KN KN KN KN

C C C C

slide-50
SLIDE 50

50

Katz vs. Kneser-Ney Backoff Smoothing

Example 1: Wall Street Journal (WSJ), English

– A vocabulary of 60,000 words and a corpus of 260 million words (read speech) from a newspaper such as Wall Street Journal

Raised to 378 when tested on the personal information management domain

slide-51
SLIDE 51

51

Katz vs. Kneser-Ney Backoff Smoothing

Example 2: Broadcast News Speech, Mandarin

– A vocabulary of 72,000 words and a corpus of 170 million Chinese characters from the Central News Agency (CNA) – Tested on Mandarin broadcast news speech collected in Taiwan, September 2002, about 3.7 hours – The perplexities are high here, because the LM training materials are not speech transcripts but merely newswire texts

14.90 670.24 Tigram Kneser-Ney 14.62 752.49 Tigram Katz 18.17 942.34 Bigram Kneser-Ney 16.81 959.56 Bigram Katz Character Error Rate

(after tree-copy search, TC )

Perplexity Models

after Berlin Chen

slide-52
SLIDE 52

52

Interpolated Kneser-Ney Smoothing

Always combine both the higher-order and the lower-

  • rder LM probability distributions

Take the bigram (m-gram, m=2) conditional probabilities as an example:

– Where

  • : the number of unique words preceding
  • : a normalizing constant that makes the probabilities

sum to 1

[ ]

{ }

[ ] [ ] [ ]

  • +

− =

− − − − w i i i i i i i IKN

w C w C w w C D w w C w w P ) ( , max ) | (

1 1 1 1

λ

[ ]

i

w C •

i

w

) (

1 − i

w λ

[ ] [ ]

  • =

− − − 1 1 1)

(

i i i

w C w C D w λ

[ ]

1 1

history the follow that words unique

  • f

number the :

− − • i i

w w C

slide-53
SLIDE 53

53

Interpolated Kneser-Ney Smoothing (cont.)

[ ] { } [ ] [ ] [ ] [ ] [ ]

[ ]

[ ] [ ] [ ]

[ ]

[ ] [ ] [ ] [ ] [ ] [ ]

  • =

>

  • =

∑ = ∑         ∑

  • +

∑ − = ∑         ∑

  • +

− = ∑

− − − − − − > − − > − − − − − −

− −

1 1 1 1 1 1 : 1 1 : 1 1 1 1 1 1

) ( whose

  • f

types are There ) ( 1 ) ( 1 ) ( , max 1 ) | (

1 1

i i i i i i i w w i i w w C w i w w i i w w C w i i i w w i i i i i w i i IKN

w C w C D w w w C w w C w C w C w w C D w C w C w w C D w w C w C w C w w C D w w C w w P

i i i i i i i i i i

λ λ λ λ Q =1

[ ] { } [ ] [ ] [ ]

[ ]

[ ]

[ ]

∑ + ∑ − = ∑ −

= − > − − − −

− −

: 1 : 1 1 1 1

1 1

, max

i i i i i i i

w w C w i w w C w i i i w i i i

w C w C D w w C w C D w w C

[ ] [ ]

[ ]

1

: 1 1

1

= ∑

> − −

− i i i

w w C w i i i

w C w w C

slide-54
SLIDE 54

54

Interpolated Kneser-Ney Smoothing

The exact formula for interpolated Kneser-Ney smoothed trigram conditional probabilities

[ ] { } [ ] [ ] { } [ ] [ ] { } [ ]

| | 1 , max ) ( ) ( ) ( , max ) | ( ) | ( ) ( , max ) | (

1 1 1 2 1 1 1 1 2 1 2 3 1 2 1 2

V w C D w C w P w P w w w C D w w C w w P w w P w w w w C D w w w C w w w P

w i i IKN i IKN i w i i i i i IKN i i IKN i i i i i i i i i i IKN

λ λ λ +

  • =

+

  • =

+ − =

∑ ∑

− − − − − − − − − − − − −

For the IKN bigram, the number of unique words that precede a given word is considered, instead

  • f the frequency counts.
slide-55
SLIDE 55

55

Backoff vs. Interpolation

When determining the probability of n-grams with nonzero counts, interpolated models use information from lower-order distributions while backoff models do not In both backoff and interpolated models, lower-order distributions are used in determining the probability

  • f n-grams with zero counts

It is easy to create a backoff version of an interpolated algorithm by modifying the normalizing constant

slide-56
SLIDE 56

56

Class N-grams

Define classes for words that exhibit similar semantic or grammatical behavior This is another effective way to handle the data sparseness problem

WEEKDAY = Sunday, Monday, Tuesday, … MONTH = January, February, April, May, June, … EVENT=meeting, class, party, …

– Consider P(Tuesday|party on), P(Monday|party on), …, P(Saturday|party on) A word may belong to more than one class and a class may contain more than one word (many-to-many mapping)

slide-57
SLIDE 57

57

Class N-grams (cont.)

The n-gram model can be computed based on the previous n-1 classes (assume a word can be uniquely mapped to

  • nly one class)

In general, we can express the class trigram as If the class are nonoverlapping; i.e., a word may belong to only one class

( ) ( ) ( )

1 1 1 1

... ...

− + − − + −

=

i n i i i i i n i i

c c c P c w P c c w P

( )

( ) ( )

∑ ∏ =

− −

n

c c i i i i i i

c c c P c w P P

... 1 2

1

, W

( )

( ) ( )

∏ =

− − i i i i i i

c c c P c w P P

1 2,

W

( )

( ) ( )

∑ ∏ =

n

c c i i i i i

c c P c w P P

... 1

1

W

( )

( ) ( )

∏ =

− i i i i i

c c P c w P P

1

W

bigram

slide-58
SLIDE 58

58

Class N-grams (cont.)

The bigram probability of a word given the prior word can be estimated as The trigram probability of a word given the two prior words can be estimated as ( ) ( ) ( ) ( )

] [ ] , [ ] [ ] [

1 1 1 1 1 − − − − −

= = =

i i i i i i i i i i i i i

c C c c C c C w C c c P c w P c w P w w P

( ) ( ) ( ) ( )

] , [ ] , , [ ] [ ] [ , , ,

1 2 1 2 1 2 1 2 1 1 − − − − − − − − − −

= = =

i i i i i i i i i i i i i i i i i i

c c C c c c C c C w C c c c P c w P c c w P w w w P mapping

  • ne
  • to
  • ne

: and

1 1 − − i i

c w

( ) ( ) ( )

( ) ( ) ( ) ( )

1 1 1 1 1 1 1

, , , , , ,

− − − − − − −

= = ∑ =

i i i i i i i i i i i j i j i i i

c P c c P c c P c c w P c c w P c c w P c w P

slide-59
SLIDE 59

59

Class N-grams (cont.)

Clustering is another way to handle the data sparseness problem (smoothing of the language model) For general-purpose large vocabulary dictation applications, class-based n-grams have not significantly improved recognition accuracy

– Mainly used as a backoff model to complement the lower-order n-grams for better smoothing

For limited (or narrow discourse) domain speech recognition, the class-based n-gram is very helpful

– The class can efficiently encode semantic information for improving keyword spotting and speech understanding accuracy – Good results are often achieved by manual clustering of semantic categories

slide-60
SLIDE 60

60

Rule-based Classes vs. Data-driven Classes

a meeting Sunday is canceled the date on Monday will be postponed

  • ne party Tuesday

in January prepared February arranged April

For general-purpose large vocabulary dictation applications, it is impractical to derive functional classes in the same manner ->data-driven approach

slide-61
SLIDE 61

61

Cache Language Models

The basic idea of caching is to accumulate n-grams dictated so far in the current document/conversation and use these to create dynamic n-grams model Trigram interpolated with unigram cache Trigram interpolated with bigram cache

( ) ( ) ( ) ( ) ( ) [ ] [ ] [ ] [ ]

∈ ∈ = ∈ = − + ≈

w cache cache smooth cache

history w C history z C history length history z C history | z P history history | z P 1 xy | z P xy | z P far so dictated n

  • nversatio

document/c : λ λ

( ) ( ) ( ) ( ) ( ) [ ] [ ]

history y C history yz C history , y | z P history , y | z P 1 xy | z P xy | z P

cache cache smooth cache

∈ ∈ = − + ≈ λ λ

slide-62
SLIDE 62

62

LM Integrated into Speech Recognition

Theoretically, Practically, language model is a better predictor while acoustic probabilities aren’t “real” probabilities

– Penalize insertions

( ) (

)

W X W W

W

P P max arg ˆ =

( )

( )

( )

y empiricall determined be can , where max arg ˆ β α β

α W W

W X W W

length

P P ⋅ =

slide-63
SLIDE 63

63

Known Weakness in Current LM

Brittleness Across Domain

– Current language models are extremely sensitive to changes in the style or topic of the text on which they are trained – E.g., conversations vs. news broadcasts

False Independent Assumption

– In order to remain trainable, the n-gram modeling assumes the probability of next word in a sentence depends only on the identity of last n-1 words