Statistical Language Modeling for Speech Recognition Berlin Chen - PowerPoint PPT Presentation

Statistical Language Modeling for Speech Recognition Berlin Chen 2003 References: 1. X. Huang et. al., Spoken Language Processing, Chapter 11 2. R. Rosenfeld, ”Two Decades of Statistical Language Modeling: Where Do We Go from Here?,” Proceedings of IEEE, August, 2000 3. Joshua Goodman’s (Microsoft Research) Public Presentation Material 4. S. M. Katz, “Estimation of probabilities from sparse data for the language model component of a speech recognizer,” IEEE ASSP, March 1987 5. R. Kneser and H. Ney, “Improved backing-off for m-gram language modeling,” ICASSP 1995

What is Language Modeling ? • Language Modeling (LM) deals with the probability distribution of word sequences, e.g.: P (“ hi ”)=0.01, P(“ and nothing but the truth ”) ≈ 0.001 P (“ and nuts sing on the roof ”) ≈ 0 From Joshua Goodman’s material 2

What is Language Modeling ? ( ) W • For a word sequence , can be decomposed into P W a product of conditional probabilities: chain rule ( ) ( ) = , ,..., P P w w w W 1 2 m ) ( ) ( ) ( ) ( = , ... , ,..., P w P w w P w w w P w w w w − 1 1 2 3 1 2 1 2 1 m m ( ) m ( ) ∏ = , ,..., P w P w w w w − 1 1 2 1 i i = 2 i – E.g.: P (“ and nothing but the truth” ) = P (“ and ”) × P (“ nothing|and ”) × P (“ but | and nothing ”) × P (“ the | and nothing but ”) × P (“ truth | and nothing but the ”) – However, it’s impossible to estimate and store ( ) if is large (data sparseness problem etc.) i P w w , w ,..., w − i 1 2 i 1 History of w i 3

What is LM Used for ? • Statistical language modeling attempts to capture the regularities of natural languages – Improve the performance of various natural language applications by estimating the probability distribution of various linguistic units, such as words, sentences, and whole documents – First significant model was proposed in 1980 ( ) ( ) = , ,..., ? P P w w w W 1 2 m 4

What is LM Used for ? • Statistical language modeling is most prevailing in many application domains – Speech recognition – Spelling correction – Handwriting recognition – Optical character recognition (OCR) – Machine translation – Document classification and routing – Information retrieval 5

Current Status • Ironically, the most successful statistical language modeling techniques use very little knowledge of what language is – The most prevailing n -gram language models take no advantage of the fact that what is being modeled is language ( ) ( ) ≈ , ,..., , ,..., P w w w w P w w w w − − + − + − 1 2 1 1 2 1 i i i i n i n i History of length n -1 – it may be a sequence of arbitrary symbols, with no deep structure, intention, or though behind then – F. Jelinek said “ put language back into language modeling ” 6

LM in Speech Recognition X = • For a given acoustic observation , the goal of , ,..., x x x 1 2 n speech recognition is to find out the corresponding word = sequence that has the maximum w ,w ,...,w W 1 2 m ( ) posterior probability P W X ( ) = ˆ w ,w ,..w ,...,w W = arg max P W W X 1 2 i m { } ∈ where Voc w w ,w ,.....,w W ( ) ( 1 2 i V ) P P X W W = arg max ( ) P X W ( ) ( ) = arg max P P X W W W Acoustic Modeling Language Modeling Posterior Probability Prior Probability 7

The Trigram Approximation • The trigram modeling assumes that each word depends only on the previous two words (a window of three words total) – “tri” means three, “gram” means writing – E.g.: P (“ the |… whole truth and nothing but ”) ≈ P (“ the|nothing but ”) P (“ truth |… whole truth and nothing but the ”) ≈ P (“ truth|but the ”) – Similar definition for bigram (a window of two words total) • How do we find probabilities? – Get real text, and start counting (empirically) ! P (“ the | nothing but ”) ≈ C [“ nothing but the ”]/ C [“ nothing but ”] Probability may be 0 count 8

Maximum Likelihood Estimate (ML/MLE) for LM Λ • Given a a training corpus T and the language model = Corpus ... ...... T w w w w − − − − 1 2 th th k th L th { } = Vocabulary ,..., W w ,w w 1 2 V ( ) ( ) ∏ Λ ≅ history of p T p w w N-grams with − − k th k th same history w − k th are collected ∏ ∏ = λ N ∑ ∀ ∈ λ = hw , 1 i h T together hw hw j i w h w j i N – Essentially, the distribution of the sample counts with hw i h the same history referred as a multinominal (polynominal) distribution ( ) ! N ∏ ∑ ∑ ∀ ∈ = λ N = λ = , ,..., , and 1 h h T P N N hw N N i ∏ hw hw hw hw h hw ! 1 N V i i j w w w hw i j i i w i ( ) [ ] [ ] [ ] ∑ = λ = = = where , , in corpus p w h N C hw N C hw C h T i hw hw i hw i i i i w i … 陳水扁總統訪問美國紐約 … 陳水扁總統在巴拿馬表示 … P(總統|陳水扁)=? 9

Maximum Likelihood Estimate (ML/MLE) for LM ( ) Λ • Take logarithm of , we have p T ( ) ( ) ∑ ∑ Φ Λ = Λ = λ log log p T N hw hw i i h w i ( ) ( , ) Φ Λ • For any pair , try to maximize and subject h w j ∑ λ = ∀ 1 , to h hw j w ⎛ ⎞ j ( ) ∑ ( ) ∑ ⎜ ⎟ Φ Λ = Φ Λ + λ − 1 l ⎜ ⎟ h hw j ⎝ ⎠ h w j ⎡ ⎤ ⎛ ⎞ ∑ ∑ ∑ ⎜ ∑ ⎟ ∂ ⎢ λ + λ − ⎥ log 1 N l ⎜ ⎟ ( ) hw hw h hw ⎢ ⎥ ∂ Φ Λ i i j ⎝ ⎠ ⎣ ⎦ h w h w = i j ∂ λ ∂ λ hw hw i i N N N N ⇒ + = ⇒ = = = = − hw hw hw hw 0 ...... l 1 2 l i V λ h λ λ λ h hw hw hw hw 1 2 i V ∑ N hw s ∑ ⇒ w = − ⇒ = − = − l l N N s ∑ λ h h hw h s w hw s j w j [ ] N C hw ˆ ∴ λ = = hw i i [ ] hw C N h i 10 h

Main Issues for LM • Evaluation – How can you tell a good language model from a bad one – Run a speech recognizer or adopt other statistical measurements • Smoothing – Deal with data sparseness of real training data – Variant approaches have been proposed • Caching – If you say something, you are likely to say it again later – Adjust word frequencies observed in the current conversation • Clustering – Group words with similar properties (similar semantic or grammatical) into the same class – Another efficient way to handle the data sparseness problem 11

Evaluation • Two most common metrics for evaluation a language model – Word Recognition Error Rate (WER) – Perplexity (PP) • Word Recognition Error Rate – Requires the participation of a speech recognition system (slow!) – Need to deal with the combination of acoustic probabilities and language model probabilities (penalizing or weighting between them) 12

Evaluation • Perplexity – Perplexity is geometric average inverse language model probability (measure language model difficulty, not acoustic difficulty/confusability) 1 1 ( ) = = ⋅ m PP w , w ,..., w ∏ W m 1 2 m P ( w ) = P ( w | w , w ,..., w ) i 2 − 1 i 1 2 i 1 – Can be roughly interpreted as the geometric mean of the branching factor of the text when presented to the language model – For trigram modeling: 1 1 1 ( ) m = = ⋅ ⋅ PP w , w ,..., w W ∏ m 1 2 m P ( w ) P ( w w ) = P ( w | w , w ) i 3 − − 1 2 1 i i 2 i 1 13

Evaluation • More about Perplexity – Perplexity is an indication of the complexity of the language if we ( ) have an accurate estimate of P W – A language with higher perplexity means that the number of words branching from a previous word is larger on average – A langue model with perplexity L has roughly the same difficulty as another language model in which every word can be followed by L different words with equal probabilities – Examples: • Ask a speech recognizer to recognize digits: “0, 1, 2, 3, 4, 5, 6, 7, 8, 9” – easy – perplexity ≈ 10 • Ask a speech recognizer to recognize names at a large institute (10,000 persons) – hard – perplexity ≈ 10,000 14

Evaluation • More about Perplexity (Cont.) – Training-set perplexity: measures how the language model fits the training data – Test-set perplexity: evaluates the generalization capability of the language model • When we say perplexity, we mean “test-set perplexity” 15

Evaluation • Is a language model with lower perplexity is better? – The true (optimal) model for data has the lowest possible perplexity – Lower the perplexity, the closer we are to true model – Typically, perplexity correlates well with speech recognition word error rate • Correlates better when both models are trained on same data • Doesn’t correlate well when training data changes – The 20,000-word continuous speech recognition for Wall Street Journal (WSJ) task has a perplexity about 128 ~ 176 (trigram) – The 2,000-word conversational Air Travel Information System (ATIS) task has a perplexity less than 20 16

Evaluation • The perplexity of bigram with different vocabulary size 17

Statistical Language Modeling for Speech Recognition Berlin Chen - PowerPoint PPT Presentation

Statistical Language Modeling for Speech Recognition Berlin Chen 2003 References: 1. X. Huang et. al., Spoken Language Processing, Chapter 11 2. R. Rosenfeld, Two Decades of Statistical Language Modeling: Where Do We Go from Here?,

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 24: Statistical

Combining Speech and Speaker Recognition - A Joint Modeling Approach Hang Su Supervised by:

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Speech Processing 15-492/18-492 Speech Recognition Intro Acoustic modelling HMMs Speech

Speech and Language CS 188: Artificial Intelligence Spring 2011 Speech technologies

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing 15-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary

EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling and

Language Modeling Hsin-min Wang References: 1. X. Huang et. al., Spoken Language Processing,

? ? Chair Chair CPSC 449 Principles of Programming Languages Jrg Denzinger CPSC 449

Numeral classifiers in areal perspective: Khmer and Thai 'syntactic borrowing' revisited RIKKER

2 MPRI 4 Syntactic Formalisms 3

Grammatical Metaphors Patrick Hanks and Sara Mo e Research Institute of Information And

V V ISUAL ISUAL G RAPH RAPH M M ODELING AND R R ETRIEVAL ODELING AND ETRIEVAL A L A L ANGUAGE

A survey of Latin squares, orthogonal arrays and their applications to cryptography Luca Mariot 1

Project Description LATIN: Logic Atlas and Integrator Mihai Codescu, Fulya Horozal, Michael

Statistical Language Modeling for Speech Recognition Berlin Chen - PowerPoint PPT Presentation

Statistical Language Modeling for Speech Recognition Berlin Chen 2003 References: 1. X. Huang et. al., Spoken Language Processing, Chapter 11 2. R. Rosenfeld, Two Decades of Statistical Language Modeling: Where Do We Go from Here?,

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 24: Statistical

Combining Speech and Speaker Recognition - A Joint Modeling Approach Hang Su Supervised by:

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Speech Processing 15-492/18-492 Speech Recognition Intro Acoustic modelling HMMs Speech

Speech and Language CS 188: Artificial Intelligence Spring 2011 Speech technologies

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing 15-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary

EE E6820: Speech &amp; Audio Processing &amp; Recognition Lecture 5: Speech modeling and

Language Modeling Hsin-min Wang References: 1. X. Huang et. al., Spoken Language Processing,

? ? Chair Chair CPSC 449 Principles of Programming Languages Jrg Denzinger CPSC 449

Numeral classifiers in areal perspective: Khmer and Thai 'syntactic borrowing' revisited RIKKER

2 MPRI 4 Syntactic Formalisms 3

Grammatical Metaphors Patrick Hanks and Sara Mo e Research Institute of Information And

V V ISUAL ISUAL G RAPH RAPH M M ODELING AND R R ETRIEVAL ODELING AND ETRIEVAL A L A L ANGUAGE

A survey of Latin squares, orthogonal arrays and their applications to cryptography Luca Mariot 1

Project Description LATIN: Logic Atlas and Integrator Mihai Codescu, Fulya Horozal, Michael

EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling and