Language Modeling Hsin-min Wang References: 1. X. Huang et. al., - PowerPoint PPT Presentation

Language Modeling Hsin-min Wang References: 1. X. Huang et. al., Spoken Language Processing, Chapter 11 2. R. Rosenfeld, ”Two Decades of Statistical Language Modeling: Where Do We Go from Here?,” Proceedings of IEEE, August, 2000 3. Joshua Goodman’s (Microsoft Research) Public Presentation Material 4. S. M. Katz, “Estimation of probabilities from sparse data for the language model component of a speech recognizer,” IEEE ASSP, March 1987 5. R. Kneser and H. Ney, “Improved backing-off for m-gram language modeling,” ICASSP 1995

Acoustic vs. Linguistic ( ) ) ( ) ( ˆ = = W arg max P W X arg max P W P X W W W � Acoustic pattern matching and knowledge about language are equally important in recognizing and understanding natural speech – Lexical knowledge (vocabulary definition and word pronunciation) is required, as are the syntax and semantics of the language (the rules that determine what sequences of words are grammatically well-formed and meaningful) – In addition, knowledge of the pragmatics of language (the structure of extended discourse, and what people are likely to say in particular contexts) can be important to achieving the goal of spoken language understanding systems 2

Language Modeling - Formal vs. Probabilistic � The formal language model – grammar and parsing – The grammar is a formal specification of the permissible structures for the language – The parsing technique is the method of analyzing the sentence to see if its structure is compliant with the grammar � The probabilistic (or stochastic) language model – Stochastic language models take a probabilistic viewpoint of language modeling • The probabilistic relationship among a sequence of words can be derived and modeled from the corpora – Avoid the need to create broad coverage formal grammars – Stochastic language models play a critical role in building a working spoken language system – N -gram language models are most widely used 3

N -gram Language Models - Applications � N -gram language models are widely used in many application domains – Speech recognition – Spelling correction – Handwriting recognition – Optical character recognition (OCR) – Machine translation – Document classification and routing – Information retrieval 4

N -gram Language Models � For a word sequence W , P (W) can be decomposed into a product of conditional probabilities: chain rule ( ) ( ) = P W P w , w ,..., w 1 2 m ) ( ) ( ) ( ) ( = P w P w w P w w , w ... P w w , w ,..., w − 1 1 2 3 1 2 m 1 2 m 1 ( ) ( ) m = ∏ P w P w w , w ,..., w − 1 i 1 2 i 1 = i 2 History of w i ( ) P w w , w ,..., w – In reality, the probability is impossible to − i 1 2 i 1 estimate for even moderate values of i (data sparseness problem) ( ) – A practical solution is to assume that P w w , w ,..., w − i 1 2 i 1 w , w ,..., w depends only on the several previous words − + − + − i N 1 i N 2 i 1 N -gram language models 5

N -gram Language Models (cont.) � If the word depends on the previous two words, we have ( ) a trigram : P w w 2 , w − − i i i 1 ( ) ( ) � Similarly, we can have unigram : or bigram : P w P w i w − i i 1 � To calculate P ( Mary loves that person ) – In trigram models, we would take P ( Mary loves that person )= P ( Mary|<s> ) P ( loves|<s>,Mary )P( that|Mary , loves ) P ( person|loves,that ) P ( </s>|that,person ) – In bigram models, we would take P ( Mary loves that person )= P ( Mary|<s> ) P ( loves|Mary )P( that|loves ) P ( person|that ) P ( </s>|person ) – In unigram models, we would take P ( Mary loves that person )= P ( Mary ) P ( loves )P( that ) P ( person ) 6

N -gram Probability Estimation � The trigram can be estimated by observing from a text corpus the frequencies or counts of the word pair C ( w i-2 , w i-1 ) and the triplet C ( w i-2 , w i-1 , w i ) as follows: C ( w , w , w ) = − − P ( w | w , w ) i 2 i 1 i − − i i 2 i 1 C ( w , w ) − − i 2 i 1 – This estimation is based on the maximum likelihood (ML) principle • This assignment of probabilities yields the trigram model that assigns the highest probability to the training data of all possible trigram models C ( w , w ) � The bigram can be estimated as = − P ( w | w ) i 1 i − i i 1 C ( w ) − i 1 � The unigram can be estimated as C ( w ) = i P ( w ) i corpus size 7

Maximum Likelihood Estimation of N -gram Probability � Given a training corpus T and the language model Λ = Corpus T w w ... w ...... w − − − − 1 th 2 th k th L th { } = Vocabulary W w ,w ,..., w 1 2 V ( ) ( ) Λ ≅ ∏ p T p w history of w N-grams with − − k th k th same history w − k th ( ) are collected ∀ ∈ λ = h T , p w h , N = λ together ∏ ∏ hw i hw i i [ ] hw λ = = ∑ 1 , N C hw i h w hw hw i i i i w i … 陳水扁總統訪問美國紐約 … 陳水扁總統在巴拿馬表示 … P(總統|陳水扁)=? 8

Maximum Likelihood Estimation of N -gram Probability (cont.) ( ) Λ p T � Take logarithm of , we have ( ) ( ) ∑ ∑ Φ Λ = Λ = λ log p T N log hw hw i i h w i ( ) Φ Λ � For any pair , try to maximize subject to ( h , w ) i λ = ∀ ∑ 1 , h hw i w   i ( ) ∑ ( ) ∑   Φ Λ = Φ Λ + λ − l 1   h hw j   h w j    ∑    ∂ λ + λ − ∑ ∑ N log ∑ l 1     ( ) hw hw h hw ∂ Φ Λ    i i j  h w h w = i j ∂ λ ∂ λ hw hw i i N N N N hw hw hw hw ⇒ + = ⇒ = = = = − l 0 ...... l i 1 2 V h h λ λ λ λ hw hw hw hw i 1 2 V ∑ N [ ] N hw C hw s ˆ w hw ⇒ = − ⇒ = − = − ∴ λ = = l l ∑ N N i s i [ ] h h hw h hw λ ∑ N C h s i w hw s h j w j 9

A Simple Bigram Example bigram bigram 10

Major Issues for N -gram LM � Evaluation – How can you tell a good language model from a bad one – Run a speech recognizer or adopt other statistical measurements � Smoothing – Deal with the data sparseness of real training data – Variant approaches have been proposed � Adaptation – Dynamic adjustment of the language model parameter, such as the n -gram probabilities, vocabulary size, and the choice of words in the vocabulary – E.g P (table|the operating), P (system|the operating) 11

How to Evaluate a Language Model? Given two languages models, how to compare them? Given two languages models, how to compare them? � Use them in a recognizer and find the one that leads to the lower recognition error rate – The best way to evaluate a language model – Expensive! � Use the information theory ( Entropy & Perplexity ) to get an estimate of how good a language model might be – Perplexity : the geometric mean of the number of words that can follow a history (or word) after the language model has been applied 12

Entropy � The information derivable from outcome x i depends on its probability P ( x i ) , and the amount of information is defined as 1 ( ) = I w log i P ( x ) i � The entropy H ( X ) of the random variable X is defined as the average amount of information 1 = = = = − ∑ ∑ H ( X ) E [ I ( X )] P ( x ) I ( x ) P ( x ) log E [ log P ( x )] i i i i P ( x ) S S i – The entropy H ( X ) attains the maximum value when the random variable X has a uniform distribution; i.e., 1 = ∀ P ( x ) i i N – The entropy H ( X ) is nonnegative and becomes zero only if the probability function is deterministic; i.e., = P ( x ) 1 for some x i i 13

Cross-Entropy � The entropy of a language is = − ∑ H ( language ) P ( E ) log P ( E ) , where E is a language event i 2 i i � It can be proved that The cross-entropy of a model with respect to − ≤ − ∑ ∑ P ( E ) log P ( E ) P ( E ) log P ( E | Model ) i 2 i i 2 i the correct model Better models will have lower cross-entropy � The entropy of a language with a vocabulary size of | V | on a per word basis is | V | = − ∑ H P ( w ) log P ( w ) i 2 i = i 1 – If every word is equally likely true entropy 1 1 | V | − = ≥ ∑ log log | V | H 2 2 | V | | V | = i 1 14

Logprob � For a language with a vocabulary size of | V |, the cross entropy of a model with respect to the correct model on a per word basis is | V | − ∑ P ( w ) log P ( w | Model ) i 2 i = i 1 � Given a text corpus W = w 1 ,w 2 ,…w N , the cross-entropy of a model can be estimated by logprob ( LP ) defined as 1 1 N = − = − ∑ LP ( Model ) log P ( W | Model ) log P ( w i Model | ) 2 2 N N = i 1 true entropy ≥ LP ( Model ) H The goal is to find a model which has a logprob that is as close as possible to the true entropy 15

Language Modeling Hsin-min Wang References: 1. X. Huang et. al., - PowerPoint PPT Presentation

Language Modeling Hsin-min Wang References: 1. X. Huang et. al., Spoken Language Processing, Chapter 11 2. R. Rosenfeld, Two Decades of Statistical Language Modeling: Where Do We Go from Here?, Proceedings of IEEE, August, 2000 3. Joshua

Language Modeling CSE354 - Spring 2020 Task Language Modeling Probabilistic Modeling

Language Modeling CSE392 - Spring 2019 Special Topic in CS Task Probabilistic Modeling

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

Language Modeling CS 6956: Deep Learning for NLP Overview What is a language model? How

Language Modeling Michael Collins, Columbia University Overview The language modeling problem

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

Count-based Language Modeling CMSC 473/673 UMBC Some slides adapted from 3SLP, Jason Eisner

NEST Modeling Language: A modeling language for spiking neuron and synapse models for NEST

Topics Why E Field Modeling What is E Field Modeling Case Studies Questions 2 Why

Outline 1 The topic 2 Decision support systems 3 Modeling 3.3 Advanced modeling

Verilog HDL:Digital Design and Modeling Chapter 5 Gate-Level Modeling Chapter 5 Gate-Level

Developmental Developmental Disorders affecting Disorders affecting language language

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

CS11-737: Multilingual Natural Language Processing Language contact Yulia Tsvetkov Language

? ? Chair Chair CPSC 449 Principles of Programming Languages Jrg Denzinger CPSC 449

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Numeral classifiers in areal perspective: Khmer and Thai 'syntactic borrowing' revisited RIKKER

Computational Approaches to Creative Language: Summary Caroline Sporleder Computational

Statistical Language Modeling for Speech Recognition Berlin Chen 2003 References: 1. X. Huang

2 MPRI 4 Syntactic Formalisms 3

Grammatical Metaphors Patrick Hanks and Sara Mo e Research Institute of Information And

V V ISUAL ISUAL G RAPH RAPH M M ODELING AND R R ETRIEVAL ODELING AND ETRIEVAL A L A L ANGUAGE