Language Model Adaptation Hsin-min Wang References: X. Huang et. - PowerPoint PPT Presentation

Language Model Adaptation Hsin-min Wang References: • X. Huang et. al., Spoken Language Processing (2001), Chapter 11. • M. Bacchiani and B. Roark, “Unsupervised language model adaptation,” ICASSP2003. • Marcello Federico, “Efficient language model adaptation through MDI ,“ Eurospeech99 • Langzhou Chen,Jean-Luc Gauvain,Lori Lamel and Gilles Adda, “Using information retrieval methods for language model adaptation,” Eurospeech2001. • Langzhou Chen, Jean-Luc Gauvain, Lori Lamel and Gilles Adda, “Unsupervised language model adaptation for broadcast news,” ICASSP 2003. 1

Definition of Speech Recognition Problem � For the given acoustic observation X = X 1 X 2 … X n , the goal of speech recognition is to find out the corresponding word sequence W = w 1 w 2 … w m that has the maximum posterior probability P ( W | X ) ( ) ˆ = W arg max P W X W = W w w ... w ...w ) ( ) ( 1 2 i m P W P X W { } ∈ = where w V : v , v , ...,v arg max ( ) i 1 2 N P X W ) ( ) ( = arg max P W P X W W Language Modeling Acoustic Modeling 2

Language Model (LM) Adaptation? Why Language Model Adaptation? Why Language Model Adaptation? � Dynamic adjustment of the language model parameter, such as n -gram probabilities, vocabulary size, and the choice of words in the vocabulary, is important since the topic changes from time to time What is Language Model Adaptation? What is Language Model Adaptation? � Language model adaptation attempts to obtain language models for a new domain with a small amount adaptation data acoustic model adaptation How to Adapt N N - -gram Probabilities? gram Probabilities? How to Adapt � The most widely used approaches are model interpolation and count mixing 3

MAP LM Adaptation 4

MAP The model parameters θ are assumed to be a random vector in the space Θ . Given an observatio n sample X , the MAP estimate is obtained as the mode of the posterior distributi on of θ denoted as g ( .| X ) θ = θ = θ θ arg max g ( | X ) arg max f ( X | ) g ( ). MAP θ θ 5

→ L w e e e 1 1 1 1 MAP Estimation for N-gram LM M M M L → w e e L e K K K K x x x 1 2 T Let w be the probabilit y of observing the k th discrete event e k k { } K = = ∑ among a set of K possible outcomes e | k 1 , , K and w 1 . L k k = k 1 Then, the probabilit y of observing a sequence of i.i.d discrete ( ) ( ) K = = n ∏ observatio ns X x , , x is p x , , x | w , , w w , L L L k 1 T 1 T 1 K k = k 1 T = = ∑ where n 1 ( x e ) is the number of occurrence observing k t k = t 1 ⋅ the k th event in the sequence with 1( ) as the indicator function. 6

MAP Estimation for N-gram LM (cont.) ( ) The prior distributi on of w , , w can be assumed as a Dirichlet density L 1 K K ν − ν ν ∝ 1 ∏ p ( w ,..., w | , ) w L k 1 K 1 K k = k 1 { } ν > = where 0 | k 1 , , K is the set of hyperparam eters. L k K + ν − ∝ n 1 ∏ So, p ( w ,..., w | x ,..., x ) w k k 1 K 1 T k = k 1 ∑    K K ⇒ = Ψ + + ν − + − ∑ log p ( w ,..., w | x ,..., x ) ( n 1 ) log w l  w 1    1 K 1 T j j j j     = = j 1 j 1 + ν − 1 n 1 ⇒ × + ν − + = ⇒ = − k k Differenti ate w.r.t w ( n 1 ) l 0 w k k k k w l k + ν − ν − + ( ) n 1 1 n K K K = − = ⇒ = − + ν − ∴ = ∑ ∑ ∑ w k k 1 l n 1 w k k Q ( ) k j j k K K l ν − + = = = ∑ ∑ k 1 k 1 j 1 1 n j j = = j 1 j 1 7

MAP N-gram LM Adaptation Let the count for a word w in n - gram history h be denoted as c ( hw ) . i i = ∑ Let the count for an n - gram history h be denoted as c ( h ) c ( hw ) . i i Let the correspond ing counts from the general - domain sample be ~ ~ denoted as c ( hw ) and c ( h ). i ~ Let P ( w | h ) and P ( w | h ) denote the probabilit y of w in history h estimated i i i from the general - domain sample and the adaptation sample, respective ly. α ~ ~ ν = + If we choose c ( h ) P ( w | h ) 1 , then i i β α ~ ~ + c ( h ) P ( w | h ) c ( hw ) ~ α + β β i i c ( hw ) c ( hw ) ˆ = = P ( w | h ) i i ~ i   α + β α c ( h ) c ( h ) ~ K ~ + ∑ c ( h ) P ( w | h ) c ( h )   j β   = j 1 Count mixing approach 8

MAP N-gram LM Adaptation λ ~ ν = + If we choose c ( h ) P ( w | h ) 1 , then i i − λ 1 λ ~ c ( hw ) λ ~ + P ( w | h ) i + c ( h ) P ( w | h ) c ( hw ) i − λ 1 c ( h ) − λ i i 1 ˆ = = P ( w | h ) λ λ i   ~ K + + ∑ 1 c ( h ) P ( w | h ) c ( h )   − λ j − λ 1  1  = j 1 ~ = λ + − λ P ( w | h ) ( 1 ) P ( w | h ) i i The MAP estimate reduces to the model interpolation approach 9

MDI LM Adaptation 10

MDI LM Adaptation � Minimum Discrimination Information (MDI) – A new LM is estimated so that it is “ as close as possible ” to a general background LM – Given a background model P B ( h,w ) and an adaptive corpus A , we want to find model P A ( h,w ) that satisfies the following set of linear constraints ˆ δ = = ∑ P ( h , w ) ( hw ) P ( S ), i 1 ,...,M A i A i n ∈ hw V S ⊂ n V where δ i (.) are indicator functions of features , and i ˆ are empirical estimates of the features on A P A S ( ) i and minimizes the Kullback-Leibler distance between P A ( h,w ) and P B ( h,w ) Q ( h , w ) = ∑ P ( h , w ) arg min Q ( h , w ) log A P ( h , w ) n ⋅ ∈ Q ( ) hw V B 11

MDI LM Adaptation (cont.) � MDI model can be trained by using the GIS (Generalized Iterative Scaling) algorithm – It performs the following iterations = ( 0 ) P ( h , w ) P ( h , w ) A B δ ( hw ) i  ˆ  ∈ n P ( S ) k Assume hw V M   + = ( r 1 ) ( r ) P ( h , w ) P ( h , w ) ∏ A i   A A ( r ) P ( S ) satisfies exactly   = i 1 A i k features where = δ = ( r ) ∑ ( r ) P ( S ) P ( h , w ) ( hw ) i 1 ,..., M A i A i n ∈ hw V 12

MDI LM Adaptation (cont.) � Given that the adaptation sample is typically small, we assume only unigram features can be reliably estimated ˆ δ = = ∑ P ( h , w ) ( hw ) P ( S ), i 1 ,...,M A i A i n ∈ hw V ˆ δ = ∀ ∈ ∑ ˆ ˆ P ( h , w ) ( hw ) P ( w ), w V P ( w | h ) A w ˆ A n A ∈ hw V α P ( w | h ) P ( h ) ( w ) δ = = ˆ where ( hw ) 1 if w w and 0 otherwise = B B ˆ w α ∑ ˆ ˆ P ( w | h ) P ( h ) ( w ) ∈ ˆ w V B B α P ( w | h ) ( w ) = ( 0 ) P ( h , w ) P ( h , w ) = B A B α ∑ P ( w ˆ | h ) ( w ˆ ) ∈ w ˆ V δ B ( hw ) i  ˆ  P ( S ) k M   + = ( r 1 ) ( r ) P ( h , w ) P ( h , w ) ∏ A i   A A ( r ) P ( S )   = i 1 A i ˆ P ( w ) = α α = P ( h , w ) P ( h , w ) ( w ), where ( w ) A A B P ( w ) B 13

MDI LM Adaptation (cont.) γ ˆ   P ( w )   α = ( w ) A   P ( w )   B where γ ranges from 0 to 1 14

Unsupervised LM Adaptation for Broadcast News Using IR Methods 15

Introduction � Unsupervised language model adaptation is an outstanding challenge for speech recognition,especially for complex task such as broadcast news transcription where the content of any given show is related to multiple topics � It is not possible to select adaptation data in advance – the dynamic nature of the task � Seen as turning a general LM to some special topics without domain specific training data � Information retrieval techniques have been proposed to address this problem � The speech recognition hypothesis is used as a query to extract articles or text segment with related topic 16

Adaptation Method Overview � The adaptive algorithm can be divided into two parts – Extraction of the adaptation corpus • Initial hypothesis segmentation • Keyword selection • Retrieving relevant articles – LM adaptation • MAP adaptation • MDI adaptation • Dynamic mixture model 17

Keyword Selection � The content words with the most relevant topic information are selected as query terms � The relevance of word w i and story s j is given by the following score p ( w , v ) = ∑ R ( w s ) log i , i j p ( w ) p ( v ) ∈ v k j i where p ( w i ,v ) is the probability that w i and v appear in the same story and k j is the set of all words in story s j – All the words with a relevance score higher than an empirically determined threshold are selected 18

Language Model Adaptation Hsin-min Wang References: X. Huang et. - PowerPoint PPT Presentation

Language Model Adaptation Hsin-min Wang References: X. Huang et. al., Spoken Language Processing (2001), Chapter 11. M. Bacchiani and B. Roark, Unsupervised language model adaptation, ICASSP2003. Marcello Federico,

Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Minema Minema

Coastal Adaptation Kellie Fisher FCERM Senior Advisor Why Adaptation? Adaptation to a

Adaptation Philipp Koehn 27 October 2020 Philipp Koehn Machine Translation: Adaptation 27

Adaptation Techniques for Acoustic Adaptation Techniques for Acoustic Adaptation Techniques for

A New Adaptation Method for Speaker- -Model Model A New Adaptation Method for Speaker Creation

Innovative Climate Financing for Adaptation Mainstreaming Adaptation Financing in Development

Climate Adaptation Intro and Workshop Overview Paul Moss MPCA Adaptation/Mitigation

IUCN Ecosystem based approaches to adaptation and risk reduction and risk reduction 1. What is

Biodiversity, Ecosystem Services and Adaptation and Adaptation Dr Pushpam Kumar Associate

Action 1. Encourage MS to adopt Adaptation Strategies and action plans Action 2. LIFE funding,

Korea's Experiences on Adaptation Planning Ju Youn KANG Korea Adaptation Center for Climate

Adaptation in polygenic traits Criteria for sweeps and shifts Joachim Hermisson Mathematics

Climate Adaptation Planning for the Town of Truckee GEOS INSTITUTE Whole Community Adaptation

ADAPTATION Michael Mullan Team lead Climate change adaptation and development Systemic

MAP adaptation with SphinxTrain David Huggins-Daines dhuggins@cs.cmu.edu Language Technologies

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

Ch10: More Expectations and Variances Part I: Expected Values of Sums of Random Variables 10.1

Gallai-Ramsey Number for K4 Ingo Schiermeyer TU Bergakademie Freiberg (joint work with Colton

PCA: algorithm 6. Project each point onto the eigenspace, giving a vector of k eigen-coefficients

More about NPC Problems Algorithm : Design & Analysis [21] In the Last Class Decision

DIADEM data extraction methodology Web data as you want it T E A M 2 I N T R O D U C T I O N

What is Search For? CS 188: Artificial Intelligence Assumptions about the world: a single

Constraints on Nonstandard Top Quark Couplings from Precision Electroweak Data Cen Zhang

Choice Theory A Synopsis 14.123 Microeconomic Theory III Muhamet Yildiz Road map 1. Basic

Language Model Adaptation Hsin-min Wang References: X. Huang et. - PowerPoint PPT Presentation

Language Model Adaptation Hsin-min Wang References: X. Huang et. al., Spoken Language Processing (2001), Chapter 11. M. Bacchiani and B. Roark, Unsupervised language model adaptation, ICASSP2003. Marcello Federico,

Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Minema Minema

Coastal Adaptation Kellie Fisher FCERM Senior Advisor Why Adaptation? Adaptation to a

Adaptation Philipp Koehn 27 October 2020 Philipp Koehn Machine Translation: Adaptation 27

Adaptation Techniques for Acoustic Adaptation Techniques for Acoustic Adaptation Techniques for

A New Adaptation Method for Speaker- -Model Model A New Adaptation Method for Speaker Creation

Innovative Climate Financing for Adaptation Mainstreaming Adaptation Financing in Development

Climate Adaptation Intro and Workshop Overview Paul Moss MPCA Adaptation/Mitigation

IUCN Ecosystem based approaches to adaptation and risk reduction and risk reduction 1. What is

Biodiversity, Ecosystem Services and Adaptation and Adaptation Dr Pushpam Kumar Associate

Action 1. Encourage MS to adopt Adaptation Strategies and action plans Action 2. LIFE funding,

Korea's Experiences on Adaptation Planning Ju Youn KANG Korea Adaptation Center for Climate

Adaptation in polygenic traits Criteria for sweeps and shifts Joachim Hermisson Mathematics

Climate Adaptation Planning for the Town of Truckee GEOS INSTITUTE Whole Community Adaptation

ADAPTATION Michael Mullan Team lead Climate change adaptation and development Systemic

MAP adaptation with SphinxTrain David Huggins-Daines dhuggins@cs.cmu.edu Language Technologies

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

Ch10: More Expectations and Variances Part I: Expected Values of Sums of Random Variables 10.1

Gallai-Ramsey Number for K4 Ingo Schiermeyer TU Bergakademie Freiberg (joint work with Colton

PCA: algorithm 6. Project each point onto the eigenspace, giving a vector of k eigen-coefficients

More about NPC Problems Algorithm : Design &amp; Analysis [21] In the Last Class Decision

DIADEM data extraction methodology Web data as you want it T E A M 2 I N T R O D U C T I O N

What is Search For? CS 188: Artificial Intelligence Assumptions about the world: a single

Constraints on Nonstandard Top Quark Couplings from Precision Electroweak Data Cen Zhang

Choice Theory A Synopsis 14.123 Microeconomic Theory III Muhamet Yildiz Road map 1. Basic

More about NPC Problems Algorithm : Design & Analysis [21] In the Last Class Decision