Language Model Adaptation Hsin-min Wang References: X. Huang et. - - PowerPoint PPT Presentation

language model adaptation
SMART_READER_LITE
LIVE PREVIEW

Language Model Adaptation Hsin-min Wang References: X. Huang et. - - PowerPoint PPT Presentation

Language Model Adaptation Hsin-min Wang References: X. Huang et. al., Spoken Language Processing (2001), Chapter 11. M. Bacchiani and B. Roark, Unsupervised language model adaptation, ICASSP2003. Marcello Federico,


slide-1
SLIDE 1

1

Language Model Adaptation

Hsin-min Wang

References:

  • X. Huang et. al., Spoken Language Processing (2001), Chapter 11.
  • M. Bacchiani and B. Roark, “Unsupervised language model adaptation,” ICASSP2003.
  • Marcello Federico, “Efficient language model adaptation through MDI ,“ Eurospeech99
  • Langzhou Chen,Jean-Luc Gauvain,Lori Lamel and Gilles Adda, “Using information

retrieval methods for language model adaptation,” Eurospeech2001.

  • Langzhou Chen, Jean-Luc Gauvain, Lori Lamel and Gilles Adda, “Unsupervised language

model adaptation for broadcast news,” ICASSP 2003.

slide-2
SLIDE 2

2

Definition of Speech Recognition Problem

For the given acoustic observation X=X1X2…Xn, the goal

  • f speech recognition is to find out the corresponding

word sequence W=w1w2…wm that has the maximum posterior probability P(W|X)

( )

( ) (

)

( ) ( ) (

)

W X W X W X W X W W

W W W

P P P P P P max arg max arg max arg ˆ = = =

{ }

N i m i

...,v v v V w ...w w w w , , : where ...

2 1 2 1

∈ = W

Acoustic Modeling Language Modeling

slide-3
SLIDE 3

3

Language Model (LM) Adaptation?

Why Language Model Adaptation? Why Language Model Adaptation? Dynamic adjustment of the language model parameter, such as n-gram probabilities, vocabulary size, and the choice of words in the vocabulary, is important since the topic changes from time to time What is Language Model Adaptation? What is Language Model Adaptation? Language model adaptation attempts to obtain language models for a new domain with a small amount adaptation data acoustic model adaptation How to Adapt How to Adapt N N-

  • gram Probabilities?

gram Probabilities? The most widely used approaches are model interpolation and count mixing

slide-4
SLIDE 4

4

MAP LM Adaptation

slide-5
SLIDE 5

5

MAP

). ( ) | ( max arg ) | ( max arg ) ( as denoted

  • f
  • n

distributi posterior the

  • f

mode the as

  • btained

is estimate MAP the , sample n

  • bservatio

an Given . space in the vector random a be to assumed are parameters model The θ θ θ θ

θ θ

g f g .| g θ Θ θ

MAP

X X X X = =

slide-6
SLIDE 6

6

MAP Estimation for N-gram LM

1

e

K

e

M

1

e

K

e

M

1

e

K

e

M

1

x

2

x

T

x

L L L

1

w →

K

w

{ } ( ) ( )

function. indicator the as ) 1( with sequence in the event th the

  • bserving
  • ccurrence
  • f

number the is ) ( 1 where , , , | , , is , , ns

  • bservatio

discrete i.i.d

  • f

sequence a

  • bserving
  • f

y probabilit the Then, . 1 and , , 1 |

  • utcomes

possible

  • f

set a among event discrete th the

  • bserving
  • f

y probabilit the be Let

1 1 1 1

⋅ ∑ = = ∏ = = = ∑ =

= = =

k e x n w w w x x p x x w K k e K e k w

T t k t k K k n k K T 1 T 1 K k k k k k

k

L L L L X

slide-7
SLIDE 7

7

MAP Estimation for N-gram LM (cont.)

( ) { }

eters. hyperparam

  • f

set the is , , 1 | where ) , | ,..., ( density Dirichlet a as assumed be can , ,

  • f
  • n

distributi prior The

1 1 1 1

K k w w w p w w

k K k k K K K 1

k

L L L = > ∏ ∝

= −

ν ν ν

ν

( ) ( )

∑ + ∑ − + − = ∴ ∑ − + − = ⇒ = ∑ − + − = ∑ − + − = ⇒ = + − + × ⇒       − ∑ +      ∑ − + + Ψ = ⇒ ∏ ∝

= = = = = = = = − + K j j K j j k k k K j j j K k k k K k k k k k k k k k K j j K j j j j T K K k n k T K

n n w n l l n w l n w l n w w w l w n x x w w p w x x w w p

k k

1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 ) 1 ( 1 w.r.t ate Differenti 1 log ) 1 ( ) ,..., | ,..., ( log ) ,..., | ,..., ( So, ν ν ν ν ν ν ν

ν

Q

slide-8
SLIDE 8

8

MAP N-gram LM Adaptation

) ( ) ( ~ ) ( ) ( ~ ) ( ) | ( ~ ) ( ~ ) ( ) | ( ~ ) ( ~ ) | ( ˆ then , 1 ) | ( ~ ) ( ~ choose we If ly. respective sample, adaptation the and sample domain

  • general

the from estimated history in

  • f

y probabilit the denote ) | ( and ) | ( ~ Let ). ( ~ and ) ( ~ as denoted be sample domain

  • general

the from counts ing correspond Let the . ) ( ) ( as denoted be history gram

  • n

an for count Let the . ) ( as denoted be history gram

  • n

in word a for count Let the

1

h c h c hw c hw c h c h w P h c hw c h w P h c h w P h w P h c h w h w P h w P h c hw c hw c h c h hw c h w

i i K j j i i i i i i i i i i i i i

β α β α β α β α β α ν + + = + ∑       + = + = ∑ =

=

Count mixing approach

slide-9
SLIDE 9

9

MAP N-gram LM Adaptation

) | ( ) 1 ( ) | ( ~ 1 1 ) ( ) ( ) | ( ~ 1 ) ( ) | ( ~ 1 ) ( ) ( ) | ( ~ 1 ) ( ) | ( ˆ then , 1 ) | ( ~ 1 ) ( choose we If

1

h w P h w P h c hw c h w P h c h w P h c hw c h w P h c h w P h w P h c

i i i i K j j i i i i i

λ λ λ λ λ λ λ λ λ λ λ λ ν − + = + − + − = + ∑       − + − = + − =

=

The MAP estimate reduces to the model interpolation approach

slide-10
SLIDE 10

10

MDI LM Adaptation

slide-11
SLIDE 11

11

MDI LM Adaptation

Minimum Discrimination Information (MDI)

– A new LM is estimated so that it is “as close as possible” to a general background LM – Given a background model PB(h,w) and an adaptive corpus A, we want to find model PA(h,w) that satisfies the following set of linear constraints where δi(.) are indicator functions of features , and are empirical estimates of the features on A and minimizes the Kullback-Leibler distance between PA(h,w) and PB(h,w)

,...,M i S P hw w h P

i A V hw i A

n

1 ), ( ˆ ) ( ) , ( = = ∑

δ ∑ =

∈ ⋅

n

V hw B Q A

w h P w h Q w h Q w h P ) , ( ) , ( log ) , ( min arg ) , (

) (

) ( ˆ

i A S

P

n i

V S ⊂

slide-12
SLIDE 12

12

MDI LM Adaptation (cont.)

MDI model can be trained by using the GIS (Generalized Iterative Scaling) algorithm

– It performs the following iterations

M i hw w h P S P S P S P w h P w h P w h P w h P

n i

V hw i r A i r A M i k hw i r A i A r A r A B A

,..., 1 ) ( ) , ( ) ( where ) ( ) ( ˆ ) , ( ) , ( ) , ( ) , (

) ( ) ( 1 ) ( ) ( ) ( ) 1 ( ) (

= ∑ = ∏         = =

∈ = +

δ

δ

features exactly satisfies Assume k V hw

n

slide-13
SLIDE 13

13

MDI LM Adaptation (cont.)

Given that the adaptation sample is typically small, we assume only unigram features can be reliably estimated

) ( ) ( ˆ ) ( where ), ( ) , ( ) , ( ) ( ) ( ˆ ) , ( ) , ( ) , ( ) , (

1 ) ( ) ( ) ( ) 1 ( ) (

w P w P w w w h P w h P S P S P w h P w h P w h P w h P

B A B A M i k hw i r A i A r A r A B A

i

= = ∏         = =

= +

α α

δ

  • therwise

and ˆ if 1 ) ( where ˆ ), ˆ ( ˆ ) ( ) , ( 1 ), ( ˆ ) ( ) , (

ˆ ˆ

w w hw V w w P hw w h P ,...,M i S P hw w h P

w A V hw w A i A V hw i A

n n

= = ∈ ∀ = ∑ = = ∑

∈ ∈

δ δ δ ∑ = ∑ =

∈ ∈ V w B B V w B B B B A

w h w P w h w P w h P h w P w h P h w P h w P

ˆ ˆ

) ˆ ( ) | ˆ ( ) ( ) | ( ) ˆ ( ) ( ) | ˆ ( ) ( ) ( ) | ( ) | ( α α α α

slide-14
SLIDE 14

14

MDI LM Adaptation (cont.)

1 to from ranges where ) ( ) ( ˆ ) ( γ w P w P w

B A γ

α         =

slide-15
SLIDE 15

15

Unsupervised LM Adaptation for Broadcast News Using IR Methods

slide-16
SLIDE 16

16

Introduction

Unsupervised language model adaptation is an

  • utstanding challenge for speech recognition,especially

for complex task such as broadcast news transcription where the content of any given show is related to multiple topics It is not possible to select adaptation data in advance

– the dynamic nature of the task

Seen as turning a general LM to some special topics without domain specific training data Information retrieval techniques have been proposed to address this problem The speech recognition hypothesis is used as a query to extract articles or text segment with related topic

slide-17
SLIDE 17

17

Adaptation Method Overview

The adaptive algorithm can be divided into two parts

– Extraction of the adaptation corpus

  • Initial hypothesis segmentation
  • Keyword selection
  • Retrieving relevant articles

– LM adaptation

  • MAP adaptation
  • MDI adaptation
  • Dynamic mixture model
slide-18
SLIDE 18

18

Keyword Selection

The content words with the most relevant topic information are selected as query terms The relevance of word wi and story sj is given by the following score where p(wi,v) is the probability that wi and v appear in the same story and kj is the set of all words in story sj

– All the words with a relevance score higher than an empirically determined threshold are selected

∑ =

j

k v i i j i

v p w p v w p s w R ) ( ) ( ) , ( log ) (

,

slide-19
SLIDE 19

19

Retrieving Relevant Articles

The selected N content words for each story are used to retrieve relevant texts where Nj is the number of content words in article Aj All articles with a score exceeding an empirically determined threshold are extracted The selected articles are used as adaptation data to train the adaptive LM

( )

∑ ∑        

= = N i N k k i k i j

j

w keyword w keyword N

1 1

Pr ) Pr( ) , Pr( log 1