Minimum Bayes-Risk Methods in Automatic Speech Recognition Vaibhava - - PowerPoint PPT Presentation

minimum bayes risk methods in automatic speech recognition
SMART_READER_LITE
LIVE PREVIEW

Minimum Bayes-Risk Methods in Automatic Speech Recognition Vaibhava - - PowerPoint PPT Presentation

Minimum Bayes-Risk Methods in Automatic Speech Recognition Vaibhava Goel IBM William Byrne Johns Hopkins University Pattern Recognition in Speech and Language Processing Chap2 Outline Minimum Bayes-Risk Classification Framework


slide-1
SLIDE 1

Minimum Bayes-Risk Methods in Automatic Speech Recognition

Vaibhava Goel – IBM William Byrne – Johns Hopkins University

Pattern Recognition in Speech and Language Processing – Chap2

slide-2
SLIDE 2

Outline

  • Minimum Bayes-Risk Classification Framework

– Likelihood Ratio Based Hypothesis Testing – Maximum A-Posteriori Probability Classification – Previous Studies of Application Sensitive ASR

  • Practical MBR Procedures for ASR

– Summation over Hidden State Sequences – MBR Recognition with N-best Lists – MBR Recognition with Lattices

slide-3
SLIDE 3

Outline

  • Segmental MBR Procedures

– Segmental Voting – ROVER – e-ROVER

  • Experimental Results

– Parameter Tuning within the MBR Classification Rule – Utterance Level MBR Word and Keyword Recognition – ROVER and e-ROVER for Multilingual ASR

  • Summary
slide-4
SLIDE 4
  • Minimum Bayes-Risk Classification Framework

– Likelihood Ratio Based Hypothesis Testing – Maximum A-Posteriori Probability Classification – Previous Studies of Application Sensitive ASR

  • Practical MBR Procedures for ASR

– Summation over Hidden State Sequences – MBR Recognition with N-best Lists – MBR Recognition with Lattices

slide-5
SLIDE 5

Minimum Bayes-Risk Classification Framework

language and speech

  • f
  • n

distributi true : ) , (

  • f

iption mistranscr is where function, loss : ) , ( classifier ASR : : ) ( n

  • bservatio

the

  • f

space hypothesis the : string word : sequence n

  • bservatio

acoustic : A W P W W W W l W A A A W A

A h A h

′ ′ → δ W

  • Definition:
slide-6
SLIDE 6

Minimum Bayes-Risk Classification Framework

[ ] { }

) ( min arg ) ( ) ( ) ( ) ( Let (2.4) ) ( ) ( min arg ) ( as rewritten be can 2.2 Equation ) | ( | ) ( nonzero with ,

  • f

subset the be Let

  • f

n anscriptio correct tr the is , ) ( min arg ) ( use but we risk

  • Bayes

minimize to chosen is (2.2) ) ( ) ( min arg ) ( (2.1) ) , ( )) ( , ( )) ( ( risk

  • Bayes

Using e? performanc classifier measure How to

) (

W S A W S A | W P W W, l A | W P W W, l A A W P W A | W P W A W W , W l A A | W P W W, l A A W P A W l A W, l E

A h A e A e A h A h A h

W W W W A e A e c c W W W W A A W, P

′ = ∴ ′ = ′ ′ = ∴ > = ⇒ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ ′ = ′ = ∴ = →

∈ ′ ∈ ∈ ∈ ′ ∈ ′ ∈ ′

∑ ∑ ∑ ∑ ∑

W W W W W W

W W δ δ δ δ δ δ

slide-7
SLIDE 7

Minimum Bayes-Risk Classification Framework

tion classifica posteriori

  • a

maximum method ) ( classifier ) ( function loss testing hypothesis ratio likelihood method ) ( classifier ) ( function loss : ways Two ? function loss define How to

  • n

distributi evidence as refered is ) | ( and space evidence as refered is . classifier by MBR used evidence the as serve in ns

  • bservatio

the Since

MAP 1 / LRT

→ = ⇒ = → = ⇒ = ∴ A X,Y l A X,Y l A W P

LRT A e A e

δ δ W W

slide-8
SLIDE 8

Likelihood Ratio Based Hypothesis Testing

{ } { }

{ }[

] [ ] [ ] [ ] [ ] [ ] [ ]

⎪ ⎩ ⎪ ⎨ ⎧ = > = ⎩ ⎨ ⎧ > = ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ = ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ = ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ + + = ′ + ′ = ′ = ∴ ⎪ ⎪ ⎩ ⎪ ⎪ ⎨ ⎧ = = = = = = = = = = =

∈ ′ ∈ ∈ ′

  • therwise

) ( ) ( ) ( ) (

  • therwise

) ( ) ( ) ( ) ( ) ( ) ( , ) ( ) ( min arg ) ( , ) ( min arg ) ( ) ( ) ( ) ( , ) ( ) ( ) ( ) ( min arg ) ( ) ( ) ( ) ( min arg ) ( ) ( min arg ) ( , if , if , if , if ) , ( define and , and , If

2 1 1 2 2 1 2 1 , 2 1 a n a a n n a a a n n n n n a a n a a a a n a n a n a n n n a a n n H H W W W a a a n n a n n LRT a n a n

H t H P t H P t H | A P H | A P H H H P H | A P t H P H | A P t H H P H | A P t H P H | A P t A | H P t A | H P t A | H P H , H l A | H P H , H l A | H P H , H l A | H P H , H l A | H P W , H l A | H P W , H l A | W P W W, l A H Y H X H Y H X t H Y H X t H Y H X Y X l H H H H

a n A e A h

W W h e

W W δ

slide-9
SLIDE 9

Likelihood Ratio Based Hypothesis Testing

aceptance. false and rejection false between balance the determines it manner; specific n applicatio an in set is threshold The (2.6)

  • therwise

) ( ) ( if ) ( t H t H | A P H | A P H A

a a n n LRT

⎪ ⎩ ⎪ ⎨ ⎧ > = ∴δ A Hn Ha null class alternative class

slide-10
SLIDE 10

Maximum A-Posteriori Probability Classification

( )

) ( max arg ) ( 1 min arg ) ( min arg ) ( ) ( min arg ) (

  • therwise

if 1 ) , ( Define

1 /

A | W P A | W P A | W P A | W P W W, l A W W W W l

A h A h A h A e A h

W W W W W W W

′ = ′ − = = ′ = ∴ ⎩ ⎨ ⎧ ≠ ′ = ′

∈ ′ ∈ ′ ′ ≠ ∈ ′ ∈ ∈ ′

∑ ∑

W W W W W

δ

slide-11
SLIDE 11

Previous Studies of Application Sensitive ASR

  • Use of risk minimization in automatic speech has not

been extensive.

  • Early investigations into the minimum Bayes-risk

training criteria for speech recognizers were performed by Nadas.

  • However our focus in this chapter is in minimum-risk

classification rather than estimation.

slide-12
SLIDE 12

Previous Studies of Application Sensitive ASR

  • Stolcke et.al. proposed an approximation to a minimum

Bayes risk classifier for generation of minimum word error rate hypothesis from recognition N-best lists.

  • Other researchers have proposed posterior probability

and confidence based hypothesis selection strategies for word error rate reduction.

slide-13
SLIDE 13
  • Minimum Bayes-Risk Classification Framework

– Likelihood Ratio Based Hypothesis Testing – Maximum A-Posteriori Probability Classification – Previous Studies of Application Sensitive ASR

  • Practical MBR Procedures for ASR

– Summation over Hidden State Sequences – MBR Recognition with N-best Lists – MBR Recognition with Lattices

slide-14
SLIDE 14

Practical MBR Procedures for ASR

  • Why difficult to implement?

– The evidence and hypothesis spaces in Equation 2.4 tend to be quite large. – The problem of large spaces is worsened by the fact that an ASR recognizer often has to process many consecutive utterances. – There are efficient DP techniques for MAP recognizer, such methods are not yet available for an MBR recognizer under an arbitrary loss function.

slide-15
SLIDE 15

Practical MBR Procedures for ASR

  • How to implement?

– Two implementation:

  • N-best list rescoring procedure
  • Search over a recognition lattice

– Segment long acoustic data into sentence or phrase length utterances. – Restrict the evidence and hypothesis spaces to manageable sets of word strings.

slide-16
SLIDE 16

Summation over Hidden State Sequences

  • A computational issue associated with the use of HMM in the

evidence distribution will be addressed.

  • How to obtain the true distribution?

(2.13) ) , | ( ) | ( ) | , ( ) | ( as computed is ) | ( y probabilit The . generate could that sequences state possible all

  • f

set the denote Let ). | ( HMM acoustic the in states the all

  • f

set the be Let the called a HMM using ed approximat usually is ) | ( model. gram

  • N

based chain a Markov usually is it , a using ed approximat is ) ( Here (2.12) ) ( ) | ( ) ( ) | (

∑ ∑

∈ ∈

= = =

χ χ

χ

X X

W X A P W X P W X A P W A P W A P A W A P S model. acoustic W A P model language W P A P W A P W P A W P

slide-17
SLIDE 17

Summation over Hidden State Sequences

( )

. in sequences state likely most the

  • f

sampling sparse a is where ) , , ( ) , ( ), , ( min arg ) , , ( ) ( min arg ) ( ) , | ( ) | ( ) ( ) ( min arg ) ( ) ( min arg ) ( as 2.4 Equation modify the to is e alternativ feasible nally computatio A expensive. too is sequences state hidden possible all

  • ver

summation The

) , ( ) , (

χ χ

χ W χ W χ W W χ W W W W A X W X W X W W X W W W W

A A e A A h A e A h A e A h A e A h

A X W P X W X W l A X W P W W, l A P W X A P W X P W P W W, l A | W P W W, l A

∑ ∑ ∑ ∑ ∑ ∑

× ∈ × ∈ ′ ′ ∈ ∈ ∈ ′ ∈ ∈ ∈ ′ ∈ ∈ ′

′ ′ ≈ ′ = ′ = ′ = δ

slide-18
SLIDE 18

Summation over Hidden State Sequences

(2.15) ) , ( ) ( min arg ) ( have we n rather tha n rather tha ) , ( n rather tha use we e convenienc For

∈ ∈ ′

′ = ∴ × ×

A e A h W

W A A e A e A A h A h

A W P W W, l A X W W

W W

χ W W χ W W δ

slide-19
SLIDE 19

MBR Recognition with N-best Lists

sum. the and search the in candidates more consider to e.e., lattice, n recognitio the to spaces two these

  • f

size the increase

  • interest t
  • f

is It functions. loss arbitrary for implement easy to ly particular is ion approximat This ) , ( ) ( min arg ) ( and denoted are They . recognizer a by produced lists best

  • N

the to restricted are spaces hypothesis and evidence the approach, this In

  • n.

minimizati WER for proposed first as procedures rescoring list best

  • N

by is 2.15 Equation

  • f

ion approximat direct most The

∈ ∈ ′

′ ≈ ∴

e h

N W N W h e

A W P W W, l A N N δ

slide-20
SLIDE 20

MBR Recognition with Lattices

  • Multistack prefix tree A* search algorithm uses

recognition lattices as the hypothesis and evidence.

ty. connectivi lattice specifies : node end lattice unique the : node start lattice unique the : edges

  • f

set the : nodes

  • f

set the : graph directed acyclic an ) , , , , ( . boundaries e their tim and strings word

  • f

set large a for tion representa compact A : N N n n N n n N Definitons Lattice

e s e s

→ ×ε ρ ε ρ ε

slide-21
SLIDE 21

MBR Recognition with Lattices

slide-22
SLIDE 22

MBR Recognition with Lattices

ly. respective )) ( , ( and )) ( , ( , )) ( , ( be will edges the

  • n

y probabilit

  • log
  • f

sum The ly. respective ) ( and ) ( , ) ( be will path three these to ing correspond segments Acoustic node (internal) to : ) ( node end to node (internal) : ) ( node (internal) to node (internal) : ) ( node end to node start :

  • r

p p e e x x p e x x s p e x e y x x e s

W A W P W A W P W A W P W A W A W A n n W path Partial n n W n n W Segment Path n n path Complete Path

slide-23
SLIDE 23

MBR Recognition with Lattices

{ } { } [ ]

∑ ∑ ∑ ∑ ∑

∈ ⋅ ∈ ⋅ ∈ ⋅ ∈ ⋅ ∈ ⋅

⋅ = ⎪ ⎭ ⎪ ⎬ ⎫ ⎪ ⎩ ⎪ ⎨ ⎧ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ ⋅ = ⎪ ⎭ ⎪ ⎬ ⎫ ⎪ ⎩ ⎪ ⎨ ⎧ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ = ⎪ ⎭ ⎪ ⎬ ⎫ ⎪ ⎩ ⎪ ⎨ ⎧ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ + = + = = ⎪ ⎭ ⎪ ⎬ ⎫ ⎪ ⎩ ⎪ ⎨ ⎧ = = =

lat e p e lat e p e lat e p e lat e p e lat e p e

W W W W e p W W W W e p W W W W p p p e e W W W W p e e p p p b p f p p lat W W W W p e e p b p p p p f p

A W W P A W W P W A W P W W A W P W W A W P W A W P W L W L W T W W W W A W P W L W W A W P W L W

: : : : :

) , ( ) , ( ln exp )) ( , ( ) | ) ( , ( ln exp ) | ) ( , ( ln )) ( , ( ln exp ) ( ) ( exp ) (

  • f

y probabilit total lattice The lattice. the in paths complete all

  • f

st the denotes where (2.18) ) | ) ( , ( ln ) ( is

  • f

y probabilit

  • log

backward lattice The . ) ( , ( ln ) (

  • f

y probabilit

  • log

path partial The

slide-24
SLIDE 24

MBR Recognition with Lattices

  • 2.5
  • 2.3
  • 3.5
  • 1.55
  • 1.6

0.043

slide-25
SLIDE 25

MBR Recognition with Lattices

{ }

algorithm. * an as d implemente y effectivel be can for search This . hypothesis complete all

  • f

least the is ) , ( ) , ( ) ( loss expected its that such lattice, the through to from path a i.e., , hypothesis complete a find to is goal The (2.20) ) , ( ) ( min arg ) ( implement would we lattice the

  • n

Therefore, . along edges lattice

  • n

ies probabilit

  • log

the adding by computed be can ) , ( ln y probabilit

  • log

jonit The A W A W P W W l W S n n W A W P W W, l A W A W P

lat lat lat

W W e s W W

′ ′ = ′ ′ ′ =

∑ ∑

∈ ∈ ∈ ′ W W

δ

slide-26
SLIDE 26

MBR Recognition with Lattices

  • Two cost functions are required for the search.

– The first cost function is associated with each hypothesis , whether partial or complete.

  • Its value is a lower bound on the expected loss that can

be obtained by extending the hypothesis through the lattice to completion.

p

W

(2.22) ) , ( ) , ( min ) (

:

∈ ∈ ⋅

⋅ ≤

lat lat e p e

W W e p W W W W p

A W P W W W l W C ) ( min ) (

e p W p

W W S W C

e

⋅ ≤

slide-27
SLIDE 27

MBR Recognition with Lattices

p

W

e

W

slide-28
SLIDE 28

MBR Recognition with Lattices

– The second cost function in only associated with complete hypotheses.

  • It is an over-estimate of the expected loss of a complete

hypothesis

– Hypotheses are kept in a priority queue sorted by cost C, with the smallest cost hypothesis at the top. – At every iteration the hypothesis at the top of the stack is extended.

W ′

(2.23) ) ( ) , ( ) , ( ) ( W S A W P W W l W C

lat

W W

′ = ′ ≥ ′

slide-29
SLIDE 29

MBR Recognition with Lattices

  • When to terminate?

– When there is a complete hypothesis at the top, its second cost is computed. If this over-estimate cost is smaller than the under-estimate cost C of the next stack hypothesis. – There is no partial hypothesis left in the stack.

  • Why we use over-estimate?

– Since A* usually employ an exact expected loss for complete hypotheses; however, this is prohibitively expensive to find in our case.

C C

slide-30
SLIDE 30

Single Stack Search Levenshtein Loss Function

  • We now present usable const functions for the

Levenshtein distance

  • These costs are not unique, and the efficiency of the

search depends on the quality of both the under- estimate and the over-estimate.

  • The Levenshtein loss function is not sensitive to the

word time boundaries.

  • Therefore, the word time boundaries would be

summed over during the search.

) , ( W W L ′

slide-31
SLIDE 31

Single Stack Search Levenshtein Loss Function

y. vocabular the

  • f

words more

  • r

zero ing concatenat by d constructe be can that boundaries time possible all their and strings word possible all

  • f

set the is where (2.24) ) ~ ( ) , ~ ( min ) ( is hypotheses partial for estimate

  • under

The stack. the in hypotheses partial and complete all

  • f

set the denote Let

~ ~ : :

W W T X W Y W L W C W

st p

W W p W Y W Y W X W X p st

∈ ∈ ⋅ ∈ ⋅

⎪ ⎭ ⎪ ⎬ ⎫ ⎪ ⎩ ⎪ ⎨ ⎧ ⋅ ⋅ =

slide-32
SLIDE 32

Single Stack Search Levenshtein Loss Function

:

W ~

p

W

W ~

X Y

) ~ (W T

st

W

slide-33
SLIDE 33

Single Stack Search Levenshtein Loss Function

(2.25) ) ~ ( ) , ~ ( ) ( estimate

  • ver

the Compute y. vocabular the in any word match not do markers These . markers y vocabular

  • f
  • ut
  • f

instances ) ~ ( by stack the in ~ hypothesis each Append . node end lattice the to node end its from path longest the

  • f

length the be ) ~ ( let , stack in ~ hypothesis a For : follows as computed be can hypothesis complete a for estimate

  • ver

The 2.22 Equation satisfies function cost that this showing derivation The

~ ) ~ ( 1

′ ⋅ = ′

st

W W W N e st

W T W D W L W C D W N W n W N W W W

slide-34
SLIDE 34

Single Stack Search Levenshtein Loss Function

stack. the into , hypothesis NULL the i.e., lattice, the

  • f

node start the inserting by search the Initialize 3. ). ( identical

  • f

cases in ) (

  • f

values decreasing by second and ), (

  • f

values increasing by first defined is

  • rdering

stack The ). 25 . 2 )( ( and ), ( ), ( , hypothesis a contains entry stack complete Each ). 24 . 2 )( ( and ), 19 . 2 )( ( ), 17 . 2 )( ( , hypothesis a contains entry stack partial Each . hypotheses complete and partial

  • f

stack a n Maintai 2. lattice. the

  • f

end the to path longest the

  • f

length the keep node each At y(2.18). probabilit

  • log

backward lattice by the nodes lattice e Mark th 1. lattice. n recognitio the in hypothesis desired the find to used be can algorithm search stack single following the (2.25), estimate

  • ver

the and (2.24) estimate

  • under

the With ⋅ ⋅ ⋅ ′ ′ ′ ′ C T C W C W C W T W W C W T W L W W

p p p f p st

slide-35
SLIDE 35

Single Stack Search Levenshtein Loss Function

4. step

  • t to

Otherwise ends. search the and candidate desired the is it complete),

  • r

(partial hypothesis stack second

  • f

estimate

  • under

the an smaller th is estimate

  • ver

its and stack the

  • f

top at the hypothesis complete a is there If 6. insertion. the during applied be may Pruning stack. the in ties)

  • f

case in ) ( by second and ) ( by first (sorted places e appropriat at their hypotheses created newly Insert the space. evidence the to hypotheses created newly theses adding after hypotheses stack complete and partial

  • ther

all

  • f

2.25) and 2.24 estimates( cost the Update 5. candidate. desired the is This ). ( least with hypothesis select the , hypotheses stack incomplete no are there if Otherwise, . hypothesis complete created newly each for ) ( and ), ( Compute . hypothesis partial created newly the

  • f

each for ) ( Compute node. end its leave that arcs lattice all by hypothesis incomplete top the extend stack, the in hypotheses incomplete are there If 4. ⋅ ⋅ ′ ′ ′ ′ T C W C W W C W C W W C

p p

slide-36
SLIDE 36

Prefix Tree Search Under Levenshtein Loss Function

stack. current

  • ver the
  • f

y probabilit total induced the be ) ~ ( ) ( Let contents. word its be ) ( let , stack the from hypothesis partial a Given . hypotheses from

  • ns

segmentati time the strips hat

  • perator t

the be Let removed. n informatio time hypotheses

  • f
  • n

segmentati time the

  • n

depend not does distance ion Levenshte the Since

p ) ~ ( : ~

Φ = Φ = Φ → ∴

Φ = ∈

p st

W U W W p p p st p

W T T W U W W U

slide-37
SLIDE 37

Prefix Tree Search Under Levenshtein Loss Function

) ( ) ( ) , ( min ) ~ ( ) , ( min ) ~ ( ) , ( min ) ~ ( ) , ~ ( min ) ( as

  • perator

the using rearranged be can function cost The

) ( : ) ( : ) ( ) ~ ( ~ ) ( : ) ( : ) ( ) ( : ) ( : ) ~ ( ~ ) ( ~ : : ~ p p W U b b W U a a W U p W U W W U b b W U a a W U p W U b b W U a a W U W W U p W Y W Y W X W X W W p

C T a b L W T a b L W T a b L W T X W Y W L W C U

p st p st p st p st

Φ = Φ ⋅ Φ ⋅ Φ = ⋅ Φ ⋅ Φ = ⋅ Φ ⋅ Φ = ⋅ ⋅ =

∈ ⋅ Φ ∈ ⋅ Φ ∈ Φ Φ = ∈ ∈ ⋅ Φ ∈ ⋅ Φ ∈ Φ ∈ ⋅ Φ ∈ ⋅ Φ Φ = ∈ ∈ Φ ∈ ⋅ ∈ ⋅ ∈

∑ ∑ ∑ ∑ ∑ ∑

slide-38
SLIDE 38

Prefix Tree Search Under Levenshtein Loss Function

lattice. the in nodes end their and hypotheses

  • f

set a identifies e prefix tre the in node A stack. the in hypotheses partial all with associated sequences word the

  • f

tion representa compact a as a introduce can that we suggests This contents word its

  • n
  • nly

depends hypothesis partial a

  • f

cost the Therefore tree prefix Φ W

p. p

slide-39
SLIDE 39

Prefix Tree Search Under Levenshtein Loss Function

cost comparison hypothesis partial a b L T C

p W U b b W U a a p p

p

→ ⋅ Φ ⋅ Φ Φ Φ

∈ ⋅ Φ ∈ ⋅ Φ

) , ( min

  • f

n computatio and storage facilitate that they is distance in Levenshte for the es prefix tre using

  • f

advantage t significan A 2.25 Equation to according computed still is estimate

  • ver

The stack. the in inserted be to nodes e prefix tre

  • f

set new a yield extensions These word.

  • ne

by extended are stack the

  • f

top at the node e prefix tre the to ing correspond paths lattic The 2. ties.

  • f

case in ) ( by then and ) ( by first

  • rdered

is and nodes e prefix tre contains stack The 1. t except tha search stack single the as same the is It

) ( : ) ( :

slide-40
SLIDE 40

Pruning and Multistack Organization of the Prefix Tree Search

  • Equation 2.24 and 2.25 didn’t take pruning into

account.

  • When entries are pruned from the stack, Equation

2.24 is still a valid under-estimate but Equation 2.25 is no longer a valid over-estimate.

  • It is however a valid over-estimate for the sub-lattice
  • f the original lattice that could be constructed by

completion of the partial hypotheses in the pruned stack.

slide-41
SLIDE 41

Pruning and Multistack Organization of the Prefix Tree Search

  • The single stack search and the prefix tree search

have the disadvantage that the costs of partial hypotheses of different lengths are compared.

  • This is acceptable under the search formulation, but

is not a good comparison for use in pruning since it favors short hypotheses. sub-optimal.

  • How to solve it?

– Using multistack implementation that maintains a separate stack for each hypothesis length. – It has been found to have better pruning characteristics in practice.

slide-42
SLIDE 42
  • Segmental MBR Procedures

– Segmental Voting – ROVER – e-ROVER

  • Experimental Results

– Parameter Tuning within the MBR Classification Rule – Utterance Level MBR Word and Keyword Recognition – ROVER and e-ROVER for Multilingual ASR

  • Summary
slide-43
SLIDE 43

Segmental MBR Procedures

  • Segmental MBR (SMBR).
  • Utterance level recognition sequence of simpler

MBR recognition.

  • The lattices or N-best lists are segmented into sets of

words.

  • Advantages:

– The segmentation can be performed to identify high confidence regions within the evidence space. – Within the regions we can produce reliable word hypotheses. – But SMBR focuses on the low confidence regions.

slide-44
SLIDE 44

Segmental MBR Procedures

N e e e e

W W R W , ,

1 L

N h h h h

W W R W , ,

1 L

↓ ) ( , ), (

1

W R W R R W

N e e e

L ↓ ) ( , ), (

1

W R W R R W

N h h h

′ ′ ↓ ′ L

evidence hypothesis space word sequence Definition reconstruct

W J W W

h N

′ ↓ ′ ′ , ,

1 L

slide-45
SLIDE 45

Segmental MBR Procedures

  • Assume that the utterance level loss can be found

from the losses over the segment sets as

set. segment the

  • n

defined function loss a is where )) ( ), ( ( ) , (

1 th i N i i h i e i

i l W R W R l W W l

=

= ′

(2.31) ) | ( ) ( set evidence

  • ver the

y probabilit marginal the is ) | ( and (2.30) ) | ( ) , ( min arg ) ( where (2.29) ) | ) ( ( ) ( s recognizer MBR

  • f

ion concatenat a as d implemente be can

  • f

recognizer MBR level utterance An

) ( : 1

∑ ∑

= ∈ ∈ ∈ ′ =

= ′ = =

i i e e i e i i h i

W W R W W i i th i i W W i i i i i W W i N i i h

A W P W P i A W P A W P W W l A A J A N n. Propositio δ δ δ

slide-46
SLIDE 46

Segmental MBR Procedures

  • Therefore, under the assumption of Equation 2.28,

utterance level MBR recognition becomes a sequence

  • f smaller MBR recognition problems.
  • In practice it may be difficult to segment the evidence

and hypothesis spaces.

  • Utterance level induced loss function is defined as
  • The overall performance under the desired loss

function l should depend on how well lI approximates l

(2.32) )) ( ), ( ( ) , (

1

=

= ′

N i i h i e i I

W R W R l W W l

slide-47
SLIDE 47

Segmental Voting

  • Special case of segmental MBR recognition
  • Suppose each evidence and hypothesis segment set

contains at most one word.

  • There is a 0/1 loss function on segment sets.
  • Under these conditions the segmental MBR recognizer
  • f Equation 2.30 becomes
  • The utterance level induced loss for segmental voting

(2.33) ) | ( max arg ) ( A W P A

i i W W i

i h i

′ =

∈ ′

δ (2.34) )) ( ), ( ( ) , (

1 1 / vote

  • seg

=

′ = ′

N i i h i e

W R W R l W W l

slide-48
SLIDE 48

Segmental Voting

  • We will now describe two versions of segmental MBR

recognition used in state-of-the-art ASR systems.

  • Both these procedures attempt to reduce the word error

rate (WER) and thus are based on the Levenshtein loss function.

slide-49
SLIDE 49

ROVER

  • Recognizer Output Voting for Error Reduction

(ROVER) is an N-best list segmental voting procedure.

  • It combines the hypotheses from multiple independent

recognizers under the Levenshtein loss.

by ROVER used

  • n

distributi evidence the and space evidence the are ) | ( and set The 1 , , ) | ( ) | ( with associated

  • n

distributi posterior the be let and , acoustics to response in systems n recognitio by produced lists best

  • N

be , , 1 , Let

1 1

A W P N N W A W P A W P N P A K K m N

e K m m e K m m m m m m

= ∈ = =

∑ ∑

= =

α α L

slide-50
SLIDE 50

ROVER

  • The word strings of Ne are arranged in a word transition

network (WTN) that represents an approximate simultaneous alignment of these hypotheses.

slide-51
SLIDE 51

Segmental Voting

  • The utterance level induced loss in ROVER is derived
  • This loss is similar to the Levenshtein distance

between strings W and W’ when their alignment is specified by the WTN.

(2.36) )) ( ), ( ( ) , (

1 1 / ROVER

=

′ = ′

N i i h i e

W R W R l W W l

slide-52
SLIDE 52

e-ROVER

  • Extended-ROVER (e-ROVER)
  • The utterance level loss function of e-ROVER is given

as follows.

– Start with initial WTN – Merge two consecutive set. – Let the loss function on the expanded set be the Levenshtein distance. – The loss function on correspondence sets that did not expand remains the 0/1 loss.

slide-53
SLIDE 53

e-ROVER

slide-54
SLIDE 54

e-ROVER

  • The utterance level induced loss in e-ROVER is
  • It follows from the definition of Levenshtein distance

that

sets. segment joined the from es subsequenc word are and Here, (2.37) ) , ( )) ( ), ( ( ) , (

1 , , 1 1 / ROVER

  • e

m m N m i m i i m m i h i e

W W W W L W R W R l W W l ′ ′ + ′ = ′

+ ≠ ≠ =

) , ( ) , ( ) , (

ROVER ROVER

  • e

W W l W W l W W L ′ ≤ ′ ≤ ′

slide-55
SLIDE 55

e-ROVER

  • There are two consequences of joining correspondence sets:

– After the joining operation, the loss function on the expanded set is no longer the 0/1 loss but is instead the Levenshtein distance. – The size of the expanded set grows exponentially with the number of joining operations, making Equation 2.30 progressively difficult to implement.

  • Therefore, it is important to determine the sets to be joined

carefully so as to yield maximum gain in Levenshtein distance approximation with minimum combinations of the correspondence sets.

slide-56
SLIDE 56
  • Segmental MBR Procedures

– Segmental Voting – ROVER – e-ROVER

  • Experimental Results

– Parameter Tuning within the MBR Classification Rule – Utterance Level MBR Word and Keyword Recognition – ROVER and e-ROVER for Multilingual ASR

  • Summary
slide-57
SLIDE 57

Parameter Tuning within the MBR Classification Rule

  • The joint distribution to be used in the MBR

recognizers is derived by combining probabilities from acoustic and language models.

  • It is customary in ASR to use two tuning parameters

in the computation of joint probability

) , ( A W P

y probabilit model acoustic the to relative y probabilit model language the scales ) ( | | increasing y with probabilit

  • f

decrease a causes constant, negative ) ( string word in words

  • f

number the is | | where (2.38) ) ( ) | ( ) , (

| | ,

→ → = factor scale model language W penalty insertion word W W W P W A P e A W P

W

β α

β α β α

slide-58
SLIDE 58

Parameter Tuning within the MBR Classification Rule

[ ]

(2.39) ) ( ) | ( ) , ( additional an introduce to useful it found have We

1 | | , , γ β α γ β α

γ W P W A P e A W P tor scale fac likelihood

W

=

slide-59
SLIDE 59

Optimization of Likelihood Parameters

{ }

truth. the

  • f

place in string evidence likely most the using risk empirical the minimize : . is this known, are labels utterance the Since . utterances labeled

  • f

) , ( database a

  • ver

(2.40) )) ( , (

  • f

risk empirical the minimize to and , ,

  • ptimize

We 2.39 Equation

  • f

) , (

  • n

distributi zed parameteri the ing incorporat recognizer risk

  • minimum

the be Let

) , ( , , , , , , , ,

  • n
  • ptimizati

ed Unsupervis ion

  • ptimizat

supervised A W A W l A W P

A W

= Τ

γ β α γ β α γ β α γ β α

δ δ γ β α δ

slide-60
SLIDE 60

Utterance Level MBR Word and Keyword Recognition

slide-61
SLIDE 61

ROVER and e-ROVER for Multilingual ASR