Minimum Bayes-Risk Methods in Automatic Speech Recognition Vaibhava - - PowerPoint PPT Presentation
Minimum Bayes-Risk Methods in Automatic Speech Recognition Vaibhava - - PowerPoint PPT Presentation
Minimum Bayes-Risk Methods in Automatic Speech Recognition Vaibhava Goel IBM William Byrne Johns Hopkins University Pattern Recognition in Speech and Language Processing Chap2 Outline Minimum Bayes-Risk Classification Framework
Outline
- Minimum Bayes-Risk Classification Framework
– Likelihood Ratio Based Hypothesis Testing – Maximum A-Posteriori Probability Classification – Previous Studies of Application Sensitive ASR
- Practical MBR Procedures for ASR
– Summation over Hidden State Sequences – MBR Recognition with N-best Lists – MBR Recognition with Lattices
Outline
- Segmental MBR Procedures
– Segmental Voting – ROVER – e-ROVER
- Experimental Results
– Parameter Tuning within the MBR Classification Rule – Utterance Level MBR Word and Keyword Recognition – ROVER and e-ROVER for Multilingual ASR
- Summary
- Minimum Bayes-Risk Classification Framework
– Likelihood Ratio Based Hypothesis Testing – Maximum A-Posteriori Probability Classification – Previous Studies of Application Sensitive ASR
- Practical MBR Procedures for ASR
– Summation over Hidden State Sequences – MBR Recognition with N-best Lists – MBR Recognition with Lattices
Minimum Bayes-Risk Classification Framework
language and speech
- f
- n
distributi true : ) , (
- f
iption mistranscr is where function, loss : ) , ( classifier ASR : : ) ( n
- bservatio
the
- f
space hypothesis the : string word : sequence n
- bservatio
acoustic : A W P W W W W l W A A A W A
A h A h
′ ′ → δ W
- Definition:
Minimum Bayes-Risk Classification Framework
[ ] { }
) ( min arg ) ( ) ( ) ( ) ( Let (2.4) ) ( ) ( min arg ) ( as rewritten be can 2.2 Equation ) | ( | ) ( nonzero with ,
- f
subset the be Let
- f
n anscriptio correct tr the is , ) ( min arg ) ( use but we risk
- Bayes
minimize to chosen is (2.2) ) ( ) ( min arg ) ( (2.1) ) , ( )) ( , ( )) ( ( risk
- Bayes
Using e? performanc classifier measure How to
) (
W S A W S A | W P W W, l A | W P W W, l A A W P W A | W P W A W W , W l A A | W P W W, l A A W P A W l A W, l E
A h A e A e A h A h A h
W W W W A e A e c c W W W W A A W, P
′ = ∴ ′ = ′ ′ = ∴ > = ⇒ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ ′ = ′ = ∴ = →
∈ ′ ∈ ∈ ∈ ′ ∈ ′ ∈ ′
∑ ∑ ∑ ∑ ∑
W W W W W W
W W δ δ δ δ δ δ
Minimum Bayes-Risk Classification Framework
tion classifica posteriori
- a
maximum method ) ( classifier ) ( function loss testing hypothesis ratio likelihood method ) ( classifier ) ( function loss : ways Two ? function loss define How to
- n
distributi evidence as refered is ) | ( and space evidence as refered is . classifier by MBR used evidence the as serve in ns
- bservatio
the Since
MAP 1 / LRT
→ = ⇒ = → = ⇒ = ∴ A X,Y l A X,Y l A W P
LRT A e A e
δ δ W W
Likelihood Ratio Based Hypothesis Testing
{ } { }
{ }[
] [ ] [ ] [ ] [ ] [ ] [ ]
⎪ ⎩ ⎪ ⎨ ⎧ = > = ⎩ ⎨ ⎧ > = ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ = ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ = ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ + + = ′ + ′ = ′ = ∴ ⎪ ⎪ ⎩ ⎪ ⎪ ⎨ ⎧ = = = = = = = = = = =
∈ ′ ∈ ∈ ′
∑
- therwise
) ( ) ( ) ( ) (
- therwise
) ( ) ( ) ( ) ( ) ( ) ( , ) ( ) ( min arg ) ( , ) ( min arg ) ( ) ( ) ( ) ( , ) ( ) ( ) ( ) ( min arg ) ( ) ( ) ( ) ( min arg ) ( ) ( min arg ) ( , if , if , if , if ) , ( define and , and , If
2 1 1 2 2 1 2 1 , 2 1 a n a a n n a a a n n n n n a a n a a a a n a n a n a n n n a a n n H H W W W a a a n n a n n LRT a n a n
H t H P t H P t H | A P H | A P H H H P H | A P t H P H | A P t H H P H | A P t H P H | A P t A | H P t A | H P t A | H P H , H l A | H P H , H l A | H P H , H l A | H P H , H l A | H P W , H l A | H P W , H l A | W P W W, l A H Y H X H Y H X t H Y H X t H Y H X Y X l H H H H
a n A e A h
W W h e
W W δ
Likelihood Ratio Based Hypothesis Testing
aceptance. false and rejection false between balance the determines it manner; specific n applicatio an in set is threshold The (2.6)
- therwise
) ( ) ( if ) ( t H t H | A P H | A P H A
a a n n LRT
⎪ ⎩ ⎪ ⎨ ⎧ > = ∴δ A Hn Ha null class alternative class
Maximum A-Posteriori Probability Classification
( )
) ( max arg ) ( 1 min arg ) ( min arg ) ( ) ( min arg ) (
- therwise
if 1 ) , ( Define
1 /
A | W P A | W P A | W P A | W P W W, l A W W W W l
A h A h A h A e A h
W W W W W W W
′ = ′ − = = ′ = ∴ ⎩ ⎨ ⎧ ≠ ′ = ′
∈ ′ ∈ ′ ′ ≠ ∈ ′ ∈ ∈ ′
∑ ∑
W W W W W
δ
Previous Studies of Application Sensitive ASR
- Use of risk minimization in automatic speech has not
been extensive.
- Early investigations into the minimum Bayes-risk
training criteria for speech recognizers were performed by Nadas.
- However our focus in this chapter is in minimum-risk
classification rather than estimation.
Previous Studies of Application Sensitive ASR
- Stolcke et.al. proposed an approximation to a minimum
Bayes risk classifier for generation of minimum word error rate hypothesis from recognition N-best lists.
- Other researchers have proposed posterior probability
and confidence based hypothesis selection strategies for word error rate reduction.
- Minimum Bayes-Risk Classification Framework
– Likelihood Ratio Based Hypothesis Testing – Maximum A-Posteriori Probability Classification – Previous Studies of Application Sensitive ASR
- Practical MBR Procedures for ASR
– Summation over Hidden State Sequences – MBR Recognition with N-best Lists – MBR Recognition with Lattices
Practical MBR Procedures for ASR
- Why difficult to implement?
– The evidence and hypothesis spaces in Equation 2.4 tend to be quite large. – The problem of large spaces is worsened by the fact that an ASR recognizer often has to process many consecutive utterances. – There are efficient DP techniques for MAP recognizer, such methods are not yet available for an MBR recognizer under an arbitrary loss function.
Practical MBR Procedures for ASR
- How to implement?
– Two implementation:
- N-best list rescoring procedure
- Search over a recognition lattice
– Segment long acoustic data into sentence or phrase length utterances. – Restrict the evidence and hypothesis spaces to manageable sets of word strings.
Summation over Hidden State Sequences
- A computational issue associated with the use of HMM in the
evidence distribution will be addressed.
- How to obtain the true distribution?
(2.13) ) , | ( ) | ( ) | , ( ) | ( as computed is ) | ( y probabilit The . generate could that sequences state possible all
- f
set the denote Let ). | ( HMM acoustic the in states the all
- f
set the be Let the called a HMM using ed approximat usually is ) | ( model. gram
- N
based chain a Markov usually is it , a using ed approximat is ) ( Here (2.12) ) ( ) | ( ) ( ) | (
∑ ∑
∈ ∈
= = =
χ χ
χ
X X
W X A P W X P W X A P W A P W A P A W A P S model. acoustic W A P model language W P A P W A P W P A W P
Summation over Hidden State Sequences
( )
. in sequences state likely most the
- f
sampling sparse a is where ) , , ( ) , ( ), , ( min arg ) , , ( ) ( min arg ) ( ) , | ( ) | ( ) ( ) ( min arg ) ( ) ( min arg ) ( as 2.4 Equation modify the to is e alternativ feasible nally computatio A expensive. too is sequences state hidden possible all
- ver
summation The
) , ( ) , (
χ χ
χ W χ W χ W W χ W W W W A X W X W X W W X W W W W
A A e A A h A e A h A e A h A e A h
A X W P X W X W l A X W P W W, l A P W X A P W X P W P W W, l A | W P W W, l A
∑ ∑ ∑ ∑ ∑ ∑
× ∈ × ∈ ′ ′ ∈ ∈ ∈ ′ ∈ ∈ ∈ ′ ∈ ∈ ′
′ ′ ≈ ′ = ′ = ′ = δ
Summation over Hidden State Sequences
(2.15) ) , ( ) ( min arg ) ( have we n rather tha n rather tha ) , ( n rather tha use we e convenienc For
∑
∈ ∈ ′
′ = ∴ × ×
A e A h W
W A A e A e A A h A h
A W P W W, l A X W W
W W
χ W W χ W W δ
MBR Recognition with N-best Lists
sum. the and search the in candidates more consider to e.e., lattice, n recognitio the to spaces two these
- f
size the increase
- interest t
- f
is It functions. loss arbitrary for implement easy to ly particular is ion approximat This ) , ( ) ( min arg ) ( and denoted are They . recognizer a by produced lists best
- N
the to restricted are spaces hypothesis and evidence the approach, this In
- n.
minimizati WER for proposed first as procedures rescoring list best
- N
by is 2.15 Equation
- f
ion approximat direct most The
∑
∈ ∈ ′
′ ≈ ∴
e h
N W N W h e
A W P W W, l A N N δ
MBR Recognition with Lattices
- Multistack prefix tree A* search algorithm uses
recognition lattices as the hypothesis and evidence.
ty. connectivi lattice specifies : node end lattice unique the : node start lattice unique the : edges
- f
set the : nodes
- f
set the : graph directed acyclic an ) , , , , ( . boundaries e their tim and strings word
- f
set large a for tion representa compact A : N N n n N n n N Definitons Lattice
e s e s
→ ×ε ρ ε ρ ε
MBR Recognition with Lattices
MBR Recognition with Lattices
ly. respective )) ( , ( and )) ( , ( , )) ( , ( be will edges the
- n
y probabilit
- log
- f
sum The ly. respective ) ( and ) ( , ) ( be will path three these to ing correspond segments Acoustic node (internal) to : ) ( node end to node (internal) : ) ( node (internal) to node (internal) : ) ( node end to node start :
- r
p p e e x x p e x x s p e x e y x x e s
W A W P W A W P W A W P W A W A W A n n W path Partial n n W n n W Segment Path n n path Complete Path
MBR Recognition with Lattices
{ } { } [ ]
∑ ∑ ∑ ∑ ∑
∈ ⋅ ∈ ⋅ ∈ ⋅ ∈ ⋅ ∈ ⋅
⋅ = ⎪ ⎭ ⎪ ⎬ ⎫ ⎪ ⎩ ⎪ ⎨ ⎧ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ ⋅ = ⎪ ⎭ ⎪ ⎬ ⎫ ⎪ ⎩ ⎪ ⎨ ⎧ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ = ⎪ ⎭ ⎪ ⎬ ⎫ ⎪ ⎩ ⎪ ⎨ ⎧ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ + = + = = ⎪ ⎭ ⎪ ⎬ ⎫ ⎪ ⎩ ⎪ ⎨ ⎧ = = =
lat e p e lat e p e lat e p e lat e p e lat e p e
W W W W e p W W W W e p W W W W p p p e e W W W W p e e p p p b p f p p lat W W W W p e e p b p p p p f p
A W W P A W W P W A W P W W A W P W W A W P W A W P W L W L W T W W W W A W P W L W W A W P W L W
: : : : :
) , ( ) , ( ln exp )) ( , ( ) | ) ( , ( ln exp ) | ) ( , ( ln )) ( , ( ln exp ) ( ) ( exp ) (
- f
y probabilit total lattice The lattice. the in paths complete all
- f
st the denotes where (2.18) ) | ) ( , ( ln ) ( is
- f
y probabilit
- log
backward lattice The . ) ( , ( ln ) (
- f
y probabilit
- log
path partial The
MBR Recognition with Lattices
- 2.5
- 2.3
- 3.5
- 1.55
- 1.6
0.043
MBR Recognition with Lattices
{ }
algorithm. * an as d implemente y effectivel be can for search This . hypothesis complete all
- f
least the is ) , ( ) , ( ) ( loss expected its that such lattice, the through to from path a i.e., , hypothesis complete a find to is goal The (2.20) ) , ( ) ( min arg ) ( implement would we lattice the
- n
Therefore, . along edges lattice
- n
ies probabilit
- log
the adding by computed be can ) , ( ln y probabilit
- log
jonit The A W A W P W W l W S n n W A W P W W, l A W A W P
lat lat lat
W W e s W W
′ ′ = ′ ′ ′ =
∑ ∑
∈ ∈ ∈ ′ W W
δ
MBR Recognition with Lattices
- Two cost functions are required for the search.
– The first cost function is associated with each hypothesis , whether partial or complete.
- Its value is a lower bound on the expected loss that can
be obtained by extending the hypothesis through the lattice to completion.
p
W
(2.22) ) , ( ) , ( min ) (
:
∑
∈ ∈ ⋅
⋅ ≤
lat lat e p e
W W e p W W W W p
A W P W W W l W C ) ( min ) (
e p W p
W W S W C
e
⋅ ≤
MBR Recognition with Lattices
p
W
e
W
MBR Recognition with Lattices
– The second cost function in only associated with complete hypotheses.
- It is an over-estimate of the expected loss of a complete
hypothesis
– Hypotheses are kept in a priority queue sorted by cost C, with the smallest cost hypothesis at the top. – At every iteration the hypothesis at the top of the stack is extended.
W ′
(2.23) ) ( ) , ( ) , ( ) ( W S A W P W W l W C
lat
W W
′ = ′ ≥ ′
∑
∈
MBR Recognition with Lattices
- When to terminate?
– When there is a complete hypothesis at the top, its second cost is computed. If this over-estimate cost is smaller than the under-estimate cost C of the next stack hypothesis. – There is no partial hypothesis left in the stack.
- Why we use over-estimate?
– Since A* usually employ an exact expected loss for complete hypotheses; however, this is prohibitively expensive to find in our case.
C C
Single Stack Search Levenshtein Loss Function
- We now present usable const functions for the
Levenshtein distance
- These costs are not unique, and the efficiency of the
search depends on the quality of both the under- estimate and the over-estimate.
- The Levenshtein loss function is not sensitive to the
word time boundaries.
- Therefore, the word time boundaries would be
summed over during the search.
) , ( W W L ′
Single Stack Search Levenshtein Loss Function
y. vocabular the
- f
words more
- r
zero ing concatenat by d constructe be can that boundaries time possible all their and strings word possible all
- f
set the is where (2.24) ) ~ ( ) , ~ ( min ) ( is hypotheses partial for estimate
- under
The stack. the in hypotheses partial and complete all
- f
set the denote Let
~ ~ : :
W W T X W Y W L W C W
st p
W W p W Y W Y W X W X p st
∑
∈ ∈ ⋅ ∈ ⋅
⎪ ⎭ ⎪ ⎬ ⎫ ⎪ ⎩ ⎪ ⎨ ⎧ ⋅ ⋅ =
Single Stack Search Levenshtein Loss Function
:
W ~
p
W
W ~
X Y
) ~ (W T
st
W
Single Stack Search Levenshtein Loss Function
(2.25) ) ~ ( ) , ~ ( ) ( estimate
- ver
the Compute y. vocabular the in any word match not do markers These . markers y vocabular
- f
- ut
- f
instances ) ~ ( by stack the in ~ hypothesis each Append . node end lattice the to node end its from path longest the
- f
length the be ) ~ ( let , stack in ~ hypothesis a For : follows as computed be can hypothesis complete a for estimate
- ver
The 2.22 Equation satisfies function cost that this showing derivation The
~ ) ~ ( 1
∑
∈
′ ⋅ = ′
- ′
st
W W W N e st
W T W D W L W C D W N W n W N W W W
Single Stack Search Levenshtein Loss Function
stack. the into , hypothesis NULL the i.e., lattice, the
- f
node start the inserting by search the Initialize 3. ). ( identical
- f
cases in ) (
- f
values decreasing by second and ), (
- f
values increasing by first defined is
- rdering
stack The ). 25 . 2 )( ( and ), ( ), ( , hypothesis a contains entry stack complete Each ). 24 . 2 )( ( and ), 19 . 2 )( ( ), 17 . 2 )( ( , hypothesis a contains entry stack partial Each . hypotheses complete and partial
- f
stack a n Maintai 2. lattice. the
- f
end the to path longest the
- f
length the keep node each At y(2.18). probabilit
- log
backward lattice by the nodes lattice e Mark th 1. lattice. n recognitio the in hypothesis desired the find to used be can algorithm search stack single following the (2.25), estimate
- ver
the and (2.24) estimate
- under
the With ⋅ ⋅ ⋅ ′ ′ ′ ′ C T C W C W C W T W W C W T W L W W
p p p f p st
Single Stack Search Levenshtein Loss Function
4. step
- t to
Otherwise ends. search the and candidate desired the is it complete),
- r
(partial hypothesis stack second
- f
estimate
- under
the an smaller th is estimate
- ver
its and stack the
- f
top at the hypothesis complete a is there If 6. insertion. the during applied be may Pruning stack. the in ties)
- f
case in ) ( by second and ) ( by first (sorted places e appropriat at their hypotheses created newly Insert the space. evidence the to hypotheses created newly theses adding after hypotheses stack complete and partial
- ther
all
- f
2.25) and 2.24 estimates( cost the Update 5. candidate. desired the is This ). ( least with hypothesis select the , hypotheses stack incomplete no are there if Otherwise, . hypothesis complete created newly each for ) ( and ), ( Compute . hypothesis partial created newly the
- f
each for ) ( Compute node. end its leave that arcs lattice all by hypothesis incomplete top the extend stack, the in hypotheses incomplete are there If 4. ⋅ ⋅ ′ ′ ′ ′ T C W C W W C W C W W C
p p
Prefix Tree Search Under Levenshtein Loss Function
stack. current
- ver the
- f
y probabilit total induced the be ) ~ ( ) ( Let contents. word its be ) ( let , stack the from hypothesis partial a Given . hypotheses from
- ns
segmentati time the strips hat
- perator t
the be Let removed. n informatio time hypotheses
- f
- n
segmentati time the
- n
depend not does distance ion Levenshte the Since
p ) ~ ( : ~
Φ = Φ = Φ → ∴
∑
Φ = ∈
p st
W U W W p p p st p
W T T W U W W U
Prefix Tree Search Under Levenshtein Loss Function
) ( ) ( ) , ( min ) ~ ( ) , ( min ) ~ ( ) , ( min ) ~ ( ) , ~ ( min ) ( as
- perator
the using rearranged be can function cost The
) ( : ) ( : ) ( ) ~ ( ~ ) ( : ) ( : ) ( ) ( : ) ( : ) ~ ( ~ ) ( ~ : : ~ p p W U b b W U a a W U p W U W W U b b W U a a W U p W U b b W U a a W U W W U p W Y W Y W X W X W W p
C T a b L W T a b L W T a b L W T X W Y W L W C U
p st p st p st p st
Φ = Φ ⋅ Φ ⋅ Φ = ⋅ Φ ⋅ Φ = ⋅ Φ ⋅ Φ = ⋅ ⋅ =
∈ ⋅ Φ ∈ ⋅ Φ ∈ Φ Φ = ∈ ∈ ⋅ Φ ∈ ⋅ Φ ∈ Φ ∈ ⋅ Φ ∈ ⋅ Φ Φ = ∈ ∈ Φ ∈ ⋅ ∈ ⋅ ∈
∑ ∑ ∑ ∑ ∑ ∑
Prefix Tree Search Under Levenshtein Loss Function
lattice. the in nodes end their and hypotheses
- f
set a identifies e prefix tre the in node A stack. the in hypotheses partial all with associated sequences word the
- f
tion representa compact a as a introduce can that we suggests This contents word its
- n
- nly
depends hypothesis partial a
- f
cost the Therefore tree prefix Φ W
p. p
Prefix Tree Search Under Levenshtein Loss Function
cost comparison hypothesis partial a b L T C
p W U b b W U a a p p
p
→ ⋅ Φ ⋅ Φ Φ Φ
∈ ⋅ Φ ∈ ⋅ Φ
) , ( min
- f
n computatio and storage facilitate that they is distance in Levenshte for the es prefix tre using
- f
advantage t significan A 2.25 Equation to according computed still is estimate
- ver
The stack. the in inserted be to nodes e prefix tre
- f
set new a yield extensions These word.
- ne
by extended are stack the
- f
top at the node e prefix tre the to ing correspond paths lattic The 2. ties.
- f
case in ) ( by then and ) ( by first
- rdered
is and nodes e prefix tre contains stack The 1. t except tha search stack single the as same the is It
) ( : ) ( :
Pruning and Multistack Organization of the Prefix Tree Search
- Equation 2.24 and 2.25 didn’t take pruning into
account.
- When entries are pruned from the stack, Equation
2.24 is still a valid under-estimate but Equation 2.25 is no longer a valid over-estimate.
- It is however a valid over-estimate for the sub-lattice
- f the original lattice that could be constructed by
completion of the partial hypotheses in the pruned stack.
Pruning and Multistack Organization of the Prefix Tree Search
- The single stack search and the prefix tree search
have the disadvantage that the costs of partial hypotheses of different lengths are compared.
- This is acceptable under the search formulation, but
is not a good comparison for use in pruning since it favors short hypotheses. sub-optimal.
- How to solve it?
– Using multistack implementation that maintains a separate stack for each hypothesis length. – It has been found to have better pruning characteristics in practice.
- Segmental MBR Procedures
– Segmental Voting – ROVER – e-ROVER
- Experimental Results
– Parameter Tuning within the MBR Classification Rule – Utterance Level MBR Word and Keyword Recognition – ROVER and e-ROVER for Multilingual ASR
- Summary
Segmental MBR Procedures
- Segmental MBR (SMBR).
- Utterance level recognition sequence of simpler
MBR recognition.
- The lattices or N-best lists are segmented into sets of
words.
- Advantages:
– The segmentation can be performed to identify high confidence regions within the evidence space. – Within the regions we can produce reliable word hypotheses. – But SMBR focuses on the low confidence regions.
Segmental MBR Procedures
N e e e e
W W R W , ,
1 L
↓
N h h h h
W W R W , ,
1 L
↓ ) ( , ), (
1
W R W R R W
N e e e
L ↓ ) ( , ), (
1
W R W R R W
N h h h
′ ′ ↓ ′ L
evidence hypothesis space word sequence Definition reconstruct
W J W W
h N
′ ↓ ′ ′ , ,
1 L
Segmental MBR Procedures
- Assume that the utterance level loss can be found
from the losses over the segment sets as
set. segment the
- n
defined function loss a is where )) ( ), ( ( ) , (
1 th i N i i h i e i
i l W R W R l W W l
∑
=
= ′
(2.31) ) | ( ) ( set evidence
- ver the
y probabilit marginal the is ) | ( and (2.30) ) | ( ) , ( min arg ) ( where (2.29) ) | ) ( ( ) ( s recognizer MBR
- f
ion concatenat a as d implemente be can
- f
recognizer MBR level utterance An
) ( : 1
∑ ∑
= ∈ ∈ ∈ ′ =
= ′ = =
i i e e i e i i h i
W W R W W i i th i i W W i i i i i W W i N i i h
A W P W P i A W P A W P W W l A A J A N n. Propositio δ δ δ
Segmental MBR Procedures
- Therefore, under the assumption of Equation 2.28,
utterance level MBR recognition becomes a sequence
- f smaller MBR recognition problems.
- In practice it may be difficult to segment the evidence
and hypothesis spaces.
- Utterance level induced loss function is defined as
- The overall performance under the desired loss
function l should depend on how well lI approximates l
(2.32) )) ( ), ( ( ) , (
1
∑
=
= ′
N i i h i e i I
W R W R l W W l
Segmental Voting
- Special case of segmental MBR recognition
- Suppose each evidence and hypothesis segment set
contains at most one word.
- There is a 0/1 loss function on segment sets.
- Under these conditions the segmental MBR recognizer
- f Equation 2.30 becomes
- The utterance level induced loss for segmental voting
(2.33) ) | ( max arg ) ( A W P A
i i W W i
i h i
′ =
∈ ′
δ (2.34) )) ( ), ( ( ) , (
1 1 / vote
- seg
∑
=
′ = ′
N i i h i e
W R W R l W W l
Segmental Voting
- We will now describe two versions of segmental MBR
recognition used in state-of-the-art ASR systems.
- Both these procedures attempt to reduce the word error
rate (WER) and thus are based on the Levenshtein loss function.
ROVER
- Recognizer Output Voting for Error Reduction
(ROVER) is an N-best list segmental voting procedure.
- It combines the hypotheses from multiple independent
recognizers under the Levenshtein loss.
by ROVER used
- n
distributi evidence the and space evidence the are ) | ( and set The 1 , , ) | ( ) | ( with associated
- n
distributi posterior the be let and , acoustics to response in systems n recognitio by produced lists best
- N
be , , 1 , Let
1 1
A W P N N W A W P A W P N P A K K m N
e K m m e K m m m m m m
= ∈ = =
∑ ∑
= =
α α L
ROVER
- The word strings of Ne are arranged in a word transition
network (WTN) that represents an approximate simultaneous alignment of these hypotheses.
Segmental Voting
- The utterance level induced loss in ROVER is derived
- This loss is similar to the Levenshtein distance
between strings W and W’ when their alignment is specified by the WTN.
(2.36) )) ( ), ( ( ) , (
1 1 / ROVER
∑
=
′ = ′
N i i h i e
W R W R l W W l
e-ROVER
- Extended-ROVER (e-ROVER)
- The utterance level loss function of e-ROVER is given
as follows.
– Start with initial WTN – Merge two consecutive set. – Let the loss function on the expanded set be the Levenshtein distance. – The loss function on correspondence sets that did not expand remains the 0/1 loss.
e-ROVER
e-ROVER
- The utterance level induced loss in e-ROVER is
- It follows from the definition of Levenshtein distance
that
sets. segment joined the from es subsequenc word are and Here, (2.37) ) , ( )) ( ), ( ( ) , (
1 , , 1 1 / ROVER
- e
m m N m i m i i m m i h i e
W W W W L W R W R l W W l ′ ′ + ′ = ′
∑
+ ≠ ≠ =
) , ( ) , ( ) , (
ROVER ROVER
- e
W W l W W l W W L ′ ≤ ′ ≤ ′
e-ROVER
- There are two consequences of joining correspondence sets:
– After the joining operation, the loss function on the expanded set is no longer the 0/1 loss but is instead the Levenshtein distance. – The size of the expanded set grows exponentially with the number of joining operations, making Equation 2.30 progressively difficult to implement.
- Therefore, it is important to determine the sets to be joined
carefully so as to yield maximum gain in Levenshtein distance approximation with minimum combinations of the correspondence sets.
- Segmental MBR Procedures
– Segmental Voting – ROVER – e-ROVER
- Experimental Results
– Parameter Tuning within the MBR Classification Rule – Utterance Level MBR Word and Keyword Recognition – ROVER and e-ROVER for Multilingual ASR
- Summary
Parameter Tuning within the MBR Classification Rule
- The joint distribution to be used in the MBR
recognizers is derived by combining probabilities from acoustic and language models.
- It is customary in ASR to use two tuning parameters
in the computation of joint probability
) , ( A W P
y probabilit model acoustic the to relative y probabilit model language the scales ) ( | | increasing y with probabilit
- f
decrease a causes constant, negative ) ( string word in words
- f
number the is | | where (2.38) ) ( ) | ( ) , (
| | ,
→ → = factor scale model language W penalty insertion word W W W P W A P e A W P
W
β α
β α β α
Parameter Tuning within the MBR Classification Rule
[ ]
(2.39) ) ( ) | ( ) , ( additional an introduce to useful it found have We
1 | | , , γ β α γ β α
γ W P W A P e A W P tor scale fac likelihood
W
=
Optimization of Likelihood Parameters
{ }
truth. the
- f
place in string evidence likely most the using risk empirical the minimize : . is this known, are labels utterance the Since . utterances labeled
- f
) , ( database a
- ver
(2.40) )) ( , (
- f
risk empirical the minimize to and , ,
- ptimize
We 2.39 Equation
- f
) , (
- n
distributi zed parameteri the ing incorporat recognizer risk
- minimum
the be Let
) , ( , , , , , , , ,
- n
- ptimizati
ed Unsupervis ion
- ptimizat
supervised A W A W l A W P
A W
= Τ
∑
γ β α γ β α γ β α γ β α