Automatic Speech Recognition (CS753) Automatic Speech Recognition - - PowerPoint PPT Presentation

automatic speech recognition cs753 automatic speech
SMART_READER_LITE
LIVE PREVIEW

Automatic Speech Recognition (CS753) Automatic Speech Recognition - - PowerPoint PPT Presentation

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 18: Search & Decoding (Part I) Instructor: Preethi Jyothi Mar 23, 2017 Recall ASR Decoding W = arg max Pr( O A | W ) Pr( W ) W " N 8 9 # 2 3


slide-1
SLIDE 1 Instructor: Preethi Jyothi Mar 23, 2017


Automatic Speech Recognition (CS753)

Lecture 18: Search & Decoding (Part I)

Automatic Speech Recognition (CS753)

slide-2
SLIDE 2

Recall ASR Decoding

W ∗ = arg max W Pr(OA|W) Pr(W) W ∗ = arg max wN 1 ,N 8 < : " N Y n=1 Pr(wn|wn−1 n−m+1) # 2 4 X qT 1 ,wN 1 T Y t=1 Pr(Ot|qt, wN 1 ) Pr(qt|qt−1, wN 1 ) 3 5 9 = ; ≈ arg max wN 1 ,N (" N Y n=1 Pr(wn|wn−1 n−m+1) # " max qT 1 ,wN 1 T Y t=1 Pr(Ot|qt, wN 1 ) Pr(qt|qt−1, wN 1 ) #) Viterbi
  • Viterbi approximation divides the above optimisation problem into
sub-problems that allows the efficient application of dynamic programming
  • An exact search using Viterbi is infeasible for large vocabulary tasks!
slide-3
SLIDE 3

Recall Viterbi search

  • Viterbi search finds the most probable path through a trellis of time on
the X-axis and states on the Y-axis
  • Viterbi algorithm: Only needs to maintain information about the most
probable path at each state start H C H C H C end P(C|start) * P(3|C) .2 * .1 P(H|H) * P(1|H) .6 * .2 P(C|C) * P(1|C) .5 * .5 P ( C | H ) * P ( 1 | C ) . 3 * . 5 P ( H | C ) * P ( 1 | H ) . 4 * . 2 P ( H | s t a r t ) * P ( 3 | H ) . 8 * . 4 v1(2)=.32 v1(1) = .02 v2(2)= max(.32*.12, .02*.08) = .038 v2(1) = max(.32*.15, .02*.25) = .048 start start start C H end end end qF q2 q1 q0
  • 1
  • 2
  • 3
3 1 3 Image from [JM]: Jurafsky & Martin, 3rd edition, Chapter 9
slide-4
SLIDE 4

ASR Search Network

the birds are boy is walking d ax b
  • y
b Network of words Network of phones Network of HMM states
slide-5
SLIDE 5

Time-state trellis

word1 word2 word3 Time, t →
slide-6
SLIDE 6

Viterbi search over the large trellis

  • Exact search is infeasible for large vocabulary tasks
  • Unknown word boundaries
  • Ngram language models greatly increase the search space
  • Solutions
  • Compactly represent the search space using WFST-based
  • ptimisations
  • Beam search: Prune away parts of the search space that aren’t
promising
slide-7
SLIDE 7

Viterbi search over the large trellis

  • Exact search is infeasible for large vocabulary tasks
  • Unknown word boundaries
  • Ngram language models greatly increase the search space
  • Solutions
  • Compactly represent the search space using WFST-based
  • ptimisations
  • Beam search: Prune away parts of the search space that aren’t
promising
slide-8
SLIDE 8

Two main WFST Optimizations

Recall not all weighted transducers are determinizable To ensure determinizability of L ○ G, introduce disambiguation 
 symbols in L to deal with homophones in the lexicon read : r eh d #0
 red : r eh d #1
  • Use determinization to reduce/eliminate redundancy
Propagate the disambiguation symbols as self-loops back to 
 C and H. Resulting machines are H ̃ , C ̃ , L̃
slide-9
SLIDE 9
  • Use determinization to reduce/eliminate redundancy
  • Use minimization to reduce space requirements

Two main WFST Optimizations

Minimization ensures that the final composed machine 
 has minimum number of states Final optimization cascade: N = πε(min(det(H ̃ ○ det(C ̃ ○ det(L̃ ○ G))))) Replaces disambiguation symbols
 in input alphabet of H ̃ with ε
slide-10
SLIDE 10

Example G

1 bob:bob bond:bond rob:rob 2 slept:slept read:read ate:ate
slide-11
SLIDE 11

Compact language models (G)

  • Use Backoff Ngram language models for G
a,b b,c b ε c c / Pr(c|a,b) ε / α(a,b) c / Pr(c|b) ε / α(b,c) ε / α(b) c / Pr(c)
slide-12
SLIDE 12

Example G

1 bob:bob bond:bond rob:rob 2 slept:slept read:read ate:ate
slide-13
SLIDE 13

Example L̃ :Lexicon with disambig symbols

1 b:bob 5 b:bond 9 r:rob 12 s:slept 17 r:read 20 ey:ate 2 aa:- 6 aa:- 10 aa:- 13 l:- 18 eh:- 21 t:- 3 b:- 4 #0:-
  • :-
7 n:- 8 d:-
  • :-
11 b:-
  • :-
14 eh:- 15 p:- 16 t:-
  • :-
19 d:-
  • :-
  • :-
slide-14
SLIDE 14

L̃ ○ G

1 b:bob 2 b:bond 3 r:rob 4 aa:- 5 aa:- 6 aa:- 7 b:- 8 n:- 9 b:- 10 #0:- 11 d:- 12
  • :-
  • :-
  • :-
13 s:slept 14 r:read 15 ey:ate 16 l:- 17 eh:- 18 t:- 19 eh:- 20 d:- 21 p:- 22 t:-

det(L̃ ○ G)

1 b:- 2 r:rob 3 aa:- 4 aa:- 5 b:bob 6 n:bond 7 b:- 8 #0:- 9 d:- 10
  • :-
  • :-
  • :-
11 r:read 12 s:slept 13 ey:ate 14 eh:- 15 l:- 16 t:- 17 d:- 18 eh:- 19 p:- 20 t:-
slide-15
SLIDE 15

min(det(L̃ ○ G))

1 b:- 2 r:rob 3 aa:- 4 aa:- 5 b:bob 6 n:bond 7 b:- #0:- d:- 8
  • :-
9 r:read 10 s:slept 11 ey:ate 12 eh:- 13 l:- 14 t:- d:- 15 eh:- p:-

det(L̃ ○ G)

1 b:- 2 r:rob 3 aa:- 4 aa:- 5 b:bob 6 n:bond 7 b:- 8 #0:- 9 d:- 10
  • :-
  • :-
  • :-
11 r:read 12 s:slept 13 ey:ate 14 eh:- 15 l:- 16 t:- 17 d:- 18 eh:- 19 p:- 20 t:-
slide-16
SLIDE 16

Viterbi search over the large trellis

  • Exact search is infeasible for large vocabulary tasks
  • Unknown word boundaries
  • Ngram language models greatly increase the search space
  • Solutions
  • Compactly represent the search space using WFST-based
  • ptimisations
  • Beam search: Prune away parts of the search space that aren’t
promising
slide-17
SLIDE 17

Beam pruning

  • At each time-step t, only retain those nodes in the time-state
trellis that are within a fixed threshold δ (beam width) of the best path
  • Given active nodes from the last time-step:
  • Examine nodes in the current time-step …
  • … that are reachable from active nodes in the previous time-
step
  • Get active nodes for the current time-step by only retaining
nodes with hypotheses that score close to the score of the best hypothesis
slide-18
SLIDE 18

Beam search

  • Beam search at each node keeps only hypotheses with scores
that fall within a threshold of the current best hypothesis
  • Hypotheses with Q(t, s) < δ ⋅ max Q(t, s’) are pruned
here, δ controls the beam width
  • Search errors could occur if the most probable hypothesis gets
pruned
  • Trade-off between balancing search errors and speeding up
decoding
slide-19
SLIDE 19

Static and dynamic networks

  • What we’ve seen so far: Static decoding graph
  • H ○ C ○ L ○ G
  • Determinize/minimize to make this graph more compact
  • Another approach: Dynamic graph expansion
  • Dynamically build the graph with active states on the fly
  • Do on-the-fly composition with the language model G
  • (H ○ C ○ L) ○ G
slide-20
SLIDE 20

Multi-pass search

  • Some models are too expensive to implement in first-pass
decoding (e.g. RNN-based LMs)
  • First-pass decoding: Use simpler model (e.g. Ngram LMs)
  • to find most probable word sequences
  • and represent as a word latuice or an N-best list
  • Rescore first-pass hypotheses using complex model to find the
best word sequence
slide-21
SLIDE 21

Multi-pass decoding with N-best lists

DRAFT

If music be the food of love... If music be the food of love... N-Best List ?Every happy family... ?In a hole in the ground... ?If music be the food of love... ?If music be the foot of dove... ?Alice was beginning to get... N-Best Decoder Smarter Knowledge Source 1-Best Utterance Simple Knowledge Source speech input Rescoring
  • Simple algorithm: Modify the Viterbi algorithm to return the N-
best word sequences for a given speech input Image from [JM]: Jurafsky & Martin, SLP 2nd edition, Chapter 10
slide-22
SLIDE 22

Multi-pass decoding with N-best lists

  • Simple algorithm: Modify the Viterbi algorithm to return the N-
best word sequences for a given speech input Image from [JM]: Jurafsky & Martin, SLP 2nd edition, Chapter 10
  • N-best lists aren’t as diverse as we’d like. And, not enough
information in N-best lists to effectively use other knowledge sources

DRAFT

AM LM Rank Path logprob logprob 1. it’s an area that’s naturally sort of mysterious
  • 7193.53
  • 20.25
2. that’s an area that’s naturally sort of mysterious
  • 7192.28
  • 21.11
3. it’s an area that’s not really sort of mysterious
  • 7221.68
  • 18.91
4. that scenario that’s naturally sort of mysterious
  • 7189.19
  • 22.08
5. there’s an area that’s naturally sort of mysterious
  • 7198.35
  • 21.34
6. that’s an area that’s not really sort of mysterious
  • 7220.44
  • 19.77
7. the scenario that’s naturally sort of mysterious
  • 7205.42
  • 21.50
8. so it’s an area that’s naturally sort of mysterious
  • 7195.92
  • 21.71
9. that scenario that’s not really sort of mysterious
  • 7217.34
  • 20.70
10. there’s an area that’s not really sort of mysterious
  • 7226.51
  • 20.01
Figure 10.2 An example 10-Best list from the Broadcast News corpus, produced by the
slide-23
SLIDE 23

Multi-pass decoding with latuices

  • ASR latuice: Weighted automata/directed graph representing
alternate word hypotheses from an ASR system so, it’s it’s there’s that’s that scenario an area that’s naturally sort
  • f mysterious
the not really
slide-24
SLIDE 24

Multi-pass decoding with latuices

  • Confusion networks/sausages: Latuices that show competing/
confusable words and can be used to compute posterior probabilities at the word level it’s there’s that’s that scenario an area that’s naturally sort
  • f mysterious
the not