search and decoding
play

Search and Decoding Lecture 16 CS 753 Instructor: Preethi Jyothi - PowerPoint PPT Presentation

Search and Decoding Lecture 16 CS 753 Instructor: Preethi Jyothi Recall Viterbi search Viterbi search finds the most probable path through a trellis of time on the X-axis and states on the Y-axis q F end end end end v 1 (2) =.32 v 2


  1. Search and Decoding Lecture 16 CS 753 Instructor: Preethi Jyothi

  2. Recall Viterbi search Viterbi search finds the most probable path through a trellis of time on • the X-axis and states on the Y-axis q F end end end end v 1 (2) =.32 v 2 (2) = max(.32*.12, .02*.08) = .038 P(H|H) * P(1|H) H H H q 2 H P ( .6 * .2 C | H ) * P . ( 3 1 | * C . ) 5 P(H|start)*P(3|H) ) H | 1 ( v 2 (1) = max(.32*.15, .02*.25) = .048 P * ) C 2 | v 1 (1) = .02 H . ( * .8 * .4 P 4 . P(C|C) * P(1|C) q 1 C C C C .5 * .5 P(C|start) * P(3|C) .2 * .1 q 0 start start start start 3 1 3 o 1 o 2 o 3 Viterbi algorithm: Only needs to maintain information about the most • probable path at each state Image from [JM]: Jurafsky & Martin, 3rd edition, Chapter 9

  3. ASR Search Network Network of HMM states d ax b Network of b phones oy walking are birds the 0 Network of is words boy

  4. Time-state trellis word 3 word 2 word 1 Time, t →

  5. Viterbi search over the large trellis Exact search is infeasible for large vocabulary tasks • Unknown word boundaries • Ngram language models greatly increase the search space • Solutions • Compactly represent the search space using WFST-based • optimisations Beam search: Prune away parts of the search space that • aren’t promising

  6. Viterbi search over the large trellis Exact search is infeasible for large vocabulary tasks • Unknown word boundaries • Ngram language models greatly increase the search space • Solutions • Compactly represent the search space using WFST-based • optimisations Beam search: Prune away parts of the search space that • aren’t promising

  7. Two main WFST Optimizations Use determinization to reduce/eliminate redundancy • Recall not all weighted transducers are determinizable To ensure determinizability of L ○ G, introduce disambiguation 
 symbols in L to deal with homophones in the lexicon read : r eh d #1 
 red : r eh d #2 Propagate the disambiguation symbols as self-loops back to 
 C and H. Resulting machines are H ̃ , C ̃ , L ̃

  8. Two main WFST Optimizations Use determinization to reduce/eliminate redundancy • Use minimization to reduce space requirements • Minimization ensures that the final composed machine 
 has minimum number of states Final optimization cascade: N = π ε (min(det(H ̃ ○ det(C ̃ ○ det(L ̃ ○ G))))) Replaces disambiguation symbols 
 in input alphabet of H ̃ with ε

  9. Example G bob:bob slept:slept bond:bond read:read 0 1 2 rob:rob ate:ate

  10. Example L ̃ :Lexicon with disambig symbols aa:- b:- 2 3 1 b:bob #0:- 4 -:- l:- 13 eh:- 12 s:slept 14 p:- aa:- 10 9 r:rob b:- 11 -:- 15 aa:- n:- 6 5 b:bond 7 d:- 0 8 -:- t:- -:- ey:ate 20 t:- 21 19 -:- d:- r:read eh:- 18 17 16 -:-

  11. L ̃ ○ G p:- t:- eh:- l:- #0:- aa:- b:- 19 21 22 16 10 13 7 1 4 b:bob s:slept -:- b:bond aa:- n:- d:- -:- r:read eh:- d:- 0 2 5 8 11 12 14 17 20 r:rob ey:ate -:- aa:- b:- 3 t:- 6 9 15 18 det(L ̃ ○ G) d:- eh:- 14 17 #0:- 11 5 8 b:bob r:read -:- aa:- n:bond d:- -:- s:slept l:- eh:- p:- t:- 3 6 9 10 12 15 18 19 20 1 b:- ey:ate 0 r:rob -:- aa:- b:- 2 4 7 t:- 13 16

  12. det(L ̃ ○ G) d:- eh:- 14 17 #0:- 11 5 8 b:bob r:read -:- aa:- n:bond d:- -:- s:slept l:- eh:- p:- t:- 3 6 9 10 12 15 18 19 20 1 b:- ey:ate 0 r:rob -:- aa:- b:- 2 4 7 t:- 13 16 min(det(L ̃ ○ G)) eh:- 9 5 12 d:- b:bob r:read #0:- 14 t:- eh:- p:- l:- n:bond d:- -:- s:slept aa:- 13 15 11 3 6 7 8 10 1 b:- 0 r:rob b:- ey:ate aa:- 2 4

  13. 1st pass recognition networks (40K vocab) transducer x real-time 12.5 C ◦ L ◦ G C ◦ det ( L ◦ G ) 1.2 det ( H ◦ C ◦ L ◦ G ) 1.0 push ( min ( F )) 0.7 Recognition speeds for systems with an accuracy of 83%

  14. Static and dynamic networks What we’ve seen so far: Static decoding graph • H ○ C ○ L ○ G • Determinize/minimize to make this graph more compact • Another approach: Dynamic graph expansion • Dynamically build the graph with active states on the fly • Do on-the-fly composition with the language model G • (H ○ C ○ L) ○ G •

  15. Viterbi search over the large trellis Exact search is infeasible for large vocabulary tasks • Unknown word boundaries • Ngram language models greatly increase the search space • Solutions • Compactly represent the search space using WFST-based • optimisations Beam search: Prune away parts of the search space that • aren’t promising

  16. Beam pruning At each time-step t, only retain those nodes in the time- • state trellis that are within a fixed threshold δ (beam width) of the best path Given active nodes from the last time-step: • Examine nodes in the current time-step … • … that are reachable from active nodes in the previous • time-step Get active nodes for the current time-step by only • retaining nodes with hypotheses that score close to the score of the best hypothesis

  17. Viterbi beam search decoder Time-synchronous search algorithm: • For time t, each state is updated by the best score from all • states in time t-1 Beam search prunes unpromising states at every time step. • At each time-step t, only retain those nodes in the time- • state trellis that are within a fixed threshold δ (beam width) of the score of the best hypothesis.

  18. Beam search algorithm Initialization: current states := initial state while (current states do not contain the goal state) do: successor states := NEXT(current states) 
 where NEXT is next state function score the successor states set current states to a pruned set of successor states using beam width δ
 only retain those successor states that are within 
 δ times the best path weight

  19. ⋯⋯ ⋯⋯ Beam search over the decoding graph Say δ = 2 Score of arc: 
 -log P(O 1 |x 1 ) + graph cost x 1 :the x 200 :the x 2 :a O 1 O 2 O 3 O T

  20. ̂ ̂ Beam search in a seq2seq model y 2 | x , "a" ) P ( ̂ a y 2 | x , "e" ) P ( ̂ Say δ = 3 y 1 y 2 e y 2 | x , "u" ) P ( ̂ u DECODER s i X c i = α ij h j j α i 1 α iM α ij h 1 h j h M ENCODER

  21. Lattices “ Lattices ” are useful when more than one hypothesis is • desired from a recognition pass A lattice is a weighted, directed acyclic graph which • encodes a large number of ASR hypotheses weighted by acoustic model +language model scores specific to a given utterance

  22. Lattice Generation Say we want to decode an utterance, U, of T frames. • Construct a sausage acceptor for this utterance, X, with T+1 • states and arcs for each context-dependent HMM state at each time-step Search the following composed machine for the best word • sequence corresponding to U: 
 
 D = X ○ HCLG

  23. Lattice Generation For all practical applications, we have to use beam pruning over D • such that only a subset of states/arcs in D are visited. Call this resulting pruned machine, B. Word lattice, say L, is a further pruned version of B defined by a • lattice beam, β . L satisfies the following requirements: L should have a path for every word sequence within β of the best- • scoring path in B All scores and alignments in L correspond to actual paths through • B L does not contain duplicate paths with the same word sequence •

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend