Automatic Speech Recognition (CS753) Automatic Speech Recognition - PowerPoint PPT Presentation

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 18: Search & Decoding (Part I) Instructor: Preethi Jyothi Mar 23, 2017  

Recall ASR Decoding W ∗ = arg max Pr( O A | W ) Pr( W ) W " N 8 9 # 2 3 T < = W ∗ = arg max Y 4 X Y Pr( w n | w n − 1 Pr( O t | q t , w N 1 ) Pr( q t | q t − 1 , w N n − m +1 ) 1 ) 5 w N 1 ,N : ; n =1 t =1 q T 1 ,w N 1 (" N # " T #) Viterbi Y Y Pr( w n | w n − 1 Pr( O t | q t , w N 1 ) Pr( q t | q t − 1 , w N ≈ arg max n − m +1 ) max 1 ) q T 1 ,w N w N 1 ,N 1 n =1 t =1 Viterbi approximation divides the above optimisation problem into • sub-problems that allows the e ff icient application of dynamic programming An exact search using Viterbi is infeasible for large vocabulary tasks! •

Recall Viterbi search Viterbi search finds the most probable path through a trellis of time on • the X-axis and states on the Y-axis q F end end end end v 1 (2) =.32 v 2 (2) = max(.32*.12, .02*.08) = .038 P(H|H) * P(1|H) H H H q 2 H P ( .6 * .2 C | H ) * P . ( 3 1 | C * . ) 5 ) ) H H | 1 | ( v 2 (1) = max(.32*.15, .02*.25) = .048 P 3 * ( ) P C | 2 v 1 (1) = .02 H . * ( * 4 P ) 4 t . . r a * t P(C|C) * P(1|C) 8 q 1 s C C C C . | H .5 * .5 P(C|start) * P(3|C) ( P .2 * .1 q 0 start start start start 3 1 3 o 2 o 3 o 1 Viterbi algorithm: Only needs to maintain information about the most • probable path at each state Image from [JM]: Jurafsky & Martin, 3rd edition, Chapter 9

ASR Search Network Network of HMM states d ax b Network of b phones oy walking are birds the 0 Network of is words boy

Time-state trellis word 3 word 2 word 1 Time, t →

Viterbi search over the large trellis Exact search is infeasible for large vocabulary tasks • Unknown word boundaries • Ngram language models greatly increase the search space • Solutions • Compactly represent the search space using WFST-based • optimisations Beam search: Prune away parts of the search space that aren’t • promising

Two main WFST Optimizations Use determinization to reduce/eliminate redundancy • Recall not all weighted transducers are determinizable To ensure determinizability of L ○ G, introduce disambiguation   symbols in L to deal with homophones in the lexicon read : r eh d #0   red : r eh d #1 Propagate the disambiguation symbols as self-loops back to   ̃ , L ̃ ̃ , C C and H. Resulting machines are H

Two main WFST Optimizations Use determinization to reduce/eliminate redundancy • Use minimization to reduce space requirements • Minimization ensures that the final composed machine   has minimum number of states Final optimization cascade: ̃ ○ det(L ̃ ○ G))))) ̃ ○ det(C N = π ε (min(det(H Replaces disambiguation symbols   ̃ with ε in input alphabet of H

Example G bob:bob slept:slept bond:bond read:read 0 1 2 rob:rob ate:ate

Compact language models (G) Use Backo ff Ngram language models for G • c / Pr (c|a,b) a,b b,c ε / α (a,b) ε / α (b,c) c / Pr (c|b) b c ε / α (b) c / Pr (c) ε

Example G bob:bob slept:slept bond:bond read:read 0 1 2 rob:rob ate:ate

Example L ̃ :Lexicon with disambig symbols aa:- b:- 2 3 1 b:bob #0:- 4 -:- l:- eh:- 13 12 s:slept 14 p:- aa:- 10 9 r:rob b:- 11 -:- 15 aa:- n:- 6 5 b:bond 7 d:- 0 8 -:- t:- -:- ey:ate 20 t:- 21 19 -:- d:- r:read eh:- 18 17 16 -:-

L ̃ ○ G p:- t:- eh:- l:- #0:- 19 21 22 aa:- b:- 16 10 13 7 1 4 b:bob s:slept -:- b:bond aa:- n:- d:- -:- r:read eh:- d:- 0 2 5 8 11 12 14 17 20 r:rob ey:ate -:- aa:- b:- 3 t:- 6 9 15 18 det(L ̃ ○ G) d:- eh:- 14 17 #0:- 11 5 8 b:bob r:read -:- aa:- n:bond d:- -:- s:slept l:- eh:- p:- t:- 3 6 9 10 12 15 18 19 20 1 b:- ey:ate 0 r:rob -:- aa:- b:- 2 4 7 t:- 13 16

det(L ̃ ○ G) d:- eh:- 14 17 #0:- 11 5 8 b:bob r:read -:- aa:- n:bond d:- -:- s:slept l:- eh:- p:- t:- 3 6 9 10 12 15 18 19 20 1 b:- ey:ate 0 r:rob -:- aa:- b:- 2 4 7 t:- 13 16 min(det(L ̃ ○ G)) eh:- 9 5 12 d:- b:bob r:read #0:- 14 t:- eh:- p:- l:- n:bond d:- -:- s:slept aa:- 13 15 11 3 6 7 8 10 1 b:- 0 r:rob b:- ey:ate aa:- 2 4

Viterbi search over the large trellis Exact search is infeasible for large vocabulary tasks • Unknown word boundaries • Ngram language models greatly increase the search space • Solutions • Compactly represent the search space using WFST-based • optimisations Beam search: Prune away parts of the search space that aren’t • promising

Beam pruning At each time-step t, only retain those nodes in the time-state • trellis that are within a fixed threshold δ (beam width) of the best path Given active nodes from the last time-step: • Examine nodes in the current time-step … • … that are reachable from active nodes in the previous time- • step Get active nodes for the current time-step by only retaining • nodes with hypotheses that score close to the score of the best hypothesis

Beam search Beam search at each node keeps only hypotheses with scores • that fall within a threshold of the current best hypothesis Hypotheses with Q(t, s) < δ ⋅ max Q(t, s’) are pruned • here, δ controls the beam width • Search errors could occur if the most probable hypothesis gets pruned • Trade-o ff between balancing search errors and speeding up decoding

Static and dynamic networks What we’ve seen so far: Static decoding graph • H ○ C ○ L ○ G • Determinize/minimize to make this graph more compact • Another approach: Dynamic graph expansion • Dynamically build the graph with active states on the fly • Do on-the-fly composition with the language model G • (H ○ C ○ L) ○ G •

Multi-pass search Some models are too expensive to implement in first-pass • decoding (e.g. RNN-based LMs) First-pass decoding: Use simpler model (e.g. Ngram LMs) • to find most probable word sequences • and represent as a word la tu ice or an N-best list • Rescore first-pass hypotheses using complex model to find the • best word sequence

DRAFT Multi-pass decoding with N-best lists Simple algorithm: Modify the Viterbi algorithm to return the N- • best word sequences for a given speech input Simple Smarter Knowledge Knowledge Source Source N-Best List 1-Best Utterance speech ?Alice was beginning to get... N-Best ?Every happy family... input If music be the ?In a hole in the ground... Rescoring Decoder food of love... ?If music be the food of love... If music be the ?If music be the foot of dove... food of love... Image from [JM]: Jurafsky & Martin, SLP 2nd edition, Chapter 10

DRAFT Multi-pass decoding with N-best lists Simple algorithm: Modify the Viterbi algorithm to return the N- • best word sequences for a given speech input AM LM Rank Path logprob logprob 1. it’s an area that’s naturally sort of mysterious -7193.53 -20.25 2. that’s an area that’s naturally sort of mysterious -7192.28 -21.11 3. it’s an area that’s not really sort of mysterious -7221.68 -18.91 4. that scenario that’s naturally sort of mysterious -7189.19 -22.08 5. there’s an area that’s naturally sort of mysterious -7198.35 -21.34 6. that’s an area that’s not really sort of mysterious -7220.44 -19.77 7. the scenario that’s naturally sort of mysterious -7205.42 -21.50 8. so it’s an area that’s naturally sort of mysterious -7195.92 -21.71 9. that scenario that’s not really sort of mysterious -7217.34 -20.70 10. there’s an area that’s not really sort of mysterious -7226.51 -20.01 Figure 10.2 An example 10-Best list from the Broadcast News corpus, produced by the N-best lists aren’t as diverse as we’d like. And, not enough • information in N-best lists to e ff ectively use other knowledge sources Image from [JM]: Jurafsky & Martin, SLP 2nd edition, Chapter 10

Multi-pass decoding with la tu ices ASR la tu ice: Weighted automata/directed graph representing • alternate word hypotheses from an ASR system so, it’s it’s there’s an that’s naturally sort of mysterious area that’s not really the that scenario

Multi-pass decoding with la tu ices Confusion networks/sausages : La tu ices that show competing/ • confusable words and can be used to compute posterior probabilities at the word level it’s there’s an that’s naturally sort of mysterious area that’s the scenario not that

Automatic Speech Recognition (CS753) Automatic Speech Recognition - PowerPoint PPT Presentation

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 18: Search & Decoding (Part I) Instructor: Preethi Jyothi Mar 23, 2017 Recall ASR Decoding W = arg max Pr( O A | W ) Pr( W ) W " N 8 9 # 2 3

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 12: Acoustic

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFSTs in ASR

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 21: Speaker

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 24: Statistical

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 22: Speaker

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 20:

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 10: Deep Neural

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 8: Hidden

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFST

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 14: Language

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 19: Search,

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 11: Recurrent

Tools and Analyses for Ambiguous Input Streams Andrew Begel and Susan L. Graham University of

homophone sound same Homophones are words that are spelled differently, have different meanings

Automatic Georeferencing and Search on Structured Georeferenced Bibliographic Data c 1 Marjan

Logics for D Data and K Knowledge L Representation R Resource Description Framework (RDF)

State Dependent Operators and the Information Paradox in AdS/CFT Suvrat Raju International

OpenStack Horizon: Train Project overview and update Project Update, Open Infrastructure Summit

Black Holes Microstates in Three Dimensional Gravity Alex Maloney Northeast Gravity Workshop

Soft Hair on Generic Horizons and Black Hole Microstates By: M.M. Sheikh-Jabbari Based on my