Search and Decoding Lecture 16 CS 753 Instructor: Preethi Jyothi

Recall Viterbi search Viterbi search finds the most probable path through a trellis of time on • the X-axis and states on the Y-axis q F end end end end v 1 (2) =.32 v 2 (2) = max(.32*.12, .02*.08) = .038 P(H|H) * P(1|H) H H H q 2 H P ( .6 * .2 C | H ) * P . ( 3 1 | * C . ) 5 P(H|start)*P(3|H) ) H | 1 ( v 2 (1) = max(.32*.15, .02*.25) = .048 P * ) C 2 | v 1 (1) = .02 H . ( * .8 * .4 P 4 . P(C|C) * P(1|C) q 1 C C C C .5 * .5 P(C|start) * P(3|C) .2 * .1 q 0 start start start start 3 1 3 o 1 o 2 o 3 Viterbi algorithm: Only needs to maintain information about the most • probable path at each state Image from [JM]: Jurafsky & Martin, 3rd edition, Chapter 9

ASR Search Network Network of HMM states d ax b Network of b phones oy walking are birds the 0 Network of is words boy

Time-state trellis word 3 word 2 word 1 Time, t →

Viterbi search over the large trellis Exact search is infeasible for large vocabulary tasks • Unknown word boundaries • Ngram language models greatly increase the search space • Solutions • Compactly represent the search space using WFST-based • optimisations Beam search: Prune away parts of the search space that • aren’t promising

Two main WFST Optimizations Use determinization to reduce/eliminate redundancy • Recall not all weighted transducers are determinizable To ensure determinizability of L ○ G, introduce disambiguation   symbols in L to deal with homophones in the lexicon read : r eh d #1   red : r eh d #2 Propagate the disambiguation symbols as self-loops back to   C and H. Resulting machines are H ̃ , C ̃ , L ̃

Two main WFST Optimizations Use determinization to reduce/eliminate redundancy • Use minimization to reduce space requirements • Minimization ensures that the final composed machine   has minimum number of states Final optimization cascade: N = π ε (min(det(H ̃ ○ det(C ̃ ○ det(L ̃ ○ G))))) Replaces disambiguation symbols   in input alphabet of H ̃ with ε

Example G bob:bob slept:slept bond:bond read:read 0 1 2 rob:rob ate:ate

Example L ̃ :Lexicon with disambig symbols aa:- b:- 2 3 1 b:bob #0:- 4 -:- l:- 13 eh:- 12 s:slept 14 p:- aa:- 10 9 r:rob b:- 11 -:- 15 aa:- n:- 6 5 b:bond 7 d:- 0 8 -:- t:- -:- ey:ate 20 t:- 21 19 -:- d:- r:read eh:- 18 17 16 -:-

L ̃ ○ G p:- t:- eh:- l:- #0:- aa:- b:- 19 21 22 16 10 13 7 1 4 b:bob s:slept -:- b:bond aa:- n:- d:- -:- r:read eh:- d:- 0 2 5 8 11 12 14 17 20 r:rob ey:ate -:- aa:- b:- 3 t:- 6 9 15 18 det(L ̃ ○ G) d:- eh:- 14 17 #0:- 11 5 8 b:bob r:read -:- aa:- n:bond d:- -:- s:slept l:- eh:- p:- t:- 3 6 9 10 12 15 18 19 20 1 b:- ey:ate 0 r:rob -:- aa:- b:- 2 4 7 t:- 13 16

det(L ̃ ○ G) d:- eh:- 14 17 #0:- 11 5 8 b:bob r:read -:- aa:- n:bond d:- -:- s:slept l:- eh:- p:- t:- 3 6 9 10 12 15 18 19 20 1 b:- ey:ate 0 r:rob -:- aa:- b:- 2 4 7 t:- 13 16 min(det(L ̃ ○ G)) eh:- 9 5 12 d:- b:bob r:read #0:- 14 t:- eh:- p:- l:- n:bond d:- -:- s:slept aa:- 13 15 11 3 6 7 8 10 1 b:- 0 r:rob b:- ey:ate aa:- 2 4

1st pass recognition networks (40K vocab) transducer x real-time 12.5 C ◦ L ◦ G C ◦ det ( L ◦ G ) 1.2 det ( H ◦ C ◦ L ◦ G ) 1.0 push ( min ( F )) 0.7 Recognition speeds for systems with an accuracy of 83%

Static and dynamic networks What we’ve seen so far: Static decoding graph • H ○ C ○ L ○ G • Determinize/minimize to make this graph more compact • Another approach: Dynamic graph expansion • Dynamically build the graph with active states on the fly • Do on-the-fly composition with the language model G • (H ○ C ○ L) ○ G •

Viterbi search over the large trellis Exact search is infeasible for large vocabulary tasks • Unknown word boundaries • Ngram language models greatly increase the search space • Solutions • Compactly represent the search space using WFST-based • optimisations Beam search: Prune away parts of the search space that • aren’t promising

Beam pruning At each time-step t, only retain those nodes in the time- • state trellis that are within a fixed threshold δ (beam width) of the best path Given active nodes from the last time-step: • Examine nodes in the current time-step … • … that are reachable from active nodes in the previous • time-step Get active nodes for the current time-step by only • retaining nodes with hypotheses that score close to the score of the best hypothesis

Viterbi beam search decoder Time-synchronous search algorithm: • For time t, each state is updated by the best score from all • states in time t-1 Beam search prunes unpromising states at every time step. • At each time-step t, only retain those nodes in the time- • state trellis that are within a fixed threshold δ (beam width) of the score of the best hypothesis.

Beam search algorithm Initialization: current states := initial state while (current states do not contain the goal state) do: successor states := NEXT(current states)   where NEXT is next state function score the successor states set current states to a pruned set of successor states using beam width δ  only retain those successor states that are within   δ times the best path weight

⋯⋯ ⋯⋯ Beam search over the decoding graph Say δ = 2 Score of arc:   -log P(O 1 |x 1 ) + graph cost x 1 :the x 200 :the x 2 :a O 1 O 2 O 3 O T

̂ ̂ Beam search in a seq2seq model y 2 | x , "a" ) P ( ̂ a y 2 | x , "e" ) P ( ̂ Say δ = 3 y 1 y 2 e y 2 | x , "u" ) P ( ̂ u DECODER s i X c i = α ij h j j α i 1 α iM α ij h 1 h j h M ENCODER

Lattices “ Lattices ” are useful when more than one hypothesis is • desired from a recognition pass A lattice is a weighted, directed acyclic graph which • encodes a large number of ASR hypotheses weighted by acoustic model +language model scores specific to a given utterance

Lattice Generation Say we want to decode an utterance, U, of T frames. • Construct a sausage acceptor for this utterance, X, with T+1 • states and arcs for each context-dependent HMM state at each time-step Search the following composed machine for the best word • sequence corresponding to U:     D = X ○ HCLG

Lattice Generation For all practical applications, we have to use beam pruning over D • such that only a subset of states/arcs in D are visited. Call this resulting pruned machine, B. Word lattice, say L, is a further pruned version of B defined by a • lattice beam, β . L satisfies the following requirements: L should have a path for every word sequence within β of the best- • scoring path in B All scores and alignments in L correspond to actual paths through • B L does not contain duplicate paths with the same word sequence •

Search and Decoding Lecture 16 CS 753 Instructor: Preethi Jyothi - PowerPoint PPT Presentation

Search and Decoding Lecture 16 CS 753 Instructor: Preethi Jyothi Recall Viterbi search Viterbi search finds the most probable path through a trellis of time on the X-axis and states on the Y-axis q F end end end end v 1 (2) =.32 v 2

By et al Siegfried Engelmann Decoding Strategies: Decoding B1- Teacher's Presentation Book

Decoding Philipp Koehn 17 September 2020 Philipp Koehn Machine Translation: Decoding 17

List Decoding of Algebraic Codes Peter Beelen, Kristian Brander and Johan S.R. Nielsen DTU

Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs by Point Based Rendering by

Chapter 6 Decoding Statistical Machine Translation Decoding We have a mathematical model for

Observation Decoding with Sensor Models: Recognition Tasks via Classical Planning Diego Aineto,

Decoding in SMT Nitin Madnani February 8, 2006 The Decoding Problem Search Inputs:

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

Foundations of Artificial Intelligence 9. State-Space Search: Tree Search and Graph Search Malte

Tabu Search Search Tabu Page 1 Part I Part I Tabu Search Principles Search Principles Tabu

Uninformed Search 2 Informed Search Rest of blind search An informed search strategyone

Informed search algorithms Outline Best-first search Greedy best-first search A *

6.02 Fall 2012 Lecture #7 Viterbi decoding of convolutional codes Path and branch metrics

Beyond Sequential decoding toward parallel decoding In the context of neural sequence modelling

Decoding One Out of Many Nicolas Sendrier INRIA Paris-Rocquencourt, equipe-projet SECRET

Why decoding? Understanding the neural code. Neural Decoding Given spikes, what was the

Implicit automata in typed -calculi Pierre PRADIC pierre.pradic@cs.ox.ac.uk j.w.w. NGUYN

IN5550 Neural Methods in Natural Language Processing Recurrent Neural Networks Stephan Oepen

Breadth-first signature of trees and rational languages Victor Marsault, joint work with Jacques

The Analysis of Infinite-State Systems Bernard Boigelot Universit e de Li` ege

CPL 2016, week 4 Inter-thread communication Oleg Batrashev Institute of Computer Science, Tartu,

The Relational Database Engine: An Efficient Validator of T emporal Properties on Event T races

On serial group rings of central extensions of simple groups Andrei Kukharau Siberian Federal

ECON ASICs Jim Hirschauer, Ralph Wickwire ASICs PMG 11 Nov 2019 DOE CD-1 IPR and CERN P2UG