Search and Decoding Lecture 16 CS 753 Instructor: Preethi Jyothi - - PowerPoint PPT Presentation

search and decoding
SMART_READER_LITE
LIVE PREVIEW

Search and Decoding Lecture 16 CS 753 Instructor: Preethi Jyothi - - PowerPoint PPT Presentation

Search and Decoding Lecture 16 CS 753 Instructor: Preethi Jyothi Recall Viterbi search Viterbi search finds the most probable path through a trellis of time on the X-axis and states on the Y-axis q F end end end end v 1 (2) =.32 v 2


slide-1
SLIDE 1

Instructor: Preethi Jyothi

Search and Decoding

Lecture 16

CS 753

slide-2
SLIDE 2

Recall Viterbi search

  • Viterbi search finds the most probable path through a trellis of time on

the X-axis and states on the Y-axis

  • Viterbi algorithm: Only needs to maintain information about the most

probable path at each state

start

H C H C H C

end

P(C|start) * P(3|C) .2 * .1 P(H|H) * P(1|H) .6 * .2 P(C|C) * P(1|C) .5 * .5 P ( C | H ) * P ( 1 | C ) . 3 * . 5 P ( H | C ) * P ( 1 | H ) . 4 * . 2 P(H|start)*P(3|H) .8 * .4

v1(2)=.32 v1(1) = .02 v2(2)= max(.32*.12, .02*.08) = .038 v2(1) = max(.32*.15, .02*.25) = .048

start start start C H end end end

qF q2 q1 q0

  • 1
  • 2
  • 3

3 1 3

Image from [JM]: Jurafsky & Martin, 3rd edition, Chapter 9

slide-3
SLIDE 3

ASR Search Network

the birds are boy is walking d ax b

  • y

b

Network of words Network of phones Network of HMM states

slide-4
SLIDE 4

Time-state trellis

word1 word2 word3 Time, t →

slide-5
SLIDE 5

Viterbi search over the large trellis

  • Exact search is infeasible for large vocabulary tasks
  • Unknown word boundaries
  • Ngram language models greatly increase the search space
  • Solutions
  • Compactly represent the search space using WFST-based
  • ptimisations
  • Beam search: Prune away parts of the search space that

aren’t promising

slide-6
SLIDE 6

Viterbi search over the large trellis

  • Exact search is infeasible for large vocabulary tasks
  • Unknown word boundaries
  • Ngram language models greatly increase the search space
  • Solutions
  • Compactly represent the search space using WFST-based
  • ptimisations
  • Beam search: Prune away parts of the search space that

aren’t promising

slide-7
SLIDE 7

Two main WFST Optimizations

Recall not all weighted transducers are determinizable To ensure determinizability of L ○ G, introduce disambiguation 
 symbols in L to deal with homophones in the lexicon read : r eh d #1
 red : r eh d #2

  • Use determinization to reduce/eliminate redundancy

Propagate the disambiguation symbols as self-loops back to 
 C and H. Resulting machines are H̃, C̃, L̃

slide-8
SLIDE 8
  • Use determinization to reduce/eliminate redundancy
  • Use minimization to reduce space requirements

Two main WFST Optimizations

Minimization ensures that the final composed machine 
 has minimum number of states Final optimization cascade: N = πε(min(det(H̃ ○ det(C̃ ○ det(L̃ ○ G))))) Replaces disambiguation symbols
 in input alphabet of H̃ with ε

slide-9
SLIDE 9

Example G

1 bob:bob bond:bond rob:rob 2 slept:slept read:read ate:ate

slide-10
SLIDE 10

Example L̃ :Lexicon with disambig symbols

1 b:bob 5 b:bond 9 r:rob 12 s:slept 17 r:read 20 ey:ate 2 aa:- 6 aa:- 10 aa:- 13 l:- 18 eh:- 21 t:- 3 b:- 4 #0:-

  • :-

7 n:- 8 d:-

  • :-

11 b:-

  • :-

14 eh:- 15 p:- 16 t:-

  • :-

19 d:-

  • :-
  • :-
slide-11
SLIDE 11

L̃ ○ G

1 b:bob 2 b:bond 3 r:rob 4 aa:- 5 aa:- 6 aa:- 7 b:- 8 n:- 9 b:- 10 #0:- 11 d:- 12

  • :-
  • :-
  • :-

13 s:slept 14 r:read 15 ey:ate 16 l:- 17 eh:- 18 t:- 19 eh:- 20 d:- 21 p:- 22 t:-

det(L̃ ○ G)

1 b:- 2 r:rob 3 aa:- 4 aa:- 5 b:bob 6 n:bond 7 b:- 8 #0:- 9 d:- 10

  • :-
  • :-
  • :-

11 r:read 12 s:slept 13 ey:ate 14 eh:- 15 l:- 16 t:- 17 d:- 18 eh:- 19 p:- 20 t:-

slide-12
SLIDE 12

min(det(L̃ ○ G))

1 b:- 2 r:rob 3 aa:- 4 aa:- 5 b:bob 6 n:bond 7 b:- #0:- d:- 8

  • :-

9 r:read 10 s:slept 11 ey:ate 12 eh:- 13 l:- 14 t:- d:- 15 eh:- p:-

det(L̃ ○ G)

1 b:- 2 r:rob 3 aa:- 4 aa:- 5 b:bob 6 n:bond 7 b:- 8 #0:- 9 d:- 10

  • :-
  • :-
  • :-

11 r:read 12 s:slept 13 ey:ate 14 eh:- 15 l:- 16 t:- 17 d:- 18 eh:- 19 p:- 20 t:-

slide-13
SLIDE 13

transducer x real-time C ◦ L ◦ G 12.5 C ◦ det(L ◦ G) 1.2 det(H ◦ C ◦ L ◦ G) 1.0 push(min(F)) 0.7

1st pass recognition networks (40K vocab)

Recognition speeds for systems with an accuracy of 83%

slide-14
SLIDE 14

Static and dynamic networks

  • What we’ve seen so far: Static decoding graph
  • H ○ C ○ L ○ G
  • Determinize/minimize to make this graph more compact
  • Another approach: Dynamic graph expansion
  • Dynamically build the graph with active states on the fly
  • Do on-the-fly composition with the language model G
  • (H ○ C ○ L) ○ G
slide-15
SLIDE 15

Viterbi search over the large trellis

  • Exact search is infeasible for large vocabulary tasks
  • Unknown word boundaries
  • Ngram language models greatly increase the search space
  • Solutions
  • Compactly represent the search space using WFST-based
  • ptimisations
  • Beam search: Prune away parts of the search space that

aren’t promising

slide-16
SLIDE 16

Beam pruning

  • At each time-step t, only retain those nodes in the time-

state trellis that are within a fixed threshold δ (beam width)

  • f the best path
  • Given active nodes from the last time-step:
  • Examine nodes in the current time-step …
  • … that are reachable from active nodes in the previous

time-step

  • Get active nodes for the current time-step by only

retaining nodes with hypotheses that score close to the score of the best hypothesis

slide-17
SLIDE 17

Viterbi beam search decoder

  • Time-synchronous search algorithm:
  • For time t, each state is updated by the best score from all

states in time t-1

  • Beam search prunes unpromising states at every time step.
  • At each time-step t, only retain those nodes in the time-

state trellis that are within a fixed threshold δ (beam width)

  • f the score of the best hypothesis.
slide-18
SLIDE 18

Beam search algorithm

Initialization: current states := initial state while (current states do not contain the goal state) do: successor states := NEXT(current states) 
 where NEXT is next state function score the successor states set current states to a pruned set of successor states using beam width δ


  • nly retain those successor states that are within


δ times the best path weight

slide-19
SLIDE 19

Beam search over the decoding graph

⋯⋯

x1:the x2:a x200:the

Say δ = 2 O1 O2 O3 OT

⋯⋯

Score of arc:


  • log P(O1|x1)

+ graph cost

slide-20
SLIDE 20

ENCODER DECODER

si ci = X

j

αijhj

αiM αij αi1 hM h1 hj

Beam search in a seq2seq model

̂ y1

Say δ = 3

̂ y2

a e u P( ̂ y2|x, "a") P( ̂ y2|x, "e") P( ̂ y2|x, "u")

slide-21
SLIDE 21

Lattices

  • “Lattices” are useful when more than one hypothesis is

desired from a recognition pass

  • A lattice is a weighted, directed acyclic graph which

encodes a large number of ASR hypotheses weighted by acoustic model +language model scores specific to a given utterance

slide-22
SLIDE 22

Lattice Generation

  • Say we want to decode an utterance, U, of T frames.
  • Construct a sausage acceptor for this utterance, X, with T+1

states and arcs for each context-dependent HMM state at each time-step

  • Search the following composed machine for the best word

sequence corresponding to U:
 
 D = X ○ HCLG

slide-23
SLIDE 23

Lattice Generation

  • For all practical applications, we have to use beam pruning over D

such that only a subset of states/arcs in D are visited. Call this resulting pruned machine, B.

  • Word lattice, say L, is a further pruned version of B defined by a

lattice beam, β. L satisfies the following requirements:

  • L should have a path for every word sequence within β of the best-

scoring path in B

  • All scores and alignments in L correspond to actual paths through

B

  • L does not contain duplicate paths with the same word sequence