Sequence to Sequence models: Connectionist Temporal Classification - - PowerPoint PPT Presentation

sequence to sequence models
SMART_READER_LITE
LIVE PREVIEW

Sequence to Sequence models: Connectionist Temporal Classification - - PowerPoint PPT Presentation

Deep Learning Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1 Sequence-to-sequence modelling Problem: A sequence 1 goes in A different sequence 1 comes out


slide-1
SLIDE 1

Deep Learning

Sequence to Sequence models: Connectionist Temporal Classification

5 March 2018

1

slide-2
SLIDE 2

Sequence-to-sequence modelling

  • Problem:

– A sequence 𝑌1 … 𝑌𝑂 goes in – A different sequence 𝑍

1 … 𝑍 𝑁 comes out

  • E.g.

– Speech recognition: Speech goes in, a word sequence comes

  • ut
  • Alternately output may be phoneme or character sequence

– Machine translation: Word sequence goes in, word sequence comes out

  • In general 𝑂 ≠ 𝑁

– No synchrony between 𝑌 and 𝑍.

2

slide-3
SLIDE 3

Sequence to sequence

  • Sequence goes in, sequence comes out
  • No notion of “synchrony” between input and
  • utput

– May even not have a notion of “alignment”

  • E.g. “I ate an apple”  “Ich habe einen apfel gegessen”

3

v

Seq2seq Seq2seq

I ate an apple Ich habe einen apfel gegessen I ate an apple

slide-4
SLIDE 4

Case 1: With alignment

  • The input and output sequences happen in the same
  • rder

– Although they may be asynchronous – E.g. Speech recognition

  • The input speech corresponds to the phoneme sequence output

Time X(t) Y(t) t=0 h-1

4

slide-5
SLIDE 5

Variants on recurrent nets

  • 1: Conventional MLP
  • 2: Sequence generation, e.g. image to caption
  • 3: Sequence based prediction or classification, e.g. Speech recognition,

text classification

Images from Karpathy

5

slide-6
SLIDE 6

Basic model

  • Sequence of inputs produces a single output

𝑌0 𝑌1 𝑌2 /AH/

6

slide-7
SLIDE 7

Training

  • The Divergence is only defined at the final input

– 𝐸𝐽𝑊 𝑍

𝑢𝑏𝑠𝑕𝑓𝑢, 𝑍 = 𝑌𝑓𝑜𝑢(𝑍 𝑈 , 𝑄ℎ𝑝𝑜𝑓𝑛𝑓)

  • This divergence must propagate through the net to update

all parameters

  • Ignores outputs at intermediate steps

𝑌0 𝑌1 𝑌2 /AH/ Div Y(2)

7

slide-8
SLIDE 8

Training

  • Exploiting the untagged inputs: assume the same output for the

entire input

  • Define the divergence everywhere

𝐸𝐽𝑊 𝑍

𝑢𝑏𝑠𝑕𝑓𝑢, 𝑍 = ෍ 𝑢

𝑥𝑢𝑌𝑓𝑜𝑢(𝑍 𝑢 , 𝑄ℎ𝑝𝑜𝑓𝑛𝑓)

𝑌0 𝑌1 𝑌2 /AH/ Div Y(2) Fix: Use these

  • utputs too.

These too must ideally point to the correct phoneme /AH/ Div /AH/ Div

8

slide-9
SLIDE 9

Training

  • Define the divergence everywhere

𝐸𝐽𝑊 𝑍

𝑢𝑏𝑠𝑕𝑓𝑢, 𝑍 = ෍ 𝑢

𝑥𝑢𝑌𝑓𝑜𝑢(𝑍 𝑢 , 𝑄ℎ𝑝𝑜𝑓𝑛𝑓)

  • Typical weighting scheme for speech: all are equally important
  • Problem like question answering: answer only expected after the question ends

– Only 𝑥𝑈 is high, other weights are 0 or low 𝑌0 𝑌1 𝑌2 /AH/ Div Y(2) Fix: Use these

  • utputs too.

These too must ideally point to the correct phoneme /AH/ Div /AH/ Div

9

𝐷𝑝𝑚𝑝𝑠 Blue Div Y(2) 𝑝𝑔 𝑡𝑙𝑧 Div Div

slide-10
SLIDE 10

Training

  • Define the divergence everywhere

𝐸𝐽𝑊 𝑍

𝑢𝑏𝑠𝑕𝑓𝑢, 𝑍 = ෍ 𝑢

𝑥𝑢𝑌𝑓𝑜𝑢(𝑍 𝑢 , 𝑄ℎ𝑝𝑜𝑓𝑛𝑓)

  • Typical weighting scheme for speech: all are equally important
  • Problem like question answering: answer only expected after the question ends

– Only 𝑥𝑈 is high, other weights are 0 or low 𝑌0 𝑌1 𝑌2 /AH/ Div Y(2) We will initially focus on the class of problem where uniform weights are reasonable (e.g speech recognition) /AH/ Div /AH/ Div

10

slide-11
SLIDE 11

The more complex problem

  • Objective: Given a sequence of inputs, asynchronously
  • utput a sequence of symbols

– This is just a simple concatenation of many copies of the simple “output at the end of the input sequence” model we just saw

  • But this simple extension complicates matters..

𝑌0 𝑌1 𝑌2 /B/ 𝑌4 𝑌5 𝑌6 /F/ 𝑌7 𝑌8 𝑌9 /IY/ 𝑌3

11

/IY/

slide-12
SLIDE 12

The sequence-to-sequence problem

  • How do we know when to output symbols

– In fact, the network produces outputs at every time – Which of these are the real outputs?

𝑌0 𝑌1 𝑌2 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌9 𝑌3 /B/

12

/F/ /IY/ /IY/

slide-13
SLIDE 13

The actual output of the network

  • At each time the network outputs a probability for

each output symbol given all inputs until that time

– E.g. 𝑧4

𝐸 = 𝑞𝑠𝑝𝑐(𝑡4 = 𝐸|𝑌0 … 𝑌4)

𝑌0 𝑌1 𝑌2 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌3 /AH/ /B/ /D/ /EH/ /IY/ /F/ /G/ 𝑧0

𝐵𝐼

𝑧0

𝐶

𝑧0

𝐸

𝑧0

𝐹𝐼

𝑧0

𝐽𝑍

𝑧0

𝐺

𝑧0

𝐻

𝑧1

𝐵𝐼

𝑧1

𝐶

𝑧1

𝐸

𝑧1

𝐹𝐼

𝑧1

𝐽𝑍

𝑧1

𝐺

𝑧1

𝐻

𝑧2

𝐵𝐼

𝑧2

𝐶

𝑧2

𝐸

𝑧2

𝐹𝐼

𝑧2

𝐽𝑍

𝑧2

𝐺

𝑧2

𝐻

𝑧3

𝐵𝐼

𝑧3

𝐶

𝑧3

𝐸

𝑧3

𝐹𝐼

𝑧3

𝐽𝑍

𝑧3

𝐺

𝑧3

𝐻

𝑧4

𝐵𝐼

𝑧4

𝐶

𝑧4

𝐸

𝑧4

𝐹𝐼

𝑧4

𝐽𝑍

𝑧4

𝐺

𝑧4

𝐻

𝑧5

𝐵𝐼

𝑧5

𝐶

𝑧5

𝐸

𝑧5

𝐹𝐼

𝑧5

𝐽𝑍

𝑧5

𝐺

𝑧5

𝐻

𝑧6

𝐵𝐼

𝑧6

𝐶

𝑧6

𝐸

𝑧6

𝐹𝐼

𝑧6

𝐽𝑍

𝑧6

𝐺

𝑧6

𝐻

𝑧7

𝐵𝐼

𝑧7

𝐶

𝑧7

𝐸

𝑧7

𝐹𝐼

𝑧7

𝐽𝑍

𝑧7

𝐺

𝑧7

𝐻

𝑧8

𝐵𝐼

𝑧8

𝐶

𝑧8

𝐸

𝑧8

𝐹𝐼

𝑧8

𝐽𝑍

𝑧8

𝐺

𝑧8

𝐻

13

slide-14
SLIDE 14

Overall objective

  • Find most likely symbol sequence given inputs

𝑇0 … 𝑇𝐿−1 = argmax

𝑇0

′…𝑇𝐿−1 ′

𝑞𝑠𝑝𝑐(𝑇0

′ … 𝑇𝐿−1 ′

|𝑌0 … 𝑌𝑂−1)

𝑌0 𝑌1 𝑌2 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌3

14

/AH/ /B/ /D/ /EH/ /IY/ /F/ /G/ 𝑧0

𝐵𝐼

𝑧0

𝐶

𝑧0

𝐸

𝑧0

𝐹𝐼

𝑧0

𝐽𝑍

𝑧0

𝐺

𝑧0

𝐻

𝑧1

𝐵𝐼

𝑧1

𝐶

𝑧1

𝐸

𝑧1

𝐹𝐼

𝑧1

𝐽𝑍

𝑧1

𝐺

𝑧1

𝐻

𝑧2

𝐵𝐼

𝑧2

𝐶

𝑧2

𝐸

𝑧2

𝐹𝐼

𝑧2

𝐽𝑍

𝑧2

𝐺

𝑧2

𝐻

𝑧3

𝐵𝐼

𝑧3

𝐶

𝑧3

𝐸

𝑧3

𝐹𝐼

𝑧3

𝐽𝑍

𝑧3

𝐺

𝑧3

𝐻

𝑧4

𝐵𝐼

𝑧4

𝐶

𝑧4

𝐸

𝑧4

𝐹𝐼

𝑧4

𝐽𝑍

𝑧4

𝐺

𝑧4

𝐻

𝑧5

𝐵𝐼

𝑧5

𝐶

𝑧5

𝐸

𝑧5

𝐹𝐼

𝑧5

𝐽𝑍

𝑧5

𝐺

𝑧5

𝐻

𝑧6

𝐵𝐼

𝑧6

𝐶

𝑧6

𝐸

𝑧6

𝐹𝐼

𝑧6

𝐽𝑍

𝑧6

𝐺

𝑧6

𝐻

𝑧7

𝐵𝐼

𝑧7

𝐶

𝑧7

𝐸

𝑧7

𝐹𝐼

𝑧7

𝐽𝑍

𝑧7

𝐺

𝑧7

𝐻

𝑧8

𝐵𝐼

𝑧8

𝐶

𝑧8

𝐸

𝑧8

𝐹𝐼

𝑧8

𝐽𝑍

𝑧8

𝐺

𝑧8

𝐻

slide-15
SLIDE 15

Finding the best output

  • Option 1: Simply select the most probable

symbol at each time

𝑌0 𝑌1 𝑌2 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌3

15

/AH/ /B/ /D/ /EH/ /IY/ /F/ /G/ 𝑧0

𝐵𝐼

𝑧0

𝐶

𝑧0

𝐸

𝑧0

𝐹𝐼

𝑧0

𝐽𝑍

𝑧0

𝐺

𝑧0

𝐻

𝑧1

𝐵𝐼

𝑧1

𝐶

𝑧1

𝐸

𝑧1

𝐹𝐼

𝑧1

𝐽𝑍

𝑧1

𝐺

𝑧1

𝐻

𝑧2

𝐵𝐼

𝑧2

𝐶

𝑧2

𝐸

𝑧2

𝐹𝐼

𝑧2

𝐽𝑍

𝑧2

𝐺

𝑧2

𝐻

𝑧3

𝐵𝐼

𝑧3

𝐶

𝑧3

𝐸

𝑧3

𝐹𝐼

𝑧3

𝐽𝑍

𝑧3

𝐺

𝑧3

𝐻

𝑧4

𝐵𝐼

𝑧4

𝐶

𝑧4

𝐸

𝑧4

𝐹𝐼

𝑧4

𝐽𝑍

𝑧4

𝐺

𝑧4

𝐻

𝑧5

𝐵𝐼

𝑧5

𝐶

𝑧5

𝐸

𝑧5

𝐹𝐼

𝑧5

𝐽𝑍

𝑧5

𝐺

𝑧5

𝐻

𝑧6

𝐵𝐼

𝑧6

𝐶

𝑧6

𝐸

𝑧6

𝐹𝐼

𝑧6

𝐽𝑍

𝑧6

𝐺

𝑧6

𝐻

𝑧7

𝐵𝐼

𝑧7

𝐶

𝑧7

𝐸

𝑧7

𝐹𝐼

𝑧7

𝐽𝑍

𝑧7

𝐺

𝑧7

𝐻

𝑧8

𝐵𝐼

𝑧8

𝐶

𝑧8

𝐸

𝑧8

𝐹𝐼

𝑧8

𝐽𝑍

𝑧8

𝐺

𝑧8

𝐻

slide-16
SLIDE 16

Finding the best output

  • Option 1: Simply select the most probable symbol at each

time

– Merge adjacent repeated symbols, and place the actual emission

  • f the symbol in the final instant

𝑌0 𝑌1 𝑌2 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌3

16

/AH/ /B/ /D/ /EH/ /IY/ /F/ /G/ 𝑧0

𝐵𝐼

𝑧0

𝐶

𝑧0

𝐸

𝑧0

𝐹𝐼

𝑧0

𝐽𝑍

𝑧0

𝐺

𝑧0

𝐻

𝑧1

𝐵𝐼

𝑧1

𝐶

𝑧1

𝐸

𝑧1

𝐹𝐼

𝑧1

𝐽𝑍

𝑧1

𝐺

𝑧1

𝐻

𝑧2

𝐵𝐼

𝑧2

𝐶

𝑧2

𝐸

𝑧2

𝐹𝐼

𝑧2

𝐽𝑍

𝑧2

𝐺

𝑧2

𝐻

𝑧3

𝐵𝐼

𝑧3

𝐶

𝑧3

𝐸

𝑧3

𝐹𝐼

𝑧3

𝐽𝑍

𝑧3

𝐺

𝑧3

𝐻

𝑧4

𝐵𝐼

𝑧4

𝐶

𝑧4

𝐸

𝑧4

𝐹𝐼

𝑧4

𝐽𝑍

𝑧4

𝐺

𝑧4

𝐻

𝑧5

𝐵𝐼

𝑧5

𝐶

𝑧5

𝐸

𝑧5

𝐹𝐼

𝑧5

𝐽𝑍

𝑧5

𝐺

𝑧5

𝐻

𝑧6

𝐵𝐼

𝑧6

𝐶

𝑧6

𝐸

𝑧6

𝐹𝐼

𝑧6

𝐽𝑍

𝑧6

𝐺

𝑧6

𝐻

𝑧7

𝐵𝐼

𝑧7

𝐶

𝑧7

𝐸

𝑧7

𝐹𝐼

𝑧7

𝐽𝑍

𝑧7

𝐺

𝑧7

𝐻

𝑧8

𝐵𝐼

𝑧8

𝐶

𝑧8

𝐸

𝑧8

𝐹𝐼

𝑧8

𝐽𝑍

𝑧8

𝐺

𝑧8

𝐻

/G/ /F/ /IY/ /D/

slide-17
SLIDE 17

The actual output of the network

  • Option 1: Simply select the most probable symbol at each

time

– Merge adjacent repeated symbols, and place the actual emission

  • f the symbol in the final instant

𝑌0 𝑌1 𝑌2 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌3

17

/AH/ /B/ /D/ /EH/ /IY/ /F/ /G/ 𝑧0

𝐵𝐼

𝑧0

𝐶

𝑧0

𝐸

𝑧0

𝐹𝐼

𝑧0

𝐽𝑍

𝑧0

𝐺

𝑧0

𝐻

𝑧1

𝐵𝐼

𝑧1

𝐶

𝑧1

𝐸

𝑧1

𝐹𝐼

𝑧1

𝐽𝑍

𝑧1

𝐺

𝑧1

𝐻

𝑧2

𝐵𝐼

𝑧2

𝐶

𝑧2

𝐸

𝑧2

𝐹𝐼

𝑧2

𝐽𝑍

𝑧2

𝐺

𝑧2

𝐻

𝑧3

𝐵𝐼

𝑧3

𝐶

𝑧3

𝐸

𝑧3

𝐹𝐼

𝑧3

𝐽𝑍

𝑧3

𝐺

𝑧3

𝐻

𝑧4

𝐵𝐼

𝑧4

𝐶

𝑧4

𝐸

𝑧4

𝐹𝐼

𝑧4

𝐽𝑍

𝑧4

𝐺

𝑧4

𝐻

𝑧5

𝐵𝐼

𝑧5

𝐶

𝑧5

𝐸

𝑧5

𝐹𝐼

𝑧5

𝐽𝑍

𝑧5

𝐺

𝑧5

𝐻

𝑧6

𝐵𝐼

𝑧6

𝐶

𝑧6

𝐸

𝑧6

𝐹𝐼

𝑧6

𝐽𝑍

𝑧6

𝐺

𝑧6

𝐻

𝑧7

𝐵𝐼

𝑧7

𝐶

𝑧7

𝐸

𝑧7

𝐹𝐼

𝑧7

𝐽𝑍

𝑧7

𝐺

𝑧7

𝐻

𝑧8

𝐵𝐼

𝑧8

𝐶

𝑧8

𝐸

𝑧8

𝐹𝐼

𝑧8

𝐽𝑍

𝑧8

𝐺

𝑧8

𝐻

/G/ /F/ /IY/ /D/ Cannot distinguish between an extended symbol and repetitions of the symbol /F/

slide-18
SLIDE 18

The actual output of the network

  • Option 1: Simply select the most probable symbol at each

time

– Merge adjacent repeated symbols, and place the actual emission

  • f the symbol in the final instant

𝑌0 𝑌1 𝑌2 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌3

18

/AH/ /B/ /D/ /EH/ /IY/ /F/ /G/ 𝑧0

𝐵𝐼

𝑧0

𝐶

𝑧0

𝐸

𝑧0

𝐹𝐼

𝑧0

𝐽𝑍

𝑧0

𝐺

𝑧0

𝐻

𝑧1

𝐵𝐼

𝑧1

𝐶

𝑧1

𝐸

𝑧1

𝐹𝐼

𝑧1

𝐽𝑍

𝑧1

𝐺

𝑧1

𝐻

𝑧2

𝐵𝐼

𝑧2

𝐶

𝑧2

𝐸

𝑧2

𝐹𝐼

𝑧2

𝐽𝑍

𝑧2

𝐺

𝑧2

𝐻

𝑧3

𝐵𝐼

𝑧3

𝐶

𝑧3

𝐸

𝑧3

𝐹𝐼

𝑧3

𝐽𝑍

𝑧3

𝐺

𝑧3

𝐻

𝑧4

𝐵𝐼

𝑧4

𝐶

𝑧4

𝐸

𝑧4

𝐹𝐼

𝑧4

𝐽𝑍

𝑧4

𝐺

𝑧4

𝐻

𝑧5

𝐵𝐼

𝑧5

𝐶

𝑧5

𝐸

𝑧5

𝐹𝐼

𝑧5

𝐽𝑍

𝑧5

𝐺

𝑧5

𝐻

𝑧6

𝐵𝐼

𝑧6

𝐶

𝑧6

𝐸

𝑧6

𝐹𝐼

𝑧6

𝐽𝑍

𝑧6

𝐺

𝑧6

𝐻

𝑧7

𝐵𝐼

𝑧7

𝐶

𝑧7

𝐸

𝑧7

𝐹𝐼

𝑧7

𝐽𝑍

𝑧7

𝐺

𝑧7

𝐻

𝑧8

𝐵𝐼

𝑧8

𝐶

𝑧8

𝐸

𝑧8

𝐹𝐼

𝑧8

𝐽𝑍

𝑧8

𝐺

𝑧8

𝐻

/G/ /F/ /IY/ /D/ Cannot distinguish between an extended symbol and repetitions of the symbol /F/ Resulting sequence may be meaningless (what word is “GFIYD”?)

slide-19
SLIDE 19

The actual output of the network

  • Option 2: Impose external constraints on what sequences are

allowed

– E.g. only allow sequences corresponding to dictionary words – E.g. Sub-symbol units (like in HW1 – what were they?)

𝑌0 𝑌1 𝑌2 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌3

19

/AH/ /B/ /D/ /EH/ /IY/ /F/ /G/ 𝑧0

𝐵𝐼

𝑧0

𝐶

𝑧0

𝐸

𝑧0

𝐹𝐼

𝑧0

𝐽𝑍

𝑧0

𝐺

𝑧0

𝐻

𝑧1

𝐵𝐼

𝑧1

𝐶

𝑧1

𝐸

𝑧1

𝐹𝐼

𝑧1

𝐽𝑍

𝑧1

𝐺

𝑧1

𝐻

𝑧2

𝐵𝐼

𝑧2

𝐶

𝑧2

𝐸

𝑧2

𝐹𝐼

𝑧2

𝐽𝑍

𝑧2

𝐺

𝑧2

𝐻

𝑧3

𝐵𝐼

𝑧3

𝐶

𝑧3

𝐸

𝑧3

𝐹𝐼

𝑧3

𝐽𝑍

𝑧3

𝐺

𝑧3

𝐻

𝑧4

𝐵𝐼

𝑧4

𝐶

𝑧4

𝐸

𝑧4

𝐹𝐼

𝑧4

𝐽𝑍

𝑧4

𝐺

𝑧4

𝐻

𝑧5

𝐵𝐼

𝑧5

𝐶

𝑧5

𝐸

𝑧5

𝐹𝐼

𝑧5

𝐽𝑍

𝑧5

𝐺

𝑧5

𝐻

𝑧6

𝐵𝐼

𝑧6

𝐶

𝑧6

𝐸

𝑧6

𝐹𝐼

𝑧6

𝐽𝑍

𝑧6

𝐺

𝑧6

𝐻

𝑧7

𝐵𝐼

𝑧7

𝐶

𝑧7

𝐸

𝑧7

𝐹𝐼

𝑧7

𝐽𝑍

𝑧7

𝐺

𝑧7

𝐻

𝑧8

𝐵𝐼

𝑧8

𝐶

𝑧8

𝐸

𝑧8

𝐹𝐼

𝑧8

𝐽𝑍

𝑧8

𝐺

𝑧8

𝐻

slide-20
SLIDE 20

The actual output of the network

  • Option 2: Impose external constraints on what sequences are

allowed

– E.g. only allow sequences corresponding to dictionary words – E.g. Sub-symbol units (like in HW1 – what were they?)

𝑌0 𝑌1 𝑌2 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌3

20

/AH/ /B/ /D/ /EH/ /IY/ /F/ /G/ 𝑧0

𝐵𝐼

𝑧0

𝐶

𝑧0

𝐸

𝑧0

𝐹𝐼

𝑧0

𝐽𝑍

𝑧0

𝐺

𝑧0

𝐻

𝑧1

𝐵𝐼

𝑧1

𝐶

𝑧1

𝐸

𝑧1

𝐹𝐼

𝑧1

𝐽𝑍

𝑧1

𝐺

𝑧1

𝐻

𝑧2

𝐵𝐼

𝑧2

𝐶

𝑧2

𝐸

𝑧2

𝐹𝐼

𝑧2

𝐽𝑍

𝑧2

𝐺

𝑧2

𝐻

𝑧3

𝐵𝐼

𝑧3

𝐶

𝑧3

𝐸

𝑧3

𝐹𝐼

𝑧3

𝐽𝑍

𝑧3

𝐺

𝑧3

𝐻

𝑧4

𝐵𝐼

𝑧4

𝐶

𝑧4

𝐸

𝑧4

𝐹𝐼

𝑧4

𝐽𝑍

𝑧4

𝐺

𝑧4

𝐻

𝑧5

𝐵𝐼

𝑧5

𝐶

𝑧5

𝐸

𝑧5

𝐹𝐼

𝑧5

𝐽𝑍

𝑧5

𝐺

𝑧5

𝐻

𝑧6

𝐵𝐼

𝑧6

𝐶

𝑧6

𝐸

𝑧6

𝐹𝐼

𝑧6

𝐽𝑍

𝑧6

𝐺

𝑧6

𝐻

𝑧7

𝐵𝐼

𝑧7

𝐶

𝑧7

𝐸

𝑧7

𝐹𝐼

𝑧7

𝐽𝑍

𝑧7

𝐺

𝑧7

𝐻

𝑧8

𝐵𝐼

𝑧8

𝐶

𝑧8

𝐸

𝑧8

𝐹𝐼

𝑧8

𝐽𝑍

𝑧8

𝐺

𝑧8

𝐻

We will refer to the process

  • f obtaining an output from

the network as decoding

slide-21
SLIDE 21

The sequence-to-sequence problem

  • How do we know when to output symbols

– In fact, the network produces outputs at every time – Which of these are the real outputs

  • How do we train these models?

𝑌0 𝑌1 𝑌2 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌9 𝑌3 /B/

21

/F/ /IY/ /IY/

slide-22
SLIDE 22

Training

  • Given output symbols at the right locations

– The phoneme /B/ ends at X2, /IY/ at X4, /F/ at X6, /IY/ at X9

𝑌0 𝑌1 𝑌2 /B/ 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌9 𝑌3

22

/F/ /IY/ /IY/

slide-23
SLIDE 23
  • Either just define Divergence as:

𝐸𝐽𝑊 = 𝑌𝑓𝑜𝑢 𝑍

2, 𝐶 + 𝑌𝑓𝑜𝑢 𝑍 4, 𝐽𝑍 + 𝑌𝑓𝑜𝑢 𝑍 6, 𝐺 + 𝑌𝑓𝑜𝑢(𝑍 9, 𝐽𝑍)

  • Or..

𝑌0 𝑌1 𝑌2 /B/ 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌9 𝑌3 Div Div Div /F/ /IY/ 𝑍

2

𝑍

6

𝑍

9

23

/IY/ Div 𝑍

4

slide-24
SLIDE 24
  • Either just define Divergence as:

𝐸𝐽𝑊 = 𝑌𝑓𝑜𝑢 𝑍

2, 𝐶 + 𝑌𝑓𝑜𝑢 𝑍 4, 𝐽𝑍 + 𝑌𝑓𝑜𝑢 𝑍 6, 𝐺 + 𝑌𝑓𝑜𝑢(𝑍 9, 𝐽𝑍)

  • Or repeat the symbols over their duration

𝐸𝐽𝑊 = ෍

𝑢

𝑌𝑓𝑜𝑢 𝑍

𝑢, 𝑡𝑧𝑛𝑐𝑝𝑚𝑢 = − ෍ 𝑢

log 𝑍 𝑢, 𝑡𝑧𝑛𝑐𝑝𝑚𝑢

𝑌0 𝑌1 𝑌2 /B/ 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌9 𝑌3 Div Div Div /F/ /IY/ 𝑍

2

𝑍

6

𝑍

9

Div Div Div Div Div Div Div

24

/IY/ 𝑍

4

slide-25
SLIDE 25

𝐸𝐽𝑊 = ෍

𝑢

𝑌𝑓𝑜𝑢 𝑍

𝑢, 𝑡𝑧𝑛𝑐𝑝𝑚𝑢 = − ෍ 𝑢

log 𝑍 𝑢, 𝑡𝑧𝑛𝑐𝑝𝑚𝑢

  • The gradient w.r.t the 𝑢-th output vector 𝑍

𝑢

𝛼𝑍

𝑢𝐸𝐽𝑊 = 0

… −1 𝑍 𝑢, 𝑡𝑧𝑛𝑐𝑝𝑚𝑢 …

– Zeros except at the component corresponding to the target

𝑌0 𝑌1 𝑌2 /B/ 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌9 𝑌3 Div Div Div /F/ /IY/ 𝑍

2

𝑍

6

𝑍

9

Div Div Div Div Div Div Div

25

/IY/ 𝑍

4

slide-26
SLIDE 26

𝑌0 𝑌1 𝑌2 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌9 𝑌3

Problem: No timing information provided

  • Only the sequence of output symbols is

provided for the training data

– But no indication of which one occurs where

  • How do we compute the divergence?

– And how do we compute its gradient w.r.t. 𝑍

𝑢

/B/ /IY/ /IY/

? ? ? ? ? ? ? ? ? ?

𝑍 𝑍

1

𝑍

2

𝑍

4

𝑍

5

𝑍

6

𝑍

7

𝑍

8

𝑍

9

𝑍

3

26

/F/

slide-27
SLIDE 27

𝑌0 𝑌1 𝑌2 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌9 𝑌3

Solution 1: Guess the alignment

  • Initialize: Assign an initial alignment

– Either randomly, based on some heuristic, or any other rationale

  • Iterate:

– Train the network using the current alignment – Reestimate the alignment for each training instance

  • Using the decoding methods already discussed

? ? ? ? ? ? ? ? ? ?

𝑍 𝑍

1

𝑍

2

𝑍

4

𝑍

5

𝑍

6

𝑍

7

𝑍

8

𝑍

9

𝑍

3

27

/B/ /B/ /IY/ /IY/ /IY/ /F/ /F/ /F/ /F/ /IY/

slide-28
SLIDE 28

𝑌0 𝑌1 𝑌2 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌9 𝑌3

Solution 1: Guess the alignment

  • Initialize: Assign an initial alignment

– Either randomly, based on some heuristic, or any other rationale

  • Iterate:

– Train the network using the current alignment – Reestimate the alignment for each training instance

  • Using the decoding methods already discussed

? ? ? ? ? ? ? ? ? ?

𝑍 𝑍

1

𝑍

2

𝑍

4

𝑍

5

𝑍

6

𝑍

7

𝑍

8

𝑍

9

𝑍

3

28

/B/ /B/ /IY/ /IY/ /IY/ /F/ /F/ /F/ /F/ /IY/

slide-29
SLIDE 29

Estimating an alignment

  • Given:

– The unaligned 𝐿-length symbol sequence 𝑇 = 𝑇0 … 𝑇𝐿−1 (e.g. /B/ /IY/ /F/ /IY/) – An 𝑂-length input (𝑂 ≥ 𝐿) – And a (trained) recurrent network

  • Find:

– An 𝑂-length expansion 𝑡0 … 𝑡𝑂−1 comprising the symbols in S in strict

  • rder
  • e.g. 𝑇0𝑇1𝑇1𝑇2𝑇3𝑇3 … 𝑇𝐿−1

– i.e.𝑡0 = 𝑇0, 𝑡2 = 𝑇1, 𝑇3 = 𝑇1, 𝑡4 = 𝑇2, 𝑡5 = 𝑇3, … 𝑡𝑂−1 = 𝑇𝐿−1

  • E.g. /B/ /B/ /IY/ /IY/ /IY/ /F/ /F/ /F/ /F/ /IY/ ..

– 𝑡𝑗 = 𝑇𝑙 ⇒ 𝑗 ≥ 𝑙 – 𝑡𝑗 = 𝑇𝑙, 𝑡

𝑘= 𝑇𝑚,

𝑗 < 𝑘 ⇒ 𝑙 ≤ 𝑚

  • Outcome: an alignment of the target symbol sequence 𝑇0 … 𝑇𝐿−1 to

the input 𝑌0 … 𝑌𝑂−1

29

slide-30
SLIDE 30

Recall: The actual output of the network

  • At each time the network outputs a probability

for each output symbol

30

𝑌0 𝑌1 𝑌2 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌3 /AH/ /B/ /D/ /EH/ /IY/ /F/ /G/ 𝑧0

𝐵𝐼

𝑧0

𝐶

𝑧0

𝐸

𝑧0

𝐹𝐼

𝑧0

𝐽𝑍

𝑧0

𝐺

𝑧0

𝐻

𝑧1

𝐵𝐼

𝑧1

𝐶

𝑧1

𝐸

𝑧1

𝐹𝐼

𝑧1

𝐽𝑍

𝑧1

𝐺

𝑧1

𝐻

𝑧2

𝐵𝐼

𝑧2

𝐶

𝑧2

𝐸

𝑧2

𝐹𝐼

𝑧2

𝐽𝑍

𝑧2

𝐺

𝑧2

𝐻

𝑧3

𝐵𝐼

𝑧3

𝐶

𝑧3

𝐸

𝑧3

𝐹𝐼

𝑧3

𝐽𝑍

𝑧3

𝐺

𝑧3

𝐻

𝑧4

𝐵𝐼

𝑧4

𝐶

𝑧4

𝐸

𝑧4

𝐹𝐼

𝑧4

𝐽𝑍

𝑧4

𝐺

𝑧4

𝐻

𝑧5

𝐵𝐼

𝑧5

𝐶

𝑧5

𝐸

𝑧5

𝐹𝐼

𝑧5

𝐽𝑍

𝑧5

𝐺

𝑧5

𝐻

𝑧6

𝐵𝐼

𝑧6

𝐶

𝑧6

𝐸

𝑧6

𝐹𝐼

𝑧6

𝐽𝑍

𝑧6

𝐺

𝑧6

𝐻

𝑧7

𝐵𝐼

𝑧7

𝐶

𝑧7

𝐸

𝑧7

𝐹𝐼

𝑧7

𝐽𝑍

𝑧7

𝐺

𝑧7

𝐻

𝑧8

𝐵𝐼

𝑧8

𝐶

𝑧8

𝐸

𝑧8

𝐹𝐼

𝑧8

𝐽𝑍

𝑧8

𝐺

𝑧8

𝐻

slide-31
SLIDE 31

Recall: unconstrained decoding

  • We find the most likely sequence of symbols

– (Conditioned on input 𝑌0 … 𝑌𝑂−1)

  • This may not correspond to an expansion of the desired symbol

sequence

– E.g. the unconstrained decode may be /AH//AH//AH//D//D//AH//F//IY//IY/

  • Contracts to /AH/ /D/ /AH/ /F/ /IY/

– Whereas we want an expansion of /B//IY//F//IY/

31

/AH/ /B/ /D/ /EH/ /IY/ /F/ /G/ 𝑧0

𝐵𝐼

𝑧0

𝐶

𝑧0

𝐸

𝑧0

𝐹𝐼

𝑧0

𝐽𝑍

𝑧0

𝐺

𝑧0

𝐻

𝑧1

𝐵𝐼

𝑧1

𝐶

𝑧1

𝐸

𝑧1

𝐹𝐼

𝑧1

𝐽𝑍

𝑧1

𝐺

𝑧1

𝐻

𝑧2

𝐵𝐼

𝑧2

𝐶

𝑧2

𝐸

𝑧2

𝐹𝐼

𝑧2

𝐽𝑍

𝑧2

𝐺

𝑧2

𝐻

𝑧3

𝐵𝐼

𝑧3

𝐶

𝑧3

𝐸

𝑧3

𝐹𝐼

𝑧3

𝐽𝑍

𝑧3

𝐺

𝑧3

𝐻

𝑧4

𝐵𝐼

𝑧4

𝐶

𝑧4

𝐸

𝑧4

𝐹𝐼

𝑧4

𝐽𝑍

𝑧4

𝐺

𝑧4

𝐻

𝑧5

𝐵𝐼

𝑧5

𝐶

𝑧5

𝐸

𝑧5

𝐹𝐼

𝑧5

𝐽𝑍

𝑧5

𝐺

𝑧5

𝐻

𝑧6

𝐵𝐼

𝑧6

𝐶

𝑧6

𝐸

𝑧6

𝐹𝐼

𝑧6

𝐽𝑍

𝑧6

𝐺

𝑧6

𝐻

𝑧7

𝐵𝐼

𝑧7

𝐶

𝑧7

𝐸

𝑧7

𝐹𝐼

𝑧7

𝐽𝑍

𝑧7

𝐺

𝑧7

𝐻

𝑧8

𝐵𝐼

𝑧8

𝐶

𝑧8

𝐸

𝑧8

𝐹𝐼

𝑧8

𝐽𝑍

𝑧8

𝐺

𝑧8

𝐻

slide-32
SLIDE 32

Constraining the alignment: Try 1

  • Block out all rows that do not include symbols

from the target sequence

– E.g. Block out rows that are not /B/ /IY/ or /F/

32

/AH/ /B/ /D/ /EH/ /IY/ /F/ /G/ 𝑧0

𝐵𝐼

𝑧0

𝐶

𝑧0

𝐸

𝑧0

𝐹𝐼

𝑧0

𝐽𝑍

𝑧0

𝐺

𝑧0

𝐻

𝑧1

𝐵𝐼

𝑧1

𝐶

𝑧1

𝐸

𝑧1

𝐹𝐼

𝑧1

𝐽𝑍

𝑧1

𝐺

𝑧1

𝐻

𝑧2

𝐵𝐼

𝑧2

𝐶

𝑧2

𝐸

𝑧2

𝐹𝐼

𝑧2

𝐽𝑍

𝑧2

𝐺

𝑧2

𝐻

𝑧3

𝐵𝐼

𝑧3

𝐶

𝑧3

𝐸

𝑧3

𝐹𝐼

𝑧3

𝐽𝑍

𝑧3

𝐺

𝑧3

𝐻

𝑧4

𝐵𝐼

𝑧4

𝐶

𝑧4

𝐸

𝑧4

𝐹𝐼

𝑧4

𝐽𝑍

𝑧4

𝐺

𝑧4

𝐻

𝑧5

𝐵𝐼

𝑧5

𝐶

𝑧5

𝐸

𝑧5

𝐹𝐼

𝑧5

𝐽𝑍

𝑧5

𝐺

𝑧5

𝐻

𝑧6

𝐵𝐼

𝑧6

𝐶

𝑧6

𝐸

𝑧6

𝐹𝐼

𝑧6

𝐽𝑍

𝑧6

𝐺

𝑧6

𝐻

𝑧7

𝐵𝐼

𝑧7

𝐶

𝑧7

𝐸

𝑧7

𝐹𝐼

𝑧7

𝐽𝑍

𝑧7

𝐺

𝑧7

𝐻

𝑧8

𝐵𝐼

𝑧8

𝐶

𝑧8

𝐸

𝑧8

𝐹𝐼

𝑧8

𝐽𝑍

𝑧8

𝐺

𝑧8

𝐻

slide-33
SLIDE 33

/B/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

/IY/ 𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

/F/ 𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

Blocking out unnecessary outputs

33

Compute the entire output (for all symbols) Copy the output values for the target symbols into the secondary reduced structure

slide-34
SLIDE 34

Constraining the alignment: Try 1

  • Only decode on reduced grid

– We are now assured that only the appropriate symbols will be hypothesized

34

/B/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

/IY/ 𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

/F/ 𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

slide-35
SLIDE 35

Constraining the alignment: Try 1

  • Only decode on reduced grid

– We are now assured that only the appropriate symbols will be hypothesized

  • Problem: This still doesn’t assure that the decode

sequence correctly expands the target symbol sequence

– E.g. the above decode is not an expansion of /B//IY//F//IY/

  • Still needs additional constraints

35

/B/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

/IY/ 𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

/F/ 𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

slide-36
SLIDE 36

36

Arrange the constructed table so that from top to bottom it has the exact sequence of symbols required

Try 2: Explicitly arrange the constructed table

/B/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

/IY/ 𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

/IY/ 𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

/F/

slide-37
SLIDE 37

37

Arrange the constructed table so that from top to bottom it has the exact sequence of symbols required

Try 2: Explicitly arrange the constructed table

/B/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

/IY/ 𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

/IY/ 𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

/F/

Note: If a symbol occurs multiple times, we repeat the row in the appropriate location. E.g. the row for /IY/ occurs twice, in the 2nd and 4th positions

slide-38
SLIDE 38

Explicitly constrain alignment

  • Constrain that the first symbol in the decode must be the top

left block

  • The last symbol must be the bottom right
  • The rest of the symbols must follow a sequence that

monotonically travels down from top left to bottom right

– I.e. never goes up

  • This guarantees that the sequence is an expansion of the

target sequence

– /B/ /IY/ /F/ /IY/ in this case

38

/B/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

/IY/ 𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

/IY/ 𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

/F/

slide-39
SLIDE 39

Explicitly constrain alignment

39

/IY/ /B/ /F/ /IY/

  • Compose a graph such that every path in the graph from source to sink represents a

valid alignment

– Which maps on to the target symbol sequence (/B//AH//T/)

  • Edge scores are 1
  • Node scores are the probabilities assigned to the symbols by the neural network
  • The “score” of a path is the product of the probabilities of all nodes along the path
  • Find the most probable path from source to sink using any dynamic programming

algorithm

– E.g. The Viterbi algorithm

𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-40
SLIDE 40

Viterbi algorithm

  • At each node, keep track of

– The best incoming edge – The score of the best path from the source to the node

  • Dynamically compute the best path from

source to sink

40

/IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-41
SLIDE 41

Viterbi algorithm

  • First, some notation:
  • 𝑧𝑢

𝑇(𝑠) is the probability of the target symbol assigned to the 𝑠-th row

in the 𝑢-th time (given inputs 𝑌0 … 𝑌𝑢)

– E.g., S(0) = /B/

  • The scores in the 0th row have the form 𝑧𝑢

𝐶

– E.g. S(1) = S(3) = /IY/

  • The scores in the 1st and 3rd rows have the form 𝑧𝑢

𝐽𝑍

– E.g. S(2) = /F/

  • The scores in the 2nd row have the form 𝑧𝑢

𝐺 41

/IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-42
SLIDE 42

Viterbi algorithm

  • Initialization:

𝐶𝑄 0, 𝑗 = 𝑜𝑣𝑚𝑚, 𝑗 = 0 … 𝐿 − 1

𝐶𝑡𝑑𝑠 0,0 = 𝑧0

𝑇 0 , 𝐶𝑡𝑑𝑠 0, 𝑗 = −∞, 𝑗 = 1 … 𝐿 − 1

  • for 𝑢 = 1 … 𝑈 − 1

𝐶𝑄(𝑢, 0) = 0, 𝐶𝑡𝑑𝑠(𝑢, 0) = 𝐶𝑡𝑑𝑠(𝑢 − 1,0) × 𝑧𝑢

𝑇 0

for 𝑚 = 0 … 𝐿 − 1

  • 𝐶𝑄 𝑢, 𝑚 = 𝑗𝑔 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 − 1 > 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚

𝑚 − 1; 𝑓𝑚𝑡𝑓 𝑚

  • 𝐶𝑡𝑑𝑠(𝑢, 𝑚) = 𝐶𝑡𝑑𝑠(𝐶𝑄(𝑢, 𝑚)) × 𝑧𝑢

𝑇 𝑚 42

/IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-43
SLIDE 43

Viterbi algorithm

  • Initialization:

𝐶𝑄 0, 𝑗 = 𝑜𝑣𝑚𝑚, 𝑗 = 0 … 𝐿 − 1

𝐶𝑡𝑑𝑠 0,0 = 𝑧0

𝑇 0 , 𝐶𝑡𝑑𝑠 0, 𝑗 = −∞, 𝑗 = 1 … 𝐿 − 1

  • for 𝑢 = 1 … 𝑈 − 1

𝐶𝑄 𝑢, 0 = 0; 𝐶𝑡𝑑𝑠(𝑢, 0) = 𝐶𝑡𝑑𝑠(𝑢 − 1,0) × 𝑧𝑢

𝑇 0

for 𝑚 = 1 … 𝐿 − 1

  • 𝐶𝑄 𝑢, 𝑚 = 𝑗𝑔 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 − 1 > 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚

𝑚 − 1; 𝑓𝑚𝑡𝑓 𝑚

  • 𝐶𝑡𝑑𝑠(𝑢, 𝑚) = 𝐶𝑡𝑑𝑠(𝐶𝑄(𝑢, 𝑚)) × 𝑧𝑢

𝑇 𝑚 43

/IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-44
SLIDE 44

Viterbi algorithm

  • Initialization:

𝐶𝑄 0, 𝑗 = 𝑜𝑣𝑚𝑚, 𝑗 = 0 … 𝐿 − 1

𝐶𝑡𝑑𝑠 0,0 = 𝑧0

𝑇 0 , 𝐶𝑡𝑑𝑠 0, 𝑗 = −∞, 𝑗 = 1 … 𝐿 − 1

  • for 𝑢 = 1 … 𝑈 − 1

𝐶𝑄 𝑢, 0 = 0; 𝐶𝑡𝑑𝑠(𝑢, 0) = 𝐶𝑡𝑑𝑠(𝑢 − 1,0) × 𝑧𝑢

𝑇 0

for 𝑚 = 1 … 𝐿 − 1

  • 𝐶𝑄 𝑢, 𝑚 = 𝑗𝑔 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 − 1 > 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚

𝑚 − 1; 𝑓𝑚𝑡𝑓 𝑚

  • 𝐶𝑡𝑑𝑠(𝑢, 𝑚) = 𝐶𝑡𝑑𝑠(𝐶𝑄(𝑢, 𝑚)) × 𝑧𝑢

𝑇 𝑚 44

/IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-45
SLIDE 45

Viterbi algorithm

  • Initialization:

𝐶𝑄 0, 𝑗 = 𝑜𝑣𝑚𝑚, 𝑗 = 0 … 𝐿 − 1

𝐶𝑡𝑑𝑠 0,0 = 𝑧0

𝑇 0 , 𝐶𝑡𝑑𝑠 0, 𝑗 = −∞, 𝑗 = 1 … 𝐿 − 1

  • for 𝑢 = 1 … 𝑈 − 1

𝐶𝑄 𝑢, 0 = 0; 𝐶𝑡𝑑𝑠(𝑢, 0) = 𝐶𝑡𝑑𝑠(𝑢 − 1,0) × 𝑧𝑢

𝑇 0

for 𝑚 = 1 … 𝐿 − 1

  • 𝐶𝑄 𝑢, 𝑚 = 𝑗𝑔 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 − 1 > 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚

𝑚 − 1; 𝑓𝑚𝑡𝑓 𝑚

  • 𝐶𝑡𝑑𝑠(𝑢, 𝑚) = 𝐶𝑡𝑑𝑠(𝐶𝑄(𝑢, 𝑚)) × 𝑧𝑢

𝑇 𝑚 45

/IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-46
SLIDE 46

Viterbi algorithm

  • Initialization:

𝐶𝑄 0, 𝑗 = 𝑜𝑣𝑚𝑚, 𝑗 = 0 … 𝐿 − 1

𝐶𝑡𝑑𝑠 0,0 = 𝑧0

𝑇 0 , 𝐶𝑡𝑑𝑠 0, 𝑗 = −∞, 𝑗 = 1 … 𝐿 − 1

  • for 𝑢 = 1 … 𝑈 − 1

𝐶𝑄 𝑢, 0 = 0; 𝐶𝑡𝑑𝑠(𝑢, 0) = 𝐶𝑡𝑑𝑠(𝑢 − 1,0) × 𝑧𝑢

𝑇 0

for 𝑚 = 1 … 𝐿 − 1

  • 𝐶𝑄 𝑢, 𝑚 = 𝑗𝑔 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 − 1 > 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚

𝑚 − 1; 𝑓𝑚𝑡𝑓 𝑚

  • 𝐶𝑡𝑑𝑠(𝑢, 𝑚) = 𝐶𝑡𝑑𝑠(𝐶𝑄(𝑢, 𝑚)) × 𝑧𝑢

𝑇 𝑚 46

/IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-47
SLIDE 47

Viterbi algorithm

  • Initialization:

𝐶𝑄 0, 𝑗 = 𝑜𝑣𝑚𝑚, 𝑗 = 0 … 𝐿 − 1

𝐶𝑡𝑑𝑠 0,0 = 𝑧0

𝑇 0 , 𝐶𝑡𝑑𝑠 0, 𝑗 = −∞, 𝑗 = 1 … 𝐿 − 1

  • for 𝑢 = 1 … 𝑈 − 1

𝐶𝑄 𝑢, 0 = 0; 𝐶𝑡𝑑𝑠(𝑢, 0) = 𝐶𝑡𝑑𝑠(𝑢 − 1,0) × 𝑧𝑢

𝑇 0

for 𝑚 = 1 … 𝐿 − 1

  • 𝐶𝑄 𝑢, 𝑚 = 𝑗𝑔 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 − 1 > 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚

𝑚 − 1; 𝑓𝑚𝑡𝑓 𝑚

  • 𝐶𝑡𝑑𝑠(𝑢, 𝑚) = 𝐶𝑡𝑑𝑠(𝐶𝑄(𝑢, 𝑚)) × 𝑧𝑢

𝑇 𝑚 47

/IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-48
SLIDE 48

Viterbi algorithm

  • Initialization:

𝐶𝑄 0, 𝑗 = 𝑜𝑣𝑚𝑚, 𝑗 = 0 … 𝐿 − 1

𝐶𝑡𝑑𝑠 0,0 = 𝑧0

𝑇 0 , 𝐶𝑡𝑑𝑠 0, 𝑗 = −∞, 𝑗 = 1 … 𝐿 − 1

  • for 𝑢 = 1 … 𝑈 − 1

𝐶𝑄 𝑢, 0 = 0; 𝐶𝑡𝑑𝑠(𝑢, 0) = 𝐶𝑡𝑑𝑠(𝑢 − 1,0) × 𝑧𝑢

𝑇 0

for 𝑚 = 1 … 𝐿 − 1

  • 𝐶𝑄 𝑢, 𝑚 = 𝑗𝑔 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 − 1 > 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚

𝑚 − 1; 𝑓𝑚𝑡𝑓 𝑚

  • 𝐶𝑡𝑑𝑠(𝑢, 𝑚) = 𝐶𝑡𝑑𝑠(𝐶𝑄(𝑢, 𝑚)) × 𝑧𝑢

𝑇 𝑚 48

/IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-49
SLIDE 49

Viterbi algorithm

  • Initialization:

𝐶𝑄 0, 𝑗 = 𝑜𝑣𝑚𝑚, 𝑗 = 0 … 𝐿 − 1

𝐶𝑡𝑑𝑠 0,0 = 𝑧0

𝑇 0 , 𝐶𝑡𝑑𝑠 0, 𝑗 = −∞, 𝑗 = 1 … 𝐿 − 1

  • for 𝑢 = 1 … 𝑈 − 1

𝐶𝑄 𝑢, 0 = 0; 𝐶𝑡𝑑𝑠(𝑢, 0) = 𝐶𝑡𝑑𝑠(𝑢 − 1,0) × 𝑧𝑢

𝑇 0

for 𝑚 = 1 … 𝐿 − 1

  • 𝐶𝑄 𝑢, 𝑚 = 𝑗𝑔 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 − 1 > 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚

𝑚 − 1; 𝑓𝑚𝑡𝑓 𝑚

  • 𝐶𝑡𝑑𝑠(𝑢, 𝑚) = 𝐶𝑡𝑑𝑠(𝐶𝑄(𝑢, 𝑚)) × 𝑧𝑢

𝑇 𝑚 49

/IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-50
SLIDE 50

Viterbi algorithm

  • Initialization:

𝐶𝑄 0, 𝑗 = 𝑜𝑣𝑚𝑚, 𝑗 = 0 … 𝐿 − 1

𝐶𝑡𝑑𝑠 0,0 = 𝑧0

𝑇 0 , 𝐶𝑡𝑑𝑠 0, 𝑗 = −∞, 𝑗 = 1 … 𝐿 − 1

  • for 𝑢 = 1 … 𝑈 − 1

𝐶𝑄 𝑢, 0 = 0; 𝐶𝑡𝑑𝑠(𝑢, 0) = 𝐶𝑡𝑑𝑠(𝑢 − 1,0) × 𝑧𝑢

𝑇 0

for 𝑚 = 1 … 𝐿 − 1

  • 𝐶𝑄 𝑢, 𝑚 = 𝑗𝑔 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 − 1 > 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚

𝑚 − 1; 𝑓𝑚𝑡𝑓 𝑚

  • 𝐶𝑡𝑑𝑠(𝑢, 𝑚) = 𝐶𝑡𝑑𝑠(𝐶𝑄(𝑢, 𝑚)) × 𝑧𝑢

𝑇 𝑚 50

/IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-51
SLIDE 51

Viterbi algorithm

  • Initialization:

𝐶𝑄 0, 𝑗 = 𝑜𝑣𝑚𝑚, 𝑗 = 0 … 𝐿 − 1

𝐶𝑡𝑑𝑠 0,0 = 𝑧0

𝑇 0 , 𝐶𝑡𝑑𝑠 0, 𝑗 = −∞, 𝑗 = 1 … 𝐿 − 1

  • for 𝑢 = 1 … 𝑈 − 1

𝐶𝑄 𝑢, 0 = 0; 𝐶𝑡𝑑𝑠(𝑢, 0) = 𝐶𝑡𝑑𝑠(𝑢 − 1,0) × 𝑧𝑢

𝑇 0

for 𝑚 = 1 … 𝐿 − 1

  • 𝐶𝑄 𝑢, 𝑚 = 𝑗𝑔 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 − 1 > 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚

𝑚 − 1; 𝑓𝑚𝑡𝑓 𝑚

  • 𝐶𝑡𝑑𝑠(𝑢, 𝑚) = 𝐶𝑡𝑑𝑠(𝐶𝑄(𝑢, 𝑚)) × 𝑧𝑢

𝑇 𝑚 51

/IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-52
SLIDE 52

Viterbi algorithm

  • 𝑡(𝑈 − 1) = 𝑇(𝐿 − 1)
  • for 𝑢 = 𝑈 𝑒𝑝𝑥𝑜 𝑢𝑝 1

– s(t-1) = BP(s(t))

52

/IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-53
SLIDE 53

Viterbi algorithm

  • 𝑡(𝑈 − 1) = 𝑇(𝐿 − 1)
  • for 𝑢 = 𝑈 − 1 𝑒𝑝𝑥𝑜𝑢𝑝 1

𝑡(𝑢 − 1) = 𝐶𝑄(𝑡(𝑢))

53

/IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-54
SLIDE 54

Viterbi algorithm

  • 𝑡(𝑈 − 1) = 𝑇(𝐿 − 1)
  • for 𝑢 = 𝑈 − 1 𝑒𝑝𝑥𝑜𝑢𝑝 1

𝑡(𝑢 − 1) = 𝐶𝑄(𝑡(𝑢))

54

/B/ /B/ /IY/ /F/ /F/ /IY/ /IY/ /IY/ /IY/

/IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-55
SLIDE 55

𝐸𝐽𝑊 = ෍

𝑢

𝑌𝑓𝑜𝑢 𝑍

𝑢, 𝑡𝑧𝑛𝑐𝑝𝑚𝑢 𝑐𝑓𝑡𝑢𝑞𝑏𝑢ℎ = − ෍ 𝑢

log 𝑍 𝑢, 𝑡𝑧𝑛𝑐𝑝𝑚𝑢

𝑐𝑓𝑡𝑢𝑞𝑏𝑢ℎ

  • The gradient w.r.t the 𝑢-th output vector 𝑍

𝑢

𝛼𝑍

𝑢𝐸𝐽𝑊 = 0

… −1 𝑍 𝑢, 𝑡𝑧𝑛𝑐𝑝𝑚𝑢

𝑐𝑓𝑡𝑢𝑞𝑏𝑢ℎ

– Zeros except at the component corresponding to the target in the estimated alignment

55

Gradients from the alignment

/B/ /B/ /IY/ /F/ /F/ /IY/ /IY/ /IY/ /IY/

/IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-56
SLIDE 56

𝑌0 𝑌1 𝑌2 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌9 𝑌3

Iterative Estimate and Training

? ? ? ? ? ? ? ? ? ?

𝑍 𝑍

1

𝑍

2

𝑍

4

𝑍

5

𝑍

6

𝑍

7

𝑍

8

𝑍

9

𝑍

3

56

/B/ /B/ /IY/ /F/ /F/ /IY/ /IY/ /IY/ /IY/ /IY/

Decode to obtain alignments Train model with given alignments Initialize alignments The “decode” and “train” steps may be combine into a single “decode, find alignment, compute derivatives” step for SGD and mini-batch updates

slide-57
SLIDE 57

Iterative update

  • Option 1:

– Determine alignments for every training instance – Train model (using SGD or your favorite approach) on the entire training set – Iterate

  • Option 2:

– During SGD, for each training instance, find the alignment during the forward pass – Use in backward pass

57

slide-58
SLIDE 58

Iterative update: Problem

  • Approach heavily dependent on initial

alignment

  • Prone to poor local optima
  • Alternate solution: Do not commit to an

alignment during any pass..

58

slide-59
SLIDE 59
  • We commit to the single “best” estimated alignment

– The most likely alignment

𝐸𝐽𝑊 = − ෍

𝑢

log 𝑍 𝑢, 𝑡𝑧𝑛𝑐𝑝𝑚𝑢

𝑐𝑓𝑡𝑢𝑞𝑏𝑢ℎ

– This can be way off, particularly in early iterations, or if the model is poorly initialized

  • Alternate view: there is a probability distribution over alignments

– Selecting a single alignment is the same as drawing a single sample from this distribution – Selecting the most likely alignment is the same as deterministically always drawing the most probable value from the distribution

59

The reason for suboptimality

/IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-60
SLIDE 60
  • We commit to the single “best” estimated alignment

– The most likely alignment

𝐸𝐽𝑊 = − ෍

𝑢

log 𝑍 𝑢, 𝑡𝑧𝑛𝑐𝑝𝑚𝑢

𝑐𝑓𝑡𝑢𝑞𝑏𝑢ℎ

– This can be way off, particularly in early iterations, or if the model is poorly initialized

  • Alternate view: there is a probability distribution over alignments of the target

Symbol sequence (to the input)

– Selecting a single alignment is the same as drawing a single sample from it – Selecting the most likely alignment is the same as deterministically always drawing the most probable value from the distribution

60

The reason for suboptimality

/IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-61
SLIDE 61
  • Instead of only selecting the most likely alignment, use the

statistical expectation over all possible alignments 𝐸𝐽𝑊 = 𝐹 − ෍

𝑢

log 𝑍 𝑢, 𝑡𝑢

– Use the entire distribution of alignments – This will mitigate the issue of suboptimal selection of alignment

61

Averaging over all alignments

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-62
SLIDE 62

𝐸𝐽𝑊 = 𝐹 − ෍

𝑢

log 𝑍 𝑢, 𝑡𝑢

  • Using the linearity of expectation

𝐸𝐽𝑊 = − ෍

𝑢

𝐹 log 𝑍 𝑢, 𝑡𝑢

– This reduces to finding the expected divergence at each input

𝐸𝐽𝑊 = − ෍

𝑢

𝑇∈𝑇1…𝑇𝐿

𝑄(𝑡𝑢 = 𝑇|𝐓, 𝐘) log 𝑍 𝑢, 𝑡𝑢 = 𝑡

62

The expectation over all alignments

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-63
SLIDE 63

𝐸𝐽𝑊 = 𝐹 − ෍

𝑢

log 𝑍 𝑢, 𝑡𝑢

  • Using the linearity of expectation

𝐸𝐽𝑊 = − ෍

𝑢

𝐹 log 𝑍 𝑢, 𝑡𝑢

– This reduces to finding the expected divergence at each input

𝐸𝐽𝑊 = − ෍

𝑢

𝑇∈𝑇1…𝑇𝐿

𝑄(𝑡𝑢 = 𝑇|𝐓, 𝐘) log 𝑍 𝑢, 𝑡𝑢 = 𝑇

63

The expectation over all alignments

t 1 2 3 4 5 6 7 8

The probability of seeing the specific symbol s at time t, given that the symbol sequence is an expansion of 𝐓 = 𝑇0 … 𝑇𝐿−1 and given the input sequence 𝐘 = 𝑌0 … 𝑌𝑂−1 We need to be able to compute this

/IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-64
SLIDE 64

𝑄(𝑡𝑢 = 𝑇𝑠|𝐓, 𝐘) ∝ 𝑄(𝑡𝑢 = 𝑇𝑠, 𝐓|𝐘)

  • 𝑄(𝑡𝑢 = 𝑇𝑠, 𝐓|𝐘) is the total probability of all valid paths in

the graph for target sequence 𝐓 that go through the symbol 𝑇𝑠 (the 𝑠th symbol in the sequence 𝑇1 … 𝑇𝐿) at time 𝑢

  • We will compute this using the “forward-backward”

algorithm

64

A posteriori probabilities of symbols

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-65
SLIDE 65
  • Decompose 𝑄(𝑡𝑢 = 𝑇𝑠, 𝐓|𝐘) as follows:

𝑄 𝑡𝑢 = 𝑇𝑠, 𝐓|𝐘 = ෍

𝑡0…𝑡𝑢−1→𝑇1…[𝑇𝑠−]

𝑡𝑢+1…𝑡𝑂−1→[𝑇𝑠+]…𝑇𝐿

𝑄 𝑡0 … 𝑡𝑢−1, 𝑡𝑢 = 𝑇𝑠, 𝑡𝑢+1 … 𝑡𝑂−1, 𝐓 𝐘

  • [𝑇𝑠+] indicates that 𝑡𝑢+1 might either be 𝑇𝑠 or 𝑇𝑠+1
  • [𝑇𝑠−] indicates that 𝑡𝑢−1 might be either 𝑇𝑠 or 𝑇𝑠−1

= ෍

𝑡0…𝑡𝑢−1→𝑇1…[𝑇𝑠−]

𝑡𝑢+1…𝑡𝑂−1→[𝑇𝑠+]…𝑇𝐿

𝑄 𝑡0 … 𝑡𝑢−1, 𝑡𝑢 = 𝑇𝑠, 𝑡𝑢+1 … 𝑡𝑂−1 𝐘

– Because the target symbol sequence 𝐓 is implicit in the synchronized sequences 𝑡0 … 𝑡𝑂−1which are constrained to be expansions of 𝐓

65

A posteriori probabilities of symbols

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-66
SLIDE 66

𝑄 𝑡𝑢 = 𝑇𝑠, 𝐓|𝐘 = ෍

𝑡0…𝑡𝑢−1→𝑇1…[𝑇𝑠−]

𝑡𝑢+1…𝑡𝑂−1→[𝑇𝑠+]…𝑇𝐿

𝑄 𝑡0 … 𝑡𝑢−1, 𝑡𝑢 = 𝑇𝑠, 𝑡𝑢+1 … 𝑡𝑂−1 𝐘 = ෍

𝑡0…𝑡𝑢−1→𝑇1…[𝑇𝑠−]

𝑡𝑢+1…𝑡𝑂−1→[𝑇𝑠+]…𝑇𝐿

𝑄 𝑡0 … 𝑡𝑢−1, 𝑡𝑢 = 𝑇𝑠 𝐘 𝑄 𝑡𝑢+1 … 𝑡𝑂−1 𝑡0 … 𝑡𝑢−1, 𝑡𝑢 = 𝑇𝑠, 𝐘

  • For a recurrent network without feedback from the output we can make the conditional independence

assumption: 𝑄 𝑡𝑢+1 … 𝑡0 … 𝑡𝑢, 𝐘 = 𝑄 𝑡𝑢+1 … 𝐘 𝑄 𝑡𝑢 = 𝑇𝑠, 𝐓|𝐘 = ෍

𝑡0…𝑡𝑢−1→𝑇1…[𝑇𝑠−]

𝑡𝑢+1…𝑡𝑂−1→[𝑇𝑠+]…𝑇𝐿

𝑄 𝑡0 … 𝑡𝑢−1, 𝑡𝑢 = 𝑇𝑠 𝐘 𝑄 𝑡𝑢+1 … 𝑡𝑂−1 𝑡𝑢 = 𝑇𝑠, 𝐘

66

A posteriori probabilities of symbols

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-67
SLIDE 67

𝑄 𝑡𝑢 = 𝑇𝑠, 𝐓|𝐘 = ෍

𝑡0…𝑡𝑢−1→𝑇1…[𝑇𝑠−]

𝑡𝑢+1…𝑡𝑂−1→[𝑇𝑠+]…𝑇𝐿

𝑄 𝑡0 … 𝑡𝑢−1, 𝑡𝑢 = 𝑇𝑠, 𝑡𝑢+1 … 𝑡𝑂−1 𝐘 = ෍

𝑡0…𝑡𝑢−1→𝑇1…[𝑇𝑠−]

𝑡𝑢+1…𝑡𝑂−1→[𝑇𝑠+]…𝑇𝐿

𝑄 𝑡0 … 𝑡𝑢−1, 𝑡𝑢 = 𝑇𝑠 𝐘 𝑄 𝑡𝑢+1 … 𝑡𝑂−1 𝑡0 … 𝑡𝑢−1, 𝑡𝑢 = 𝑇𝑠, 𝐘

  • For a recurrent network without feedback from the output we can make the conditional independence

assumption: 𝑄 𝑡𝑢+1 … 𝑡0 … 𝑡𝑢, 𝐘 = 𝑄 𝑡𝑢+1 … 𝐘 𝑄 𝑡𝑢 = 𝑇𝑠, 𝐓|𝐘 = ෍

𝑡0…𝑡𝑢−1→𝑇1…[𝑇𝑠−]

𝑡𝑢+1…𝑡𝑂−1→[𝑇𝑠+]…𝑇𝐿

𝑄 𝑡0 … 𝑡𝑢−1, 𝑡𝑢 = 𝑇𝑠 𝐘 𝑄 𝑡𝑢+1 … 𝑡𝑂−1 𝑡𝑢 = 𝑇𝑠, 𝐘

67

Note: in reality, this assumption is not valid if the hidden states are unknown, but we will make it anyway

A posteriori probabilities of symbols

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-68
SLIDE 68

Conditional independence

  • Dependency graph: Input sequence 𝐘 = 𝑌0 𝑌1… 𝑌𝑂−1 governs hidden

variables 𝐈 = 𝐼0 𝐼1 … 𝐼𝑂−1

  • Hidden variables govern output predictions 𝑧0, 𝑧1, … 𝑧𝑂−1 individually
  • 𝑧0, 𝑧1, … 𝑧𝑂−1 are conditionally independent given 𝐈
  • Since 𝐈 is deterministically derived from 𝐘, 𝑧0, 𝑧1, … 𝑧𝑂−1 are also

conditionally independent given 𝐘

– This wouldn’t be true if the relation between 𝐘 and 𝐈 were not deterministic or if 𝐘 is unknown

68

𝐘 = 𝑌0 𝑌1… 𝑌𝑂−1 𝐈 = 𝐼0 𝐼1 … 𝐼𝑂−1 𝑧0 𝑧1 𝑧𝑂−1 ⋮

slide-69
SLIDE 69

𝑄 𝑡𝑢 = 𝑇𝑠, 𝐓|𝐘 = ෍

𝑡0…𝑡𝑢−1→𝑇1…[𝑇𝑠−]

𝑡𝑢+1…𝑡𝑂−1→[𝑇𝑠+]…𝑇𝐿

𝑄 𝑡0 … 𝑡𝑢−1, 𝑡𝑢 = 𝑇𝑠 𝐘 𝑄 𝑡𝑢+1 … 𝑡𝑂−1 𝐘 = ෍

𝑡0…𝑡𝑢−1→𝑇1…[𝑇𝑠−]

𝑄 𝑡0 … 𝑡𝑢−1, 𝑡𝑢 = 𝑇𝑠 𝐘 ෍

𝑡𝑢+1…𝑡𝑂−1→[𝑇𝑠+]…𝑇𝐿

𝑄 𝑡𝑢+1 … 𝑡𝑂−1 𝐘

69

A posteriori probabilities of symbols

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-70
SLIDE 70

𝑄 𝑡𝑢 = 𝑇𝑠, 𝐓|𝐘 = ෍

𝑡0…𝑡𝑢−1→𝑇1…[𝑇𝑠−]

𝑡𝑢+1…𝑡𝑂−1→[𝑇𝑠+]…𝑇𝐿

𝑄 𝑡0 … 𝑡𝑢−1, 𝑡𝑢 = 𝑇𝑠 𝐘 𝑄 𝑡𝑢+1 … 𝑡𝑂−1 𝐘 = ෍

𝑡0…𝑡𝑢−1→𝑇1…[𝑇𝑠−]

𝑄 𝑡0 … 𝑡𝑢−1, 𝑡𝑢 = 𝑇𝑠 𝐘 ෍

𝑡𝑢+1…𝑡𝑂−1→[𝑇𝑠+]…𝑇𝐿

𝑄 𝑡𝑢+1 … 𝑡𝑂−1 𝐘

70

A posteriori probabilities of symbols

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-71
SLIDE 71

𝑄 𝑡𝑢 = 𝑇𝑠, 𝐓|𝐘 = ෍

𝑡0…𝑡𝑢−1→𝑇1…[𝑇𝑠−]

𝑄 𝑡0 … 𝑡𝑢−1, 𝑡𝑢 = 𝑇𝑠 𝐘 ෍

𝑡𝑢+1…𝑡𝑂−1→[𝑇𝑠+]…𝑇𝐿

𝑄 𝑡𝑢+1 … 𝑡𝑂−1 𝐘

  • We will call the first term the forward probability 𝛽 𝑢, 𝑠
  • We will call the second term the backward probability 𝛾 𝑢, 𝑠

71

The expectation over all alignments

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-72
SLIDE 72

𝛽 𝑢, 𝑠 = ෍

𝑡0…𝑡𝑢−1→𝑇1…[𝑇𝑠−]

𝑄 𝑡0 … 𝑡𝑢−1, 𝑡𝑢 = 𝑇𝑠 𝐘 = ෍

𝑡0…𝑡𝑢−1→𝑇1…[𝑇𝑠−]

𝑄 𝑡0 … 𝑡𝑢−1 𝐘 𝑄 𝑡𝑢 = 𝑇𝑠 𝑡0 … 𝑡𝑢−1, 𝐘 = ෍

𝑡0…𝑡𝑢−1→𝑇1…[𝑇𝑠−]

𝑄 𝑡0 … 𝑡𝑢−1 𝐘 𝑄 𝑡𝑢 = 𝑇𝑠 𝐘 = ෍

𝑡0…𝑡𝑢−2→𝑇1…[𝑇𝑠−]

𝑄 𝑡0 … 𝑡𝑢−2, 𝑡𝑢−1 = 𝑇𝑠 𝐘 + ෍

𝑡0…𝑡𝑢−2→𝑇1…[𝑇(𝑠−1)−]

𝑄 𝑡0 … 𝑡𝑢−2, 𝑡𝑢−1 = 𝑇𝑠−1 𝐘 𝑄 𝑡𝑢 = 𝑇𝑠 𝐘

72

Forward algorithm

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-73
SLIDE 73

𝛽 𝑢, 𝑠 = ෍

𝑡0…𝑡𝑢−1→𝑇1…[𝑇𝑠−]

𝑄 𝑡0 … 𝑡𝑢−1, 𝑡𝑢 = 𝑇𝑠 𝐘 = ෍

𝑡0…𝑡𝑢−1→𝑇1…[𝑇𝑠−]

𝑄 𝑡0 … 𝑡𝑢−1 𝐘 𝑄 𝑡𝑢 = 𝑇𝑠 𝑡0 … 𝑡𝑢−1, 𝐘 = ෍

𝑡0…𝑡𝑢−1→𝑇1…[𝑇𝑠−]

𝑄 𝑡0 … 𝑡𝑢−1 𝐘 𝑄 𝑡𝑢 = 𝑇𝑠 𝐘 = ෍

𝑡0…𝑡𝑢−2→𝑇1…[𝑇𝑠−]

𝑄 𝑡0 … 𝑡𝑢−2, 𝑡𝑢−1 = 𝑇𝑠 𝐘 + ෍

𝑡0…𝑡𝑢−2→𝑇1…[𝑇(𝑠−1)−]

𝑄 𝑡0 … 𝑡𝑢−2, 𝑡𝑢−1 = 𝑇𝑠−1 𝐘 𝑄 𝑡𝑢 = 𝑇𝑠 𝐘

73

Forward algorithm

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-74
SLIDE 74

𝛽 𝑢, 𝑠 = ෍

𝑡0…𝑡𝑢−1→𝑇1…[𝑇𝑠−]

𝑄 𝑡0 … 𝑡𝑢−1, 𝑡𝑢 = 𝑇𝑠 𝐘 = ෍

𝑡0…𝑡𝑢−1→𝑇1…[𝑇𝑠−]

𝑄 𝑡0 … 𝑡𝑢−1 𝐘 𝑄 𝑡𝑢 = 𝑇𝑠 𝑡0 … 𝑡𝑢−1, 𝐘 = ෍

𝑡0…𝑡𝑢−1→𝑇1…[𝑇𝑠−]

𝑄 𝑡0 … 𝑡𝑢−1 𝐘 𝑄 𝑡𝑢 = 𝑇𝑠 𝐘 = ෍

𝑡0…𝑡𝑢−2→𝑇1…[𝑇𝑠−]

𝑄 𝑡0 … 𝑡𝑢−2, 𝑡𝑢−1 = 𝑇𝑠 𝐘 + ෍

𝑡0…𝑡𝑢−2→𝑇1…[𝑇(𝑠−1)−]

𝑄 𝑡0 … 𝑡𝑢−2, 𝑡𝑢−1 = 𝑇𝑠−1 𝐘 𝑄 𝑡𝑢 = 𝑇𝑠 𝐘

74

Forward algorithm

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-75
SLIDE 75

𝛽 𝑢, 𝑠 = ෍

𝑡0…𝑡𝑢−1→𝑇1…[𝑇𝑠−]

𝑄 𝑡0 … 𝑡𝑢−1, 𝑡𝑢 = 𝑇𝑠 𝐘 = ෍

𝑡0…𝑡𝑢−1→𝑇1…[𝑇𝑠−]

𝑄 𝑡0 … 𝑡𝑢−1 𝐘 𝑄 𝑡𝑢 = 𝑇𝑠 𝑡0 … 𝑡𝑢−1, 𝐘 = ෍

𝑡0…𝑡𝑢−1→𝑇1…[𝑇𝑠−]

𝑄 𝑡0 … 𝑡𝑢−1 𝐘 𝑄 𝑡𝑢 = 𝑇𝑠 𝐘 = ෍

𝑡0…𝑡𝑢−2→𝑇1…[𝑇𝑠−]

𝑄 𝑡0 … 𝑡𝑢−2, 𝑡𝑢−1 = 𝑇𝑠 𝐘 + ෍

𝑡0…𝑡𝑢−2→𝑇1…[𝑇(𝑠−1)−]

𝑄 𝑡0 … 𝑡𝑢−2, 𝑡𝑢−1 = 𝑇𝑠−1 𝐘 𝑄 𝑡𝑢 = 𝑇𝑠 𝐘

75

Forward algorithm

𝛽 𝑢 − 1, 𝑠 𝛽 𝑢 − 1, 𝑠 − 1 𝑧𝑢

𝑇(𝑠)

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-76
SLIDE 76

𝛽 𝑢, 𝑠 = ෍

𝑡0…𝑡𝑢−1→𝑇1…[𝑇𝑠−]

𝑄 𝑡0 … 𝑡𝑢−1, 𝑡𝑢 = 𝑇𝑠 𝐘 = ෍

𝑡0…𝑡𝑢−1→𝑇1…[𝑇𝑠−]

𝑄 𝑡0 … 𝑡𝑢−1 𝐘 𝑄 𝑡𝑢 = 𝑇𝑠 𝑡0 … 𝑡𝑢−1, 𝐘 = ෍

𝑡0…𝑡𝑢−1→𝑇1…[𝑇𝑠−]

𝑄 𝑡0 … 𝑡𝑢−1 𝐘 𝑄 𝑡𝑢 = 𝑇𝑠 𝐘 = ෍

𝑡0…𝑡𝑢−2→𝑇1…[𝑇𝑠−]

𝑄 𝑡0 … 𝑡𝑢−2, 𝑡𝑢−1 = 𝑇𝑠 𝐘 + ෍

𝑡0…𝑡𝑢−2→𝑇1…[𝑇(𝑠−1)−]

𝑄 𝑡0 … 𝑡𝑢−2, 𝑡𝑢−1 = 𝑇𝑠−1 𝐘 𝑄 𝑡𝑢 = 𝑇𝑠 𝐘 𝛽 𝑢, 𝑠 = 𝛽 𝑢 − 1, 𝑠 + 𝛽 𝑢 − 1, 𝑠 − 1 𝑧𝑢

𝑇(𝑠)

76

Forward algorithm

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-77
SLIDE 77

𝛽 𝑢, 𝑠 = 𝛽 𝑢 − 1, 𝑠 + 𝛽 𝑢 − 1, 𝑠 − 1 𝑧𝑢

𝑇(𝑠)

77

Forward algorithm

𝛽 𝑢, 𝑠 𝛽 𝑢 − 1, 𝑠 𝛽 𝑢 − 1, 𝑠 − 1

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-78
SLIDE 78

Forward algorithm

  • Initialization:

𝛽 0,1 = 𝑧0

𝑇 1 ,

𝛽 0, 𝑠 = 0, 𝑠 > 1

  • for 𝑢 = 1 … 𝑈 − 1

𝛽(𝑢, 1) = 𝛽(𝑢 − 1,1)𝑧𝑢

𝑇 1

for 𝑚 = 2 … 𝐿

  • 𝛽(𝑢, 𝑚) = (𝛽 𝑢 − 1, 𝑚 + 𝛽 𝑢 − 1, 𝑚 − 1 )𝑧𝑢

𝑇 𝑚 78

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-79
SLIDE 79

Forward algorithm

  • Initialization:

𝛽 0,1 = 𝑧0

𝑇 1 ,

𝛽 0, 𝑠 = 0, 𝑠 > 1

  • for 𝑢 = 1 … 𝑈 − 1

𝛽(𝑢, 1) = 𝛽(𝑢 − 1,1)𝑧𝑢

𝑇 1

for 𝑚 = 2 … 𝐿

  • 𝛽(𝑢, 𝑚) = (𝛽 𝑢 − 1, 𝑚 + 𝛽 𝑢 − 1, 𝑚 − 1 )𝑧𝑢

𝑇 𝑚 79

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-80
SLIDE 80

Forward algorithm

  • Initialization:

𝛽 0,1 = 𝑧0

𝑇 1 ,

𝛽 0, 𝑠 = 0, 𝑠 > 1

  • for 𝑢 = 1 … 𝑈 − 1

𝛽(𝑢, 1) = 𝛽(𝑢 − 1,1)𝑧𝑢

𝑇 1

for 𝑚 = 2 … 𝐿

  • 𝛽(𝑢, 𝑚) = (𝛽 𝑢 − 1, 𝑚 + 𝛽 𝑢 − 1, 𝑚 − 1 )𝑧𝑢

𝑇 𝑚 80

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-81
SLIDE 81

Forward algorithm

  • Initialization:

𝛽 0,1 = 𝑧0

𝑇 1 ,

𝛽 0, 𝑠 = 0, 𝑠 > 1

  • for 𝑢 = 1 … 𝑈 − 1

𝛽(𝑢, 1) = 𝛽(𝑢 − 1,1)𝑧𝑢

𝑇 1

for 𝑚 = 2 … 𝐿

  • 𝛽(𝑢, 𝑚) = (𝛽 𝑢 − 1, 𝑚 + 𝛽 𝑢 − 1, 𝑚 − 1 )𝑧𝑢

𝑇 𝑚 81

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-82
SLIDE 82

Forward algorithm

  • Initialization:

𝛽 0,1 = 𝑧0

𝑇 1 ,

𝛽 0, 𝑠 = 0, 𝑠 > 1

  • for 𝑢 = 1 … 𝑈 − 1

𝛽(𝑢, 1) = 𝛽(𝑢 − 1,1)𝑧𝑢

𝑇 1

for 𝑚 = 2 … 𝐿

  • 𝛽(𝑢, 𝑚) = (𝛽 𝑢 − 1, 𝑚 + 𝛽 𝑢 − 1, 𝑚 − 1 )𝑧𝑢

𝑇 𝑚 82

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-83
SLIDE 83

Forward algorithm

  • Initialization:

𝛽 0,1 = 𝑧0

𝑇 1 ,

𝛽 0, 𝑠 = 0, 𝑠 > 1

  • for 𝑢 = 1 … 𝑈 − 1

𝛽(𝑢, 1) = 𝛽(𝑢 − 1,1)𝑧𝑢

𝑇 1

for 𝑚 = 2 … 𝐿

  • 𝛽(𝑢, 𝑚) = (𝛽 𝑢 − 1, 𝑚 + 𝛽 𝑢 − 1, 𝑚 − 1 )𝑧𝑢

𝑇 𝑚 83

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-84
SLIDE 84

In practice..

  • The recursion

𝛽(𝑢, 𝑚) = (𝛽 𝑢 − 1, 𝑚 + 𝛽 𝑢 − 1, 𝑚 − 1 )𝑧𝑢

𝑇 𝑚

will generally underflow

  • Instead we can do it in the log domain

log 𝛽(𝑢, 𝑚) = log(𝑓log 𝛽 𝑢−1,𝑚 + 𝑓log 𝛽 𝑢−1,𝑚−1 ) + log 𝑧𝑢

𝑇 𝑚

– This can be computed entirely without underflow

84

slide-85
SLIDE 85

Forward algorithm

  • Initialization:

ො 𝛽 0,1 = 1, ො 𝛽 0, 𝑠 = 0, 𝑠 > 1 𝛽 0, 𝑠 = ො 𝛽 0, 𝑠 𝑧0

𝑇 𝑠 ,

1 ≤ 𝑠 ≤ 𝐿

  • for 𝑢 = 1 … 𝑈 − 1

ො 𝛽(𝑢, 1) = 𝛽(𝑢 − 1,1) for 𝑚 = 2 … 𝐿

𝛽(𝑢, 𝑚) = 𝛽 𝑢 − 1, 𝑚 + 𝛽 𝑢 − 1, 𝑚 − 1

𝛽 𝑢, 𝑠 = ො 𝛽 𝑢, 𝑠 𝑧𝑢

𝑇 𝑠 ,

1 ≤ 𝑠 ≤ 𝐿

85

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-86
SLIDE 86

𝑄 𝑡𝑢 = 𝑇𝑠, 𝐓|𝐘 = ෍

𝑡0…𝑡𝑢−1→𝑇1…[𝑇𝑠−]

𝑄 𝑡0 … 𝑡𝑢−1, 𝑡𝑢 = 𝑇𝑠 𝐘 ෍

𝑡𝑢+1…𝑡𝑂−1→[𝑇𝑠+]…𝑇𝐿

𝑄 𝑡𝑢+1 … 𝑡𝑂−1 𝐘

  • We will call the first term the forward probability 𝛽 𝑢, 𝑠
  • We will call the second term the backward probability 𝛾 𝑢, 𝑠

86

The forward probability

We have seen how to compute this 𝛽 𝑢, 𝑠 t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-87
SLIDE 87

𝑄 𝑡𝑢 = 𝑇𝑠, 𝐓|𝐘 = 𝛽 𝑢, 𝑠 ෍

𝑡𝑢+1…𝑡𝑂−1→[𝑇𝑠+]…𝑇𝐿

𝑄 𝑡𝑢+1 … 𝑡𝑂−1 𝐘

  • We will call the first term the forward probability 𝛽 𝑢, 𝑠
  • We will call the second term the backward probability 𝛾 𝑢, 𝑠

87

The forward probability

We have seen how to compute this t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-88
SLIDE 88

𝑄 𝑡𝑢 = 𝑇𝑠, 𝐓|𝐘 = 𝛽 𝑢, 𝑠 ෍

𝑡𝑢+1…𝑡𝑂−1→[𝑇𝑠+]…𝑇𝐿

𝑄 𝑡𝑢+1 … 𝑡𝑂−1 𝐘

  • We will call the first term the forward probability 𝛽 𝑢, 𝑠
  • We will call the second term the backward probability 𝛾 𝑢, 𝑠

88

The forward probability

Lets look at this 𝛾 𝑢, 𝑠 t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-89
SLIDE 89

𝛾 𝑢, 𝑠 = ෍

𝑡𝑢+1…𝑡𝑂−1→[𝑇𝑠+]…𝑇𝐿

𝑄 𝑡𝑢+1 … 𝑡𝑂−1 𝐘 = ෍

𝑡𝑢+2…𝑡𝑂−1→[𝑇𝑠+]…𝑇𝐿

𝑄 𝑡𝑢+1 = 𝑇𝑠, 𝑡𝑢+2 … 𝑡𝑂−1 𝐘 + ෍

𝑡𝑢+2…𝑡𝑂−1→[𝑇(𝑠+1)+]…𝑇𝐿

𝑄 𝑡𝑢+1 = 𝑇𝑠+1, 𝑡𝑢+2 … 𝑡𝑂−1 𝐘

89

Backward algorithm

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-90
SLIDE 90

𝛾 𝑢, 𝑠 = ෍

𝑡𝑢+1…𝑡𝑂−1→[𝑇𝑠+]…𝑇𝐿

𝑄 𝑡𝑢+1 … 𝑡𝑂−1 𝐘 = ෍

𝑡𝑢+2…𝑡𝑂−1→[𝑇𝑠+]…𝑇𝐿

𝑄 𝑡𝑢+1 = 𝑇𝑠, 𝑡𝑢+2 … 𝑡𝑂−1 𝐘 + ෍

𝑡𝑢+2…𝑡𝑂−1→[𝑇(𝑠+1)+]…𝑇𝐿

𝑄 𝑡𝑢+1 = 𝑇𝑠+1, 𝑡𝑢+2 … 𝑡𝑂−1 𝐘 = 𝑄 𝑡𝑢+1 = 𝑇𝑠 𝐘 ෍

𝑡𝑢+2…𝑡𝑂−1→[𝑇𝑠+]…𝑇𝐿

𝑄 𝑡𝑢+2 … 𝑡𝑂−1 𝑡𝑢+1 = 𝑇𝑠, 𝐘 + 𝑄 𝑡𝑢+1 = 𝑇𝑠+1 𝐘 ෍

𝑡𝑢+2…𝑡𝑂−1→[𝑇(𝑠+1)+]…𝑇𝐿

𝑄 𝑡𝑢+2 … 𝑡𝑂−1 𝑡𝑢+1 = 𝑇𝑠+1, 𝐘

90

Backward algorithm

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-91
SLIDE 91

𝛾 𝑢, 𝑠 = ෍

𝑡𝑢+1…𝑡𝑂−1→[𝑇𝑠+]…𝑇𝐿

𝑄 𝑡𝑢+1 … 𝑡𝑂−1 𝐘 = ෍

𝑡𝑢+2…𝑡𝑂−1→[𝑇𝑠+]…𝑇𝐿

𝑄 𝑡𝑢+1 = 𝑇𝑠, 𝑡𝑢+2 … 𝑡𝑂−1 𝐘 + ෍

𝑡𝑢+2…𝑡𝑂−1→[𝑇(𝑠+1)+]…𝑇𝐿

𝑄 𝑡𝑢+1 = 𝑇𝑠+1, 𝑡𝑢+2 … 𝑡𝑂−1 𝐘 = 𝑄 𝑡𝑢+1 = 𝑇𝑠 𝐘 ෍

𝑡𝑢+2…𝑡𝑂−1→[𝑇𝑠+]…𝑇𝐿

𝑄 𝑡𝑢+2 … 𝑡𝑂−1 𝑡𝑢+1 = 𝑇𝑠, 𝐘 + 𝑄 𝑡𝑢+1 = 𝑇𝑠+1 𝐘 ෍

𝑡𝑢+2…𝑡𝑂−1→[𝑇(𝑠+1)+]…𝑇𝐿

𝑄 𝑡𝑢+2 … 𝑡𝑂−1 𝑡𝑢+1 = 𝑇𝑠+1, 𝐘 = 𝑄 𝑡𝑢+1 = 𝑇𝑠 𝐘 ෍

𝑡𝑢+2…𝑡𝑂−1→[𝑇𝑠+]…𝑇𝐿

𝑄 𝑡𝑢+2 … 𝑡𝑂−1 𝐘 + 𝑄 𝑡𝑢+1 = 𝑇𝑠+1 𝐘 ෍

𝑡𝑢+2…𝑡𝑂−1→[𝑇(𝑠+1)+]…𝑇𝐿

𝑄 𝑡𝑢+2 … 𝑡𝑂−1 𝐘

91

Backward algorithm

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-92
SLIDE 92

𝛾 𝑢, 𝑠 = ෍

𝑡𝑢+1…𝑡𝑂−1→[𝑇𝑠+]…𝑇𝐿

𝑄 𝑡𝑢+1 … 𝑡𝑂−1 𝐘 = ෍

𝑡𝑢+2…𝑡𝑂−1→[𝑇𝑠+]…𝑇𝐿

𝑄 𝑡𝑢+1 = 𝑇𝑠, 𝑡𝑢+2 … 𝑡𝑂−1 𝐘 + ෍

𝑡𝑢+2…𝑡𝑂−1→[𝑇(𝑠+1)+]…𝑇𝐿

𝑄 𝑡𝑢+1 = 𝑇𝑠+1, 𝑡𝑢+2 … 𝑡𝑂−1 𝐘 = 𝑄 𝑡𝑢+1 = 𝑇𝑠 𝐘 ෍

𝑡𝑢+2…𝑡𝑂−1→[𝑇𝑠+]…𝑇𝐿

𝑄 𝑡𝑢+2 … 𝑡𝑂−1 𝑡𝑢+1 = 𝑇𝑠, 𝐘 + 𝑄 𝑡𝑢+1 = 𝑇𝑠+1 𝐘 ෍

𝑡𝑢+2…𝑡𝑂−1→[𝑇(𝑠+1)+]…𝑇𝐿

𝑄 𝑡𝑢+2 … 𝑡𝑂−1 𝑡𝑢+1 = 𝑇𝑠+1, 𝐘 = 𝑄 𝑡𝑢+1 = 𝑇𝑠 𝐘 ෍

𝑡𝑢+2…𝑡𝑂−1→[𝑇𝑠+]…𝑇𝐿

𝑄 𝑡𝑢+2 … 𝑡𝑂−1 𝐘 + 𝑄 𝑡𝑢+1 = 𝑇𝑠+1 𝐘 ෍

𝑡𝑢+2…𝑡𝑂−1→[𝑇(𝑠+1)+]…𝑇𝐿

𝑄 𝑡𝑢+2 … 𝑡𝑂−1 𝐘

92

Backward algorithm

𝛾 𝑢 + 1, 𝑠 𝑧𝑢+1

𝑇(𝑠+1)

𝑧𝑢+1

𝑇(𝑠)

𝛾 𝑢 + 1, 𝑠 + 1 t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-93
SLIDE 93

𝛾 𝑢, 𝑠 = 𝑧𝑢+1

𝑇(𝑠)𝛾 𝑢 + 1, 𝑠

+ 𝑧𝑢+1

𝑇(𝑠+1)𝛾 𝑢 + 1, 𝑠 + 1

93

Backward algorithm

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-94
SLIDE 94

Backward algorithm

  • Initialization:

𝛾 𝑈 − 1, 𝐿 = 1, 𝛾 𝑈 − 1, 𝑠 = 0, 𝑠 < 𝐿

  • for 𝑢 = 𝑈 − 2 𝑒𝑝𝑥𝑜𝑢𝑝 0

𝛾(𝑢, 𝐿) = 𝛾(𝑢 + 1, 𝐿)𝑧𝑢+1

𝑇 𝐿

for 𝑚 = 𝐿 − 1 … 1

  • 𝛾 𝑢, 𝑠 = 𝑧𝑢+1

𝑇(𝑚)𝛾 𝑢 + 1, 𝑠

+ 𝑧𝑢+1

𝑇(𝑠+1)𝛾 𝑢 + 1, 𝑠 + 1

94

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-95
SLIDE 95

Backward algorithm

  • Initialization:

𝛾 𝑈 − 1, 𝐿 = 1, 𝛾 𝑈 − 1, 𝑠 = 0, 𝑠 < 𝐿

  • for 𝑢 = 𝑈 − 2 𝑒𝑝𝑥𝑜𝑢𝑝 0

𝛾(𝑢, 𝐿) = 𝛾(𝑢 + 1, 𝐿)𝑧𝑢+1

𝑇 𝐿

for 𝑚 = 𝐿 − 1 … 1

  • 𝛾 𝑢, 𝑠 = 𝑧𝑢+1

𝑇(𝑚)𝛾 𝑢 + 1, 𝑠

+ 𝑧𝑢+1

𝑇(𝑠+1)𝛾 𝑢 + 1, 𝑠 + 1

95

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-96
SLIDE 96

Backward algorithm

  • Initialization:

𝛾 𝑈 − 1, 𝐿 = 1, 𝛾 𝑈 − 1, 𝑠 = 0, 𝑠 < 𝐿

  • for 𝑢 = 𝑈 − 2 𝑒𝑝𝑥𝑜𝑢𝑝 0

𝛾(𝑢, 𝐿) = 𝛾(𝑢 + 1, 𝐿)𝑧𝑢+1

𝑇 𝐿

for 𝑚 = 𝐿 − 1 … 1

  • 𝛾 𝑢, 𝑠 = 𝑧𝑢+1

𝑇(𝑚)𝛾 𝑢 + 1, 𝑠

+ 𝑧𝑢+1

𝑇(𝑠+1)𝛾 𝑢 + 1, 𝑠 + 1

96

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-97
SLIDE 97

Backward algorithm

  • Initialization:

𝛾 𝑈 − 1, 𝐿 = 1, 𝛾 𝑈 − 1, 𝑠 = 0, 𝑠 < 𝐿

  • for 𝑢 = 𝑈 − 2 𝑒𝑝𝑥𝑜𝑢𝑝 0

𝛾(𝑢, 𝐿) = 𝛾(𝑢 + 1, 𝐿)𝑧𝑢+1

𝑇 𝐿

for 𝑚 = 𝐿 − 1 … 1

  • 𝛾 𝑢, 𝑠 = 𝑧𝑢+1

𝑇(𝑚)𝛾 𝑢 + 1, 𝑠

+ 𝑧𝑢+1

𝑇(𝑠+1)𝛾 𝑢 + 1, 𝑠 + 1

97

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-98
SLIDE 98

Backward algorithm

  • Initialization:

𝛾 𝑈 − 1, 𝐿 = 1, 𝛾 𝑈 − 1, 𝑠 = 0, 𝑠 < 𝐿

  • for 𝑢 = 𝑈 − 2 𝑒𝑝𝑥𝑜𝑢𝑝 0

𝛾(𝑢, 𝐿) = 𝛾(𝑢 + 1, 𝐿)𝑧𝑢+1

𝑇 𝐿

for 𝑚 = 𝐿 − 1 … 1

  • 𝛾 𝑢, 𝑠 = 𝑧𝑢+1

𝑇(𝑚)𝛾 𝑢 + 1, 𝑠

+ 𝑧𝑢+1

𝑇(𝑠+1)𝛾 𝑢 + 1, 𝑠 + 1

98

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-99
SLIDE 99

𝑄 𝑡𝑢 = 𝑇𝑠, 𝐓|𝐘 = 𝛽 𝑢, 𝑠 ෍

𝑡𝑢+1…𝑡𝑂−1→[𝑇𝑠+]…𝑇𝐿

𝑄 𝑡𝑢+1 … 𝑡𝑂−1 𝐘

  • We will call the first term the forward probability 𝛽 𝑢, 𝑠
  • We will call the second term the backward probability 𝛾 𝑢, 𝑠

99

The joint probability

We now can compute this 𝛾 𝑢, 𝑠 t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-100
SLIDE 100

𝑄 𝑡𝑢 = 𝑇𝑠, 𝐓|𝐘 = 𝛽 𝑢, 𝑠 𝛾 𝑢, 𝑠

  • We will call the first term the forward probability 𝛽 𝑢, 𝑠
  • We will call the second term the backward probability

𝛾 𝑢, 𝑠

100

The joint probability

Backward algo Forward algo t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-101
SLIDE 101

𝑄 𝑡𝑢 = 𝑇𝑠, 𝐓|𝐘 = 𝛽 𝑢, 𝑠 𝛾 𝑢, 𝑠

  • The posterior is given by

𝑄 𝑡𝑢 = 𝑇𝑠|𝐓, 𝐘 = 𝑄 𝑡𝑢 = 𝑇𝑠, 𝐓|𝐘 σ𝑇𝑠

′ 𝑄 𝑡𝑢 = 𝑇𝑠

′, 𝐓|𝐘 =

𝛽 𝑢, 𝑠 𝛾 𝑢, 𝑠 σ𝑠′ 𝛽 𝑢, 𝑠′ 𝛾 𝑢, 𝑠′

  • We can also write this as

𝑄 𝑡𝑢 = 𝑇𝑠|𝐓, 𝐘 = ො 𝛽 𝑢, 𝑠 𝑧𝑢

𝑇(𝑠)𝛾 𝑢, 𝑠

ො 𝛽 𝑢, 𝑠 𝑧𝑢

𝑇(𝑠)𝛾 𝑢, 𝑠 + σ𝑠′≠𝑠 𝛽 𝑢, 𝑠 𝛾 𝑢, 𝑠′

The posterior probability

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-102
SLIDE 102

𝐸𝐽𝑊 = − ෍

𝑢

𝑡∈𝑇1…𝑇𝐿

𝑄 𝑡𝑢 = 𝑡 𝐓, 𝐘 log 𝑍 𝑢, 𝑡𝑢 = 𝑡 𝐸𝐽𝑊 = − ෍

𝑢

𝑠

𝛽 𝑢, 𝑠 𝛾 𝑢, 𝑠 σ𝑠′ 𝛽 𝑢, 𝑠′ 𝛾 𝑢, 𝑠′ log 𝑧𝑢

𝑇(𝑠)

  • The derivative of the divergence w.r.t the output Yt of the net at any time:

𝛼𝑍

𝑢𝐸𝐽𝑊 = 𝑒𝐸𝐽𝑊

𝑒𝑧𝑢

1

𝑒𝐸𝐽𝑊 𝑒𝑧𝑢

2

… 𝑒𝐸𝐽𝑊 𝑒𝑧𝑢

𝑀

– Components will be non-zero only for symbols that occur in the training instance

102

The expected divergence

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-103
SLIDE 103

𝐸𝐽𝑊 = − ෍

𝑢

𝑡∈𝑇1…𝑇𝐿

𝑄 𝑡𝑢 = 𝑡 𝐓, 𝐘 log 𝑍 𝑢, 𝑡𝑢 = 𝑡 𝐸𝐽𝑊 = − ෍

𝑢

𝑠

𝛽 𝑢, 𝑠 𝛾 𝑢, 𝑠 σ𝑠′ 𝛽 𝑢, 𝑠′ 𝛾 𝑢, 𝑠′ log 𝑧𝑢

𝑇(𝑠)

  • The derivative of the divergence w.r.t the output Yt of the net at any time:

𝛼𝑍

𝑢𝐸𝐽𝑊 = 𝑒𝐸𝐽𝑊

𝑒𝑧𝑢

1

𝑒𝐸𝐽𝑊 𝑒𝑧𝑢

2

… 𝑒𝐸𝐽𝑊 𝑒𝑧𝑢

𝑀

– Components will be non-zero only for symbols that occur in the training instance

103

The expected divergence

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-104
SLIDE 104

𝐸𝐽𝑊 = − ෍

𝑢

𝑡∈𝑇1…𝑇𝐿

𝑄 𝑡𝑢 = 𝑡 𝐓, 𝐘 log 𝑍 𝑢, 𝑡𝑢 = 𝑡 𝐸𝐽𝑊 = − ෍

𝑢

𝑠

𝛽 𝑢, 𝑠 𝛾 𝑢, 𝑠 σ𝑠′ 𝛽 𝑢, 𝑠′ 𝛾 𝑢, 𝑠′ log 𝑧𝑢

𝑇(𝑠)

  • The derivative of the divergence w.r.t the output Yt of the net at any time:

𝛼𝑍

𝑢𝐸𝐽𝑊 = 𝑒𝐸𝐽𝑊

𝑒𝑧𝑢

1

𝑒𝐸𝐽𝑊 𝑒𝑧𝑢

2

… 𝑒𝐸𝐽𝑊 𝑒𝑧𝑢

𝑀

– Components will be non-zero only for symbols that occur in the training instance

104

The expected divergence

Must compute these terms from here t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-105
SLIDE 105

𝐸𝐽𝑊 = − ෍

𝑢

𝑠

𝛽 𝑢, 𝑠 𝛾 𝑢, 𝑠 σ𝑠′ 𝛽 𝑢, 𝑠′ 𝛾 𝑢, 𝑠′ log 𝑧𝑢

𝑇(𝑠)

  • The derivative of the divergence w.r.t any particular output of the network must sum over

all instances of that symbol in the target sequence 𝑒𝐸𝐽𝑊 𝑒𝑧𝑢

𝑚 = −

𝑠 ∶𝑇 𝑠 =𝑚

𝑒 𝑒𝑧𝑢

𝑇(𝑠)

𝛽 𝑢, 𝑠 𝛾 𝑢, 𝑠 σ𝑠′ 𝛽 𝑢, 𝑠′ 𝛾 𝑢, 𝑠′ log 𝑧𝑢

𝑇(𝑠)

– E.g. the derivative w.r.t 𝑧𝑢

5 will sum over both rows representing /IY/ in the above figure

105

The expected divergence

The derivatives at both these locations must be summed to get 𝑒𝐸𝐽𝑊

𝑒𝑧4

5

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-106
SLIDE 106

Overall training procedure for Seq2Seq case 1

  • Problem: Given input and output sequences

without alignment, train models

106

𝑌0 𝑌1 𝑌2 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌9 𝑌3

/B/ /IY/ /IY/

? ? ? ? ? ? ? ? ? ?

𝑍 𝑍

1

𝑍

2

𝑍

4

𝑍

5

𝑍

6

𝑍

7

𝑍

8

𝑍

9

𝑍

3

/F/

slide-107
SLIDE 107

Overall training procedure for Seq2Seq case 1

  • Step 1: Setup the network

– Typically many-layered LSTM

  • Step 2: Initialize all parameters of the network

107

slide-108
SLIDE 108

Overall Training: Forward pass

108

  • Foreach training instance
  • Step 3: Forward pass. Pass the training instance through

the network and obtain all symbol probabilities at each time

slide-109
SLIDE 109

/B/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

/IY/ 𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

/IY/ 𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

/F/

109

Overall training: Backward pass

  • Foreach training instance
  • Step 3: Forward pass. Pass the training instance through

the network and obtain all symbol probabilities at each time

  • Step 4: Construct the graph representing the specific

symbol sequence in the instance. This may require having multiple rows of nodes with the same symbol scores

slide-110
SLIDE 110
  • Foreach training instance:

– Step 5: Perform the forward backward algorithm to compute 𝛽 𝑢, 𝑠 and 𝛾 𝑢, 𝑠 at each time, for each row of nodes in the graph – Step 6: Compute derivative of divergence 𝛼𝑍

𝑢𝐸𝐽𝑊

for each 𝑍

𝑢

110

Overall training: Backward pass

t 1 2 3 4 5 6 7 8 /IY/ /B/ /F/ /IY/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-111
SLIDE 111

Overall training: Backward pass

  • Foreach instance

– Step 6: Compute derivative of divergence 𝛼𝑍

𝑢𝐸𝐽𝑊 for each 𝑍

𝑢

𝛼𝑍

𝑢𝐸𝐽𝑊 = 𝑒𝐸𝐽𝑊

𝑒𝑧𝑢

1

𝑒𝐸𝐽𝑊 𝑒𝑧𝑢

2

… 𝑒𝐸𝐽𝑊 𝑒𝑧𝑢

𝑀

𝑒𝐸𝐽𝑊 𝑒𝑧𝑢

𝑚 = −

𝑠 ∶𝑇 𝑠 =𝑚

𝑒 𝑒𝑧𝑢

𝑇(𝑠)

𝛽 𝑢, 𝑠 𝛾 𝑢, 𝑠 σ𝑠′ 𝛽 𝑢, 𝑠′ 𝛾 𝑢, 𝑠′ log 𝑧𝑢

𝑇(𝑠)

  • Step 7: Aggregate derivatives over minibatch and update

parameters

111

slide-112
SLIDE 112

A key decoding problem

  • Consider a problem where the output symbols are

characters

  • We have a decode: R R R O O O O O D
  • Is this the symbol sequence ROD or ROOD?
  • Note: This problem does not always occur, e.g. when

symbols have sub symbols

– E.g. If O is produced as O1 and O2

  • A single O would be of the form O1 O1 .. O2
  • Multiple Os would have the decode O1 .. O2.. O1..O2..

112

slide-113
SLIDE 113

We’ve seen this before

𝑌0 𝑌1 𝑌2 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌3

113

/AH/ /B/ /D/ /EH/ /IY/ /F/ /G/ 𝑧0

𝐵𝐼

𝑧0

𝐶

𝑧0

𝐸

𝑧0

𝐹𝐼

𝑧0

𝐽𝑍

𝑧0

𝐺

𝑧0

𝐻

𝑧1

𝐵𝐼

𝑧1

𝐶

𝑧1

𝐸

𝑧1

𝐹𝐼

𝑧1

𝐽𝑍

𝑧1

𝐺

𝑧1

𝐻

𝑧2

𝐵𝐼

𝑧2

𝐶

𝑧2

𝐸

𝑧2

𝐹𝐼

𝑧2

𝐽𝑍

𝑧2

𝐺

𝑧2

𝐻

𝑧3

𝐵𝐼

𝑧3

𝐶

𝑧3

𝐸

𝑧3

𝐹𝐼

𝑧3

𝐽𝑍

𝑧3

𝐺

𝑧3

𝐻

𝑧4

𝐵𝐼

𝑧4

𝐶

𝑧4

𝐸

𝑧4

𝐹𝐼

𝑧4

𝐽𝑍

𝑧4

𝐺

𝑧4

𝐻

𝑧5

𝐵𝐼

𝑧5

𝐶

𝑧5

𝐸

𝑧5

𝐹𝐼

𝑧5

𝐽𝑍

𝑧5

𝐺

𝑧5

𝐻

𝑧6

𝐵𝐼

𝑧6

𝐶

𝑧6

𝐸

𝑧6

𝐹𝐼

𝑧6

𝐽𝑍

𝑧6

𝐺

𝑧6

𝐻

𝑧7

𝐵𝐼

𝑧7

𝐶

𝑧7

𝐸

𝑧7

𝐹𝐼

𝑧7

𝐽𝑍

𝑧7

𝐺

𝑧7

𝐻

𝑧8

𝐵𝐼

𝑧8

𝐶

𝑧8

𝐸

𝑧8

𝐹𝐼

𝑧8

𝐽𝑍

𝑧8

𝐺

𝑧8

𝐻

/G/ /F/ /IY/ /D/ Cannot distinguish between an extended symbol and repetitions of the symbol /F/

  • /G/ /F/ /F/ /IY/ /D/ or /G/ /F/ /IY/ /D/ ?
slide-114
SLIDE 114

A key decoding problem

  • Consider a problem where the output symbols are

characters

  • We have a decode: R R R O O O O O D
  • Is this the symbol sequence ROD or ROOD?
  • Note: This problem does not always occur, e.g. when

symbols have sub symbols

– E.g. If O is produced as O1 and O2

  • A single O would be of the form O1 O1 .. O2  O
  • Multiple Os would have the decode O1 .. O2.. O1..O2..  OO

114

slide-115
SLIDE 115

A key decoding problem

  • We have a decode: R R R O O O O O D
  • Is this the symbol sequence ROD or ROOD?
  • Solution: Introduce an explicit extra symbol which serves to separate

discrete versions of a symbol

– A “blank” (represented by “-”) – RRR---OO---DDD = ROD – RR-R---OO---D-DD = RRODD – R-R-R---O-ODD-DDDD-D = RRROODDD

  • The next symbol at the end of a sequence of blanks is always a new character
  • When a symbol repeats, there must be at least one blank between the repetitions
  • The symbol set recognized by the network must now include the extra

blank symbol

– Which too must be trained

115

slide-116
SLIDE 116

The modified forward output

116

  • Note the extra “blank” at the output

𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

slide-117
SLIDE 117

𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

The modified forward output

117

  • Note the extra “blank” at the output

/B/ /IY/ /F/ /IY/

slide-118
SLIDE 118

𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

The modified forward output

118

  • Note the extra “blank” at the output

/B/ /IY/ /F/ /IY/

slide-119
SLIDE 119

𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

The modified forward output

119

  • Note the extra “blank” at the output

/B/ /IY/ /F/ /F/ /IY/

slide-120
SLIDE 120

120

Composing the graph for training

  • The original method without blanks
  • Changing the example to /B/ /IY/ /IY/ /F/ from /B/ /IY/ /F/ /IY/

for illustration

t 1 2 3 4 5 6 7 8 /IY/ /B/ /IY/ /F/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

slide-121
SLIDE 121

/IY/ /B/ /IY/

121

Composing the graph for training

  • With blanks
  • Note: a row of blanks between any two symbols
  • Also blanks at the very beginning and the very end

𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– /F/ 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

slide-122
SLIDE 122

/IY/ /B/ /F/ /IY/

122

Composing the graph for training

  • Add edges such that all paths from initial node(s) to final

node(s) unambiguously represent the target symbol sequence

𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

slide-123
SLIDE 123

/IY/ /B/ /F/ /IY/ 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

123

Composing the graph for training

  • The first and last column are allowed to also end at initial and

final blanks

slide-124
SLIDE 124

/IY/ /B/ /F/ /IY/ 𝑧6

5

124

Composing the graph for training

  • The first and last column are allowed to also end at initial and

final blanks

  • Skips are permitted across a blank, but only if the symbols on

either side are different

  • Because a blank is mandatory between repetitions of a symbol but not

required between distinct symbols

𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

slide-125
SLIDE 125

Modified Forward Algorithm

  • Initialization:

– 𝛽 0,0 = 𝑧0

𝑐, 𝛽 0,1 = 𝑧0 𝑐, 𝛽 0, 𝑠 = 0

𝑠 > 1

125

/IY/ /B/ /F/ /IY/ 𝑧0

5

𝑧0

5

𝑧0

6

𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– t 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

slide-126
SLIDE 126

Modified Forward Algorithm

  • Iteration:

𝛽 𝑢, 𝑠 = 𝛽 𝑢 − 1, 𝑠 + 𝛽 𝑢 − 1, 𝑠 − 1 𝑧𝑢

𝑇(𝑠)

  • If 𝑇 𝑠 = " − " or 𝑇 𝑠 = 𝑇 𝑠 − 2

𝛽 𝑢, 𝑠 = 𝛽 𝑢 − 1, 𝑠 + 𝛽 𝑢 − 1, 𝑠 − 1 + 𝛽 𝑢 − 1, 𝑠 − 2 𝑧𝑢

𝑇(𝑠)

  • Otherwise

126

/IY/ /B/ /F/ /IY/ 𝑧0

5

𝑧0

5

𝑧0

6

𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– t 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

slide-127
SLIDE 127

Modified Forward Algorithm

  • Iteration:

𝛽 𝑢, 𝑠 = 𝛽 𝑢 − 1, 𝑠 + 𝛽 𝑢 − 1, 𝑠 − 1 𝑧𝑢

𝑇(𝑠)

  • If 𝑇 𝑠 = " − " or 𝑇 𝑠 = 𝑇 𝑠 − 2

𝛽 𝑢, 𝑠 = 𝛽 𝑢 − 1, 𝑠 + 𝛽 𝑢 − 1, 𝑠 − 1 + 𝛽 𝑢 − 1, 𝑠 − 2 𝑧𝑢

𝑇(𝑠)

  • Otherwise

127

/IY/ /B/ /F/ /IY/ 𝑧0

5

𝑧0

5

𝑧0

6

𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– t 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

slide-128
SLIDE 128

Modified Backward Algorithm

  • Initialization:

𝛾 𝑈 − 1,2𝐿 = 𝛾 𝑈 − 1,2𝐿 − 1 = 𝛾 𝑈 − 1, 𝑠 = 0 𝑠 < 2𝐿 − 1

128

/IY/ /B/ /F/ /IY/ 𝑧8

2

𝑧8

5

𝑧8

5

𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– t 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

slide-129
SLIDE 129

Modified Backward Algorithm

  • Iteration:

𝛾 𝑢, 𝑠 = 𝛾 𝑢 + 1, 𝑠 𝑧𝑢

𝑇(𝑠) + 𝛾 𝑢 + 1, 𝑠 + 1 𝑧𝑢 𝑇(𝑠+1)

  • If 𝑇 𝑠 = " − " or 𝑇 𝑠 = 𝑇 𝑠 + 2

𝛾 𝑢, 𝑠 = 𝛾 𝑢 + 1, 𝑠 𝑧𝑢

𝑇(𝑠) + 𝛾 𝑢 + 1, 𝑠 + 1 𝑧𝑢 𝑇(𝑠+1) + 𝛾 𝑢 + 1, 𝑠 + 2 𝑧𝑢 𝑇(𝑠+2)

  • Otherwise

129

/IY/ /B/ /F/ /IY/ 𝑧8

2

𝑧8

5

𝑧8

5

𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– 𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

– t 𝑧0

𝐶

𝑧1

𝐶

𝑧2

𝐶

𝑧3

𝐶

𝑧4

𝐶

𝑧5

𝐶

𝑧6

𝐶

𝑧7

𝐶

𝑧8

𝐶

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐽𝑍

𝑧1

𝐽𝑍

𝑧2

𝐽𝑍

𝑧3

𝐽𝑍

𝑧4

𝐽𝑍

𝑧5

𝐽𝑍

𝑧6

𝐽𝑍

𝑧7

𝐽𝑍

𝑧8

𝐽𝑍

𝑧0

𝐺

𝑧1

𝐺

𝑧2

𝐺

𝑧3

𝐺

𝑧4

𝐺

𝑧5

𝐺

𝑧6

𝐺

𝑧7

𝐺

𝑧8

𝐺

slide-130
SLIDE 130

Overall training procedure for Seq2Seq with blanks

  • Problem: Given input and output sequences

without alignment, train models

130

𝑌0 𝑌1 𝑌2 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌9 𝑌3

/B/ /IY/ /IY/

? ? ? ? ? ? ? ? ? ?

𝑍 𝑍

1

𝑍

2

𝑍

4

𝑍

5

𝑍

6

𝑍

7

𝑍

8

𝑍

9

𝑍

3

/F/

slide-131
SLIDE 131

Overall training procedure

  • Step 1: Setup the network

– Typically many-layered LSTM

  • Step 2: Initialize all parameters of the network

– Include a “blank” symbol in vocabulary

131

slide-132
SLIDE 132

Overall Training: Forward pass

132

  • Foreach training instance
  • Step 3: Forward pass. Pass the training instance through

the network and obtain all symbol probabilities at each time, including blanks

𝑧0

𝑐

𝑧1

𝑐

𝑧2

𝑐

𝑧3

𝑐

𝑧4

𝑐

𝑧5

𝑐

𝑧6

𝑐

𝑧7

𝑐

𝑧8

𝑐

slide-133
SLIDE 133

133

Overall training: Backward pass

  • Foreach training instance
  • Step 3: Forward pass. Pass the training instance through

the network and obtain all symbol probabilities at each time

  • Step 4: Construct the graph representing the specific

symbol sequence in the instance. Use appropriate connections if blanks are included

slide-134
SLIDE 134
  • Foreach training instance:

– Step 5: Perform the forward backward algorithm to compute 𝛽 𝑢, 𝑠 and 𝛾 𝑢, 𝑠 at each time, for each row of nodes in the graph using the modified forward-backward equations – Step 6: Compute derivative of divergence 𝛼𝑍

𝑢𝐸𝐽𝑊 for each 𝑍

𝑢

134

Overall training: Backward pass

slide-135
SLIDE 135

Overall training: Backward pass

  • Foreach instance

– Step 6: Compute derivative of divergence 𝛼𝑍

𝑢𝐸𝐽𝑊 for each 𝑍

𝑢

𝛼𝑍

𝑢𝐸𝐽𝑊 = 𝑒𝐸𝐽𝑊

𝑒𝑧𝑢

1

𝑒𝐸𝐽𝑊 𝑒𝑧𝑢

2

… 𝑒𝐸𝐽𝑊 𝑒𝑧𝑢

𝑀

𝑒𝐸𝐽𝑊 𝑒𝑧𝑢

𝑚 = −

𝑠 ∶𝑇 𝑠 =𝑚

𝑒 𝑒𝑧𝑢

𝑇(𝑠)

𝛽 𝑢, 𝑠 𝛾 𝑢, 𝑠 σ𝑠′ 𝛽 𝑢, 𝑠′ 𝛾 𝑢, 𝑠′ log 𝑧𝑢

𝑇(𝑠)

  • Step 7: Aggregate derivatives over minibatch and update

parameters

135

slide-136
SLIDE 136

CTC: Connectionist Temporal Classification

  • The overall framework we saw is referred to as

CTC

– Applies when “duplicating” labels at the output is considered acceptable, and when output sequence length < input sequence length

136

slide-137
SLIDE 137

CTC caveats

  • The “blank” structure (with concurrent modifications to the

forward-backward equations) is only one way to deal with the problem of repeating symbols

  • Possible variants:

– Symbols partitioned into two or more sequential subunits

  • No blanks are required, since subunits must be visited in order

– Symbol-specific blanks

  • Doubles the “vocabulary”

– CTC can use bidirectional recurrent nets

  • And frequently does

– Other variants possible..

137

slide-138
SLIDE 138

Most common CTC applications

  • Speech recognition

– Speech in, phoneme sequence out – Speech in, character sequence (spelling out)

  • Handwriting recognition

138

slide-139
SLIDE 139

Speech recognition using Recurrent Nets

  • Recurrent neural networks (with LSTMs) can be

used to perform speech recognition

– Input: Sequences of audio feature vectors – Output: Phonetic label of each vector

Time 𝑄

1

X(t) t=0 𝑄2 𝑄3 𝑄

4

𝑄5 𝑄6 𝑄7

139

slide-140
SLIDE 140

Speech recognition using Recurrent Nets

  • Alternative: Directly output phoneme,

character or word sequence

Time 𝑋

1

X(t) t=0 𝑋

2

140

slide-141
SLIDE 141

Next up: Attention models

  • Will cover on Friday!

141