[PPT] - DTW and Search Hsin-min Wang References Books 1. X. Huang, A. PowerPoint Presentation

SLIDE 1

DTW and Search

Hsin-min Wang

SLIDE 2

2

References

Books

1.

X. Huang, A. Acero, H. Hon, “Spoken Language Processing”, Chapters 12-13,

Prentice Hall, 2001 2. Chin-Hui Lee, Frank K. Soong and Kuldip K. Paliwal. Automatic Speech and Speaker Recognition, Chapters 13, 16-18, Kluwer Academic Publishers, 1996 3. John R. Deller, JR. John G. Proakis, and John H. L. Hansen. Discrete-Time Processing of Speech Signals, Chapters 11-12, IEEE Press, 2000 4. L.R. Rabiner and B.H. Juang. Fundamentals of speech recognition, Chapter 7, Prentice Hall, 1993 5. Frederick Jelinek. Statistical Methods for Speech Recognition, Chapters 5-6, MIT Press, 1999 6.

N. Nilisson. Principles of Artificial Intelligence, 1982

Papers

1. Hermann Ney, “Progress in Dynamic Programming Search for LVCSR,” Proceedings of the IEEE, August 2000 2. Patrick Kenny, et al, “A*-Admissible heuristics for rapid lexical access,” IEEE

Trans. on SAP, 1993

3. Stefan Ortmanns and Hermann Ney, “A Word Graph Algorithm for Large Vocabulary Continuous Speech Recognition,” Computer Speech and Language (1997) 11,43-72 4. Jean-Luc Gauvain and Lori Lamel, “Large-Vocabulary Continuous Speech Recognition: Advances and Applications,” Proceedings of the IEEE, August 2000

SLIDE 3

3

Search Algorithms for Speech Recognition

Template-based: without statistical modeling/training

– Directly compare/align the test and reference waveforms on their features vector sequences (could be with different length) to derive the overall distortion between them – Dynamic Time Warping (DTW): warp speech templates in the time dimension to alleviate the distortion

Model-based: HMMs widely used

– Concatenate the subword models according to the pronunciation

f the words in a lexicon

– The states in the HMM can be expanded to form the state-search space (HMM state transition network) in the search – Apply appropriate search strategies

SLIDE 4

4

Dynamic Time Warping (DTW)

Eamonn J. Keogh & Michael J. Pazzani

Fig .3

SLIDE 5

5

DTW (cont.)

SLIDE 6

6

DTW (cont.)

(1,1) (2,2) (2,3) (n,m)

SLIDE 7

7

DTW (cont.)

SLIDE 8

8

DTW (cont.)

SLIDE 9

9

Advantages of DTW

Speech recognition based on DTW is simple to implement and fairly effective for small-vocabulary isolated word speech recognition Dynamic programming (DP) can temporally align patterns to account for differences in speaking rates across speakers as well as across repetitions of the word by the same speaker

SLIDE 10

10

Weaknesses of DTW

Creation of the template streams from data is non-trivial and typically is accomplished by pairwise warping of training instances Alternatively, all observed instances are stored as templates, but this is incredibly slow As a result, the HMM is a much better alternative for spoken language processing

Also refer to page 383 of the text book.

SLIDE 11

11

Continuous Speech Recognition

Continuous speech recognition is both a pattern recognition and search problem

– In speech recognition, making a search decision is also referred as decoding

Find a sequence of words whose corresponding acoustic and

language models best match the input signal

The search space (complexity) is highly correlated with the search

space determined by the constraints imposed by the language models

Speech recognition search is usually done with the Viterbi or A* stack decoders

– The relative merits of both search algorithms were quite controversial in the 1980s

SLIDE 12

12

Model-based Speech Recognition

m 2 1

w ,..., w w ˆ = W

( )

X W P

A search process to uncover the word sequence with the maximum posterior probability

( )

( ) (

)

( ) ( ) (

)

W X W X W X W X W W

W W W

P P P P P P ˆ max arg max arg max arg = = =

{ }

N 2 1 i m i 2 1

,.....,v ,v v : V w w ,..., w ,.. w , w where ∈ = W

Acoustic Model Probability Language Model Probability Unigram: Bigram: Trigram: ( ) ( ) (

) ( )

( )

( ) ( )

1 j j 1 j 1 j j 1 k k 1 2 1 k 2 1

w C w w C w w P , w w P ... w w P w P w .. w w P

− − − −

= ≈

( ) ( ) ( ) ( )

( ) ( )

( )

∑

= ≈

i i j j k 2 1 k 2 1

w C w C w P , w P ... w P w P w .. w w P

( ) ( ) (

) ( ) ( ) ( ) ( ) ( )

1 j 2 j j 1 j 2 j 2 k 1 k k 1 k 2 k k 2 1 3 1 2 1 k 2 1

w w C w w w C w w w P , w w w P ... w w w P w w P w P w .. w w P

− − − − − − − −

= ≈

N-gram Language Modeling

SLIDE 13

13

Block Diagram of Model-based Speech Recognition

Information-based Case Grammar

SLIDE 14

Basic Search Algorithms

SLIDE 15

15

What Is “Search”?

The idea of search implies moving around, examining things, and making decisions about whether the sought

bject has yet been found

Two classical problems in AI

– Traveling salesman’s problem: find a shortest-distance tour, starting at one of many cities, visiting each city exactly once, and returning to the starting city – N-queens problem: place N queens on an NxN chessboard in such a way that no queen can capture any other queen; i.e., there is no more than one queen in any given row, column, or diagonal.

SLIDE 16

16

A Simple City-traveling Problem

SLIDE 17

17

A Simple City-traveling Problem (cont.)

SLIDE 18

18

The General Graph Search Algorithm

If both 6(a) and 6(b) are omitted, graph search

> tree search

6(a) and 6(b): bookkeeping

r merging

process OPEN: stores the nodes waiting for expansion CLOSE: stores the already expanded nodes

SLIDE 19

19

Blind Graph Search Algorithms

If the aim of the search problem is to find an acceptable path instead of the best path, blind search is often used Blind search treats every node in the OPEN list the same and blindly decides the order to be expanded without using any domain knowledge Blind search does not expand nodes randomly. It follows some systematic way to explore the search graph

– Depth-first search – Breadth-first search

SLIDE 20

20

Depth-First Search

Depth-first search picks an arbitrary alternative at every node visited

The search sticks with this partial path and works forward from it Other alternatives at the same level are ignored completely

The deepest nodes are expanded first and nodes of equal depth are

rdered arbitrary

SLIDE 21

21

Depth-First Search (cont.)

Depth-first search generates only

ne at a time

– Graph search generates all successors at a time

When depth-first search reaches a dead-end, it goes back to the last decision point and proceeds with another alternative

– Depth-first search could be dangerous because it might search an impossible path that is actually an infinite dead-end – A depth bound can be placed to constrain the nodes to be explored

SLIDE 22

22

Breadth-First Search

Breadth-first search examines all the nodes on one level before considering any of the nodes on the next level (depth) Breadth-first search is guaranteed to find a solution if

ne exists

– it might not find a shortest- distance path, but it’s guaranteed to find the one with fewest cities visited (minimum-length path)

May be inefficient when all solutions leading to the goal node are at approximately the same depth

SLIDE 23

23

Breadth-First Search (cont.)

SLIDE 24

24

Heuristic Graph Search

Blind search finds only one arbitrary solution instead of the optimal solution

– To find the optimal solution with depth-first or breadth-first search, the search needs to continue rather than stop searching when the first solution is discovered

After the search reaches all solutions, we can compare them and

select the best

– British Museum search or brute-force search

Heuristic search takes advantage of the heuristic information (domain-specific knowledge) during search

– Use the heuristic function to re-order the OPEN list in Step 7 of Algorithm 12.1 – Some heuristics can reduce search effort without sacrificing

ptimality, while other can greatly reduce search effort but

provide only sub-optimal solutions – Best-first (or A*) search and beam search

SLIDE 25

25

Best-First Search

Best-first search explores the best node first since it

ffers the best

hope of leading to the best path, this is why it is called best-first search A search algorithm is called admissible if it can guarantee to find the

ptimal solution

– Admissible best-first search is called A* search

f(N)=g(N)+h(N) h(N)=0&g(N):depth of node N ->breadth-first

SLIDE 26

26

A* Search

History of A* Search in AI

– The most widely studied best-first search (Hert, Nilsson,1968) – Developed for additive cost measures (The cost of a path = sum

f the costs of its arcs)

Properties

– A* search can sequentially generates multiple recognition candidates – A* search needs a heuristic function that satisfies the admissible condition

Admissibility

– The property that a search algorithm guarantees to find an

ptimal solution, if one exists

SLIDE 27

27

A* Search – 1st example

SLIDE 28

28

A* Search – 1st example (cont.)

S A C E G

3+8.5 2+10.3 (3+3)+5.7 (3+4)+10.3 (6+3)+2.8 (9+5)+7 9+3

SLIDE 29

29

A* Search – 2nd example

Find a path with the highest score from root node “A” to some leaf node (one of “L1”,”L2”,”L3”,”L4”)

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

n h n h n n h n n h n n g n n h n g n f ≥ + =

* * *

if e Admissibil node goal to node from score estimated : node leaf specific a to node from score exact : score path partial decoded , node to node root from score d accumulate : node

f

function heuristic ,

A B C D E F G L4 L1 L2 L3 4 3 2 3 2 4 1 8 1 3

SLIDE 30

30

A* Search – 2nd example (cont.)

Proving the Admissibility of A* Algorithm:

Suppose when algorithm terminates, “G “ is a complete path

n the top of the stack and “p” is a partial path which presents

somewhere on the stack. There exists a complete path “P” passing through “p”, which is not equal to “G” and is optimal. Proof:

1. “P” is a complete path which passes through “p”, f(P)<=f(p)

2.Because “G” is on the top of the stack , f(G)>=f(p)>=f(P)

3. Therefore, it makes contrariety !!

A B C D E F G L4 L1 L2 L3 4 3 2 3 2 4 1 8 1 3

( ) ( ) ( )

: node

f

function Evaluation

* n

h n g n f n + =

Node g(n) h*(n) f(n) A 0 15 15 B 4 9 13 C 3 12 15 D 2 5 7 E 7 4 11 F 7 2 9 G 11 3 14 L1 9 0 9 L2 8 0 8 L3 12 0 12 L4 5 0 5

List or Stack(sorted)

Stack Top Stack Elements

A(15) A(15) C(15) C(15), B(13), D(7) G(14) G(14), B(13), F(9), D(7) B(13) B(13), L3(12), F(9), D(7) L3(12) L3(12), E(11), F(9), D(7)

SLIDE 31

31

A* Search – practice

Please find a path from the initial stat α to one of the four goal states (β1, β2, β3, β4) with the shortest path cost. Each arc is associated with a number representing its corresponding cost to be taken, while each node is associated with a number standing for the expected cost (the heuristic score/function) to one of the four goal states.

SLIDE 32

32

A* Search – practice (cont.)

Problems

– What is the first goal state found by the depth-first search, which always selects a node’s left-most child node for path expansion? Is it an optimal solution? What is the total search cost? – What is the first goal state found by the bread-first search, which always expends all child nodes at the same level from left to right? Is it an optimal solution? What is the total search cost? – What is the first goal state found by the A* search using the path cost and heuristic function for path expansion? Is it an optimal solution? What is the total search cost? – What is the search path cost if the A* search was used to sequentially visit the four goal states?

SLIDE 33

33

Beam Search

Beam search is a widely used search technique for speech recognition systems

– It is a breadth-first search and progresses along with the depth – Unlike traditional breadth-first search, beam search only expands nodes that are likely to succeed at each level

Keep up to w best paths at each level (stage), the rest are discarded

w=2 Beam width

SLIDE 34

34

Beam Search (cont.)

Step 4 obviously requires sorting that is time-consuming if the number wxb is huge

– Keep only the nodes whose heuristic functions are within some threshold of the best node, and prune away nodes that are outside of the threshold

Unlike A* search, beam search is an approximate heuristic search that is not admissible

SLIDE 35

Search Algorithms for Speech Recognition

SLIDE 36

36

Calculating the Probability of an Observation Sequence Given a HMM

Direct Evaluation: without using recursion (DP, dynamic programming) and memory

– Huge Computation Requirements: O(NT)

Exponential computational complexity

A more efficient algorithms can be used to evaluate

– Forward/Backward Procedure/Algorithm

time state

                       

n

v v v v . . .

3 2 1

                       

n

v v v v . . .

3 2 1

                       

n

v v v v . . .

3 2 1

                       

n

v v v v . . .

3 2 1

                       

n

v v v v . . .

3 2 1

1

s

π

2

s

π

3

s

π

) , , ( B A Π = Λ

Initial state probability State transition probability State observation probability

( ) ( ) ( )

[ ]

( ) ( ) ( )

[ ] ( )

( ) ( ) ( )

T q q q q q q ,..,q ,q q q q all T q q q q q q q q q q all

b

a

b

a

b
b
b
b

a a a P P P

T T T T T T T 1 2 2 1 2 1 1 1 2 1 1 3 2 2 1 1

..... ..... ..... ,

2 1 2 1

− −

∑ ∑ ∑

= = = π π λ λ λ Q O Q O

Q Q

( )

ADD 1 2 1 2 :

, N

TN MUL N T-

T T T

≈ Complexity

SLIDE 37

37

Forward Procedure

Algorithm Complexity: O(N2T) Based on the lattice (trellis) structure

– Computed in a time-synchronous fashion from left-to-right, where each cell for time t is completely computed before proceeding to time t+1 – All state sequences, regardless how long previously, merge to N nodes (states) at each time instance t

( ) ( ) ( ) ( )

[ ] (

)

( )

∑ ∑

= + = +

= ≤ ≤ ≤ ≤ = ≤ ≤ =

N 1 i T 1 t j N 1 i ij t 1 t 1 i i 1

i α λ O P N j 1 , 1 T- t 1 ,

b

a i α j α N i 1 ,

b

π i α ion 3.Terminat Induction 2. tion Initializa 1.

T N 1)

1)N(T
(N

: ADD T N N + 1)

1)(T

+ N(N : MUL

2 2

≈ ≈ time state

1

π

2

π

3

π

( )

1 2 o

b

( )

2 3 o

b

( )

1 1 o

b

( )

1 3 o

b

( )

2 2 o

b

( )

2 1 o

b

3 , 3

a

3 , 2

a

2 , 2

a

1 , 1

a

2 , 1

a

( )

λ i q

P

i

t t t

= = , ...

2 1

α

SLIDE 38

38

Backward Procedure

Algorithm Complexity: O(N2T)

( ) ( ) ( ) ( ) ( ) ( ) ( )

∑ = ≤ ≤ ≤ ≤ ∑ = ≤ ≤ =

= = + + N j j j N j t t j ij t

j

b

P j

b

a i i

1 1 1 1 1 1 T

| n Terminatio 3. N j 1,1

T

t 1 , Induction 2. N i 1 , 1 tion Initializa 1. β π β β β λ O

( ) ( )

∑ = ∑ = = ∴

= = N i t t N i t

i i i q O P O P

1 1

, β α λ λ

( )

( ) ( )

( )

( ) ( )

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

( ) ( )

i i i q

P

i q

P

i q P i q

P

i q

P

i q P i q

P

i q P i q O P P i q P i q O P P i q O P i q O P

t t t t t T t t t t t t T t t t t T t t t t t t t

β α λ λ λ λ λ λ λ λ λ λ λ λ λ λ λ = = = = = = = = = = = = = = = ⋅ = = = = =

+ + + + +

, ,.., , , ,.., , , ,.., , , ,.., , t! independen are ations ....observ , ,.., , , / , , / , , ,

2 1 2 1 2 1 2 1 2 1

T N 1)

1)N(T
(N

ADD ; T N 1)

(T

2N : MUL Complexity

2 2 2

≈ ≈

( )

λ , ,..., ,

2 1

i q

P

i

t T t t t

= =

+ +

β

SLIDE 39

39

Viterbi Algorithm

Algorithm

1. Initialization
2. Induction
3. Termination
4. Backtracking

Complexity: O(N2T) ( ) ( )

N i , i N i ,

b

π i

i i

≤ ≤ = Ψ ≤ ≤ = 1 ) ( 1

1 1 1

δ

) ,..., , ( 1 ,..., 2 . 1 ), (

* * 2 * 1 * * 1 1 T t t * t

q q q T T t q q = − − = =

+ +

Q ψ

( ) ( ) ( ) ( )

N j , T- t , a i j N j , T- t ,

b

a i j

ij t N i t j ij t N i t

≤ ≤ ≤ ≤ = Ψ ≤ ≤ ≤ ≤ =

≤ ≤ + + ≤ ≤ +

1 1 1 ] [ max arg ) ( 1 1 1 ] [ max

1 1 t 1 1 1

δ δ δ

( )

( ) ( )

i q i λ O P

T N i * T T N i

δ δ

≤ ≤ ≤ ≤

= =

1 1 *

max arg max

is the best state sequence

SLIDE 40

40

Viterbi Algorithm (cont.)

In practice, we calculate the logarithmic value of a given state sequence instead of its real value

1. Initialization
2. Induction
3. Termination
4. Backtracking

( ) ( )

N i , i N i ,

b

π i

i i

≤ ≤ = Ψ ≤ ≤ + = 1 ) ( 1 log log

1 1 1

δ

( ) ( ) ( ) ( )

N j , T- t , a i j N j , T- t ,

b

a i j

ij t N i t j ij t N i t

≤ ≤ ≤ ≤ + = Ψ ≤ ≤ ≤ ≤ + + =

≤ ≤ + + ≤ ≤ +

1 1 1 ] log [ max arg ) ( 1 1 1 log ] log [ max

1 1 t 1 1 1

δ δ δ

( )

( ) ( )

i q i λ O P

T N i * T T N i

δ δ

≤ ≤ ≤ ≤

= =

1 1 *

max arg max

) ,..., , ( 1 ,..., 2 . 1 ), (

* * 2 * 1 * * 1 1 T t t * t

q q q T T t q q = − − = =

+ +

Q ψ

is the best state sequence

SLIDE 41

41

Isolated Word Recognition

Word boundaries are known Two search structures

– Lexicon-List (Linear Lexicon)

Each word is individually represented as a huge composite HMM

constructed by concatenating corresponding subword-level (phone/Initial-Final/syllable) HMMs

No sharing of computation between words when performing search
The search becomes a simple pattern recognition problem, and the

word with the highest forward probability is chosen as the recognized word

– Tree Structure (Tree Lexicon)

Arrange the subword-level (phone/Initial-Final/syllable)

representations of the words in a tree structure

Each arc stands for an HMM or subword-level modeling
Sharing of computation between word as much as possible

SLIDE 42

42

Isolated Word Recognition (cont.)

Two search structures (Cont.)

林登賢 l in d eng shi ian l in d eng shi ian l i wang h uei d eng d eng shi ian h uei shi ian h uei

Tree lexicon

j empt 13 arcs l in d eng h uei 林登輝 l i j empt shi ian 李志賢李登輝 l i d eng h uei 王登賢 wang d eng shi ian wang d eng h uei 王登輝 18 arcs

Linear lexicon

SLIDE 43

43

Isolated Word Recognition (cont.)

More about the tree lexicon

– The idea of using a tree representation was suggested in 1970s in the CASPERS system and the LAFS system – When using such a lexical tree in a language model (bigram or trigram) and dynamic programming, many technical details must be taken into account and require a careful structuring of the search space (especially for continuous speech recognition to be discussed later)

Delayed application of language model until reaching tree leaf

nodes

A copy of the lexical tree for each alive language model history in

dynamic programming

SLIDE 44

44

Continuous Speech Recognition (CSR)

Search in CSR is rather complicated, since the search algorithm has to consider the possibility of each word starting at any arbitrary time frame Some of the earliest recognition systems took a two- stage approach

– First synthesizing the possible word boundaries – Then using pattern matching techniques for recognizing the segmented patterns

Due to significant cross-word co-articulation, there is no reliable segmentation algorithm for detecting word boundaries other than doing recognition itself

SLIDE 45

45

CSR-Linear lexicon without language model

Starting state Collector state P(w1)=1/2 P(w2)=1/2

SLIDE 46

46

Continuous Digit Recognition - Viterbi

Unknown number of digits/syllables Search over a 3-dim grid At each frame iteration, the maximum value achieved from the end states of all models in the previous frame will be propagated and competed for the values of the start states of all models May result in substitutions, deletions and insertions 1 9

1 9

t

0 1 2 9

Correct 32561 Recognized 325561 3261

SLIDE 47

47

Continuous Digit Recognition - Level-Building

Known number of digits/syllables Higher computation complexity, no deletion and insertion Number of levels: number of digits in an utterance Transitions from the last states of the previous models (previous level) to the first states of specific models (current level)

1 . 9 1 . 9 1 . 9 1 . 9

t

1 9

l=1 l=2 l=3 l=4

SLIDE 48

48

Level-Building

SLIDE 49

49

Level-Building (cont.)

SLIDE 50

50

Level-Building (cont.)

SLIDE 51

51

CSR

Linear lexicon with a unigram language model

( )

∏ =

= n i i

w P P

1

) ( W

SLIDE 52

52

CSR

Linear lexicon with a bigram language model

( )

∏ > < =

= − n i i i

w w P s w P P

2 1 1

) | ( ) | ( W

) ( ) ( ) | (

j i i j

w P w w w P α =

Katz backoff method for unseen bigram

SLIDE 53

53

CSR

Linear lexicon with a trigram language model

( )

∏ > < > < =

= − − n i i i i

w w w P w s w P s w P P

3 1 2 1 2 1

) , | ( ) , | ( ) | ( W

SLIDE 54

54

Time-Synchronous Viterbi Beam Search

For isolated word recognition using word HMMs, the recognition task is defined as For CSR, the goal of decoding is to uncover the best word sequence, so we could approximate the summation with the maximum to find the best state sequence instead

      ∑ = =

Q

Q X X ) | , ( max arg ) | ( max arg ˆ

i i i i

w P w P w

( ) (

)

( )

( )

     ≅       ∑ = = W Q X W W Q X W W X W W

Q W Q W W

, max max arg , max arg max arg ˆ

possible all

P P P P P P

SLIDE 55

55

Time-Synchronous Viterbi Beam Search (cont.)

Prune unlikely path during search Need some criteria to prune unlikely paths

l in d eng h uei l i j empt shi ian l i d eng h uei wang d eng shi ian wang d eng h uei

林登輝

List-lexicon

李志賢李登輝王登賢王登輝

l in d eng shi ian

林登賢李登輝林志賢王登輝 time

pruning

state

SLIDE 56

56

Time-Synchronous Viterbi Beam Search (cont.)

D(t;st;w): total cost of the best path up to time t that ends in state st of grammar word state w h(t;st;w): backtrack pointer for the best path up to time t that ends in state st of grammar word state w I(w): the initial state of grammar word state w F(w): the final state of grammar word state w η: the pseudo initial state

{ }

) ; ; 1 ( ) ; | , ( min arg ) ; ; ( ) ; | ( log ) ; | ( log ) ; | , (

1 1 min 1 1

1

w s t D w s s d w s t b w s P w s s P w s s d

t t t t s t t t t t t t t

t

− − − −

− + = − − =

−

x x x Beam size ~ 5-10% of the entire search space