DTW and Search Hsin-min Wang References Books 1. X. Huang, A. - - PowerPoint PPT Presentation
DTW and Search Hsin-min Wang References Books 1. X. Huang, A. - - PowerPoint PPT Presentation
DTW and Search Hsin-min Wang References Books 1. X. Huang, A. Acero, H. Hon, Spoken Language Processing, Chapters 12-13, Prentice Hall, 2001 2. Chin-Hui Lee, Frank K. Soong and Kuldip K. Paliwal. Automatic Speech and Speaker
2
References
Books
1.
- X. Huang, A. Acero, H. Hon, “Spoken Language Processing”, Chapters 12-13,
Prentice Hall, 2001 2. Chin-Hui Lee, Frank K. Soong and Kuldip K. Paliwal. Automatic Speech and Speaker Recognition, Chapters 13, 16-18, Kluwer Academic Publishers, 1996 3. John R. Deller, JR. John G. Proakis, and John H. L. Hansen. Discrete-Time Processing of Speech Signals, Chapters 11-12, IEEE Press, 2000 4. L.R. Rabiner and B.H. Juang. Fundamentals of speech recognition, Chapter 7, Prentice Hall, 1993 5. Frederick Jelinek. Statistical Methods for Speech Recognition, Chapters 5-6, MIT Press, 1999 6.
- N. Nilisson. Principles of Artificial Intelligence, 1982
Papers
1. Hermann Ney, “Progress in Dynamic Programming Search for LVCSR,” Proceedings of the IEEE, August 2000 2. Patrick Kenny, et al, “A*-Admissible heuristics for rapid lexical access,” IEEE
- Trans. on SAP, 1993
3. Stefan Ortmanns and Hermann Ney, “A Word Graph Algorithm for Large Vocabulary Continuous Speech Recognition,” Computer Speech and Language (1997) 11,43-72 4. Jean-Luc Gauvain and Lori Lamel, “Large-Vocabulary Continuous Speech Recognition: Advances and Applications,” Proceedings of the IEEE, August 2000
3
Search Algorithms for Speech Recognition
Template-based: without statistical modeling/training
– Directly compare/align the test and reference waveforms on their features vector sequences (could be with different length) to derive the overall distortion between them – Dynamic Time Warping (DTW): warp speech templates in the time dimension to alleviate the distortion
Model-based: HMMs widely used
– Concatenate the subword models according to the pronunciation
- f the words in a lexicon
– The states in the HMM can be expanded to form the state-search space (HMM state transition network) in the search – Apply appropriate search strategies
4
Dynamic Time Warping (DTW)
Eamonn J. Keogh & Michael J. Pazzani
Fig .3
5
DTW (cont.)
6
DTW (cont.)
(1,1) (2,2) (2,3) (n,m)
7
DTW (cont.)
8
DTW (cont.)
9
Advantages of DTW
Speech recognition based on DTW is simple to implement and fairly effective for small-vocabulary isolated word speech recognition Dynamic programming (DP) can temporally align patterns to account for differences in speaking rates across speakers as well as across repetitions of the word by the same speaker
10
Weaknesses of DTW
Creation of the template streams from data is non-trivial and typically is accomplished by pairwise warping of training instances Alternatively, all observed instances are stored as templates, but this is incredibly slow As a result, the HMM is a much better alternative for spoken language processing
Also refer to page 383 of the text book.
11
Continuous Speech Recognition
Continuous speech recognition is both a pattern recognition and search problem
– In speech recognition, making a search decision is also referred as decoding
- Find a sequence of words whose corresponding acoustic and
language models best match the input signal
- The search space (complexity) is highly correlated with the search
space determined by the constraints imposed by the language models
Speech recognition search is usually done with the Viterbi or A* stack decoders
– The relative merits of both search algorithms were quite controversial in the 1980s
12
Model-based Speech Recognition
m 2 1
w ,..., w w ˆ = W
( )
X W P
A search process to uncover the word sequence with the maximum posterior probability
( )
( ) (
)
( ) ( ) (
)
W X W X W X W X W W
W W W
P P P P P P ˆ max arg max arg max arg = = =
{ }
N 2 1 i m i 2 1
,.....,v ,v v : V w w ,..., w ,.. w , w where ∈ = W
Acoustic Model Probability Language Model Probability Unigram: Bigram: Trigram: ( ) ( ) (
) ( )
( )
( ) ( )
1 j j 1 j 1 j j 1 k k 1 2 1 k 2 1
w C w w C w w P , w w P ... w w P w P w .. w w P
− − − −
= ≈
( ) ( ) ( ) ( )
( ) ( )
( )
∑
= ≈
i i j j k 2 1 k 2 1
w C w C w P , w P ... w P w P w .. w w P
( ) ( ) (
) ( ) ( ) ( ) ( ) ( )
1 j 2 j j 1 j 2 j 2 k 1 k k 1 k 2 k k 2 1 3 1 2 1 k 2 1
w w C w w w C w w w P , w w w P ... w w w P w w P w P w .. w w P
− − − − − − − −
= ≈
N-gram Language Modeling
13
Block Diagram of Model-based Speech Recognition
Information-based Case Grammar
Basic Search Algorithms
15
What Is “Search”?
The idea of search implies moving around, examining things, and making decisions about whether the sought
- bject has yet been found
Two classical problems in AI
– Traveling salesman’s problem: find a shortest-distance tour, starting at one of many cities, visiting each city exactly once, and returning to the starting city – N-queens problem: place N queens on an NxN chessboard in such a way that no queen can capture any other queen; i.e., there is no more than one queen in any given row, column, or diagonal.
16
A Simple City-traveling Problem
17
A Simple City-traveling Problem (cont.)
18
The General Graph Search Algorithm
If both 6(a) and 6(b) are omitted, graph search
- > tree search
6(a) and 6(b): bookkeeping
- r merging
process OPEN: stores the nodes waiting for expansion CLOSE: stores the already expanded nodes
19
Blind Graph Search Algorithms
If the aim of the search problem is to find an acceptable path instead of the best path, blind search is often used Blind search treats every node in the OPEN list the same and blindly decides the order to be expanded without using any domain knowledge Blind search does not expand nodes randomly. It follows some systematic way to explore the search graph
– Depth-first search – Breadth-first search
20
Depth-First Search
Depth-first search picks an arbitrary alternative at every node visited
The search sticks with this partial path and works forward from it Other alternatives at the same level are ignored completely
The deepest nodes are expanded first and nodes of equal depth are
- rdered arbitrary
21
Depth-First Search (cont.)
Depth-first search generates only
- ne at a time
– Graph search generates all successors at a time
When depth-first search reaches a dead-end, it goes back to the last decision point and proceeds with another alternative
– Depth-first search could be dangerous because it might search an impossible path that is actually an infinite dead-end – A depth bound can be placed to constrain the nodes to be explored
22
Breadth-First Search
Breadth-first search examines all the nodes on one level before considering any of the nodes on the next level (depth) Breadth-first search is guaranteed to find a solution if
- ne exists
– it might not find a shortest- distance path, but it’s guaranteed to find the one with fewest cities visited (minimum-length path)
May be inefficient when all solutions leading to the goal node are at approximately the same depth
23
Breadth-First Search (cont.)
24
Heuristic Graph Search
Blind search finds only one arbitrary solution instead of the optimal solution
– To find the optimal solution with depth-first or breadth-first search, the search needs to continue rather than stop searching when the first solution is discovered
- After the search reaches all solutions, we can compare them and
select the best
– British Museum search or brute-force search
Heuristic search takes advantage of the heuristic information (domain-specific knowledge) during search
– Use the heuristic function to re-order the OPEN list in Step 7 of Algorithm 12.1 – Some heuristics can reduce search effort without sacrificing
- ptimality, while other can greatly reduce search effort but
provide only sub-optimal solutions – Best-first (or A*) search and beam search
25
Best-First Search
Best-first search explores the best node first since it
- ffers the best
hope of leading to the best path, this is why it is called best-first search A search algorithm is called admissible if it can guarantee to find the
- ptimal solution
– Admissible best-first search is called A* search
f(N)=g(N)+h(N) h(N)=0&g(N):depth of node N ->breadth-first
26
A* Search
History of A* Search in AI
– The most widely studied best-first search (Hert, Nilsson,1968) – Developed for additive cost measures (The cost of a path = sum
- f the costs of its arcs)
Properties
– A* search can sequentially generates multiple recognition candidates – A* search needs a heuristic function that satisfies the admissible condition
Admissibility
– The property that a search algorithm guarantees to find an
- ptimal solution, if one exists
27
A* Search – 1st example
28
A* Search – 1st example (cont.)
S A C E G
3+8.5 2+10.3 (3+3)+5.7 (3+4)+10.3 (6+3)+2.8 (9+5)+7 9+3
29
A* Search – 2nd example
Find a path with the highest score from root node “A” to some leaf node (one of “L1”,”L2”,”L3”,”L4”)
( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )
n h n h n n h n n h n n g n n h n g n f ≥ + =
* * *
if e Admissibil node goal to node from score estimated : node leaf specific a to node from score exact : score path partial decoded , node to node root from score d accumulate : node
- f
function heuristic ,
A B C D E F G L4 L1 L2 L3 4 3 2 3 2 4 1 8 1 3
30
A* Search – 2nd example (cont.)
Proving the Admissibility of A* Algorithm:
Suppose when algorithm terminates, “G “ is a complete path
- n the top of the stack and “p” is a partial path which presents
somewhere on the stack. There exists a complete path “P” passing through “p”, which is not equal to “G” and is optimal. Proof:
- 1. “P” is a complete path which passes through “p”, f(P)<=f(p)
2.Because “G” is on the top of the stack , f(G)>=f(p)>=f(P)
- 3. Therefore, it makes contrariety !!
A B C D E F G L4 L1 L2 L3 4 3 2 3 2 4 1 8 1 3
( ) ( ) ( )
: node
- f
function Evaluation
* n
h n g n f n + =
Node g(n) h*(n) f(n) A 0 15 15 B 4 9 13 C 3 12 15 D 2 5 7 E 7 4 11 F 7 2 9 G 11 3 14 L1 9 0 9 L2 8 0 8 L3 12 0 12 L4 5 0 5
List or Stack(sorted)
Stack Top Stack Elements
A(15) A(15) C(15) C(15), B(13), D(7) G(14) G(14), B(13), F(9), D(7) B(13) B(13), L3(12), F(9), D(7) L3(12) L3(12), E(11), F(9), D(7)
31
A* Search – practice
Please find a path from the initial stat α to one of the four goal states (β1, β2, β3, β4) with the shortest path cost. Each arc is associated with a number representing its corresponding cost to be taken, while each node is associated with a number standing for the expected cost (the heuristic score/function) to one of the four goal states.
32
A* Search – practice (cont.)
Problems
– What is the first goal state found by the depth-first search, which always selects a node’s left-most child node for path expansion? Is it an optimal solution? What is the total search cost? – What is the first goal state found by the bread-first search, which always expends all child nodes at the same level from left to right? Is it an optimal solution? What is the total search cost? – What is the first goal state found by the A* search using the path cost and heuristic function for path expansion? Is it an optimal solution? What is the total search cost? – What is the search path cost if the A* search was used to sequentially visit the four goal states?
33
Beam Search
Beam search is a widely used search technique for speech recognition systems
– It is a breadth-first search and progresses along with the depth – Unlike traditional breadth-first search, beam search only expands nodes that are likely to succeed at each level
- Keep up to w best paths at each level (stage), the rest are discarded
w=2 Beam width
34
Beam Search (cont.)
Step 4 obviously requires sorting that is time-consuming if the number wxb is huge
– Keep only the nodes whose heuristic functions are within some threshold of the best node, and prune away nodes that are outside of the threshold
Unlike A* search, beam search is an approximate heuristic search that is not admissible
Search Algorithms for Speech Recognition
36
Calculating the Probability of an Observation Sequence Given a HMM
Direct Evaluation: without using recursion (DP, dynamic programming) and memory
– Huge Computation Requirements: O(NT)
- Exponential computational complexity
A more efficient algorithms can be used to evaluate
– Forward/Backward Procedure/Algorithm
time state
nv v v v . . .
3 2 1
nv v v v . . .
3 2 1
nv v v v . . .
3 2 1
nv v v v . . .
3 2 1
nv v v v . . .
3 2 11
s
π
2
s
π
3
s
π
) , , ( B A Π = Λ
Initial state probability State transition probability State observation probability
( ) ( ) ( )
[ ]
( ) ( ) ( )
[ ] ( )
( ) ( ) ( )
T q q q q q q ,..,q ,q q q q all T q q q q q q q q q q all
- b
a
- b
a
- b
- b
- b
- b
a a a P P P
T T T T T T T 1 2 2 1 2 1 1 1 2 1 1 3 2 2 1 1
..... ..... ..... ,
2 1 2 1
− −
∑ ∑ ∑
= = = π π λ λ λ Q O Q O
Q Q
( )
ADD 1 2 1 2 :
- , N
TN MUL N T-
T T T
≈ Complexity
37
Forward Procedure
Algorithm Complexity: O(N2T) Based on the lattice (trellis) structure
– Computed in a time-synchronous fashion from left-to-right, where each cell for time t is completely computed before proceeding to time t+1 – All state sequences, regardless how long previously, merge to N nodes (states) at each time instance t
( ) ( ) ( ) ( )
[ ] (
)
( )
( )
∑ ∑
= + = +
= ≤ ≤ ≤ ≤ = ≤ ≤ =
N 1 i T 1 t j N 1 i ij t 1 t 1 i i 1
i α λ O P N j 1 , 1 T- t 1 ,
- b
a i α j α N i 1 ,
- b
π i α ion 3.Terminat Induction 2. tion Initializa 1.
T N 1)
- 1)N(T
- (N
: ADD T N N + 1)
- 1)(T
+ N(N : MUL
2 2
≈ ≈ time state
1
π
2
π
3
π
( )
1 2 o
b
( )
2 3 o
b
( )
1 1 o
b
( )
1 3 o
b
( )
2 2 o
b
( )
2 1 o
b
3 , 3
a
3 , 2
a
2 , 2
a
1 , 1
a
2 , 1
a
( )
( )
λ i q
- P
i
t t t
= = , ...
2 1
α
38
Backward Procedure
Algorithm Complexity: O(N2T)
( ) ( ) ( ) ( ) ( ) ( ) ( )
∑ = ≤ ≤ ≤ ≤ ∑ = ≤ ≤ =
= = + + N j j j N j t t j ij t
j
- b
P j
- b
a i i
1 1 1 1 1 1 T
| n Terminatio 3. N j 1,1
- T
t 1 , Induction 2. N i 1 , 1 tion Initializa 1. β π β β β λ O
( ) ( )
( ) ( )
∑ = ∑ = = ∴
= = N i t t N i t
i i i q O P O P
1 1
, β α λ λ
( )
( ) ( )
( )
( ) ( )
( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )
( ) ( )
i i i q
- P
i q
- P
i q P i q
- P
i q
- P
i q P i q
- P
i q P i q O P P i q P i q O P P i q O P i q O P
t t t t t T t t t t t t T t t t t T t t t t t t t
β α λ λ λ λ λ λ λ λ λ λ λ λ λ λ λ = = = = = = = = = = = = = = = ⋅ = = = = =
+ + + + +
, ,.., , , ,.., , , ,.., , , ,.., , t! independen are ations ....observ , ,.., , , / , , / , , ,
2 1 2 1 2 1 2 1 2 1
T N 1)
- 1)N(T
- (N
ADD ; T N 1)
- (T
2N : MUL Complexity
2 2 2
≈ ≈
( )
( )
λ , ,..., ,
2 1
i q
- P
i
t T t t t
= =
+ +
β
39
Viterbi Algorithm
Algorithm
- 1. Initialization
- 2. Induction
- 3. Termination
- 4. Backtracking
Complexity: O(N2T) ( ) ( )
N i , i N i ,
- b
π i
i i
≤ ≤ = Ψ ≤ ≤ = 1 ) ( 1
1 1 1
δ
) ,..., , ( 1 ,..., 2 . 1 ), (
* * 2 * 1 * * 1 1 T t t * t
q q q T T t q q = − − = =
+ +
Q ψ
( ) ( ) ( ) ( )
N j , T- t , a i j N j , T- t ,
- b
a i j
ij t N i t j ij t N i t
≤ ≤ ≤ ≤ = Ψ ≤ ≤ ≤ ≤ =
≤ ≤ + + ≤ ≤ +
1 1 1 ] [ max arg ) ( 1 1 1 ] [ max
1 1 t 1 1 1
δ δ δ
( )
( ) ( )
i q i λ O P
T N i * T T N i
δ δ
≤ ≤ ≤ ≤
= =
1 1 *
max arg max
is the best state sequence
40
Viterbi Algorithm (cont.)
In practice, we calculate the logarithmic value of a given state sequence instead of its real value
- 1. Initialization
- 2. Induction
- 3. Termination
- 4. Backtracking
( ) ( )
N i , i N i ,
- b
π i
i i
≤ ≤ = Ψ ≤ ≤ + = 1 ) ( 1 log log
1 1 1
δ
( ) ( ) ( ) ( )
N j , T- t , a i j N j , T- t ,
- b
a i j
ij t N i t j ij t N i t
≤ ≤ ≤ ≤ + = Ψ ≤ ≤ ≤ ≤ + + =
≤ ≤ + + ≤ ≤ +
1 1 1 ] log [ max arg ) ( 1 1 1 log ] log [ max
1 1 t 1 1 1
δ δ δ
( )
( ) ( )
i q i λ O P
T N i * T T N i
δ δ
≤ ≤ ≤ ≤
= =
1 1 *
max arg max
) ,..., , ( 1 ,..., 2 . 1 ), (
* * 2 * 1 * * 1 1 T t t * t
q q q T T t q q = − − = =
+ +
Q ψ
is the best state sequence
41
Isolated Word Recognition
Word boundaries are known Two search structures
– Lexicon-List (Linear Lexicon)
- Each word is individually represented as a huge composite HMM
constructed by concatenating corresponding subword-level (phone/Initial-Final/syllable) HMMs
- No sharing of computation between words when performing search
- The search becomes a simple pattern recognition problem, and the
word with the highest forward probability is chosen as the recognized word
– Tree Structure (Tree Lexicon)
- Arrange the subword-level (phone/Initial-Final/syllable)
representations of the words in a tree structure
- Each arc stands for an HMM or subword-level modeling
- Sharing of computation between word as much as possible
42
Isolated Word Recognition (cont.)
Two search structures (Cont.)
林 登 賢 l in d eng shi ian l in d eng shi ian l i wang h uei d eng d eng shi ian h uei shi ian h uei
Tree lexicon
j empt 13 arcs l in d eng h uei 林 登 輝 l i j empt shi ian 李 志 賢 李 登 輝 l i d eng h uei 王 登 賢 wang d eng shi ian wang d eng h uei 王 登 輝 18 arcs
Linear lexicon
43
Isolated Word Recognition (cont.)
More about the tree lexicon
– The idea of using a tree representation was suggested in 1970s in the CASPERS system and the LAFS system – When using such a lexical tree in a language model (bigram or trigram) and dynamic programming, many technical details must be taken into account and require a careful structuring of the search space (especially for continuous speech recognition to be discussed later)
- Delayed application of language model until reaching tree leaf
nodes
- A copy of the lexical tree for each alive language model history in
dynamic programming
44
Continuous Speech Recognition (CSR)
Search in CSR is rather complicated, since the search algorithm has to consider the possibility of each word starting at any arbitrary time frame Some of the earliest recognition systems took a two- stage approach
– First synthesizing the possible word boundaries – Then using pattern matching techniques for recognizing the segmented patterns
Due to significant cross-word co-articulation, there is no reliable segmentation algorithm for detecting word boundaries other than doing recognition itself
45
CSR-Linear lexicon without language model
Starting state Collector state P(w1)=1/2 P(w2)=1/2
46
Continuous Digit Recognition - Viterbi
Unknown number of digits/syllables Search over a 3-dim grid At each frame iteration, the maximum value achieved from the end states of all models in the previous frame will be propagated and competed for the values of the start states of all models May result in substitutions, deletions and insertions 1 9
1 9
t
0 1 2 9
Correct 32561 Recognized 325561 3261
47
Continuous Digit Recognition - Level-Building
Known number of digits/syllables Higher computation complexity, no deletion and insertion Number of levels: number of digits in an utterance Transitions from the last states of the previous models (previous level) to the first states of specific models (current level)
1 . 9 1 . 9 1 . 9 1 . 9
t
1 9
l=1 l=2 l=3 l=4
48
Level-Building
49
Level-Building (cont.)
50
Level-Building (cont.)
51
CSR
- Linear lexicon with a unigram language model
( )
∏ =
= n i i
w P P
1
) ( W
52
CSR
- Linear lexicon with a bigram language model
( )
∏ > < =
= − n i i i
w w P s w P P
2 1 1
) | ( ) | ( W
) ( ) ( ) | (
j i i j
w P w w w P α =
Katz backoff method for unseen bigram
53
CSR
- Linear lexicon with a trigram language model
( )
∏ > < > < =
= − − n i i i i
w w w P w s w P s w P P
3 1 2 1 2 1
) , | ( ) , | ( ) | ( W
54
Time-Synchronous Viterbi Beam Search
For isolated word recognition using word HMMs, the recognition task is defined as For CSR, the goal of decoding is to uncover the best word sequence, so we could approximate the summation with the maximum to find the best state sequence instead
∑ = =
Q
Q X X ) | , ( max arg ) | ( max arg ˆ
i i i i
w P w P w
( ) (
)
( )
( )
( )
( )
≅ ∑ = = W Q X W W Q X W W X W W
Q W Q W W
, max max arg , max arg max arg ˆ
possible all
P P P P P P
55
Time-Synchronous Viterbi Beam Search (cont.)
Prune unlikely path during search Need some criteria to prune unlikely paths
l in d eng h uei l i j empt shi ian l i d eng h uei wang d eng shi ian wang d eng h uei
林 登 輝
List-lexicon
李 志 賢 李 登 輝 王 登 賢 王 登 輝
l in d eng shi ian
林 登 賢 李登輝 林志賢 王登輝 time
pruning
state
56
Time-Synchronous Viterbi Beam Search (cont.)
D(t;st;w): total cost of the best path up to time t that ends in state st of grammar word state w h(t;st;w): backtrack pointer for the best path up to time t that ends in state st of grammar word state w I(w): the initial state of grammar word state w F(w): the final state of grammar word state w η: the pseudo initial state
{ }
) ; ; 1 ( ) ; | , ( min arg ) ; ; ( ) ; | ( log ) ; | ( log ) ; | , (
1 1 min 1 1
1
w s t D w s s d w s t b w s P w s s P w s s d
t t t t s t t t t t t t t
t
− − − −
− + = − − =
−
x x x Beam size ~ 5-10% of the entire search space