Structured Discriminative Models for Speech Recognition Mark Gales - PowerPoint PPT Presentation

Structured Discriminative Models for Speech Recognition Mark Gales - work with Anton Ragni, Austin Zhang, Rogier van Dalen September 2012 Cambridge University Engineering Department Symposium on Machine Learning in Speech and Language Processing

Structured Discriminative Models for Speech Recognition Overview • Acoustic Models for Speech Recognition – generative and discriminative models • Sequence (dynamic) kernels – discrete and continuous observation forms • Combining Generative and Discriminative Models – generative score-spaces and log-linear models – efficient feature extraction • Training Criteria – large-margin-based training • Initial Evaluation on Noise Robust Speech Recognition – AURORA-2 and AURORA-4 experimental results Cambridge University MLSLP 2012 1 Engineering Department

Structured Discriminative Models for Speech Recognition Acoustic Models Cambridge University MLSLP 2012 2 Engineering Department

Structured Discriminative Models for Speech Recognition Hidden Markov Model - a Generative Model o o o o o 1 2 3 4 T q t q t+1 () () b () b b 2 4 3 1 2 3 4 5 a a a 34 a o t o t+1 12 23 45 a a 33 a 22 44 (a) Standard HMM phone topology (b) HMM Dynamic Bayesian Network • Conditional independence assumption: – observations conditionally independent of other observations given state. – states conditionally independent of other states given previous states. T � � P ( q t | q t − 1 ) p ( o t | q t ; λ ) p ( O ; λ ) = q t =1 • Sentence models formed by “glueing” sub-sentence models together Cambridge University MLSLP 2012 3 Engineering Department

Structured Discriminative Models for Speech Recognition Discriminative Models • Classification requires class posteriors P ( w | O ) – Generative model classification use Bayes’ rule e.g. for HMMs p ( O | w ; λ ) P ( w ) P ( w | O ; λ ) = � w p ( O | ˜ w ; λ ) P ( ˜ w ) ˜ • Discriminative model - directly model posterior [1] e.g. Log-Linear Model P ( w | O ; α ) = 1 � � α T φ ( O , w ) Z exp – normalisation term Z (simpler to compute than generative model) � � � α T φ ( O , ˜ Z = exp w ) ˜ w • BUT still need to decide form of features φ ( O , w ) Cambridge University MLSLP 2012 4 Engineering Department

Structured Discriminative Models for Speech Recognition Example Standard Sequence Models q t q q t q q t q t+1 t+1 t+1 o t o t+1 o t o t+1 o o t+1 t HMM MEMM (H)CRF • The segmentation, a , determines the state-sequence q – maximum entropy Markov model [4] T 1 � � � α T φ ( q t , q t − 1 , o t ) P ( q | O ) = exp Z t t =1 – hidden conditional random field (simplified linear form only) [5] T P ( q | O ) = 1 � � � α T φ ( q t , q t − 1 , o t ) exp Z t =1 Cambridge University MLSLP 2012 5 Engineering Department

Structured Discriminative Models for Speech Recognition Sequence Discriminative Models • “Standard” models represent state sequences P ( q | O ) – actually want word posteriors P ( w | O ) • Applying discriminative models directly to speech recognition: 1. Number of possible classes is vast – motivates the use of structured discriminative models 2. Length of observation O varies from utterance to utterance – motivates the use of sequence kernels to obtain features 3. Number of labels (words) and observations (frames) differ – addressed by combining solutions to (1) and (2) Cambridge University MLSLP 2012 6 Engineering Department

Structured Discriminative Models for Speech Recognition Code-Breaking Style • Rather than handle complete sequence - split into segments – perform simpler classification for each segment – complexity determined by segment (simplest word) 1. Using HMM-based hypothesis – word start/end FOUR SEVEN ONE 2. Foreach segment of a : – binary SVMs voting ONE ONE ONE α ( ω ) T φ ( O { a i } , ω ) – arg max ω ∈{ ONE ,..., SIL } ZERO ZERO ZERO SIL SIL SIL • Limitations of code-breaking approach [3] – each segment is treated independently – restrict to one segmentation, generated by HMMs Cambridge University MLSLP 2012 7 Engineering Department

Structured Discriminative Models for Speech Recognition Flat Direct Models <s> the dog chased the cat </s> ... ... o o o o o 1 t−1 t t+1 T • Log-linear model for complete sentence [7] P ( w | O ) = 1 � � α T φ ( O , w ) Z exp • Simple model, but lack of structure may cause problems – extracted feature-space becomes vast (number of possible sentences) – associated parameter vector is vast – (possibly) large number of unseen examples Cambridge University MLSLP 2012 8 Engineering Department

Structured Discriminative Models for Speech Recognition Structured Discriminative Models ... ... dog chased ... ... o o o o o o τ j i+1 i+2 j+1 j+2 • Introduce structure into observation sequence [8] - segmentation a – comprises: segmentation identity a i , set of observations O { a }     | a | P ( w | O ) = 1 � �  α T φ ( O { a τ } , a i exp τ )    Z a τ =1 – segmentation may be at word, (context-dependent) phone, etc etc • What form should φ ( O { a τ } , a i τ ) have? – must be able to handle variable length O { a τ } Cambridge University MLSLP 2012 9 Engineering Department

Structured Discriminative Models for Speech Recognition Features • Discriminative models performance highly dependent on the features – basic features - second-order statistics (almost) a discriminative HMM – simplest approach extend frame features (for each unit w ( k ) ) [6]   . . . � τ , w ( k ) ) o t t ∈{ a τ } δ ( a i     � τ , w ( k ) ) o t ⊗ o t   φ ( O { a τ } , a i t ∈{ a τ } δ ( a i τ ) =   �   τ , w ( k ) ) o t ⊗ o t ⊗ o t t ∈{ a τ } δ ( a i   . . . – features have same conditional independence assumption as HMM How to extend range of features? • Consider extracting features for a complete segment of speech – number of frames will vary from segment to segment – need to map to a fixed dimensionality independent of number of frames Cambridge University MLSLP 2012 10 Engineering Department

Structured Discriminative Models for Speech Recognition Sequence Kernels Cambridge University MLSLP 2012 11 Engineering Department

Structured Discriminative Models for Speech Recognition Sequence Kernel • Sequence kernels are a class of kernel that handles sequence data – also applied in a range of biological applications, text processing, speech – these kernels may be partitioned into three broad classes • Discrete-observation kernels – appropriate for text data – string kernels simplest form • Distributional kernels (not discussed in this talk) – distances between distributions trained on sequences • Generative kernels: – parametric form: use the parameters of the generative model – derivative form: use the derivatives with respect to the model parameters Cambridge University MLSLP 2012 12 Engineering Department

Structured Discriminative Models for Speech Recognition String Kernel • For speech and text processing input space has variable dimension: – use a kernel to map from variable to a fixed length; – string kernels are an example for text [9]. • Consider the words cat , cart , bar and a character string kernel c-a c-t c-r a-r r-t b-a b-r φ ( cat ) 1 λ 0 0 0 0 0 λ 2 φ ( cart ) 1 λ 1 1 0 0 φ ( bar ) 0 0 0 1 0 1 λ K ( cat , cart ) = 1 + λ 3 , K ( cat , bar ) = 0 , K ( cart , bar ) = 1 • Successfully applied to various text classification tasks: – how to make process efficient (and more general)? Cambridge University MLSLP 2012 13 Engineering Department

Structured Discriminative Models for Speech Recognition Rational Kernels • Rational kernels [10] encompass various standard feature-spaces and kernels: – bag-of-words and N-gram counts, gappy N-grams (string Kernel), • A transducer, T , for the string kernel (gappy bigram) (vocab { a , b } ) b: ε/λ ε ε b: /1 b: /1 a: ε/λ ε ε a: /1 a: /1 a:a/1 a:a/1 2 3/1 1 b:b/1 b:b/1 � � O i ◦ ( T ◦ T − 1 ) ◦ O j The kernel is: K ( O i , O j ) = w • This form can also handle uncertainty in decoding: – lattices can be used rather than the 1-best output ( O i ). • Can also be applied for continuous data kernels [11]. Cambridge University MLSLP 2012 14 Engineering Department

Structured Discriminative Models for Speech Recognition Generative Score-Spaces • Generative kernels use scores of the following form [12] φ ( O ; λ ) = [log( p ( O ; λ ))] – simplest form maps sequence to 1-dimensional score-space • Parametric score-space increase the score-space size   ˆ λ (1) . . φ ( O ; λ ) = .   ˆ λ ( K ) – parameters estimated on O : related to the mean-supervector kernel • Derivative score-space take the following form φ ( O ; λ ) = [ ∇ λ log ( p ( O ; λ ))] – using the appropriate metric this is the Fisher kernel [13] Cambridge University MLSLP 2012 15 Engineering Department

Structured Discriminative Models for Speech Recognition Combining Generative & Discriminative Models Cambridge University MLSLP 2012 16 Engineering Department

Structured Discriminative Models for Speech Recognition Mark Gales - PowerPoint PPT Presentation

Structured Discriminative Models for Speech Recognition Mark Gales - work with Anton Ragni, Austin Zhang, Rogier van Dalen September 2012 Cambridge University Engineering Department Symposium on Machine Learning in Speech and Language

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Discriminative Models Joakim Nivre Uppsala University Department of Linguistics and Philology

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 20:

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

Three models for discriminative machine Three models for discriminative machine translation using

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Interchangeable Modalities W3C Workshop on MultiModal Interaction 22-23 July 2013, New York

Engineering ............ design of the physical ? ....early 90s What do we live for? Is

Speech Recognition and Synthesis for Conversational AI Mari Ostendorf University of Washington

Towards Unsupervised Speech-to-Text Translation Yu-An Chung Wei-Hung Weng Schrasing Tong

Introduction to Statistical Speech Recognition Lecture 1 CS 753 Instructor: Preethi Jyothi

SI485i Natural Language Processing Set 1 Intro to NLP Fall 2013 : Chambers Assumptions about

SpeechRecognition P y thon librar y SP OK E N L AN G U AG E P R OC E SSIN G IN P YTH ON Daniel

Letter-to-Phoneme Conversion for a German Text-to-Speech System Vera Demberg Institut fr

Sambuz

Useful Links

Newsletter

Mail Us