Fundamental Equations Bayes decision rule: = arg max { P ( | O ) } - PowerPoint PPT Presentation

Fundamental Equations Bayes’ decision rule: ω = arg max { P ( ω | O ) } = arg max { P ( ω ) P ω ( O ) } � ω ω P ω ( O ) — acoustic model. For word sequence ω , how likely are features O ? P ( ω ) — language model. How likely is word sequence ω ? 1 / 49

Lecture 9 Speaker Adaptation Michael Picheny, Bhuvana Ramabhadran, Stanley F . Chen, Markus Nussbaum-Thom Watson Group IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen,nussbaum}@us.ibm.com 8 April 2016

Where Are We? [0] Introduction 1 Segmentation and Clustering 2 Maximum Likelihood Linear Regression 3 Feature based Maximum Likelihood Linear Regression 4 Speaker Adaptive Training 5 3 / 49

Problem: Sources of Variability gender: male / female age: young / old accents: Texas, South-Carolina environment noise: office, car, shopping mall different types of microphone channel characteristics: high-quality, telephone, mobile phone Question: Are all these effects covered in training ? 4 / 49

Changing Conditions (I) Training data: Should represent test data adequately. Problem: There will always be new speakers or conditions. Consequence: What will happen ? Recognition performance drops. 5 / 49

Changing Conditions (II) Why does the performance drop ? The features are different from training. Situation in training: Larger amount data for a specific set of speakers. Situation in recognition: Small amount of data from a target speaker. What can we do ? 6 / 49

Adaptation vs. Normalization What can we do to overcome the mismatch between training and recognition ? Change features O or the acoustic model P ( O | ω, θ ) . O : feature sequence. ω : word sequence. θ : free model parameters. Model-based – Feature-based: Modify model to better fit the features ⇒ adaptation. Transform features to better fit model ⇒ normalization. 7 / 49

Terminology Speaker: Rather a concept for different signal conditions. Speaker-independent (SI) system: trained on complete data. Speaker-dependent (SD) system: trained on all the data per speaker. Speaker-adaptive (SA) system: adapted SI system using the speaker dependent data. 8 / 49

Adaptation/Normalization Types Supervised vs. Unsupervised: Is correct transcription of utterance available at test time ? Batch vs. Incremental adaptation/normalization: Whole vs. small (time critical) portion of the test data is available. Online vs. Offline system: Real-time demand vs. No time restriction. 9 / 49

Question answer (I) What is the concept of a speaker ? Are all speaker covered in the training data ? What happens ? How do we approach the problem of unseen speaker ? Supervised vs. unsupervised ? Batch vs. Incremental ? 10 / 49

Strategies Summary Goal: Fit acoustic model/features to speaker. Use new acoustic adaptation data from the current speaker. Model-based – Feature-based: Modify model to better fit the features ⇒ adaptation. Transform features to better fit model ⇒ normalization. Supervised – Unsupervised: Transcription is available for adaptation ⇒ supervised. No transcription is available ⇒ unsupervised. Training: Normalization/Adaptation also in training ⇒ Speaker Adaptive Training (SAT). Incremental – Batch: adaptation only on small parts ⇒ incremental. adaptation on all data ⇒ batch. 11 / 49

Transformation of a Random Variable Consider a random variable O : with density P ( O ) ′ = f ( O ) (assume f can be inverted) and transform O ′ ) is: Then the density P ( O 1 1 ′ ) = P ( f − 1 ( O ′ )) = P ( O � � � � P ( O ) � � � � � d f ( O ) � d f ( O ) � � d O d O with Jacobian determinant: � � � � d f ( O ) � � � � d O or equivalent: � � � � d f ( O ) � � ′ ) P ( O ) = � P ( O � d O 12 / 49

Supervised Normalization and Adaptation Estimation of adaptation parameters Model based: θ ′ = f ( θ, Φ) ˆ P ( O | ω, θ ′ ) Φ = arg max Φ Feature based: O ′ = f ( O , Φ) ( | · | is the Jacobi matrix) � � � � d f ( O , Φ) ˆ � � � P ( O ′ | ω, θ ) Φ = arg max � d O Φ Correct transcript ω of adaptation data is given. 13 / 49

Unsupervised Normalization and Adaptation Estimation of adaptation parameters and generation of adaptation word sequence. Model based: θ ′ = f ( θ, Φ) ω, ˆ ω, Φ P ( O | ω, θ ′ ) . (ˆ Φ) = arg max Feature based: O ′ = f ( O , Φ) � � � � d f ( O , Φ) ω, ˆ � � (ˆ Φ) = arg max � P ( O ′ | ω, θ ) . � d O ω, Φ In practice infeasable. In practice transcript ω of adaptation data is approximated. 14 / 49

Unsupervised Normalization and Adaptation: First Best Approximation First-best approximation A speaker independent system generates the first best output. Estimation is performed exactly as in the supervised case, but use first pass output as transcription. Most popular method. 15 / 49

Unsupervised Normalization and Adaptation: Word Graph Approximation Word graph based approximation A first pass recognition generates a word graph. Use the forward-backward algorithm as in the supervised case based on the word graph. Weighted accumulation. DIG THE ATE MAY DOG MY EIGHT DOG THIS MAY DOGGY THUD 16 / 49

Question answer (II) Superprvised vs. Unsupervised adaptation ? Approximations ? 17 / 49

Where Are We? [0] Introduction 1 Segmentation and Clustering 2 Maximum Likelihood Linear Regression 3 Feature based Maximum Likelihood Linear Regression 4 Speaker Adaptive Training 5 18 / 49

Online System Objective: generate transcription from audio in real time. The audio has multiple unknown speakers and conditions. Requires fast adaptation with very little data (a couple of seconds). Can benefit from incremental adaptation which continuously updates the adaptation for new data. 19 / 49

Offline System (I) Objective: generate transcription from audio. Multiple passed over the data are allowed. The audio has multiple unknown speakers and conditions. Segmentation and Clustering: 20 / 49

Offline System (II) Where are the speakers and conditions ? ⇒ Segmentation and Clustering Segmentation: Partitioning of the audio into homogenous areas, ideally one speaker/condition per segment. What are the speakers are conditions ? Clustering: Clustering into similar speakers/conditions. The speakers unknown/no transcribed audio data ⇒ unsupervised adaptation. 21 / 49

Audio Segmentation(I) Objective: split audio stream in homogeneous regions Properties: speaker identity, recording condition (e.g. background noise, telephone channel), signal type (e.g. speech. music, noise, silence), and spoken word sequence. 22 / 49

Audio Segmentation (II) Segmentation affects speech recognition performance: speaker adaptation and speaker clustering assume one speaker per segment, language model assumes sentence boundaries at segment end, non-speech regions cause insertion errors, overlapping speech is not recognized correctly, causes errors at sorrounding regions, 23 / 49

Audio Segmentation: Methods Metric based Compute distance between adjacent regions. Segment at maxima of the distances. Distances: Kullback-Leibler distance, Bayesian information criterion. Model based Classify regions using precomputed models for music, speech, etc. Segment changes in acoustic class. Decoder guided Apply speech recognition to input audio stream. Segment at silence regions. Other decoder output useful too. 24 / 49

Audio Segmentation: Bayesian Information Criterion Bayesian Information Criterion (BIC): Likelihood criterion for a model Θ given observations O : BIC (Θ , O ) = log p ( O | Θ) − λ 2 · d (Θ) · log ( N ) d (Θ) : number of parameters in Θ , λ : penalty weight for model complexity. used for model selection: choose model maximizing BIC. 25 / 49

Change Point Detection: Modeling Change point detection using BIC: Input stream is modeled as Gaussian process in the cepstral domain. Feature vectors of one segment: drawn from multivariate Gaussian: O j i := O i . . . O j ∼ N ( µ, Σ) For hypothesized segment boundary t in O T 1 decide between O T 1 ∈ N ( µ, Σ) and O t 1 ∈ N ( µ 1 , Σ 1 ) O T t + 1 ∈ N ( µ 2 , Σ 2 ) Use difference of BIC values: ∆ BIC ( t ) = BIC ( µ, Σ , O T 1 ) − BIC ( µ 1 , Σ 1 , O t 1 ) − BIC ( µ 2 , Σ 2 , O T t + 1 ) 26 / 49

Change Point Detection: Criterion Detect single change point in O 1 . . . O T : ˆ t = arg max { ∆ BIC ( t ) } t 27 / 49

Change Point Detection: Example ∆ BIC ( t ) can be simplified to: ∆ BIC ( t ) = T log | Σ | − t log | Σ 1 | − ( T − t ) log | Σ 2 | − λ P ( D + 1 2 D ( D + 1 )) number of parameters P = log T , 2 D : dimensionality. t 1 2 3 4 5 6 O 6 4 3 2 9 5 7 1 Σ 5.67 Σ 1 , t 0 0.25 0.67 7.25 5.84 5.67 Σ 2 , t 6.56 6.69 2.67 1 0 0 ∆ BIC ( t ) -0.89 5.57 8.68 2.48 -0.17 0 28 / 49

Question answer (II) What should a segmentation ideally do ? What problems can occur due to segmenting (LM, Non-Speech, Overlap) ? What methods exist for segmentation ? What ist the Bayesian Information criterion ? How does change point detection work ? 29 / 49

Fundamental Equations Bayes decision rule: = arg max { P ( | O ) } - PowerPoint PPT Presentation

Fundamental Equations Bayes decision rule: = arg max { P ( | O ) } = arg max { P ( ) P ( O ) } P ( O ) acoustic model. For word sequence , how likely are features O ? P ( ) language model. How likely is

Section6.1 Systems of Equations in Two Variables Introduction Definitions A system of equations

Chemical Equations and Chemical Reactions Symbols Used in Chemical Equations Chemical Equations

Parametric Equations We sometimes have several equations sharing an independent vari- able. In

Modelling with Differential Equations Modelling with Differential Equations Modelling with

Diophantine equations, modular forms, and cycles Henri Darmon McGill University Ottawa,

Controlled McKean-Vlasov Equations and Related Master Equations Cong Wu Department of

Diophantine equations Henri Darmon McGill University CRM-ISM Colloquium, UQAM, January 8, 2010

Week 9 Difference Equations Discrete Math April 23, 2020 Marie Demlova: Discrete Math Cyclic

Cold Atom Atom Clocks Clocks Cold Cold Atom Clocks and Fundamental Fundamental Tests Tests

Scientific Computation and Differential Equations John Butcher The University of Auckland New

Higher Order Linear Differential Equations Familiar stuff Example Homogeneous equations Math

1 The Euler system of equations The Euler system of equations describing flows of incompressible

1.3 Differential Equations as Mathematical Models a lesson for MATH F302 Differential Equations

JUST THE MATHS SLIDES NUMBER 15.3 ORDINARY DIFFERENTIAL EQUATIONS 3 (First order

JUST THE MATHS SLIDES NUMBER 15.2 ORDINARY DIFFERENTIAL EQUATIONS 2 (First order

Unmasking the Villain Solving Multi-Step equations Remember Scooby Doo? Solving equations is

RESEARCH Strategy to Compete with the call of the mall Darryn Durno SADC Research Centre Most

NYU WIRELESS The World's First Academic Research Center Combining Wireless, Computing, and Medical

Background to the Group 12 th February 2019, 13.00 GMT The ECR Community Shrinkage and OSA

Dissemination and Outreach John Dyer, Dissemination and Outreach Manager, TERENA EGEE 1 st EU

Gap Inc. Fiscal 2020 Sonia Syngal Katrina OConnell CHIEF EXECUTIVE OFFICER CHIEF FINANCIAL

Best Newspaper in the web, what are the criteria & How are they implemented Harika Janagama

Has Digital Distribution Rejuvenated Readership? Revisiting the age demographics of newspaper

Boston Globe Discovering and Optimizing a Value Proposition Peter Doucette Pamela Markey

Fundamental Equations Bayes decision rule: = arg max { P ( | O ) } - PowerPoint PPT Presentation

Fundamental Equations Bayes decision rule: = arg max { P ( | O ) } = arg max { P ( ) P ( O ) } P ( O ) acoustic model. For word sequence , how likely are features O ? P ( ) language model. How likely is

Section6.1 Systems of Equations in Two Variables Introduction Definitions A system of equations

Chemical Equations and Chemical Reactions Symbols Used in Chemical Equations Chemical Equations

Parametric Equations We sometimes have several equations sharing an independent vari- able. In

Modelling with Differential Equations Modelling with Differential Equations Modelling with

Diophantine equations, modular forms, and cycles Henri Darmon McGill University Ottawa,

Controlled McKean-Vlasov Equations and Related Master Equations Cong Wu Department of

Diophantine equations Henri Darmon McGill University CRM-ISM Colloquium, UQAM, January 8, 2010

Week 9 Difference Equations Discrete Math April 23, 2020 Marie Demlova: Discrete Math Cyclic

Cold Atom Atom Clocks Clocks Cold Cold Atom Clocks and Fundamental Fundamental Tests Tests

Scientific Computation and Differential Equations John Butcher The University of Auckland New

Higher Order Linear Differential Equations Familiar stuff Example Homogeneous equations Math

1 The Euler system of equations The Euler system of equations describing flows of incompressible

1.3 Differential Equations as Mathematical Models a lesson for MATH F302 Differential Equations

JUST THE MATHS SLIDES NUMBER 15.3 ORDINARY DIFFERENTIAL EQUATIONS 3 (First order

JUST THE MATHS SLIDES NUMBER 15.2 ORDINARY DIFFERENTIAL EQUATIONS 2 (First order

Unmasking the Villain Solving Multi-Step equations Remember Scooby Doo? Solving equations is

RESEARCH Strategy to Compete with the call of the mall Darryn Durno SADC Research Centre Most

NYU WIRELESS The World's First Academic Research Center Combining Wireless, Computing, and Medical

Background to the Group 12 th February 2019, 13.00 GMT The ECR Community Shrinkage and OSA

Dissemination and Outreach John Dyer, Dissemination and Outreach Manager, TERENA EGEE 1 st EU

Gap Inc. Fiscal 2020 Sonia Syngal Katrina OConnell CHIEF EXECUTIVE OFFICER CHIEF FINANCIAL

Best Newspaper in the web, what are the criteria &amp; How are they implemented Harika Janagama

Has Digital Distribution Rejuvenated Readership? Revisiting the age demographics of newspaper

Boston Globe Discovering and Optimizing a Value Proposition Peter Doucette Pamela Markey

Best Newspaper in the web, what are the criteria & How are they implemented Harika Janagama