Robustness Techniques for Speech Recognition Berlin Chen, 2004 - PowerPoint PPT Presentation

Robustness Techniques for Speech Recognition Berlin Chen, 2004 References: 1. X. Huang et al. Spoken Language Processing (2001). Chapter 10 2. J. C. Junqua and J. P. Haton. Robustness in Automatic Speech Recognition (1996), Chapters 5, 8-9 3. T. F. Quatieri, Discrete-Time Speech Signal Processing (2002), Chapter 13

Introduction • Classification of Speech Variability in Five Categories Pronunciation Variation Speaker-independency Speaker-adaptation Linguistic Speaker-dependency variability Inter-speaker variability Intra-speaker variability Variability caused Variability caused by the environment by the context Context-Dependent Acoustic Modeling Robustness Enhancement 2004 Speech - Berlin Chen 2

Introduction (cont.) • The Diagram for Speech Recognition Linguistic Processing Acoustic Processing Feature Likelihood Linguistic Network Feature Likelihood Linguistic Network Extraction computation Decoding Extraction computation Decoding Speech Recognition signal results Language Acoustic Language Acoustic Lexicon Lexicon model model model model • Importance of the robustness in speech recognition – Speech recognition systems must operate in situations with uncontrollable acoustic environments – The recognition performance is often degraded due to the mismatch in the training and testing conditions • Varying environmental noises, different speaker characteristics (sex, age, dialects), different speaking modes (stylistic, Lombard effect), etc. 2004 Speech - Berlin Chen 3

Introduction (cont.) • If a speech recognition system’s accuracy doesn’t degrade very much under mismatch conditions, the system is called robust – ASR performance is rather uniform for SNRs greater than 25dB, but there is a very steep degradation as the noise level increases E E = => = 2 . 5 ≈ s s 25 dB 10 log 10 316 10 E E N N • Variant noises exist in varying real-world environments – periodic, impulsive, or wide/narrow band 2004 Speech - Berlin Chen 4

Introduction (cont.) • Therefore, several possible robustness approaches have been developed to enhance the speech signal, its spectrum, and the acoustic models as well – Environment compensation processing (feature-based) – Environment model adaptation (model-based) – Inherently robust acoustic features (both model- and feature- based) • Discriminative acoustic features 2004 Speech - Berlin Chen 5

The Noise Types h [ m ] s [ m ] x [ m ] n [ m ] A model of the environment. [ ] [ ] [ ] [ ] = ∗ + x m s m h m n m ( ) ( ) ( ) ( ) ⇔ ω = ω ω + ω X S H N { } ( ) ( ) ( ) ( ) ( ) ( ) ( ) 2 2 2 2 * ⇔ ω = ω ω + ω + ω ω ω X S H N 2 Re S H N ( ) ( ) ( ) ( ) ( ) ( ) 2 2 2 = ω ω + ω + ω ω ω θ S H N 2 S H N cos ω ( ) ( ) ( ) 2 2 2 ≈ ω ω + ω S H N ( ) ( ) ( ) ( ) ( ) ω = ω ω + ω ⋅ or P P P P , P : power spectrum X S H N ( ) ( ) ( ) ( ) ( ) spectrum ω = ω ω + ω ⋅ or S S S S , S : 2004 Speech - Berlin Chen 6 xx ss hh nn - -

Additive Noises • Additive noises can be stationary or non-stationary – Stationary noises • Such as computer fan, air conditioning, car noise: the power spectral density does not change over time (the above noises are also narrow-band noises) – Non-stationary noises • Machine gun, door slams, keyboard clicks, radio/TV, and other speakers’ voices (babble noise, wide band nose, most difficult): the statistical properties change over time 2004 Speech - Berlin Chen 7

Additive Noises (cont.) 2004 Speech - Berlin Chen 8

Convolutional Noises • Convolutional noises are mainly resulted from channel distortion (sometimes called “channel noises”) and are stationary for most cases – Reverberation, the frequency response of microphone, transmission lines, etc. 2004 Speech - Berlin Chen 9

Noise Characteristics • White Noise ( ) ω = S nn q – The power spectrum is flat ,a condition equivalent to [ ] [ ] = δ R nn m q m different samples being uncorrelated, – White noise has a zero mean, but can have different distributions – We are often interested in the white Gaussian noise, as it resembles better the noise that tends to occur in practice • Colored Noise – The spectrum is not flat (like the noise captured by a microphone) – Pink noise • A particular type of colored nose that has a low-pass nature, as it has more energy at the low frequencies and rolls off at high frequency • E.g., the noise generated by a computer fan, an air conditioner, or an automobile 2004 Speech - Berlin Chen 10

Noise Characteristics (cont.) • Musical Noise – Musical noise is short sinusoids (tones) randomly distributed over time and frequency • That occur due to, e.g., the drawback of original spectral subtraction technique and statistical inaccuracy in estimating noise magnitude spectrum • Lombard effect – A phenomenon by which a speaker increases his vocal effect in the presence of background noise (the additive noise) – When a large amount of noise is present, the speaker tends to shout, which entails not only a high amplitude, but also often higher pitch, slightly different formants, and a different coloring (shape) of the spectrum – The vowel portion of the words will be overemphasized by the speakers 2004 Speech - Berlin Chen 11

Robustness Approaches

Three Basic Categories of Approaches • Speech Enhancement Techniques – Eliminate or reduce the noisy effect on the speech signals, thus better accuracy with the originally trained models (Restore the clean speech signals or compensate for distortions) – The feature part is modified while the model part remains unchanged • Model-based Noise Compensation Techniques – Adjust (changing) the recognition model parameters ( means and variances ) for better matching the testing noisy conditions – The model part is modified while the feature part remains unchanged • Inherently Robust Parameters for Speech – Find robust representation of speech signals less influenced by additive or channel noise – Both of the feature and model parts are changed 2004 Speech - Berlin Chen 13

Assumptions & Evaluations • General Assumptions for the Noise – The noise is uncorrelated with the speech signal – The noise characteristics are fixed during the speech utterance or vary very slowly (the noise is said to be stationary) • The estimates of the noise characteristics can be obtained during non-speech activity – The noise is supposed to be additive or convolutional • Performance Evaluations – Intelligibility, quality ( subjective assessment) – Distortion between clean and recovered speech ( objective assessment) – Speech recognition accuracy 2004 Speech - Berlin Chen 14

Spectral Subtraction (SS) S. F. Boll, 1979 • A Speech Enhancement Technique • Estimate the magnitude (or the power) of clean speech by explicitly subtracting the noise magnitude (or the power) spectrum from the noisy magnitude (or power) spectrum • Basic Assumption of Spectral Subtraction [ ] [ ] – The clean speech is corrupted by additive noise s m n m – Different frequencies are uncorrelated from each other [ ] [ ] – and are statistically independent, so that the power s m n m [ ] spectrum of the noisy speech can be expressed as: x m ( ) ( ) ( ) ω = ω + ω P P P X S N ( ) ( ) ( ) ω = ω − ω – To eliminate the additive noise: P P P S X N ( ) ω – We can obtain an estimate of using the average period of M P N frames that known to be just noise : 1 ( ) ( ) ˆ − M 1 ω = ω P P ∑ N N , i M = i 0 frames 2004 Speech - Berlin Chen 15

Spectral Subtraction (cont.) • Problems of Spectral Subtraction [ ] [ ] – and are not statistically independent such that the cross s m n m term in power spectrum can not be eliminated ( ) ˆ – is possibly less than zero ω P S ( ) ( ) ω ≈ ω – Introduce “musical noise” when P P X N – Need a robust endpoint (speech/noise/silence) detector 2004 Speech - Berlin Chen 16

Spectral Subtraction (cont.) • Modification: Nonlinear Spectral Subtraction (NSS) ( ) ( ) ( ) ( ) ( ) ⎧ ω − φ ω ω > φ ω + β ⋅ ω ( ) ( ) ( ) ( ) P , if P P ⎧ ω − ω ω ≥ ω ( ) P P , if P P ˆ ( ) ω = ˆ P X X N ⎨ ω = X N X N P ⎨ ( ) ( ) S β ⋅ ω S P , otherwise ω ⎩ P , otherwise ⎩ or N N ( ) ( ) ( ) ( ) ω ω ω ω P and P : smoothed noisy and noise spectrum P and P : smoothed noisy and noise spectrum X N X N ( ) φ ω : a non - linear function according to SNR 2004 Speech - Berlin Chen 17

Robustness Techniques for Speech Recognition Berlin Chen, 2004 - PowerPoint PPT Presentation

Robustness Techniques for Speech Recognition Berlin Chen, 2004 References: 1. X. Huang et al. Spoken Language Processing (2001). Chapter 10 2. J. C. Junqua and J. P. Haton. Robustness in Automatic Speech Recognition (1996), Chapters 5, 8-9 3. T.

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

UCSD Robustness Summer School David Donoho 20190812 David Donoho UCSD Robustness Summer School

Robustness? Robustness ? Robustness?

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Spoken Language Understanding strategies developed at the University of Avignon: For a better

Monte Carlo methods Draw random samples from the desired distribution Yield a stochastic

SurFi: Detecting Surveillance Camera Looping Attacks with Wi-Fi Channel State Information Nitya

PrefMiner: Mining Users Preferences for Intelligent Mobile

R SciDB Julia Mert Terzihan Zhixiong Chen R 1. What is R In the 1970s, at Bell Labs,

Introduction to the R Language Data Types and Basic Operations Computing for Data Analysis 1 /

Wh y do missing v al u es e x ist ? FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN

Robust code R Functions What do these calls do? > df[, vars] > subset(df, x == y) >

Robustness Techniques for Speech Recognition Berlin Chen, 2004 - PowerPoint PPT Presentation

Robustness Techniques for Speech Recognition Berlin Chen, 2004 References: 1. X. Huang et al. Spoken Language Processing (2001). Chapter 10 2. J. C. Junqua and J. P. Haton. Robustness in Automatic Speech Recognition (1996), Chapters 5, 8-9 3. T.

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

UCSD Robustness Summer School David Donoho 20190812 David Donoho UCSD Robustness Summer School

Robustness? Robustness ? Robustness?

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Spoken Language Understanding strategies developed at the University of Avignon: For a better

Monte Carlo methods Draw random samples from the desired distribution Yield a stochastic

SurFi: Detecting Surveillance Camera Looping Attacks with Wi-Fi Channel State Information Nitya

PrefMiner: Mining Users Preferences for Intelligent Mobile

R SciDB Julia Mert Terzihan Zhixiong Chen R 1. What is R In the 1970s, at Bell Labs,

Introduction to the R Language Data Types and Basic Operations Computing for Data Analysis 1 /

Wh y do missing v al u es e x ist ? FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN

Robust code R Functions What do these calls do? &gt; df[, vars] &gt; subset(df, x == y) &gt;

Robust code R Functions What do these calls do? > df[, vars] > subset(df, x == y) >