Computer Speech Recognition: Mimicking the Human System Li Deng - PowerPoint PPT Presentation

Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Microsoft Research, Redmond July 24, 2005 Banff/BIRS

Fundamental Equations • Enhancement (denoising): • Recognition: ˆ = = W P W x P x W P W arg max ( | ) arg max ( | ) ( ) W W • Importance of speech modeling

Speech Recognition--- Introduction • Converting naturally uttered speech into text and meaning • Conventional technology --- statistical modeling and estimation (HMM) • Limitations – noisy acoustic environments – rigid speaking style – constrained task – unrealistic demand of training data – huge model sizes, etc. – far below human speech recognition performance • Trend: Incorporate key aspects of human speech processing mechanisms

Segment-Level Speech Dynamics

Production & Perception: Closed-Loop Chain LISTENER SPEAKER decoded message Internal message model ear/auditory reception motor/articulators Speech Acoustics in closed-loop chain

Encoder: Two-Stage Production Mechanisms Phonology (higher level): •Symbolic encoding of linguistic message •Discrete representation by phonological features SPEAKER •Loosely-coupled multiple feature tiers •Overcome beads-on-a-string phone model •Theories of distinctive features, feature geometry & articulatory phonology • Account for partial/full sound deletion/modification message in casual speech motor/articulators Phonetics Phonetics (lower level): (lower level): •Convert discrete linguistic features to Convert discrete linguistic features to • continuous acoustics continuous acoustics • •Mediated by motor control & articulatory Mediated by motor control & articulatory dynamics dynamics •Mapping from articulatory variables to Mapping from articulatory variables to • VT area function to acoustics VT area function to acoustics Speech Acoustics • •Account for co Account for co- -articulation and reduction articulation and reduction (target undershoot), etc. (target undershoot), etc.

Encoder: Phonological Modeling Computational phonology: • Represent pronunciation variations as constrained factorial Markov chain SPEAKER • Constraint: from articulatory phonology • Language-universal representation ten themes message / t ε n ө i: m z / Labial- motor/articulators LIPS: closure Alveolar Tongue Alveolar dental TT: Alveolar constr. Tip closure constr. closure Tongue High / Front gesture-iy TB: Mid / Front gesture-eh Body Speech Acoustics VEL: Nasality Nasality GLO: Voicing Voicing Aspiration

Encoder: Phonetic Modeling Computational phonetics: Computational phonetics: • • Segmental factorial HMM for sequential target Segmental factorial HMM for sequential target in articulatory or vocal tract resonance domain in articulatory or vocal tract resonance domain SPEAKER • Switching trajectory model for target Switching trajectory model for target- -directed directed • articulatory dynamics articulatory dynamics • Switching nonlinear state • Switching nonlinear state- -space model for space model for dynamics in speech acoustics dynamics in speech acoustics • Illustration: Illustration: • message motor/articulators Speech Acoustics

Phonetic Encoder: Computation S S S S ��....... S K 3 1 2 4 SPEAKER targets t t t t t 1 3 K 2 4 z articulation z z z z 2 3 K 1 4 message distortion-free acoustics o o o o o motor/articulators K 2 3 4 1 distorted acoustics y y y y y K 1 2 3 4 h n n n n n Speech Acoustics 3 K 1 2 4 distortion factors & feedback to articulation N N N N N K 4 3 1 2

Decoder I: Auditory Reception LISTENER • Convert speech acoustic waves into efficient & robust auditory representation decoded • This processing is largely independent message of phonological units • Involves processing stages in cochlea Internal (ear), cochlear nucleus, SOC, IC,…, all message model the way to A1 cortex • Principal roles: ear/auditory reception 1) combat environmental acoustic distortion; motor/articulators 2) detect relevant speech features 3) provide temporal landmarks to aid decoding • Key properties: 1) Critical-band freq scale, logarithmic compression, 2) adapt freq selectivity, cross-channel correlation, 3) sharp response to transient sounds 4) modulation in independent frequency bands, 5) binaural noise suppression, etc.

Decoder II: Cognitive Perception LISTENER • Cognitive process: recovery of linguistic message • Relies on decoded 1) “Internal” model: structural knowledge of message the encoder (production system) 2) Robust auditory representation of features Internal 3) Temporal landmarks message model • Child speech acquisition process is one that gradually establishes the “internal” model ear/auditory reception • Strategy: analysis by synthesis • i.e., Probabilistic inference on (deeply) motor/articulators hidden linguistic units using the internal model • No motor theory: the above strategy requires no articulatory recovery from speech acoustics

Speaker-Listener Interaction • On-line modification of speaker’s articulatory behavior (speaking effort, rate, clarity, etc.) based on listener’s “decoding” performance (i.e. discrimination) • Especially important for conversational speech recognition and understanding • On-line adaptation of “encoder” parameters • Novel criteria: – maximize discrimination while minimizing articulation effort • In this closed-loop model, the “effort” quantified as “curvature” of temporal sequence of articulatory vector z t . • No such concept of “effort” in conventional HMM systems

Model synthesis in FT • xxx • xxx

Model synthesis in cepstra Model synthesis in cepstra 2 1 C1 Model 0 −1 data −2 0 50 100 150 200 250 0.5 C2 0 −0.5 −1 0 50 100 150 200 250 0.5

Procedure --- N-best Evaluation test data H*= arg Max { P(H1), P(H2),…P(H1000)} LPCC feature extraction + table nonlinear Gaussian Hyp 1 FIR - Scorer lookup mapping + table Hyp 2 nonlinear Gaussian FIR triphone lookup mapping Scorer - HMM system … … … … … … … … … … … … … … … … … … + table nonlinear Gaussian Hyp N FIR lookup mapping Scorer - N-best list (N=1000); each hypothesis has γ γ parameter µ µ σ σ T γ γ µ µ σ σ 2 (k) (k) phonetic xcript & time s s ss ss free

Results (recognition accuracy %) (work with Dong Yu) Lattice Decode Models 75.1 New model 72.5 72.5 72.5 HMM system 100 . . . 90 80 70 Acc% HMM 60 50 40 N in N-best 30 1 11 101 1001 10001

Summary & Conclusion • Human speech production/perception viewed as synergistic elements in a closed-looped communication chain • They function as encoding & decoding of linguistic messages, respectively. • In human, speech “encoder” (production system) consists of phonological (symbolic) and phonetic (numeric) levels. • Current HMM approach approximates these two levels in a crude way: – phone-based phonological model (“beads-on-a-string”) – multiple Gaussians as phonetic model for acoustics directly – very weak hidden structure

Summary & Conclusion (cont’d) • “Linguistic message recovery” (decoding) formulated as: – auditory reception for efficient & robust speech representation & for providing temporal landmarks for phonological features – cognition perception using “encoder” knowledge or “internal model” to perform probabilistic analysis by synthesis or pattern matching • Dynamic Bayes network developed as a computational tool for constructing encoder and decoder • Speaker-listener interaction (in addition to poor acoustic environment) cause substantial changes of articulation behavior and acoustic patterns

Issues for discussion • Differences and similarities in processing/analysis techniques for audio/speech and image/video processing • Integrated processing vs. modular processing ˆ = = W P W x P x W P W argmax ( | ) argmax ( | ) ( ) W W • Feature extraction vs. classification • Use of semantics (class) information for feature extraction (dim reduction, discriminative features, etc.) • Arbitrary signal vs. structured signal (e.g. face image, human body motion, speech, music)

Computer Speech Recognition: Mimicking the Human System Li Deng - PowerPoint PPT Presentation

Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Microsoft Research, Redmond July 24, 2005 Banff/BIRS Fundamental Equations Enhancement (denoising): Recognition: = = W P W x P x W

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Effective Open Source Speech Recognition in Your Application #kde-speech Peter Grasch

Computing Like the Brain The Path To Machine Intelligence Jeff Hawkins GROK - Numenta

Presentation title A Deep Learning based Fast Signed Distance Map Generation Zihao Wang, Clair

The Ear and Hearing Sound and sensations: Physical attributes of sound: intensity, frequency,

Auditory Perception and Audition Addition of audio cues to VR environments Primary goal is to

Move to hear and listen to perform . On auditory stimulation through motor activities for

Cochlear Implant Systems todays challenges in embedded firmware design December 7, 2009 Ren

OPPORTUNITIES AND CHALLENGES OF PARALLELIZING SPEECH RECOGNITION Jike Chong, Gerald Friedland,

Some Essentials of Data Analysis with Wavelets Slides in the wavelet part of the course in data