computer speech recognition mimicking the human system
play

Computer Speech Recognition: Mimicking the Human System Li Deng - PowerPoint PPT Presentation

Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Microsoft Research, Redmond July 24, 2005 Banff/BIRS Fundamental Equations Enhancement (denoising): Recognition: = = W P W x P x W


  1. Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Microsoft Research, Redmond July 24, 2005 Banff/BIRS

  2. Fundamental Equations • Enhancement (denoising): • Recognition: ˆ = = W P W x P x W P W arg max ( | ) arg max ( | ) ( ) W W • Importance of speech modeling

  3. Speech Recognition--- Introduction • Converting naturally uttered speech into text and meaning • Conventional technology --- statistical modeling and estimation (HMM) • Limitations – noisy acoustic environments – rigid speaking style – constrained task – unrealistic demand of training data – huge model sizes, etc. – far below human speech recognition performance • Trend: Incorporate key aspects of human speech processing mechanisms

  4. Segment-Level Speech Dynamics

  5. Production & Perception: Closed-Loop Chain LISTENER SPEAKER decoded message Internal message model ear/auditory reception motor/articulators Speech Acoustics in closed-loop chain

  6. Encoder: Two-Stage Production Mechanisms Phonology (higher level): •Symbolic encoding of linguistic message •Discrete representation by phonological features SPEAKER •Loosely-coupled multiple feature tiers •Overcome beads-on-a-string phone model •Theories of distinctive features, feature geometry & articulatory phonology • Account for partial/full sound deletion/modification message in casual speech motor/articulators Phonetics Phonetics (lower level): (lower level): •Convert discrete linguistic features to Convert discrete linguistic features to • continuous acoustics continuous acoustics • •Mediated by motor control & articulatory Mediated by motor control & articulatory dynamics dynamics •Mapping from articulatory variables to Mapping from articulatory variables to • VT area function to acoustics VT area function to acoustics Speech Acoustics • •Account for co Account for co- -articulation and reduction articulation and reduction (target undershoot), etc. (target undershoot), etc.

  7. Encoder: Phonological Modeling Computational phonology: • Represent pronunciation variations as constrained factorial Markov chain SPEAKER • Constraint: from articulatory phonology • Language-universal representation ten themes message / t ε n ө i: m z / Labial- motor/articulators LIPS: closure Alveolar Tongue Alveolar dental TT: Alveolar constr. Tip closure constr. closure Tongue High / Front gesture-iy TB: Mid / Front gesture-eh Body Speech Acoustics VEL: Nasality Nasality GLO: Voicing Voicing Aspiration

  8. Encoder: Phonetic Modeling Computational phonetics: Computational phonetics: • • Segmental factorial HMM for sequential target Segmental factorial HMM for sequential target in articulatory or vocal tract resonance domain in articulatory or vocal tract resonance domain SPEAKER • Switching trajectory model for target Switching trajectory model for target- -directed directed • articulatory dynamics articulatory dynamics • Switching nonlinear state • Switching nonlinear state- -space model for space model for dynamics in speech acoustics dynamics in speech acoustics • Illustration: Illustration: • message motor/articulators Speech Acoustics

  9. Phonetic Encoder: Computation S S S S �����....... S K 3 1 2 4 SPEAKER targets t t t t t 1 3 K 2 4 z articulation z z z z 2 3 K 1 4 message distortion-free acoustics o o o o o motor/articulators K 2 3 4 1 distorted acoustics y y y y y K 1 2 3 4 h n n n n n Speech Acoustics 3 K 1 2 4 distortion factors & feedback to articulation N N N N N K 4 3 1 2

  10. Decoder I: Auditory Reception LISTENER • Convert speech acoustic waves into efficient & robust auditory representation decoded • This processing is largely independent message of phonological units • Involves processing stages in cochlea Internal (ear), cochlear nucleus, SOC, IC,…, all message model the way to A1 cortex • Principal roles: ear/auditory reception 1) combat environmental acoustic distortion; motor/articulators 2) detect relevant speech features 3) provide temporal landmarks to aid decoding • Key properties: 1) Critical-band freq scale, logarithmic compression, 2) adapt freq selectivity, cross-channel correlation, 3) sharp response to transient sounds 4) modulation in independent frequency bands, 5) binaural noise suppression, etc.

  11. Decoder II: Cognitive Perception LISTENER • Cognitive process: recovery of linguistic message • Relies on decoded 1) “Internal” model: structural knowledge of message the encoder (production system) 2) Robust auditory representation of features Internal 3) Temporal landmarks message model • Child speech acquisition process is one that gradually establishes the “internal” model ear/auditory reception • Strategy: analysis by synthesis • i.e., Probabilistic inference on (deeply) motor/articulators hidden linguistic units using the internal model • No motor theory: the above strategy requires no articulatory recovery from speech acoustics

  12. Speaker-Listener Interaction • On-line modification of speaker’s articulatory behavior (speaking effort, rate, clarity, etc.) based on listener’s “decoding” performance (i.e. discrimination) • Especially important for conversational speech recognition and understanding • On-line adaptation of “encoder” parameters • Novel criteria: – maximize discrimination while minimizing articulation effort • In this closed-loop model, the “effort” quantified as “curvature” of temporal sequence of articulatory vector z t . • No such concept of “effort” in conventional HMM systems

  13. Model synthesis in FT • xxx • xxx

  14. Model synthesis in cepstra Model synthesis in cepstra 2 1 C1 Model 0 −1 data −2 0 50 100 150 200 250 0.5 C2 0 −0.5 −1 0 50 100 150 200 250 0.5

  15. Procedure --- N-best Evaluation test data H*= arg Max { P(H1), P(H2),…P(H1000)} LPCC feature extraction + table nonlinear Gaussian Hyp 1 FIR - Scorer lookup mapping + table Hyp 2 nonlinear Gaussian FIR triphone lookup mapping Scorer - HMM system … … … … … … … … … … … … … … … … … … + table nonlinear Gaussian Hyp N FIR lookup mapping Scorer - N-best list (N=1000); each hypothesis has γ γ parameter µ µ σ σ T γ γ µ µ σ σ 2 (k) (k) phonetic xcript & time s s ss ss free

  16. Results (recognition accuracy %) (work with Dong Yu) Lattice Decode Models 75.1 New model 72.5 72.5 72.5 HMM system 100 . . . 90 80 70 Acc% HMM 60 50 40 N in N-best 30 1 11 101 1001 10001

  17. Summary & Conclusion • Human speech production/perception viewed as synergistic elements in a closed-looped communication chain • They function as encoding & decoding of linguistic messages, respectively. • In human, speech “encoder” (production system) consists of phonological (symbolic) and phonetic (numeric) levels. • Current HMM approach approximates these two levels in a crude way: – phone-based phonological model (“beads-on-a-string”) – multiple Gaussians as phonetic model for acoustics directly – very weak hidden structure

  18. Summary & Conclusion (cont’d) • “Linguistic message recovery” (decoding) formulated as: – auditory reception for efficient & robust speech representation & for providing temporal landmarks for phonological features – cognition perception using “encoder” knowledge or “internal model” to perform probabilistic analysis by synthesis or pattern matching • Dynamic Bayes network developed as a computational tool for constructing encoder and decoder • Speaker-listener interaction (in addition to poor acoustic environment) cause substantial changes of articulation behavior and acoustic patterns

  19. Issues for discussion • Differences and similarities in processing/analysis techniques for audio/speech and image/video processing • Integrated processing vs. modular processing ˆ = = W P W x P x W P W argmax ( | ) argmax ( | ) ( ) W W • Feature extraction vs. classification • Use of semantics (class) information for feature extraction (dim reduction, discriminative features, etc.) • Arbitrary signal vs. structured signal (e.g. face image, human body motion, speech, music)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend