Computer Speech Recognition: Mimicking the Human System
Li Deng
Microsoft Research, Redmond Microsoft Research, Redmond
July 24, 2005 Banff/BIRS
Computer Speech Recognition: Mimicking the Human System Li Deng - - PowerPoint PPT Presentation
Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Microsoft Research, Redmond July 24, 2005 Banff/BIRS Fundamental Equations Enhancement (denoising): Recognition: = = W P W x P x W
Li Deng
July 24, 2005 Banff/BIRS
ˆ arg max ( | ) arg max ( | ) ( )
W W
W P W x P x W P W = =
estimation (HMM)
– noisy acoustic environments – rigid speaking style – constrained task – unrealistic demand of training data – huge model sizes, etc. – far below human speech recognition performance
mechanisms
message
motor/articulators
Internal model
decoded message
ear/auditory reception
SPEAKER LISTENER
Speech Acoustics in closed-loop chain
message
motor/articulators
Speech Acoustics
Phonology (higher level):
& articulatory phonology
in casual speech
SPEAKER
Phonetics Phonetics (lower level): (lower level):
Convert discrete linguistic features to continuous acoustics continuous acoustics
Mediated by motor control & articulatory dynamics dynamics
Mapping from articulatory variables to VT area function to acoustics VT area function to acoustics
Account for co-
articulation and reduction (target undershoot), etc. (target undershoot), etc.
message
motor/articulators
Speech Acoustics
Computational phonology:
constrained factorial Markov chain
SPEAKER
Labial- closure Alveolar constr. gesture-iy Nasality Aspiration Voicing LIPS: TT: TB: VEL: GLO: Alveolar closure gesture-eh Alveolar closure dental constr. Nasality Voicing
ten themes / t ε n ө i: m z /
Tongue Tip Tongue Body
High / Front Mid / Front
message
motor/articulators
Speech Acoustics
SPEAKER
Computational phonetics: Computational phonetics:
Segmental factorial HMM for sequential target in articulatory or vocal tract resonance domain in articulatory or vocal tract resonance domain
Switching trajectory model for target-
directed articulatory dynamics articulatory dynamics
Switching nonlinear state-
space model for dynamics in speech acoustics dynamics in speech acoustics
Illustration:
message
motor/articulators
Speech Acoustics
SPEAKER
1
S
2
S
3
S
4
S
K
S
.......
1
t
2
t
3
t
K
t
4
t
1
z
2
z
3
z
K
z
4
z
1
y
2
y
3
y
K
y
4
y
1
n
2
n
3
n
K
n
4
n
1
N
2
N
3
N
K
N
4
N h
articulation targets distortion-free acoustics distorted acoustics distortion factors & feedback to articulation
message
motor/articulators
Internal model
decoded message
ear/auditory reception
LISTENER
efficient & robust auditory representation
(ear), cochlear nucleus, SOC, IC,…, all the way to A1 cortex
1) combat environmental acoustic distortion; 2) detect relevant speech features 3) provide temporal landmarks to aid decoding
1) Critical-band freq scale, logarithmic compression, 2) adapt freq selectivity, cross-channel correlation, 3) sharp response to transient sounds 4) modulation in independent frequency bands, 5) binaural noise suppression, etc.
message
motor/articulators
Internal model
decoded message
ear/auditory reception
LISTENER
message
1) “Internal” model: structural knowledge of the encoder (production system) 2) Robust auditory representation of features 3) Temporal landmarks
gradually establishes the “internal” model
hidden linguistic units using the internal model
requires no articulatory recovery from speech acoustics
(speaking effort, rate, clarity, etc.) based on listener’s “decoding” performance (i.e. discrimination)
understanding
– maximize discrimination while minimizing articulation effort
50 100 150 200 250 −1 −0.5 0.5 0.5 50 100 150 200 250 −2 −1 1 2 C1 C2
data Model
ss
µ µ µ µ
feature extraction FIR triphone HMM system
test data LPCC
+
ss
σ σ σ σ
nonlinear mapping
table lookup Hyp 1 Hyp 2 Hyp N
N-best list (N=1000); each hypothesis has phonetic xcript & time
… … …
Gaussian Scorer
table lookup table lookup FIR FIR
nonlinear mapping nonlinear mapping
+
Gaussian Scorer
Gaussian
Scorer
H*= arg Max { P(H1), P(H2),…P(H1000)} … … …
s
γ γ γ γ
s
T
… … … … … … … … … … … … parameter free
(k) (k)
(work with Dong Yu)
30 40 50 60 70 80 90 100 1 101 10001 Acc%
HMM 1001 11 N in N-best
New model HMM system
72.5 72.5 72.5 75.1 Lattice Decode
elements in a closed-looped communication chain
respectively.
phonological (symbolic) and phonetic (numeric) levels.
crude way:
– phone-based phonological model (“beads-on-a-string”) – multiple Gaussians as phonetic model for acoustics directly – very weak hidden structure
– auditory reception for efficient & robust speech representation & for providing temporal landmarks for phonological features – cognition perception using “encoder” knowledge or “internal model” to perform probabilistic analysis by synthesis or pattern matching
constructing encoder and decoder
environment) cause substantial changes of articulation behavior and acoustic patterns
techniques for audio/speech and image/video processing
extraction (dim reduction, discriminative features, etc.)
human body motion, speech, music)
W W