Computer Speech Recognition: Mimicking the Human System Li Deng - - PowerPoint PPT Presentation

computer speech recognition mimicking the human system
SMART_READER_LITE
LIVE PREVIEW

Computer Speech Recognition: Mimicking the Human System Li Deng - - PowerPoint PPT Presentation

Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Microsoft Research, Redmond July 24, 2005 Banff/BIRS Fundamental Equations Enhancement (denoising): Recognition: = = W P W x P x W


slide-1
SLIDE 1

Computer Speech Recognition: Mimicking the Human System

Li Deng

Microsoft Research, Redmond Microsoft Research, Redmond

July 24, 2005 Banff/BIRS

slide-2
SLIDE 2

Fundamental Equations

  • Enhancement (denoising):
  • Recognition:
  • Importance of speech modeling

ˆ arg max ( | ) arg max ( | ) ( )

W W

W P W x P x W P W = =

slide-3
SLIDE 3

Speech Recognition--- Introduction

  • Converting naturally uttered speech into text and meaning
  • Conventional technology --- statistical modeling and

estimation (HMM)

  • Limitations

– noisy acoustic environments – rigid speaking style – constrained task – unrealistic demand of training data – huge model sizes, etc. – far below human speech recognition performance

  • Trend: Incorporate key aspects of human speech processing

mechanisms

slide-4
SLIDE 4

Segment-Level Speech Dynamics

slide-5
SLIDE 5

Production & Perception: Closed-Loop Chain

message

motor/articulators

Internal model

decoded message

ear/auditory reception

SPEAKER LISTENER

Speech Acoustics in closed-loop chain

slide-6
SLIDE 6

Encoder: Two-Stage Production Mechanisms

message

motor/articulators

Speech Acoustics

Phonology (higher level):

  • Symbolic encoding of linguistic message
  • Discrete representation by phonological features
  • Loosely-coupled multiple feature tiers
  • Overcome beads-on-a-string phone model
  • Theories of distinctive features, feature geometry

& articulatory phonology

  • Account for partial/full sound deletion/modification

in casual speech

SPEAKER

Phonetics Phonetics (lower level): (lower level):

  • Convert discrete linguistic features to

Convert discrete linguistic features to continuous acoustics continuous acoustics

  • Mediated by motor control & articulatory

Mediated by motor control & articulatory dynamics dynamics

  • Mapping from articulatory variables to

Mapping from articulatory variables to VT area function to acoustics VT area function to acoustics

  • Account for co

Account for co-

  • articulation and reduction

articulation and reduction (target undershoot), etc. (target undershoot), etc.

slide-7
SLIDE 7

Encoder: Phonological Modeling

message

motor/articulators

Speech Acoustics

Computational phonology:

  • Represent pronunciation variations as

constrained factorial Markov chain

  • Constraint: from articulatory phonology
  • Language-universal representation

SPEAKER

Labial- closure Alveolar constr. gesture-iy Nasality Aspiration Voicing LIPS: TT: TB: VEL: GLO: Alveolar closure gesture-eh Alveolar closure dental constr. Nasality Voicing

ten themes / t ε n ө i: m z /

Tongue Tip Tongue Body

High / Front Mid / Front

slide-8
SLIDE 8

Encoder: Phonetic Modeling

message

motor/articulators

Speech Acoustics

SPEAKER

Computational phonetics: Computational phonetics:

  • Segmental factorial HMM for sequential target

Segmental factorial HMM for sequential target in articulatory or vocal tract resonance domain in articulatory or vocal tract resonance domain

  • Switching trajectory model for target

Switching trajectory model for target-

  • directed

directed articulatory dynamics articulatory dynamics

  • Switching nonlinear state

Switching nonlinear state-

  • space model for

space model for dynamics in speech acoustics dynamics in speech acoustics

  • Illustration:

Illustration:

slide-9
SLIDE 9

Phonetic Encoder: Computation

message

motor/articulators

Speech Acoustics

SPEAKER

1

S

2

S

3

S

4

S

K

S

.......

1

t

2

t

3

t

K

t

4

t

1

z

2

z

3

z

K

z

4

z

1

  • 2
  • 3
  • K
  • 4
  • 1

y

2

y

3

y

K

y

4

y

1

n

2

n

3

n

K

n

4

n

1

N

2

N

3

N

K

N

4

N h

articulation targets distortion-free acoustics distorted acoustics distortion factors & feedback to articulation

slide-10
SLIDE 10
slide-11
SLIDE 11

Decoder I: Auditory Reception

message

motor/articulators

Internal model

decoded message

ear/auditory reception

LISTENER

  • Convert speech acoustic waves into

efficient & robust auditory representation

  • This processing is largely independent
  • f phonological units
  • Involves processing stages in cochlea

(ear), cochlear nucleus, SOC, IC,…, all the way to A1 cortex

  • Principal roles:

1) combat environmental acoustic distortion; 2) detect relevant speech features 3) provide temporal landmarks to aid decoding

  • Key properties:

1) Critical-band freq scale, logarithmic compression, 2) adapt freq selectivity, cross-channel correlation, 3) sharp response to transient sounds 4) modulation in independent frequency bands, 5) binaural noise suppression, etc.

slide-12
SLIDE 12

Decoder II: Cognitive Perception

message

motor/articulators

Internal model

decoded message

ear/auditory reception

LISTENER

  • Cognitive process: recovery of linguistic

message

  • Relies on

1) “Internal” model: structural knowledge of the encoder (production system) 2) Robust auditory representation of features 3) Temporal landmarks

  • Child speech acquisition process is one that

gradually establishes the “internal” model

  • Strategy: analysis by synthesis
  • i.e., Probabilistic inference on (deeply)

hidden linguistic units using the internal model

  • No motor theory: the above strategy

requires no articulatory recovery from speech acoustics

slide-13
SLIDE 13
  • On-line modification of speaker’s articulatory behavior

(speaking effort, rate, clarity, etc.) based on listener’s “decoding” performance (i.e. discrimination)

  • Especially important for conversational speech recognition and

understanding

  • On-line adaptation of “encoder” parameters
  • Novel criteria:

– maximize discrimination while minimizing articulation effort

  • In this closed-loop model, the “effort” quantified as “curvature”
  • f temporal sequence of articulatory vector zt.
  • No such concept of “effort” in conventional HMM systems

Speaker-Listener Interaction

slide-14
SLIDE 14
  • xxx
  • xxx

Model synthesis in FT

slide-15
SLIDE 15

Model synthesis in cepstra Model synthesis in cepstra

50 100 150 200 250 −1 −0.5 0.5 0.5 50 100 150 200 250 −2 −1 1 2 C1 C2

data Model

slide-16
SLIDE 16

Procedure --- N-best Evaluation

ss

µ µ µ µ

feature extraction FIR triphone HMM system

test data LPCC

+

  • 2

ss

σ σ σ σ

nonlinear mapping

table lookup Hyp 1 Hyp 2 Hyp N

N-best list (N=1000); each hypothesis has phonetic xcript & time

… … …

Gaussian Scorer

table lookup table lookup FIR FIR

nonlinear mapping nonlinear mapping

  • +

+

Gaussian Scorer

Gaussian

Scorer

H*= arg Max { P(H1), P(H2),…P(H1000)} … … …

s

γ γ γ γ

s

T

… … … … … … … … … … … … parameter free

(k) (k)

slide-17
SLIDE 17

Results (recognition accuracy %)

(work with Dong Yu)

30 40 50 60 70 80 90 100 1 101 10001 Acc%

. . .

HMM 1001 11 N in N-best

Models

New model HMM system

72.5 72.5 72.5 75.1 Lattice Decode

slide-18
SLIDE 18
  • Human speech production/perception viewed as synergistic

elements in a closed-looped communication chain

  • They function as encoding & decoding of linguistic messages,

respectively.

  • In human, speech “encoder” (production system) consists of

phonological (symbolic) and phonetic (numeric) levels.

  • Current HMM approach approximates these two levels in a

crude way:

– phone-based phonological model (“beads-on-a-string”) – multiple Gaussians as phonetic model for acoustics directly – very weak hidden structure

Summary & Conclusion

slide-19
SLIDE 19
  • “Linguistic message recovery” (decoding) formulated as:

– auditory reception for efficient & robust speech representation & for providing temporal landmarks for phonological features – cognition perception using “encoder” knowledge or “internal model” to perform probabilistic analysis by synthesis or pattern matching

  • Dynamic Bayes network developed as a computational tool for

constructing encoder and decoder

  • Speaker-listener interaction (in addition to poor acoustic

environment) cause substantial changes of articulation behavior and acoustic patterns

Summary & Conclusion (cont’d)

slide-20
SLIDE 20

Issues for discussion

  • Differences and similarities in processing/analysis

techniques for audio/speech and image/video processing

  • Integrated processing vs. modular processing
  • Feature extraction vs. classification
  • Use of semantics (class) information for feature

extraction (dim reduction, discriminative features, etc.)

  • Arbitrary signal vs. structured signal (e.g. face image,

human body motion, speech, music)

ˆ argmax ( | ) argmax ( | ) ( )

W W

W P W x P x W P W = =