Subproject II: Robustness in Speech Recognition Members (1/2) - - PowerPoint PPT Presentation

subproject ii
SMART_READER_LITE
LIVE PREVIEW

Subproject II: Robustness in Speech Recognition Members (1/2) - - PowerPoint PPT Presentation

Subproject II: Robustness in Speech Recognition Members (1/2) Jen-Tzung Chien Hsiao-Chuan Wang (Co-PI) (PI) National Cheng Kung National Tsing Hua University University Jeih-Weih Hung Lin-shan Lee (Co-PI) National Taiwan National Chi


slide-1
SLIDE 1

Subproject II:

Robustness in Speech Recognition

slide-2
SLIDE 2

Members (1/2)

Hsiao-Chuan Wang (PI) National Tsing Hua University Jeih-Weih Hung (Co-PI) National Chi Nan University Sin-Horng Chen National Chiao Tung University Jen-Tzung Chien (Co-PI) National Cheng Kung University Lin-shan Lee National Taiwan University Hsin-min Wang Academia Sinica

slide-3
SLIDE 3

Members (2/2)

Yih-Ru Wang National Chiao Tung University Yuan-Fu Liao National Taipei University of Technology Berlin Chen National Taiwan Normal University

slide-4
SLIDE 4

Research Theme

Lexical level Signal Level Model Level Signal Processing Feature Extraction & Transformation Speech Decoding including Word Graph Rescoring Adaptive Language Models Adaptive Pronunciation Lexicon Adaptive HMM Models

Output Recognition Results Input speech

slide-5
SLIDE 5

Research Roadmap

Current Achievements Future Directions & Applications

  • Speech enhancement &

wavelet processing

  • Speech recognition in different

adverse environments, e.g. car, home, etc.

  • Microphone array and noise

cancellation approaches

  • Robust broadcast news

transcription

  • Lecture speech recognition
  • Spontaneous speech

recognition

  • Next generation automatic

speech recognition

  • Powerful machine learning

approaches for complicated robustness problems

  • Cepstral moment normalization &

temporal filtering &

  • Discriminative adaptation for

acoustic and linguistic models

  • Maximum entropy modeling &

data mining algorithm

  • Robust language modeling
slide-6
SLIDE 6

Signal Level Approaches

Speech Enhancement – Harmonic retaining, perceptual factor analysis, etc. Robust Feature Representation – Higher-order cepstral moment normalization, data-driven temporal filtering, etc. Microphone Array Processing – Microphone array with post-filtering, etc. Missing-Feature Approach – Sub-space missing feature imputation and environment sniffing, mismatch-aware stochastic matching, etc.

slide-7
SLIDE 7

Higher-Order Cepstral Moment Normalization (HOCMN) (1/3)

Cepstral Feature Normalization Widely Used

– CMS: normalizing the first moment – CMVN: normalizing the first and second moments – HEQ: normalizing the full distribution (all order moments) – How about normalizing a few higher order moments only? – Disturbances of larger magnitudes may be the major sources

  • f recognition errors, which are better reflected in higher
  • rder moments
slide-8
SLIDE 8

Higher-Order Cepstral Moment Normalization (HOCMN) (2/3)

Experimental results : Aurora 2, clean condition training, word accuracy averaged over 0~20dB and all types of noise (sets A,B,C)

74.00 75.00 76.00 77.00 78.00 79.00 80.00 81.00 82.00 83.00 10 20 30 40 50 60 70 N (even integer)

(a) HOCMN[1,N] (full-utterance) (b) HOCMN[1,N](L=86)

(a) (b) CMVN CMVN (L=86) (1st and N-th moments normalized)

slide-9
SLIDE 9

Higher-Order Cepstral Moment Normalization (HOCMN) (3/3)

Experimental Results : Aurora 2, clean condition training, word accuracy averaged over 0~20dB for each type of noise condition

72.00 74.00 76.00 78.00 80.00 82.00 84.00 86.00 S u b w a y B a b b l e C a r E x h i b i t i

  • n

S e t A A v g . R e s t a u r a n t S t r e e t A i r p

  • r

t S t a t i

  • n

S e t B A v g . S u b w a y . C S t r e e t . C S e t C A v g .

CMVN HOCMN[1,5,100] HEQ Set B Set A Set C

HOCMN is significantly better than CMVN for all types of noise HOCMN is better than HEQ in most types of noise except for the “Subway” and “Street” noise

slide-10
SLIDE 10

Data-Driven Temporal Filtering

Developed filters were performed on the temporal domain of the original features These filters can be derived in a data-driven manner according to the criteria of PCA/LDA/MCE They can be integrated with Cepstral mean and variance normalization (CMVN) to achieve further performance

slide-11
SLIDE 11

Microphone Array Processing (1/3)

Integrated with Model Level Approaches (MLLR)

Delay Estimator Delay-and-Sum Beamformer Enhanced signal Speech Recognition Initial HMM Parameters MLLR Adaptation Adapted HMM Parameters Result Speech Input Using Microphone Array

Speech Enhancement Speech Recognition Model Adaptation

Using Time Domain Coherence Measure (TDCM)

slide-12
SLIDE 12

Microphone Array Processing (2/3)

Further Improved with Wiener Filtering and Spectral Weighting Function (SWF)

FFT FFT Improved Wiener Filter

IFFT

x

1

X X X

Spectral Weighting Function Weight Selection

W ˆ W

W

τ ˆ τ ˆ 2 τ ˆ 3 S ˆ s ˆ

1

x

2

x

3

x

4

x

Delay-and-Sum Beamformer

slide-13
SLIDE 13

Microphone Array Processing (3/3)

Applications for In-Car Speech Recognition

– Power Spectral Coherence Measure (PSCM) used to estimate the time delay

Microphone Array Air Conditioner Air Conditioner

wheel

speaker personal computer Fan noise 45º 90cm

Physical configuration Configuration in car

slide-14
SLIDE 14

Model Level Approaches

Improved Parallel Model Combination Bayesian Learning of Speech Duration Models Aggregate a Posteriori Linear Regression Adaptation

slide-15
SLIDE 15

Aggregate a Posteriori Linear Regression (AAPLR) (1/3)

Discriminative Linear Regression Adaptation Prior Density of Regression Matrix is Incorporated to Construct Bayesian Learning Capabilities Closed-form Solution Obtained for Rapid Adaptation

Prior information of regression matrix Discriminative criterion AAPLR Bayesian Learning Closed form solution

slide-16
SLIDE 16

Aggregate a Posteriori Linear Regression (AAPLR) (2/3)

∑∑

= =

= =

M m N n n m r m r n m

m

X p g m X p R J

1 1 , , MAPLR

) ( ) ˆ ( ) , ˆ , ( log ) ˆ ( ) ˆ ( W W W W W λ

MAPLR

∑∑

= =

=

M m N n n m r m m r n m

m

X p g P X p M J

1 1 , , AAPLR

) ( ) ( ) , ( 1 ) ( W W W λ

AAPLR

( )

∑∑

= =

=

M m N n m

m

d M J

1 1 AAPLR AAPLR

) ( 1 l W

η

λ η λ

/ 1 ) ( ) ( AAPLR

)] , ; ( exp[ 1 1 log ) , ; ( ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ − − =

≠ m j j r j j m r m m m

X g M X g d W W

)} ( ) , ( log{ ) , ; (

, r m r n m r m m

g X p X g W W W λ λ =

Discriminative Training

─aggregated over all model classes m with probabilities Pm

slide-17
SLIDE 17

Aggregate a Posteriori Linear Regression (AAPLR) (3/3)

Comparison with Other Approaches

Estimation Criterion ML MAP MCE MMI AAP Discriminative adaptation Bayesian learning Closed- form solution MLLR ○ No No Yes MAPLR ○ No Yes Yes MCELR ○ Yes No No CMLLR ○ ○ Yes No Yes AAPLR ○ ○ Yes Yes Yes

slide-18
SLIDE 18

Lexical Level Approaches

  • Pronunciation Modeling for Spontaneous

Mandarin Speech

  • Language Model Adaptation

– Latent Semantic Analysis and Smoothing – Maximum Entropy Principle

  • Association Pattern

Language Model

slide-19
SLIDE 19

Pronunciation Modeling for Spontaneous Mandarin Speech

Automatically Constructing Multiple-pronunciation Lexicon using a Three-stage Framework to Reduce Confusion Introduced by the Added Pronunciations

Ranking the pronunciations to avoid confusion across different words Automatically generating possible surface forms but avoiding confusion across different words Keeping only the necessary pronunciations to avoid confusion across different words

slide-20
SLIDE 20

Association Pattern Language Model (1/5)

N-grams Consider only Local Relations Trigger pairs Consider Long-distance Relations, but

  • nly for Two Associated Words

Word Associations Can Be Expanded for More than Two Distant Words A New Algorithm to Discover Association Patterns via Data Mining Techniques

slide-21
SLIDE 21

Association Pattern Language Model (2/5)

Bigram & Trigram

Sept.

11

George

Bush Towers Twin bigram bigram bigram bigram

...

bigram bigram trigram trigram Sept.

11

George

Bush Towers Twin bigram bigram bigram bigram

...

bigram bigram trigger pair trigger pair

Trigger Pairs

slide-22
SLIDE 22

Association Pattern Language Model (3/5)

Association Patterns

Sept.

11

George

Bush Towers Twin bigram bigram bigram bigram

...

bigram bigram association pattern association pattern

slide-23
SLIDE 23

Association Pattern Language Model (4/5)

Association Pattern Mining Procedure

slide-24
SLIDE 24

Association Pattern Language Model (5/5)

Association Pattern Set ΩAS Covering Different Association Steps Constructed Merge Mutual Information of All Association Patterns Association Pattern n-gram Estimated

) ( ) ( ) , ( log ) ( MI

1 1 1 j q a j q a j q a

w p W p w W p w W

− − −

= →

∑ ∑ ∑

= Ω ∈ → − =

→ + =

S s w W s j q s a L q q

s j q s a

w W w p W p

1 , 1 1 AS

AS , 1

) ( MI ) ( log ) ( log

) ( log ) ( log ) ( ~ log

2 AS 1

W p a W p a W p + =

slide-25
SLIDE 25

Future Directions

Robustness in Detecting Speech Attributes and Events

– Detection-based processing for Next Generation Automatic Speech Recognition – Robustness in sequential hypothesis test for acoustic and linguistic detectors

Beyond Current Robustness Approaches

– Maximum entropy framework is useful for building robust speech and linguistic models – Develop new machine learning approaches, e.g. ICA, LDA, etc, for speech technologies – Build powerful technologies to handle complicated robustness problem

Application of Robustness Techniques in Spontaneous Speech Recognition

– Robustness issue is ubiquitous in speech areas – Towards robustness in different levels – Robustness in establishing applications and systems