THE NTU-ADSC SYSTEMS FOR REVERBERATION CHALLENGE 2014 presented by - - PowerPoint PPT Presentation

the ntu adsc systems for reverberation challenge 2014
SMART_READER_LITE
LIVE PREVIEW

THE NTU-ADSC SYSTEMS FOR REVERBERATION CHALLENGE 2014 presented by - - PowerPoint PPT Presentation

THE NTU-ADSC SYSTEMS FOR REVERBERATION CHALLENGE 2014 presented by Xiong Xiao 1 , Shengkui Zhao 2 , Duc Hoang Ha Nguyen 3 , Xionghu Zhong 3 , Douglas L. Jones 2 , Eng Siong Chng 1,3 , Haizhou Li 1,3,4 1 Temasek Lab@NTU, Nanyang Technological


slide-1
SLIDE 1

THE NTU-ADSC SYSTEMS FOR REVERBERATION CHALLENGE 2014

presented by

Xiong Xiao1, Shengkui Zhao2, Duc Hoang Ha Nguyen3, Xionghu Zhong3, Douglas L. Jones2, Eng Siong Chng1,3, Haizhou Li1,3,4

1Temasek Lab@NTU, Nanyang Technological University, Singapore. 2Advanced Digital Sciences Center, Singapore. 3School of Computer Engineering, Nanyang Technological University, Singapore. 4Department of Human Language Technology, Institute for Infocomm Research, Singapore.

slide-2
SLIDE 2

Outline

  • System Highlights
  • Speech Enhancement

– Delay and Sum + spectral subtraction – MVDR + DNN spectrogram enhancement

  • Speech Recognition

– Multi condition training – Clean condition training

  • Summary

2

slide-3
SLIDE 3

System Highlights

  • Beamforming

– Delay and Sum, MVDR – Classic method, always works!

  • DNN feature mapping

– Mapping reverberant spectrogram to clean spectrogram for enhancement – Mapping reverberant MFCC features to clean features for ASR

  • DNN acoustic modeling for ASR

– Discriminative feature learning and modeling in a single framework.

  • Feature adaptation (Cross-transform) for ASR

– a generalization of temporal filter and fMLLR transform. – explicitly use the correlation between feature frames to counter distortions that have effects over many frames. 3

slide-4
SLIDE 4

Outline

  • System Highlights
  • Speech Enhancement

– Delay and Sum + spectral subtraction – MVDR + DNN spectrogram enhancement

  • Speech Recognition

– Multi condition training – Clean condition training

  • Summary

4

slide-5
SLIDE 5

Speech Enhancement Systems

5

Two speech enhancement systems are considered:  DS beamforming + spectral subtraction (DS+SS);  MVDR beamforming + DNN based spectrogram enhancement (MVDR + DNN).

slide-6
SLIDE 6

Speech Enhancement – DS + Spectral Susbtraction

6

 DS beamforming

  • Windowing STFT: 64ms Hanning window,
  • GCC-PHAT for TDOA estimation,
  • Multi-channel phase alignment and sum.

75% frame overlap, 1024 point STFT.

 Spectral Subtraction

  • Reverberation time estimation: ML method.
  • Amplitude spectral subtraction.
slide-7
SLIDE 7

Speech Enhancement – MVDR + DNN feature mapping

7

  • Use DNN to map a window of

reverberant feature vectors to a (central) clean feature vector.

  • Let DNN learn to do dereverberation.
  • For speech enhancement, input and
  • utput are spectrum vectors.
  • For ASR, input and out are MFCC

feature vectors.

  • Training data: frame aligned clean and

multi-condition data.

  • DNN size: 2827– 3x3072 – 771
  • Predict both static and dynamic spectrum,

then merge them to produce smoothed static spectrum.

slide-8
SLIDE 8

Objective measures – CD and LLR

8

0.00 1.00 2.00 3.00 4.00 5.00 6.00 Near Far Near Far Near Far Room 1 Room 2 Room 3 Ave. Unprocesed SS (1ch) DNN (1ch) DS+SS (8ch) MVDR+DNN (8ch)

Cepstral Distance

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Near Far Near Far Near Far Room 1 Room 2 Room 3 Ave. Unprocesed SS (1ch) DNN (1ch) DS+SS (8ch) MVDR+DNN (8ch)

Log Likelihood Ratio

Both DS+SS and MVDR+DNN reduces cepstral distances and LLR significantly, especially for high reverberation cases. DNN degrades LLR significantly for 8-ch low reverberation cases.

slide-9
SLIDE 9

Objective measures – fwSegSNR and SRMR

9

2 4 6 8 10 12 Near Far Near Far Near Far Room 1 Room 2 Room 3 Ave. Unprocesed SS (1ch) DNN (1ch) DS+SS (8ch) MVDR+DNN (8ch)

fwSegSNR

1 2 3 4 5 6 7 Near Far Near Far Near Far Near Far Room 1 Room 2 Room 3 Ave. Room1 Ave. SimData RealData Unprocesed SS (1ch) DNN (1ch) DS+SS (8ch) MVDR+DNN (8ch)

SRMR

DNN improves fwSegSNR for most cases. DNN has smaller improvements in SRMR for real data.

  • Generalization problem
  • f DNN.
slide-10
SLIDE 10

Subjective measures

10

MVDR+DNN generally removes more reverberation than DS+SS. But it also introduces more speech distortion and results in poorer quality. Reasons?

  • Frame-by-frame processing of DNN.
  • DNN reduces mean square errors

between predicted log spectrum and clean log spectrum, not a perceptually meaningful error.

Near Far Near Far Unprocessed 41.5 31.0 28.9 21.5 Processed 52.6 42.7 37.8 38.6 Improvement 11.1 11.7 8.9 17.2 Processed 59.3 51.7 63.9 63.5 Improvement 17.8 20.7 35.0 42.0 Unprocessed 21.5 18.9 14.6 16.6 Processed 47.4 42.1 42.2 30.7 Improvement 25.9 23.2 27.6 14.1 Processed 83.3 50.1 50.2 29.4 Improvement 61.8 31.2 35.6 12.9 Near Far Near Far Unprocessed 36.7 46.3 51.9 42.9 Processed 47.9 47.4 45.6 50.2 Improvement 11.2 1.1

  • 6.3

7.3 Processed 19.6 16.6 16.7 16.4 Improvement

  • 17.1
  • 29.7
  • 35.3
  • 26.5

Unprocessed 37.0 33.8 30.6 25.3 Processed 57.8 55.8 52.0 43.9 Improvement 20.8 22.0 21.4 18.6 Processed 31.9 20.7 15.5 9.3 Improvement

  • 5.1
  • 13.2
  • 15.1
  • 16.0

MVDR+DNN 1ch 8ch DNN DS+SS MVDR+DNN 1ch 8ch Amont of Reverberation Score Mean SS Simulated RealData Room 2 Room1 Mean SS Overall Quality Score Simulated RealData Room 2 Room1 DNN DS+SS

slide-11
SLIDE 11

Outline

  • System Highlights
  • Speech Enhancement

– Delay and Sum + spectral subtraction – MVDR + DNN spectrogram enhancement

  • Speech Recognition

– Multi condition training – Clean condition training

  • Summary

11

slide-12
SLIDE 12

Speech Recognition Systems

  • MVDR beamforming for 2ch and 8ch.
  • Clean condition training scheme

– Cross Transform Adaptation – CMLLR (256 class) model adaptation. – HMM/GMM model (the challenge baseline settings)

  • Multi condition training scheme

– DNN based feature compensation – DNN based acoustic modeling

12

slide-13
SLIDE 13

ASR - Multi-condition training – results

  • DNN feature mapping (585-3x2048-39)
  • DNN acoustic modeling (351-7x2048-3500, RBM pretraining +

CrossEntropy + SMBR)

13

DNN feature compensation and DNN acoustic model are complementary. Reason?

  • DNN feature compensation uses

parallel corpus and wider context.

  • Good to have a two concatenated

DNN architecture than a big DNN?

5 10 15 20 25 30 35 40 near far near far near far near far Room1_A Room2_A Room3_A Real Room1_A Simulated Rooms Real Room Avg 1ch-w/o DNN feature compensation 1ch-w DNN feature compensation 8ch-w/o DNN feature compensation 8ch-w DNN feature compensation

WER

slide-14
SLIDE 14

ASR - Clean-condition training

  • Use cross transform for feature compensation
  • Use CMLLR for model adaptation (challenge script)
  • HMM/GMM system (challenge script)

14

Temporal filtering processes the feature trajectories. Linear transform processes feature vectors. How about combine them?

slide-15
SLIDE 15

ASR – Cross-transform

  • Cross-transform is a generalization of both temporal filtering and linear

transform.

  • To adapt the features at a time-frequency location, both the feature

vector and feature trajectory that contains the location are used in the regression.

15

Necessary to take the cross- shape to reduce the number of free parameters.

slide-16
SLIDE 16

ASR - Clean-condition training – Results

  • Cross-transform (33 frame window size, batch mode)
  • CMLLR (256 class, batch mode)
  • HMM/GMM system (Challenge scripts)

16

10 20 30 40 50 60 70 80 90 Near Far Near Far Near Far Near Far Room 1 Room 2 Room 3 Room1 SimData RealData Average 1ch-MVN 1ch-CrossTransform 1ch-CMLLR 1ch-CrossTransform+CMLLR 8ch-MVN 8ch-CrossTransform 8ch-CMLLR 8ch-CrossTransform+CMLLR

WER

Cross-transform and CMLLR model adaptation are complementary. Reason:

  • Cross-transform uses

longer context size.

  • Multi-class CMLLR is

more flexible: different transform for different classes.

slide-17
SLIDE 17

Summary

  • Traditional beamforming works well for both speech enhancement and

recognition.

  • DNN reduces reverberation significantly, but also introduces high

distortion especially in high reverberation cases.

  • Cross-transform adapts features using both long term temporal

information and spectral information. Complementary to CMLLR.

  • Future directions

– Analyze why DNN produces distortions to speech signal and propose solution. – Apply cross-transform to adaptive training of DNN based acoustic model in multi- condition training scheme. 17

slide-18
SLIDE 18

Thank you!

18