Summary of the REVERB challenge .. Reinhold Haeb Umbach, Keisuke - - PowerPoint PPT Presentation

summary of the reverb challenge
SMART_READER_LITE
LIVE PREVIEW

Summary of the REVERB challenge .. Reinhold Haeb Umbach, Keisuke - - PowerPoint PPT Presentation

http://reverb2014.dereverberation.com/ Summary of the REVERB challenge .. Reinhold Haeb Umbach, Keisuke Kinoshita, Emanuel Habets International AudioLabs Erlangen Volker Leutnant Marc Delcroix, Paderborn Univ. Takuya Yoshioka, Tomohiro


slide-1
SLIDE 1

Summary of the REVERB challenge

Keisuke Kinoshita, Marc Delcroix, Takuya Yoshioka, Tomohiro Nakatani

NTT Corporation

Emanuel Habets

International AudioLabs Erlangen

Reinhold Haeb‐Umbach, Volker Leutnant

Paderborn Univ.

Armin Sehr

Beuth Univ. of Applied Sciences Berlin

Walter Kellermann, Roland Maas

  • Univ. of Erlangen‐Nuremberg

Sharon Gannot

Bar‐Ilan Univ.

Bhiksha Raj

Carnegie Mellon Univ.

..

http://reverb2014.dereverberation.com/

slide-2
SLIDE 2

Outline

  • Motivation and design of the REVERB challenge
  • Result summary
  • The SE (Speech Enhancement) results
  • The ASR results
  • Concluding remarks
  • Summary of the participants’ systems

2

slide-3
SLIDE 3

Motivation

REVERB challenge to provide a common evaluation framework for both ASR and SE studies

 Recently, substantial progress made in the field of

reverberant speech signal processing, including

  • Single- and multi-channel de-reverberation techniques
  • ASR techniques robust to reverberation

 Lack of common evaluation framework

3

slide-4
SLIDE 4

Target acoustic scenarios

  • Reverberant
  • Moderate stationary noise (~ SNR* 20dB)
  • 1ch, 2ch and 8ch scenarios

Fig: One of microphone arrays used

* “S” includes direct signal and early reflections up to 50ms.

4

slide-5
SLIDE 5
  • Based on Wall Street Journal Cambridge (WSJCAM0) 5K task
  • Real recordings (RealData) * 1 & simulated data (SimData) * 2

(Development and evaluation sets provided)

  • SimData for experiments in various reverb conditions

(A part of SimData simulates RealData in terms of the reverb time)

  • RealData for validity assessment in real reverb conditions

* 1 RealData is available from the LDC catalog as a part of MC-WSJ-AV corpus (since April 2014). * 2 Materials required to generate SimData is available on our webpage. The data will soon be available

through the LDC catalog.

  • Text prompts used for both data were the same.
  • Clean and multi-condition (simulated) training data provided

The challenge data (1/ 2)

http://catalog.ldc.upenn.edu/LDC2014S03

5

slide-6
SLIDE 6

RealData (far) SimData (Room2, far) Male Female Male Female Clean/Headset Observed

The challenge data (2/ 2)

Reverb time (T60) Distance between speaker and mic

SimData 0.25s , 0.5s, 0.7s*

(Room1, 2, 3) near: 0.5m far: 2.0m

RealData 0.7s*

near: ~1.0m far: > 2.5m

  • Acoustic conditions for SimData and RealData
  • Sound examples

* SimData room3 simulates RealData

6

slide-7
SLIDE 7

The challenge tasks: ASR and SE

  • ASR task
  • SE task
  • Evaluation criterion: Word Error Rate (WER)
  • Objective evaluation criteria
  • Intrusive measure (that requires reference clean speech)
  • Cepstrum distance (CD)
  • Freq-weighted segmental SNR (FWsegSNR)
  • Log likelihood ratio (LLR)
  • PESQ (optional)
  • Non-intrusive measure
  • Speech-to-reverb modulation ratio (SRMR)
  • Subjective evaluation criteria (web-based MUSHRA test)
  • Perceived amount of reverberation
  • Overall quality (i.e.,artifacts, distortions, remaining reverb and etc)
  • Same test & training data provided for both tasks

7

slide-8
SLIDE 8

Number of submissions

  • 18 submissions (incl. 49 systems) to the ASR task
  • 14 submissions (incl. 25 systems) to the SE task
  • Percentage of 1ch, 2ch and 8ch systems in each task -
  • 27 participants (i.e., # of papers)

8

slide-9
SLIDE 9

Quick introduction to the submitted participants’ systems

9

slide-10
SLIDE 10

A wide variety of approaches submitted

Spatial filtering 1ch SE/FE Main focus of SE participants

10

slide-11
SLIDE 11

Spatial filtering 1ch SE/FE Robust feature Extraction/ normalization Decoding AM LM Main focus of ASR participants

A wide variety of approaches submitted

11

slide-12
SLIDE 12

System combination Spatial filtering 1ch SE/FE Robust feature Extraction/ normalization Decoding AM LM Main focus of ASR participants

A wide variety of approaches submitted

12

slide-13
SLIDE 13

Submission ranges from 1ch/multi‐channel SE algorithms to the ASR back‐end algorithms.

System combination Spatial filtering 1ch SE/FE Robust feature Extraction/ normalization Decoding AM LM Adapt. Main focus of ASR participants

A wide variety of approaches submitted

13

slide-14
SLIDE 14

System combination Spatial filtering 1ch SE/FE Robust feature Extraction/ normalization Decoding AM LM Adapt.

Various approaches (1/ 4)

‐ De‐noising (STFT, auditory‐feature domain) e.g., MVDR, delay‐sum, GSC, Mch‐WF. ‐ De‐reverb ‐ STFT domain ‐ Inverse filtering ‐ Linear prediction ‐ Correlation shaping ‐ DOA detection based Beamformer ‐ Mask‐based approach ‐ Phase‐error filter ‐ Magnitude spec domain ‐ Estimation of nonnegative RIRs

14

slide-15
SLIDE 15

System combination Spatial filtering 1ch SE/FE Robust feature Extraction/ normalization Decoding AM LM Adapt.

Various approaches (2/ 4)

‐ De‐noising e.g., SS, MMSE‐STSA. ‐ De‐reverb ‐ Power/magnitude/auditory spec domain e.g., Exponential RIR model, Linear prediction, Non‐negative Matrix Fact./Deconv., DNN/DRNN/DAE based dereverb ‐ Cepstral domain e.g., Cepstral smoothing, ML‐based inverse filter estimation

15

slide-16
SLIDE 16

System combination Spatial filtering 1ch SE/FE Robust feature Extraction/ normalization Decoding AM LM Adapt.

Various approaches (3/ 4)

‐ Robust features e.g., PLP, auditory/articulatory based features, modified cepstral features, i‐vector, warped MVDR, etc... ‐ Normalizatoin e.g., CMS, VTLN, CMLLR, (H)LDA,

16

slide-17
SLIDE 17

System combination Spatial filtering 1ch SE/FE Robust feature Extraction/ normalization Decoding AM LM Adapt.

Various approaches (4/ 4)

‐ Acoustic model ‐ GMM ‐ SGMM ‐ DNN ‐ LSTM ‐ Adaptation ‐ MLLR ‐ DNN‐adaptation ‐ Training ‐ Clean/multi‐condition ‐ SAT ‐ ML/MMI/bMMI ‐ System combination ‐ ROVER ‐ Multi‐stream HMM ‐ Decoding ‐ Minimum Bayes risk dec.

17

slide-18
SLIDE 18

System combination Spatial filtering 1ch SE/FE Robust feature Extraction/ normalization Decoding AM LM Adapt.

Various approaches (4/ 4)

18

slide-19
SLIDE 19

Now, the results... 

19

slide-20
SLIDE 20

Results already publicly available

  • Results for the ASR task

http://reverb2014.dereverberation.com/result_asr.html

  • Results for the SE task

http://reverb2014.dereverberation.com/result_se.html Note: More results (detailed/new/updated results) are available in participants’ papers.

20

slide-21
SLIDE 21

Let’s start with the ASR results... 

21

slide-22
SLIDE 22

ASR results: baselines

WER (%) 100 50 Small room SimData RealData Near Far Near Far Near Far Near Far

  • Mid. room

Large room Large room HTK‐baseline (clean training) HTK‐baseline+CMLLR (clean training) HTK‐baseline

(multicondition training)

HTK‐baseline+CMLLR

(multicondition training)

Recognition of unprocessed 1ch observation

22

slide-23
SLIDE 23

ASR results: at a glance

  • All the submitted WERs (everything mixed, not a fair comparison)

WER (%) 100 50 Small room SimData RealData Near Far Near Far Near Far Near Far

  • Mid. room

Large room Large room Clean/Headset WERs HTK‐baseline (clean training) HTK‐baseline+CMLLR (clean training) HTK‐baseline

(multicondition training)

HTK‐baseline+CMLLR

(multicondition training)

23

slide-24
SLIDE 24
  • Relationship between (averaged) WER and # of mic., data and acoust. model

The size of a circle indicates the # of systems in the corresponding category

ASR results analysis with bubble chart

24

slide-25
SLIDE 25

Results per 1ch, 2ch and 8ch systems

More microphones lead to better performance

ASR results analysis with bubble chart

25

slide-26
SLIDE 26

Training data: “Clean” vs “multi-condition” vs “own data”

More training data (acoustic variety) lead to better performance

※ E.g., WSJ America, Data with different SNRs

ASR results analysis with bubble chart

26

slide-27
SLIDE 27

GMM-HMM recognizers vs DNN-HMM recognizers

  • The top-performing systems often employ DNN-HMM
  • Resultant performance may differ due to the front-end proc. and the DNN config. etc

ASR results analysis with bubble chart

27

slide-28
SLIDE 28
  • Relationship between SimData scores and RealData scores

SimData vs RealData

Very strong correlation between SimData and RealData scores

SimData Room3 Far vs RealData

(Even stronger between SimData Room3 Far and RealData)

ASR results analysis: SimData vs RealData

28

slide-29
SLIDE 29

ASR results: Some remarks...

  • Some more works required to achieve the clean/headset

performance. (E.g., for RealData, the headset WER is roughly 60% of the best performing system.)

  • Strategies often present in the top-performing systems include:
  • Some kind of dereverberation (STFT/Amp spec/feature domain)
  • Linear Multi-ch filtering (MVDR, DS, etc) often for denoising
  • Strong backend (e.g., DNN-HMM recognizer,

sophisticated adaptation, robust feature extraction, multi-condition training)

  • System combination
  • However, it’s hard to tell the exact impact of each SE/ASR technique.

(It’s something we should discover at this workshop!)

29

slide-30
SLIDE 30

Now, the SE part... 

30

slide-31
SLIDE 31

Most submissions managed to improve the

  • bjective measures (cf. webpage, presentations), but

how about their subjective qualities?

  • An important question in the SE task-

31

slide-32
SLIDE 32

Subjective evaluation: test outline

  • MUSHRA (MUltiple Stimuli with Hidden Reference and Anchor) test
  • Perceived amount of reverberation
  • Overall quality (i.e.,artifacts, distortions, remaining reverb and etc)
  • Test carried out separately for 1ch, 2ch and 8ch systems
  • 2 evaluation metrics
  • Evaluation conditions (4 conditions): SimData room2 near & far

RealData near & far

  • Test materials: clean (hidden ref.) + No proc. + test systems
  • Very large/Large/Mid./Small/Very small
  • Bad/Poor/Fair/Good/Excellent
  • Test materials: clean (hidden ref) + no proc.

+ a 3.5kHz lowpass of the reverberant speech, + test systems

  • Web-based listening test (not well controlled)

32

slide-33
SLIDE 33

Subjective eval. result : 1ch

Headset Unprocessed

# of samples (N)=13

(Result at RealData far condition)

# of samples (N)=12

33

slide-34
SLIDE 34

Subjective eval. result : 1ch

(Result at RealData far condition)

# of samples (N)=13 # of samples (N)=12

34

slide-35
SLIDE 35

# of samples (N)=13 # of samples (N)=12

Subjective eval. result : 1ch

(Result at RealData far condition)

35

slide-36
SLIDE 36

Subjective eval. result : 1ch

(Result at RealData far condition)

# of samples (N)=13 # of samples (N)=12

36

slide-37
SLIDE 37

# of samples (N)=13 # of samples (N)=12

Subjective eval. result : 1ch

(Result at RealData far condition)

37

slide-38
SLIDE 38

Subjective eval. result : 2ch

N=15 N=15

(Result at RealData far condition)

38

slide-39
SLIDE 39

N=15 N=15

Subjective eval. result : 2ch

(Result at RealData far condition)

39

slide-40
SLIDE 40

Subjective eval. result : 8ch

N=9 N=10

(Result at RealData far condition)

40

slide-41
SLIDE 41

N=10

Subjective eval. result : 8ch

(Result at RealData far condition)

41

slide-42
SLIDE 42

How does the subjective score correlate with the objective measures?

  • Another important question-

42

slide-43
SLIDE 43

SE results: subjective vs objective

  • Relationship between the subjective scores and each objective score

CD FWSegSNR LLR SRMR PESQ Averaged correlation coeff.

  • 0.70

0.71

  • 0.43

0.62

0.77

CD FWSegSNR LLR SRMR PESQ Averaged correlation coeff.

  • 0.35

0.39

  • 0.21

0.12 0.28 Correlation with the scores of the “perceived amount of reverberation” test Correlation with the scores of the “overall quality” test

  • Amount of dereverberation can be roughly measured with the objective

measures such as CD, FWSegSNR, PESQ.

  • The overall quality is not well captured with the objective measures used.

There may be more appropriate objective measures that correlate well with the subjective scores.

43

slide-44
SLIDE 44

SE results: Some remarks...

  • 1ch dereverberation is still a challenge task

(Much room to be improved!)

  • Some multi-channel dereverberation methods are found to be

effective in various conditions.

  • More appropriate objective quality measure should be considered,

which well coincides with subjective scores.

44

slide-45
SLIDE 45

Conclusions...

  • A wide variety of approaches submitted

to both the ASR and the SE tasks

  • ASR task
  • Most submissions managed to bring improvement
  • ver the baseline systems
  • The top-performing systems tend to be quite

sophisticated both in the front-end and the back-end

  • SE task
  • Most submissions succeeded in dereverberation
  • Improvement in the overall quality was not always easy
  • Better objective scores maybe necessary

45

slide-46
SLIDE 46

I mportant questions to be discussed...

  • Is this challenge already overcome?
  • Which directions/methodologies are essential to pursue?
  • Collaboration between SE and ASR necessary?

Let’s discover our own answers during the workshop and discuss at the panel session 

  • How was the challenge framework? How can we do better?
  • For improving ASR performance
  • For improving SE performance
slide-47
SLIDE 47

Thank you... and now let’s start the workshop!

47

slide-48
SLIDE 48

Appendix

48

slide-49
SLIDE 49

I ntermediate result of the subjective quality test for 1ch systems

Notes: ‐ It is not recommended to directly compare the numbers

  • btained with the different reverberation conditions.

‐ All mean scores are plotted with their associate 95% confidence intervals. ‐ About notations

‐ RT: real‐time processing ‐ UB: utterance‐batch processing ‐ FB: full‐batch processing

49

slide-50
SLIDE 50

I ntermediate result of the subjective quality test for 2ch systems

Notes: ‐ It is not recommended to directly compare the numbers

  • btained with the different reverberation conditions.

‐ All mean scores are plotted with their associate 95% confidence intervals. ‐ About notations

‐ RT: real‐time processing ‐ UB: utterance‐batch processing ‐ FB: full‐batch processing

50

slide-51
SLIDE 51

I ntermediate result of the subjective quality test for 8ch systems

Notes: ‐ It is not recommended to directly compare the numbers

  • btained with the different reverberation conditions.

‐ All mean scores are plotted with their associate 95% confidence intervals. ‐ About notations

‐ RT: real‐time processing ‐ UB: utterance‐batch processing ‐ FB: full‐batch processing

51

slide-52
SLIDE 52

Differential score based on the MUSHRA score: 1ch systems

Notes: ‐ The differential scores were calculated by subtracting the scores for the unprocessed signal from all the scores to remove potential biases [1]. ‐ It is not recommended to directly compare the numbers

  • btained with the different reverberation conditions.

‐ All mean scores are plotted with their associate 95% confidence intervals.

[1] T. Zernicki, et al., ``Enhanced coding of high-frequency tonal components in MPEG-D USAC through joint application of ESBR and sinusoidal modeling,’’ Proc. ICASSP 2011.

52

slide-53
SLIDE 53

Differential score based on the MUSHRA score: 2ch systems

Notes: ‐ The differential score was calculated by subtracting the score for the unprocessed signal from all the scores to remove potential biases [1]. ‐ It is not recommended to directly compare the numbers

  • btained with the different reverberation conditions.

‐ All mean scores are plotted with their associate 95% confidence intervals.

[1] T. Zernicki, et al., ``Enhanced coding of high-frequency tonal components in MPEG-D USAC through joint application of ESBR and sinusoidal modeling,’’ Proc. ICASSP 2011.

53

slide-54
SLIDE 54

Differential score based on the MUSHRA score: 8ch systems

Notes: ‐ The differential score was calculated by subtracting the score for the unprocessed signal from all the scores to remove potential biases [1]. ‐ It is not recommended to directly compare the numbers

  • btained with the different reverberation conditions.

‐ All mean scores are plotted with their associate 95% confidence intervals.

[1] T. Zernicki, et al., ``Enhanced coding of high-frequency tonal components in MPEG-D USAC through joint application of ESBR and sinusoidal modeling,’’ Proc. ICASSP 2011.

54

slide-55
SLIDE 55

ASR result for the systems trained on clean data

Details of the ASR results are available at http://www.reverb2014.dereverberation.com/result_asr.html

‐ RT: real‐time processing ‐ UB: utterance‐batch processing ‐ FB: full‐batch processing

55

slide-56
SLIDE 56

ASR result for the systems trained on multi-condition data

Details of the ASR results are available at http://www.reverb2014.dereverberation.com/result_asr.html

‐ RT: real‐time processing ‐ UB: utterance‐batch processing ‐ FB: full‐batch processing

56

slide-57
SLIDE 57

ASR result for the systems trained on own data

Details of the ASR results are available at http://www.reverb2014.dereverberation.com/result_asr.html

‐ RT: real‐time processing ‐ UB: utterance‐batch processing ‐ FB: full‐batch processing

57