Joint Dereverberation and Noise Reduction Using Beamforming and a - - PowerPoint PPT Presentation

▶

May 02, 2023 145 likes •321 views

Joint Dereverberation and Noise Reduction Using Beamforming and a Single-Channel Speech Enhancement Scheme B. Cauchi, I. Kodrasi, R. Rehr, S. Gerlach, T. Gerkmann, S. Doclo, S. Goetze Fraunhofer IDMT, Project Group Hearing, Speech and Audio

SLIDE 1

Joint Dereverberation and Noise Reduction Using Beamforming and a Single-Channel Speech Enhancement Scheme

B. Cauchi, I. Kodrasi, R. Rehr,
S. Gerlach, T. Gerkmann,
S. Doclo, S. Goetze

Fraunhofer IDMT, Project Group Hearing, Speech and Audio Technology Oldenburg University, Signal Processing Group

Florence, May 10th 2014

benjamin.cauchi@idmt.fraunhofer.de phone 0441 2172-450 c Fraunhofer IDMT 1/16

SLIDE 2

Introduction

Overview of the proposed system Design of the MVDR beamformer DOA estimated using MUSIC Estimated noise covariance Single-channel enhancement scheme Combination and optimization of published estimators Results Objective measures MUSHRA scores WER using a baseline recognizer c Fraunhofer IDMT 2/16

SLIDE 3

1. Proposed System

Overview

MVDR beamformer single-channel enhancement VAD DOA estimation coherence estimation s(n) ˆ x(n) ˆ θ Γ(k) y1(n) y2(n) yM(n) h

( n ) h2(n) hM(n) θ ˆ s(n)

Beamformer: towards estimated direction of arrival (DOA) Single-channel enhancement: based on statistical estimators Late reverberant spectral variance (LRSV) Noise spectral variance (NSV) Speech spectral variance (SSV) c Fraunhofer IDMT 3/16

SLIDE 4

2. MVDR Beamformer

With Ym(k, ℓ) the STFT of the input signal in the m-th microphone we define Y(k, ℓ) = [Y1(k, ℓ) Y2(k, ℓ) . . . YM(k, ℓ)]T The output ˆ X(k, ℓ) of the beamformer is obtained as ˆ X(k, ℓ) = WH

θ (k)Y(k, ℓ)

where Wθ(k) =

Γ−1(k)dθ(k) dH

θ (k)Γ−1(k)dθ(k) Noise coherence matrix: Γ(k)

estimated using a VAD.

Steering vector: dθ(k)

from ˆ θ using a far-field assumption.

c Fraunhofer IDMT 4/16

SLIDE 5

2. MVDR Beamformer

Estimation of noisefield coherence

Noise periods identified with a VAD Comparison between the long-term spectral envelope and the

average noise spectrum

Γ(k) is estimated using detected noise-only frames Alternatively, a theoretically diffuse noise field is used:

Wθ(k) =

(Γdiff(k)+̺(k)IM )−1dθ(k) dH

θ (k)(Γdiff(k)+̺(k)IM )−1dθ(k)

with ̺(k) a constraint such that WH

θ (k)Wθ(k) ≤ WNGmax = 10 dB

Ramirez, J., Segura, J.C., Benitez, C., de la Torre, A., and Rubio, A., Efficient voice activity detection algorithms using long-term speech information, 2003.

c Fraunhofer IDMT 5/16

SLIDE 6

2. MVDR Beamformer

DOA Estimation

1 2 3

b b b b b b b b b b b b b b b b b b b b b b b b

θ

ˆ θ = argmax

θ

1 K

khigh

klow

Uθ(k, ℓ), where Uθ(k, ℓ) is the MUSIC pseudo-spectra: Uθ(k, ℓ) =

1 dH

θ (k)E(k,ℓ)EH(k,ℓ)dθ(k)

E(k, ℓ) = [eQ+1(k, ℓ) . . . eM(k, ℓ)] with em denoting eigenvectors of the covariance matrix of Y(k, ℓ).

Schmidt, R., Multiple emitter location and signal parameter estimation, 1986.

c Fraunhofer IDMT 6/16

SLIDE 7

3. Single-channel Enhancement

Overview

noise power estimation speech power estimation reverb power estimation T60 estimator speech power re- estimation + gain function × ˆ σ2

˜ v(k)

ˆ σ2

z(k)

ˆ σ2

r(k)

ˆ σ2

r(k) + ˆ

σ2

˜ v(k)

ˆ σ2

s(k)

G(k) ˆ S(k) T60 ˆ X(k) σ2

˜ v(k, ℓ) estimated using Minimum Statistics

σ2

s(k, ℓ) estimated using Cepstral Smoothing

σ2

r(k, ℓ) estimated using Lebart’s approach

Martin, R., Noise Power Spectral Density Estimation Based on Optimal Smoothing and Minimum Statistics, 2001. Breithaupt, C., Gerkmann T. and Martin, R., A Novel A Priori SNR Estimation Approach Based on Selective Cepstro-Temporal Smoothing, 2008. Eaton, J., Gaubitch, N.D., Naylor, P.A., Noise-robust reverberation time estimation using spectral decay distributions with reduced computational cost, 2012. Lebart, K., Boucher J.M. and Denbigh, P., A new method based on spectral subtraction for speech dereverberation, 2013.

c Fraunhofer IDMT 7/16

SLIDE 8

3. Single-channel Enhancement

LRSV estimation

RIR modeled as Gaussian noise with decay ∆ = 3 ln 10

T60fs decay RIR Time [s] Amplitude 0.1 0.2 0.3 0.4 0.5

Representing the variance of the reverberant speech as: σ2

z(k, ℓ) = σ2 r(k, ℓ) + σ2 s(k, ℓ)

Leads to the estimator ˆ

σ2

r (k, ℓ) = e−2∆Tdfsσ2 z(k, ℓ − Td/Ts)

Lebart, K., Boucher J.M. and Denbigh, P., A new method based on spectral subtraction for speech dereverberation, 2001.

c Fraunhofer IDMT 8/16

SLIDE 9

3. Single-channel Enhancement

Gain function

The output ˆ

X(k, ℓ) of the beamformer contains the anechoic speech, remaining noise and spatially filtered reverberation

X(k, ℓ) = S(k, ℓ) + ˜ V (k, ℓ) + R(k, ℓ)

We aim to compute a real gain such that: ˆ

S(k, ℓ) = G(k, ℓ) ˆ X(k, ℓ)

Computation of G(k, ℓ) using an MMSE estimation of the

speech amplitude based on a super Gaussian speech model.

Breithaupt, C., Krawczyk, M., and Martin, R., Parameterized MMSE spectral magnitude estimation for the enhancement of noisy speech, 2008.

c Fraunhofer IDMT 9/16

SLIDE 10

4. Objective Measures

SRMR

Real

Unprocessed 8 Channels 1 Channel

SRMR [dB] Simulated 700 ms 700 ms 500 ms 250 ms near far Mean near far near far near far Mean 1 3 5 7 9 1 3 5 7 9

Illustrates dereverberation performance in all condition Better dereverberation achieved by multichannel, except for

T60=500 ms

c Fraunhofer IDMT 10/16

SLIDE 11

4. Objective Measures

FWSSNR

Unprocessed

8 Channels 1 Channel

FWSSNR [dB] Simulated

700 ms 500 ms 250 ms

near far near far near far Mean 2 6 10 14

Illustrates noise reduction in all condition Beamforming step advantageous for the noise reduction c Fraunhofer IDMT 11/16

SLIDE 12

4. Objective Measures

PESQ

Unprocessed

8 Channels 1 Channel

PESQ Simulated

700 ms 500 ms 250 ms

near far near far near far Mean 1.5 2.5 3.5

Improvement of PESQ score in all condition illustrate the

verall improvement in speech quality

c Fraunhofer IDMT 12/16

SLIDE 13

5. Subjective Tests

MUSHRA test

Intermediate results of the subjective test ran by the organizers:

Tests carried out separately for 1 and 8 channels

Unprocessed Overall Quality Reverberation Real Single-channel Sim MUSHRA score [%] near far near far 10 20 30 40 50 60 70 80 Real Multichannel Sim near far near far 10 20 30 40 50 60 70 80

Improvement for all tested condition Higher improvement of the overall quality c Fraunhofer IDMT 13/16

SLIDE 14

6. Preprocessing for ASR

Word Error Rate

Baseline recognizer provided by the organizers Using pre-trained models on clean data

Simulated 2.33 3.27 −11.94 −20.31 −15.78 −18.69 −10.18 1.65 2.83 −24.91 −42.53 −21.63 −30.48 −19.17 WER \% 250, near 250, far 500, near 500, far 700, near 700, far Mean 10 20 30 40 50 60 70 80 90 100 1 Channel 8 Channels baseline Real −12.74 −11.48 −12.11 −25.93 −25.76 −25.85 700, near 700, far Mean 10 20 30 40 50 60 70 80 90 100

c Fraunhofer IDMT 14/16

SLIDE 15

7. Conclusion

System based on combination of MVDR beamformer and

spectral enhancement

All parameters are blindly estimated Speech enhancement achieved in all conditions in terms of: Objective measures Subjective tests Word error rate c Fraunhofer IDMT 15/16

SLIDE 16

Thank you very much for your attention

House of Hearing, Oldenburg

Questions ?

Fraunhofer IDMT Project Group Hearing, Speech and Audio Technology Oldenburg University Signal Processing Group benjamin.cauchi@idmt.fraunhofer.de

c Fraunhofer IDMT 16/16