Microphone Array Processing for Distant Speech Recognition From - - PowerPoint PPT Presentation

microphone array processing for distant speech recognition
SMART_READER_LITE
LIVE PREVIEW

Microphone Array Processing for Distant Speech Recognition From - - PowerPoint PPT Presentation

SPSC - Microphone Array Processing for Distant Speech Recognition Microphone Array Processing for Distant Speech Recognition From close-talking microphones to far-field sensors Hannes Pessentheiner Signal Processing and Speech Communication


slide-1
SLIDE 1

SPSC - Microphone Array Processing for Distant Speech Recognition

Microphone Array Processing for Distant Speech Recognition

From close-talking microphones to far-field sensors Hannes Pessentheiner

Signal Processing and Speech Communication Laboratory

Advanced Signal Processing 2

Hannes Pessentheiner Advanced Signal Processing 2 page 1/32

slide-2
SLIDE 2

SPSC - Microphone Array Processing for Distant Speech Recognition

Distant Speech-Interaction in Robust Home Applications

◮ people that would like to have assistance in everyday life ◮ physically handicapped people

Problem: want to live independently

  • lack of sphere of privacy
  • depend on other people

Solution: ambient assisted living (AAL)

  • operated by device
  • operated by voice command

AAL-scenario of handicapped woman.

What is a main challenge in voice command?

Hannes Pessentheiner Advanced Signal Processing 2 page 2/32

slide-3
SLIDE 3

SPSC - Microphone Array Processing for Distant Speech Recognition

Distant Speech Recognition (DSR)

◮ most natural human computer interface

  • interaction through speech
  • no use of body- or head-mounted microphones

Block-diagram of a simple DSR system.

How does speech capturing work?

Hannes Pessentheiner Advanced Signal Processing 2 page 3/32

slide-4
SLIDE 4

SPSC - Microphone Array Processing for Distant Speech Recognition

Speech Capturing

◮ free field / diffuse field ◮ spherical / planar wave propagation

Propagation of spherical (left) and plane (right) wave.

What to do with multi-channel data?

Hannes Pessentheiner Advanced Signal Processing 2 page 4/32

slide-5
SLIDE 5

SPSC - Microphone Array Processing for Distant Speech Recognition

Speech Enhancement: Source localization & tracking

◮ estimate speaker’s position / direction for each instant of time ◮ compute trajectory of instantaneous position estimates

Sound capturing (left), speaker localization (center), and speaker tracking (right).

How to employ directional information?

Hannes Pessentheiner Advanced Signal Processing 2 page 5/32

slide-6
SLIDE 6

SPSC - Microphone Array Processing for Distant Speech Recognition

Speech Enhancement: Beamforming

How to improve pre-enhanced signal?

Hannes Pessentheiner Advanced Signal Processing 2 page 6/32

slide-7
SLIDE 7

SPSC - Microphone Array Processing for Distant Speech Recognition

Speech Enhancement: Postfiltering

Beamformer-related Postfilter (ao)

Generalized Sidelobe Canceller (GSC): Filter&Sum-Beamformer wH

q

with parallel filter.

Autonomous Postfilter (ao)

◮ single-channel source separation filter ◮ echo & noise attenuation filter ◮ spectral subtraction

What about errors in DSR systems?

Hannes Pessentheiner Advanced Signal Processing 2 page 7/32

slide-8
SLIDE 8

SPSC - Microphone Array Processing for Distant Speech Recognition

Errors in DSR

Major Errors

◮ front-end errors ◮ corrupted training material

for single-channel source separation or speech recognizer

Representative front-end errors.

Minor Errors (ao)

◮ distorted features ◮ numerical accuracy (single/double precision)

How to measure performance of DSR system?

Hannes Pessentheiner Advanced Signal Processing 2 page 8/32

slide-9
SLIDE 9

SPSC - Microphone Array Processing for Distant Speech Recognition

Metrics

Word Error Rate (WER)

WER = S + D + I S + D + C where I . . . # of insertions S . . . # of substitutions D . . . # of deletions C . . . # of of corrects

Word Accuracy Rate (WACC)

WACC = 1 − WER

Hannes Pessentheiner Advanced Signal Processing 2 page 9/32

slide-10
SLIDE 10

SPSC - Microphone Array Processing for Distant Speech Recognition

Metrics cont’d

Real-valued Kurtosis (peakedness)

K(X) = E{|X|4} − β · E{|X|2}2 where X . . . random variable (RV) E . . . expectation operator β . . . positive constant K > 1: super-Gaussian probability density function (PDF) K = 0: Gaussian PDF K < 0: sub-Gaussian PDF

Hannes Pessentheiner Advanced Signal Processing 2 page 10/32

slide-11
SLIDE 11

SPSC - Microphone Array Processing for Distant Speech Recognition

Metrics cont’d

Negentropy (Gaussian-distance)

N(X) = H(XGaussian) − H(X) with differential entropy H(X) = −

  • pX(x) log pX(x)dx = −E{log pX(x)}

N = 0: Gaussian PDF N > 0: non-Gaussian PDF Why to consider kurtosis & negentropy?

Hannes Pessentheiner Advanced Signal Processing 2 page 11/32

slide-12
SLIDE 12

SPSC - Microphone Array Processing for Distant Speech Recognition

Distribution of Speech Samples

Histograms of real parts of sub-band frequency components (f = 800 Hz) of (a) clean speech, (b) noise-corrupted speech, and (c) reverberated speech snapshots.

◮ PDF of sum of independent RVs approach Gaussian in limit

  • mix of speech, reverb, & noise exhibits Gaussian PDF
  • clean speech exhibits super-Gaussian PDF
  • use N and K to restore super-Gaussianity

How to restore super-Gaussianity?

Hannes Pessentheiner Advanced Signal Processing 2 page 12/32

slide-13
SLIDE 13

SPSC - Microphone Array Processing for Distant Speech Recognition

Conventional Beamforming

Assumptions

◮ signal s(ω) exhibits plane wave characteristic ◮ microphone i captures noise-corrupted signal xi(ω)

x(ω) = s(ω)d(ω, k) + v(ω) where ω . . . radial frequency k . . . wave-frequency vector d . . . array manifold / sound capture model vector x . . . snapshot vector

Hannes Pessentheiner Advanced Signal Processing 2 page 13/32

slide-14
SLIDE 14

SPSC - Microphone Array Processing for Distant Speech Recognition

Conventional Beamforming cont’d

Beamforming

◮ linear spatio-temporal filter ◮ compensate d(ω, k) for steering direction

y(ω) = wH(ω)x(ω) with wH(ω)d(ω, k) = 1

Hannes Pessentheiner Advanced Signal Processing 2 page 14/32

slide-15
SLIDE 15

SPSC - Microphone Array Processing for Distant Speech Recognition

Speech Enhancement: Beamforming cont’d

3D directivity pattern of two different beamformer. Hannes Pessentheiner Advanced Signal Processing 2 page 15/32

slide-16
SLIDE 16

SPSC - Microphone Array Processing for Distant Speech Recognition

Speech Enhancement: Beamforming cont’d

Delay&Sum (DS)

w(ω) = d(ω, k) N

Minimum Variance Distortionless Response (MVDR)

arg min

w(ω)

wH(ω)RNN(ω)w(ω) wH(ω) = dH(ω, k)R−1

NN(ω)

dH(ω, k)R−1

NN(ω)d(ω, k)

Hannes Pessentheiner Advanced Signal Processing 2 page 16/32

slide-17
SLIDE 17

SPSC - Microphone Array Processing for Distant Speech Recognition

Speech Enhancement: Beamforming cont’d

MVDR with Diagonal Loading

◮ consider quadratic constraint: 0 < |w|2 < γ ◮ optimization replaces R−1 NN by R−1 NN(ω) + σ2I

where σ . . . loading level I . . . identity matrix

Super-directive MVDR

◮ replace RNN by Γm,n = sinc

  • ω·lm,n

c

  • where

lm,n . . . distance between microphones m and n c . . . sound velocity

Hannes Pessentheiner Advanced Signal Processing 2 page 17/32

slide-18
SLIDE 18

SPSC - Microphone Array Processing for Distant Speech Recognition

Speech Enhancement: Beamforming cont’d

One-dimensional directivity pattern of DS and MVDR. Hannes Pessentheiner Advanced Signal Processing 2 page 18/32

slide-19
SLIDE 19

SPSC - Microphone Array Processing for Distant Speech Recognition

Speech Enhancement: Beamforming cont’d

3D Convex-optimized (CVX)

◮ more constraints & optimization in 3 spatial dimensions

arg min

w(ω) G(ω) · [w(ω) ⊗ I] − ˆ

DF subject to |wT (ω)d(ω)|2 wH(ω)w(ω) ≥ γ

  • white noise gain

, wH(ω)d(ω) = 1

  • distortionless response

, wH(ω)V (ω) = 0

  • null steering

, where G . . . 3D capturing response matrix ˆ D . . . 3D desired response matrix V . . . 3D null steering matrix

Hannes Pessentheiner Advanced Signal Processing 2 page 19/32

slide-20
SLIDE 20

SPSC - Microphone Array Processing for Distant Speech Recognition

Speech Enhancement: Beamforming cont’d

Two-dimensional CVX directivity pattern with synthesized null and frequency-invariance. Hannes Pessentheiner Advanced Signal Processing 2 page 20/32

slide-21
SLIDE 21

SPSC - Microphone Array Processing for Distant Speech Recognition

Speech Enhancement: Beamforming cont’d

Generalized Sidelobe Canceller (GSC)

◮ combine beamformer and postfilter

Block-diagram of a GSC beamformer. Hannes Pessentheiner Advanced Signal Processing 2 page 21/32

slide-22
SLIDE 22

SPSC - Microphone Array Processing for Distant Speech Recognition

Speech Enhancement: Beamforming cont’d

Subspace Maximum Kurtosis (MK) / Negentropy (MN)

◮ based on GSC, but with subspace filter matrix U that

  • reduces dimensionality and
  • decomposes signal into spatially correlated and ambient comp.

◮ use kurtosis or negentropy to detect (sub-/super-)Gaussianity

Block-diagram of a MK/MN beamformer. Hannes Pessentheiner Advanced Signal Processing 2 page 22/32

slide-23
SLIDE 23

SPSC - Microphone Array Processing for Distant Speech Recognition

Speech Enhancement: Beamforming cont’d

Modes (Eigenvalues) of spatially correlated and ambient components. Hannes Pessentheiner Advanced Signal Processing 2 page 23/32

slide-24
SLIDE 24

SPSC - Microphone Array Processing for Distant Speech Recognition

Speech Enhancement: Beamforming cont’d

Super-directive MVDR based on Spherical Harmonics

◮ based on 3D microphone array ◮ frequency-invariant directivity pattern ◮ directivity stability around spherical array ◮ redefine array manifold / sound capture model vector d ◮ all beamforming techniques can be applied

Eigenmike: spherical microphone array. Hannes Pessentheiner Advanced Signal Processing 2 page 24/32

slide-25
SLIDE 25

SPSC - Microphone Array Processing for Distant Speech Recognition

Speech Enhancement: Beamforming cont’d

Magnitudes of spherical harmonics of different order. Magnitudes of super-directive MVDR beamformer based on spherical harmonics. Hannes Pessentheiner Advanced Signal Processing 2 page 25/32

slide-26
SLIDE 26

SPSC - Microphone Array Processing for Distant Speech Recognition

Experimental Setup #1

◮ Linear Array: 64 channels, di,j = 2 cm; ◮ Spherical Array: Eigenmike, 32 channels, r = 4.2 cm; ◮ Reverberation time: T60 = 525 ms

Recording room. Hannes Pessentheiner Advanced Signal Processing 2 page 26/32

slide-27
SLIDE 27

SPSC - Microphone Array Processing for Distant Speech Recognition

Results #1

WERs for each beamformer and decoding pass: (1) unadapted, (2) estimated vocal tract length normalization (ao), (3)-(4) additional parameters & models. Hannes Pessentheiner Advanced Signal Processing 2 page 27/32

slide-28
SLIDE 28

SPSC - Microphone Array Processing for Distant Speech Recognition

Results #1 cont’d

WERs for each beamformer and decoding pass. Hannes Pessentheiner Advanced Signal Processing 2 page 28/32

slide-29
SLIDE 29

SPSC - Microphone Array Processing for Distant Speech Recognition

Experimental Setup #2

◮ Uniform Circular Array: 8 channels, r = 10 cm; ◮ Reverberation time: T60 = 380 ms ◮ Recording room size: 650 × 490 × 325 cm

Hannes Pessentheiner Advanced Signal Processing 2 page 29/32

slide-30
SLIDE 30

SPSC - Microphone Array Processing for Distant Speech Recognition

Results #2

WERs for each beamformer and decoding pass. Hannes Pessentheiner Advanced Signal Processing 2 page 30/32

slide-31
SLIDE 31

SPSC - Microphone Array Processing for Distant Speech Recognition

Conclusion

◮ state-of-the-art DSR systems achieve WER similar to CTM ◮ compact spherical microphone array can achieve recognition

performance comparable to large linear array

  • less space required
  • less channels to process (32 instead of 64)
  • still a virgin field of research

Hannes Pessentheiner Advanced Signal Processing 2 page 31/32

slide-32
SLIDE 32