EE E6820: Speech & Audio Processing & Recognition Lecture 8: - - PowerPoint PPT Presentation

ee e6820 speech audio processing recognition lecture 8
SMART_READER_LITE
LIVE PREVIEW

EE E6820: Speech & Audio Processing & Recognition Lecture 8: - - PowerPoint PPT Presentation

EE E6820: Speech & Audio Processing & Recognition Lecture 8: Spatial sound 1 Spatial acoustics 2 Binaural perception 3 Synthesizing spatial audio 4 Extracting spatial sounds Dan Ellis <dpwe@ee.columbia.edu>


slide-1
SLIDE 1

E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 1

EE E6820: Speech & Audio Processing & Recognition

Lecture 8: Spatial sound

Spatial acoustics Binaural perception Synthesizing spatial audio Extracting spatial sounds

Dan Ellis <dpwe@ee.columbia.edu> http://www.ee.columbia.edu/~dpwe/e6820/

1 2 3 4

slide-2
SLIDE 2

E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 2

Spatial acoustics

  • Received sound = source +

channel

  • so far, only considered ideal source waveform
  • Sound carries information on its spatial origin
  • e.g. “ripples in the lake”
  • great evolutionary significance
  • The basis of scene analysis?
  • yes and no - try blocking an ear

1

slide-3
SLIDE 3

E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 3

Ripples in the lake

  • Effect of relative position on sound
  • delay =

∆ r/c

  • energy decay ~ 1/r

2

  • absorption ~ G(f)

r

  • direct energy plus reflections
  • Give cues for recovering source position
  • Describe wavefront by its normal

Source Source Listener Wavefront (@ c m/s) Energy ∝ 1/r2

slide-4
SLIDE 4

E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 4

Recovering spatial information

  • Source

direction as wavefront normal

  • moving plane found from timing at 3 points
  • need to solve

correspondence

  • Space: need 3

parameters

  • e.g. 2 angles and range

wavefront A B C

time pressure ∆t/c = ∆s = AB·cosθ θ range r azimuth θ elevation φ

slide-5
SLIDE 5

E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 5

The effect of the environment

  • Reflection causes additional wavefronts
  • + scattering, absorption
  • many paths

→ many echoes

  • Reverberant

effect

  • causal ‘smearing’ of signal energy

reflection diffraction & shadowing

time / sec freq / Hz time / sec freq / Hz 0.5 1 1.5 2000 4000 6000 8000 0.5 1 1.5 2000 4000 6000 8000 Dry speech airvib16 + reverb from hlwy16

slide-6
SLIDE 6

E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 6

Reverberation impulse response

  • Exponential decay of reflections:
  • Frequency-dependent
  • greater absorption at high frequencies

→ faster decay

  • Size-dependent
  • larger rooms

→ longer delays → slower decay

  • Sabine’s equation:
  • Time constant as size, absorption

t hroom(t) ~e-t/T

time / s freq / Hz

hlwy16 - 128pt window

0.1 0.2 0.3 0.4 0.5 0.6 0.7 2000 4000 6000 8000

  • 70
  • 60
  • 50
  • 40
  • 30
  • 20
  • 10

RT60 0.049V Sα

  • =
slide-7
SLIDE 7

E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 7

Outline

Spatial acoustics Binaural perception

  • The sound at the two ears
  • Available cues
  • Perceptual phenomena

Synthesizing spatial audio Extracting spatial sounds 1 2 3 4

slide-8
SLIDE 8

E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 8

Binaural perception

  • What is the information in the 2 ear signals?
  • the

sound

  • f the source(s) (L+R)
  • the

position

  • f the source(s) (L-R)
  • Example waveforms (ShATR database)

2

path length difference path length difference head shadow (high freq) source L R

2.2 2.205 2.21 2.215 2.22 2.225 2.23 2.235

  • 0.1
  • 0.05

0.05 0.1 time / s shatr78m3 waveform Left Right

slide-9
SLIDE 9

E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 9

Main cues to spatial hearing

  • Interaural time difference (ITD)
  • from different path lengths around head
  • dominates in low frequency (< 1.5 kHz)
  • max ~ 750

µ s → ambiguous for freqs > 600 Hz

  • Interaural intensity difference (IID)
  • from head shadowing of far ear
  • negligable for LF; increases with frequency
  • Spectral detail (from pinna relfections)

useful for elevation & range

  • Direct-to-reverberant useful for range
slide-10
SLIDE 10

E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 10

Head-Related Transfer Fns (HRTFs)

  • Capture source coupling as impulse responses
  • Collection: (

http://phosphor.cipic.ucdavis.edu/ )

  • Highly individual!

lθ φ R

, ,

t ( ) rθ φ R

, ,

t ( ) , { }

0.5 1 1.5

  • 45

45 0.5 1 1.5 1 0.5 1 1.5

  • 1

1

time / ms time / ms HRIR_021 Left @ 0 el HRIR_021 Left @ 0 el 0 az HRIR_021 Right @ 0 el 0 az HRIR_021 Right @ 0 el

LEFT RIGHT Azimuth / deg

slide-11
SLIDE 11

E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 11

Cone of confusion

  • Interaural timing cue dominates (below 1kHz)
  • from differing path lengths to two ears
  • But: only resolves to a cone
  • Up/down? Front/back?

azimuth θ

Cone of confusion (approx equal ITD)

slide-12
SLIDE 12

E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 12

Further cues

  • Pinna

causes elevation-dependent coloration

  • Monaural perception
  • separate coloration from source spectrum?
  • Head motion
  • synchronized spectral changes
  • also for ITD (front/back) etc.
slide-13
SLIDE 13

E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 13

Combining multiple cues

  • Both ITD and ILD influence azimuth;

What happens when they disagree?

  • trading @ around 0.1 ms / dB

t t r(t) 1 ms l(t) t t r(t) l(t)

Identical signals to both ears → image is centered Delaying right channel moves image to left

t t r(t) l(t)

Attenuating left channel returns image to center

slide-14
SLIDE 14

E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 14

Binaural position estimation

  • Imperfect results: (Arruda, Kistler & Wightman 1992)
  • listening to ‘wrong’ hrtfs → errors
  • front/back reversals stay on cone of confusion
  • 180
  • 120
  • 60

60 120 180

Target Azimuth (Deg)

  • 180
  • 120
  • 60

60 120 180

Judged Azimuth (Deg)

slide-15
SLIDE 15

E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 15

The Precedence Effect

  • Reflections give misleading spatial cues
  • But: Spatial impression based on 1st wavefront

then ‘switches off’ for ~50 ms

  • .. even if ‘reflections’ are louder
  • .. leads to impression of room

t l(t) t R/c

R

r(t) direct reflected

slide-16
SLIDE 16

E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 16

Binaural Masking Release

  • Adding noise to reveal target
  • why does this make sense?
  • Binaural Masking Level Difference up to 12dB
  • greatest for noise in phase, tone anti-phase

t t

Tone + noise to one ear: tone is masked +

t t

Identical noise to other ear: tone is audible

t

+

slide-17
SLIDE 17

E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 17

Outline

Spatial acoustics Binaural perception Synthesizing spatial audio

  • Position
  • Environment

Extracting spatial sounds 1 2 3 4

slide-18
SLIDE 18

E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 18

Synthesizing spatial audio

  • Goal: recreate realistic soundfield
  • hi-fi experience
  • synthetic environments (VR)
  • Constraints
  • resources
  • information (individual HRTFs)
  • delivery mechanism (headphones)
  • Source material types
  • live recordings (actual soundfields)
  • synthetic (studio mixing, virtual environments)

3

slide-19
SLIDE 19

E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 19

Classic stereo

  • ‘Intensity panning’:

no timing modifications, just vary level ±20 dB

  • works as long as listener is equidistant
  • Surround sound:

extra channels in center, sides, ...

  • same basic effect - pan between pairs

L R

slide-20
SLIDE 20

E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 20

Simulating reverberation

  • Can characterize reverb by impulse response
  • spatial cues are important - record in stereo
  • IRs of ~ 1 sec → very long convolution
  • Image model: reflections as duplicate sources
  • ‘Early echos’ in room impulse response:
  • Actual reflection may be href(t), not δ(t)

source listener virtual (image) sources reflected path t hroom(t) direct path early echos

slide-21
SLIDE 21

E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 21

Artificial reverberation

  • Reproduce perceptually salient aspects
  • early echo pattern (→ room size impression)
  • overall decay tail (→ wall materials...)
  • interaural coherence (→ spaciousness)
  • Nested allpass filters (Gardner ’92)

z-k + +

  • g

g g,k x[n] y[n] n

k 2k 3k

  • g

1-g2 g(1-g2) g2(1-g2)

h[n] z-k - g 1 - g·z-k H(z) = 20,0.3

Allpass Nested+Cascade Allpass Synthetic Reverb

30,0.7 50,0.5 AP0 + AP1 AP2 LPF g a0 a1 a2 + +

slide-22
SLIDE 22

E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 22

Synthetic binaural audio

  • Source convolved with {L,R} HRTFs gives

precise positioning

  • ...for headphone presentation
  • can combine multiple sources (by adding)
  • Where to get HRTFs?
  • measured set, but: specific to individual, discrete
  • interpolate by linear crossfade, PCA basis set
  • or: parametric model - delay, shadow, pinna
  • Head motion cues?
  • head tracking + fast updates

Source Delay Shadow Pinna z-tDL(θ) 1 - azt 1 - bL(θ)z-1 z-tDR(θ) 1 - azt 1 - bR(θ)z-1

Σ pkL(θ,φ)·z-tPkL(θ,φ) Σ pkR(θ,φ)·z-tPkR(θ,φ)

Room echo KE·z-tE

+ +

(after Brown & Duda '97)

slide-23
SLIDE 23

E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 23

Transaural sound

  • Binaural signals without headphones?
  • Can cross-cancel wrap-around signals
  • speakers SL,R, ears EL,R, binaural signals BL,R.
  • Narrow ‘sweet spot’
  • head motion?

SL HLL

1 –

BL HRLSR – ( ) = SR HRR

1 –

BR HLRSL – ( ) =

EL ER HRR HRL HLR HLL SL BL SR BR M

slide-24
SLIDE 24

E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 24

Soundfield reconstruction

  • Stop thinking about ears

just reconstruct pressure + spatial derivatives

  • ears in reconstructed field receive same sounds
  • Complex reconstruction setup (ambisonics)
  • able to preserve head motion cues?

p(x,y,z,t) ∂p(t)/∂z ∂p(t)/∂x ∂p(t)/∂y

slide-25
SLIDE 25

E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 25

Outline

Spatial acoustics Binaural perception Synthesizing spatial audio Extracting spatial sounds

  • Microphone arrays
  • Modeling binaural processing

1 2 3 4

slide-26
SLIDE 26

E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 26

Extracting spatial sounds

  • Given access to soundfield, can we recover

separate components?

  • degrees of freedom:

>N signals from N sensors is hard

  • but: people can do it (somewhat)
  • Information-theoretic approach
  • use only very general constraints
  • rely on precision measurements
  • Anthropic approach
  • examine human perception
  • attempt to use same information

4

slide-27
SLIDE 27

E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 27

Microphone arrays

  • Signals from multiple microphones can be

combined to enhance/cancel certain sources

  • ‘Coincident’ mics with diff. directional gains
  • Microphone arrays (endfire)

m1 s1 m2 s2 a21 a22 a12 a11

m1 m2 a11 a12 a21 a22 s1 s2 ⋅ = s1 ˆ s2 ˆ ⇒ A 1

m ⋅ =

D D + D + +

  • 40
  • 20

λ = 4D λ = 2D λ = D

slide-28
SLIDE 28

E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 28

Adaptive Beamforming & Independent Component Analysis (ICA)

  • Formulate mathematical criteria to optimize
  • Beamforming: Drive interference to zero
  • cancel energy during nontarget intervals
  • ICA: maximize mutual independence of outputs
  • from higher-order moments during overlap
  • Limited by separation model parameter space
  • only NxN?

m1 m2 s1 s2 a11 a21 a12 a22

x

−δ MutInfo δa

slide-29
SLIDE 29

E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 29

Binaural models

  • Human listeners do better?
  • certainly given only 2 channels
  • Extract ITD and IID cues?
  • cross-correlation finds timing differences
  • ‘consume’ counter-moving pulses
  • how to achieve IID, trading
  • vertical cues...
  • 6
  • 4
  • 2

2 4 6 lag / ms 100 200 400 800 1600 3200 Center freq / Hz

Interaural cross-correlation Target azimuth

slide-30
SLIDE 30

E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 30

Nonlinear filtering

  • How to separate sounds based on direction?
  • estimate direction locally
  • choose target direction
  • remove energy from other directions
  • E.g. Kollmeier, Peissig & Hohman ’93
  • IID from |Lw|/|Rw|; ITD (IPD) from arg{LwRw

*}

  • match to IID/IPD template for desired direction
  • also reverberation?

time frequency

Xw(mH,2 πk/NT)

FFT analysis Modulus l L w |L w| 2 FFT analysis r R w Modulus |R w| 2 Cross- correlation LwR*w Smooth (1-a) (1-az-1) SLL Smooth (1-a) (1-az-1) Smooth (1-a) (1-az-1) SLR SRR Gain factor calc l' OLA- FFT synthesis r' OLA- FFT synthesis

g

slide-31
SLIDE 31

E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 31

Summary

  • Spatial sound
  • sampling at more than one point gives

information on origin direction

  • Binaural perception
  • time & intensity cues used between/within ears
  • Sound rendering
  • conventional stereo
  • HRTF-based
  • Spatial analysis
  • optimal linear techniques
  • elusive auditory models
slide-32
SLIDE 32

E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 32

References

B.C.J. Moore, An introduction to the psychology of hearing (4th ed.) Academic, 1997.

  • J. Blauert, Spatial Hearing (revised ed.), MIT Press, 1996.

R.O. Duda, Sound Localization Research,

http://www.engr.sjsu.edu/~duda/Duda.Research.frameset.html