E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 1
EE E6820: Speech & Audio Processing & Recognition Lecture 8: - - PowerPoint PPT Presentation
EE E6820: Speech & Audio Processing & Recognition Lecture 8: - - PowerPoint PPT Presentation
EE E6820: Speech & Audio Processing & Recognition Lecture 8: Spatial sound 1 Spatial acoustics 2 Binaural perception 3 Synthesizing spatial audio 4 Extracting spatial sounds Dan Ellis <dpwe@ee.columbia.edu>
E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 2
Spatial acoustics
- Received sound = source +
channel
- so far, only considered ideal source waveform
- Sound carries information on its spatial origin
- e.g. “ripples in the lake”
- great evolutionary significance
- The basis of scene analysis?
- yes and no - try blocking an ear
1
E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 3
Ripples in the lake
- Effect of relative position on sound
- delay =
∆ r/c
- energy decay ~ 1/r
2
- absorption ~ G(f)
r
- direct energy plus reflections
- Give cues for recovering source position
- Describe wavefront by its normal
Source Source Listener Wavefront (@ c m/s) Energy ∝ 1/r2
E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 4
Recovering spatial information
- Source
direction as wavefront normal
- moving plane found from timing at 3 points
- need to solve
correspondence
- Space: need 3
parameters
- e.g. 2 angles and range
wavefront A B C
time pressure ∆t/c = ∆s = AB·cosθ θ range r azimuth θ elevation φ
E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 5
The effect of the environment
- Reflection causes additional wavefronts
- + scattering, absorption
- many paths
→ many echoes
- Reverberant
effect
- causal ‘smearing’ of signal energy
reflection diffraction & shadowing
time / sec freq / Hz time / sec freq / Hz 0.5 1 1.5 2000 4000 6000 8000 0.5 1 1.5 2000 4000 6000 8000 Dry speech airvib16 + reverb from hlwy16
E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 6
Reverberation impulse response
- Exponential decay of reflections:
- Frequency-dependent
- greater absorption at high frequencies
→ faster decay
- Size-dependent
- larger rooms
→ longer delays → slower decay
- Sabine’s equation:
- Time constant as size, absorption
t hroom(t) ~e-t/T
time / s freq / Hz
hlwy16 - 128pt window
0.1 0.2 0.3 0.4 0.5 0.6 0.7 2000 4000 6000 8000
- 70
- 60
- 50
- 40
- 30
- 20
- 10
RT60 0.049V Sα
- =
E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 7
Outline
Spatial acoustics Binaural perception
- The sound at the two ears
- Available cues
- Perceptual phenomena
Synthesizing spatial audio Extracting spatial sounds 1 2 3 4
E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 8
Binaural perception
- What is the information in the 2 ear signals?
- the
sound
- f the source(s) (L+R)
- the
position
- f the source(s) (L-R)
- Example waveforms (ShATR database)
2
path length difference path length difference head shadow (high freq) source L R
2.2 2.205 2.21 2.215 2.22 2.225 2.23 2.235
- 0.1
- 0.05
0.05 0.1 time / s shatr78m3 waveform Left Right
E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 9
Main cues to spatial hearing
- Interaural time difference (ITD)
- from different path lengths around head
- dominates in low frequency (< 1.5 kHz)
- max ~ 750
µ s → ambiguous for freqs > 600 Hz
- Interaural intensity difference (IID)
- from head shadowing of far ear
- negligable for LF; increases with frequency
- Spectral detail (from pinna relfections)
useful for elevation & range
- Direct-to-reverberant useful for range
E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 10
Head-Related Transfer Fns (HRTFs)
- Capture source coupling as impulse responses
- Collection: (
http://phosphor.cipic.ucdavis.edu/ )
- Highly individual!
lθ φ R
, ,
t ( ) rθ φ R
, ,
t ( ) , { }
0.5 1 1.5
- 45
45 0.5 1 1.5 1 0.5 1 1.5
- 1
1
time / ms time / ms HRIR_021 Left @ 0 el HRIR_021 Left @ 0 el 0 az HRIR_021 Right @ 0 el 0 az HRIR_021 Right @ 0 el
LEFT RIGHT Azimuth / deg
E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 11
Cone of confusion
- Interaural timing cue dominates (below 1kHz)
- from differing path lengths to two ears
- But: only resolves to a cone
- Up/down? Front/back?
azimuth θ
Cone of confusion (approx equal ITD)
E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 12
Further cues
- Pinna
causes elevation-dependent coloration
- Monaural perception
- separate coloration from source spectrum?
- Head motion
- synchronized spectral changes
- also for ITD (front/back) etc.
E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 13
Combining multiple cues
- Both ITD and ILD influence azimuth;
What happens when they disagree?
- trading @ around 0.1 ms / dB
t t r(t) 1 ms l(t) t t r(t) l(t)
Identical signals to both ears → image is centered Delaying right channel moves image to left
t t r(t) l(t)
Attenuating left channel returns image to center
E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 14
Binaural position estimation
- Imperfect results: (Arruda, Kistler & Wightman 1992)
- listening to ‘wrong’ hrtfs → errors
- front/back reversals stay on cone of confusion
- 180
- 120
- 60
60 120 180
Target Azimuth (Deg)
- 180
- 120
- 60
60 120 180
Judged Azimuth (Deg)
E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 15
The Precedence Effect
- Reflections give misleading spatial cues
- But: Spatial impression based on 1st wavefront
then ‘switches off’ for ~50 ms
- .. even if ‘reflections’ are louder
- .. leads to impression of room
t l(t) t R/c
R
r(t) direct reflected
E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 16
Binaural Masking Release
- Adding noise to reveal target
- why does this make sense?
- Binaural Masking Level Difference up to 12dB
- greatest for noise in phase, tone anti-phase
t t
Tone + noise to one ear: tone is masked +
t t
Identical noise to other ear: tone is audible
t
+
E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 17
Outline
Spatial acoustics Binaural perception Synthesizing spatial audio
- Position
- Environment
Extracting spatial sounds 1 2 3 4
E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 18
Synthesizing spatial audio
- Goal: recreate realistic soundfield
- hi-fi experience
- synthetic environments (VR)
- Constraints
- resources
- information (individual HRTFs)
- delivery mechanism (headphones)
- Source material types
- live recordings (actual soundfields)
- synthetic (studio mixing, virtual environments)
3
E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 19
Classic stereo
- ‘Intensity panning’:
no timing modifications, just vary level ±20 dB
- works as long as listener is equidistant
- Surround sound:
extra channels in center, sides, ...
- same basic effect - pan between pairs
L R
E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 20
Simulating reverberation
- Can characterize reverb by impulse response
- spatial cues are important - record in stereo
- IRs of ~ 1 sec → very long convolution
- Image model: reflections as duplicate sources
- ‘Early echos’ in room impulse response:
- Actual reflection may be href(t), not δ(t)
source listener virtual (image) sources reflected path t hroom(t) direct path early echos
E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 21
Artificial reverberation
- Reproduce perceptually salient aspects
- early echo pattern (→ room size impression)
- overall decay tail (→ wall materials...)
- interaural coherence (→ spaciousness)
- Nested allpass filters (Gardner ’92)
z-k + +
- g
g g,k x[n] y[n] n
k 2k 3k
- g
1-g2 g(1-g2) g2(1-g2)
h[n] z-k - g 1 - g·z-k H(z) = 20,0.3
Allpass Nested+Cascade Allpass Synthetic Reverb
30,0.7 50,0.5 AP0 + AP1 AP2 LPF g a0 a1 a2 + +
E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 22
Synthetic binaural audio
- Source convolved with {L,R} HRTFs gives
precise positioning
- ...for headphone presentation
- can combine multiple sources (by adding)
- Where to get HRTFs?
- measured set, but: specific to individual, discrete
- interpolate by linear crossfade, PCA basis set
- or: parametric model - delay, shadow, pinna
- Head motion cues?
- head tracking + fast updates
Source Delay Shadow Pinna z-tDL(θ) 1 - azt 1 - bL(θ)z-1 z-tDR(θ) 1 - azt 1 - bR(θ)z-1
Σ pkL(θ,φ)·z-tPkL(θ,φ) Σ pkR(θ,φ)·z-tPkR(θ,φ)
Room echo KE·z-tE
+ +
(after Brown & Duda '97)
E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 23
Transaural sound
- Binaural signals without headphones?
- Can cross-cancel wrap-around signals
- speakers SL,R, ears EL,R, binaural signals BL,R.
- Narrow ‘sweet spot’
- head motion?
SL HLL
1 –
BL HRLSR – ( ) = SR HRR
1 –
BR HLRSL – ( ) =
EL ER HRR HRL HLR HLL SL BL SR BR M
E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 24
Soundfield reconstruction
- Stop thinking about ears
just reconstruct pressure + spatial derivatives
- ears in reconstructed field receive same sounds
- Complex reconstruction setup (ambisonics)
- able to preserve head motion cues?
p(x,y,z,t) ∂p(t)/∂z ∂p(t)/∂x ∂p(t)/∂y
E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 25
Outline
Spatial acoustics Binaural perception Synthesizing spatial audio Extracting spatial sounds
- Microphone arrays
- Modeling binaural processing
1 2 3 4
E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 26
Extracting spatial sounds
- Given access to soundfield, can we recover
separate components?
- degrees of freedom:
>N signals from N sensors is hard
- but: people can do it (somewhat)
- Information-theoretic approach
- use only very general constraints
- rely on precision measurements
- Anthropic approach
- examine human perception
- attempt to use same information
4
E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 27
Microphone arrays
- Signals from multiple microphones can be
combined to enhance/cancel certain sources
- ‘Coincident’ mics with diff. directional gains
- Microphone arrays (endfire)
m1 s1 m2 s2 a21 a22 a12 a11
m1 m2 a11 a12 a21 a22 s1 s2 ⋅ = s1 ˆ s2 ˆ ⇒ A 1
–
m ⋅ =
D D + D + +
- 40
- 20
λ = 4D λ = 2D λ = D
E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 28
Adaptive Beamforming & Independent Component Analysis (ICA)
- Formulate mathematical criteria to optimize
- Beamforming: Drive interference to zero
- cancel energy during nontarget intervals
- ICA: maximize mutual independence of outputs
- from higher-order moments during overlap
- Limited by separation model parameter space
- only NxN?
m1 m2 s1 s2 a11 a21 a12 a22
x
−δ MutInfo δa
E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 29
Binaural models
- Human listeners do better?
- certainly given only 2 channels
- Extract ITD and IID cues?
- cross-correlation finds timing differences
- ‘consume’ counter-moving pulses
- how to achieve IID, trading
- vertical cues...
- 6
- 4
- 2
2 4 6 lag / ms 100 200 400 800 1600 3200 Center freq / Hz
Interaural cross-correlation Target azimuth
E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 30
Nonlinear filtering
- How to separate sounds based on direction?
- estimate direction locally
- choose target direction
- remove energy from other directions
- E.g. Kollmeier, Peissig & Hohman ’93
- IID from |Lw|/|Rw|; ITD (IPD) from arg{LwRw
*}
- match to IID/IPD template for desired direction
- also reverberation?
time frequency
Xw(mH,2 πk/NT)
FFT analysis Modulus l L w |L w| 2 FFT analysis r R w Modulus |R w| 2 Cross- correlation LwR*w Smooth (1-a) (1-az-1) SLL Smooth (1-a) (1-az-1) Smooth (1-a) (1-az-1) SLR SRR Gain factor calc l' OLA- FFT synthesis r' OLA- FFT synthesis
g
E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 31
Summary
- Spatial sound
- sampling at more than one point gives
information on origin direction
- Binaural perception
- time & intensity cues used between/within ears
- Sound rendering
- conventional stereo
- HRTF-based
- Spatial analysis
- optimal linear techniques
- elusive auditory models
E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 32
References
B.C.J. Moore, An introduction to the psychology of hearing (4th ed.) Academic, 1997.
- J. Blauert, Spatial Hearing (revised ed.), MIT Press, 1996.