ee e6820 speech audio processing recognition lecture 8
play

EE E6820: Speech & Audio Processing & Recognition Lecture 8: - PowerPoint PPT Presentation

EE E6820: Speech & Audio Processing & Recognition Lecture 8: Spatial sound 1 Spatial acoustics 2 Binaural perception 3 Synthesizing spatial audio 4 Extracting spatial sounds Dan Ellis <dpwe@ee.columbia.edu>


  1. EE E6820: Speech & Audio Processing & Recognition Lecture 8: Spatial sound 1 Spatial acoustics 2 Binaural perception 3 Synthesizing spatial audio 4 Extracting spatial sounds Dan Ellis <dpwe@ee.columbia.edu> http://www.ee.columbia.edu/~dpwe/e6820/ E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 1

  2. Spatial acoustics 1 • Received sound = source + channel - so far, only considered ideal source waveform • Sound carries information on its spatial origin - e.g. “ripples in the lake” - great evolutionary significance • The basis of scene analysis? - yes and no - try blocking an ear E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 2

  3. Ripples in the lake Listener Source Source Wavefront (@ c m/s) Energy ∝ 1/r 2 • Effect of relative position on sound ∆ - delay = r/c 2 - energy decay ~ 1/r r - absorption ~ G(f) - direct energy plus reflections • Give cues for recovering source position • Describe wavefront by its normal E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 3

  4. Recovering spatial information • Source direction as wavefront normal - moving plane found from timing at 3 points pressure B ∆ t/c = ∆ s = AB·cos θ θ C A time wavefront - need to solve correspondence • Space: need 3 parameters range r - e.g. 2 angles and range elevation φ azimuth θ E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 4

  5. The effect of the environment • Reflection causes additional wavefronts reflection diffraction & shadowing - + scattering, absorption → - many paths many echoes • Reverberant effect - causal ‘smearing’ of signal energy Dry speech airvib16 + reverb from hlwy16 8000 8000 freq / Hz freq / Hz 6000 6000 4000 4000 2000 2000 0 0 0 0.5 1 1.5 0 0.5 1 1.5 time / sec time / sec E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 5

  6. Reverberation impulse response • Exponential decay of reflections: hlwy16 - 128pt window ~e- t/T 8000 -10 freq / Hz h room (t) -20 6000 -30 4000 -40 -50 2000 -60 t 0 -70 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 time / s • Frequency-dependent - greater absorption at high frequencies → faster decay • Size-dependent → → - larger rooms longer delays slower decay • Sabine’s equation: 0.049 V = - - - - - - - - - - - - - - - - - RT 60 S α • Time constant as size, absorption E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 6

  7. Outline 1 Spatial acoustics 2 Binaural perception - The sound at the two ears - Available cues - Perceptual phenomena 3 Synthesizing spatial audio 4 Extracting spatial sounds E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 7

  8. Binaural perception 2 R L head shadow (high freq) path length path length difference difference source • What is the information in the 2 ear signals? - the sound of the source(s) (L+R) - the position of the source(s) (L-R) • Example waveforms (ShATR database) shatr78m3 waveform 0.1 Left 0.05 0 -0.05 Right -0.1 2.2 2.205 2.21 2.215 2.22 2.225 2.23 2.235 time / s E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 8

  9. Main cues to spatial hearing • Interaural time difference (ITD) - from different path lengths around head - dominates in low frequency (< 1.5 kHz) µ → - max ~ 750 s ambiguous for freqs > 600 Hz • Interaural intensity difference (IID) - from head shadowing of far ear - negligable for LF; increases with frequency • Spectral detail (from pinna relfections) useful for elevation & range • Direct-to-reverberant useful for range E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 9

  10. Head-Related Transfer Fns (HRTFs) • Capture source coupling as impulse responses { ( ) r θ φ R , ( ) } l θ φ R t t , , , , • Collection: ( ) http://phosphor.cipic.ucdavis.edu/ HRIR_021 Left @ 0 el HRIR_021 Right @ 0 el 1 RIGHT HRIR_021 Left @ 0 el 0 az 45 0 Azimuth / deg 0 1 HRIR_021 Right @ 0 el 0 az 0 -45 LEFT -1 time / ms time / ms 0 0.5 1 1.5 0 0.5 1 1.5 0 0.5 1 1.5 • Highly individual! E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 10

  11. Cone of confusion azimuth θ Cone of confusion (approx equal ITD) • Interaural timing cue dominates (below 1kHz) - from differing path lengths to two ears • But: only resolves to a cone - Up/down? Front/back? E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 11

  12. Further cues • Pinna causes elevation-dependent coloration • Monaural perception - separate coloration from source spectrum? • Head motion - synchronized spectral changes - also for ITD (front/back) etc. E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 12

  13. Combining multiple cues • Both ITD and ILD influence azimuth; What happens when they disagree? Identical signals to both ears → image is centered l ( t ) r ( t ) t t 1 ms Delaying right channel moves image to left l ( t ) r ( t ) t t Attenuating left channel returns image to center l ( t ) r ( t ) t t - trading @ around 0.1 ms / dB E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 13

  14. Binaural position estimation • Imperfect results: (Arruda, Kistler & Wightman 1992) 180 Judged Azimuth (Deg) 120 60 0 -60 -120 -180 -180 -120 -60 0 0 60 120 180 Target Azimuth (Deg) - listening to ‘wrong’ hrtfs → errors - front/back reversals stay on cone of confusion E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 14

  15. The Precedence Effect • Reflections give misleading spatial cues l ( t ) direct reflected t r ( t ) R R/c t • But: Spatial impression based on 1st wavefront then ‘switches off’ for ~50 ms - .. even if ‘reflections’ are louder - .. leads to impression of room E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 15

  16. Binaural Masking Release • Adding noise to reveal target Tone + noise to one ear: tone is masked t + t Identical noise to other ear: tone is audible t + t t - why does this make sense? • Binaural Masking Level Difference up to 12dB - greatest for noise in phase, tone anti-phase E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 16

  17. Outline 1 Spatial acoustics 2 Binaural perception 3 Synthesizing spatial audio - Position - Environment 4 Extracting spatial sounds E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 17

  18. Synthesizing spatial audio 3 • Goal: recreate realistic soundfield - hi-fi experience - synthetic environments (VR) • Constraints - resources - information (individual HRTFs) - delivery mechanism (headphones) • Source material types - live recordings (actual soundfields) - synthetic (studio mixing, virtual environments) E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 18

  19. Classic stereo L R • ‘Intensity panning’: no timing modifications, just vary level ±20 dB - works as long as listener is equidistant • Surround sound: extra channels in center, sides, ... - same basic effect - pan between pairs E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 19

  20. Simulating reverberation • Can characterize reverb by impulse response - spatial cues are important - record in stereo - IRs of ~ 1 sec → very long convolution • Image model: reflections as duplicate sources virtual (image) sources reflected path source listener • ‘Early echos’ in room impulse response: direct path early echos h room (t) t Actual reflection may be h ref (t), not δ (t) • E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 20

  21. Artificial reverberation • Reproduce perceptually salient aspects - early echo pattern ( → room size impression) - overall decay tail ( → wall materials...) - interaural coherence ( → spaciousness) • Nested allpass filters (Gardner ’92) Allpass z -k - g H(z) = -g 1 - g·z -k y[n] x[n] 1-g 2 z -k + + h[n] g(1-g 2 ) g 2 (1-g 2 ) k 2k 3k n g g,k -g Synthetic Reverb Nested+Cascade Allpass + + 50,0.5 30,0.7 a 0 a 1 a 2 20,0.3 + AP 0 AP 1 AP 2 g LPF E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 21

  22. Synthetic binaural audio • Source convolved with {L,R} HRTFs gives precise positioning - ...for headphone presentation - can combine multiple sources (by adding) • Where to get HRTFs? - measured set, but: specific to individual, discrete - interpolate by linear crossfade, PCA basis set - or: parametric model - delay, shadow, pinna Delay Shadow Pinna 1 - b L ( θ ) z -1 z - t DL ( θ ) Σ p kL ( θ , φ )· z - t PkL ( θ , φ ) + 1 - az t Room echo K E ·z - t E Source 1 - b R ( θ ) z -1 z - t DR ( θ ) Σ p kR ( θ , φ )· z - t PkR ( θ , φ ) + 1 - az t (after Brown & Duda '97) • Head motion cues? - head tracking + fast updates E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 22

  23. Transaural sound • Binaural signals without headphones? • Can cross-cancel wrap-around signals - speakers S L,R , ears E L,R , binaural signals B L,R . B R B L – 1 ( ) M = – S L H LL B L H RL S R – 1 ( ) = – S L S R S R H RR B R H LR S L H LR H RL H RR H LL E L E R • Narrow ‘sweet spot’ - head motion? E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend