EE E6820: Speech & Audio Processing & Recognition Lecture 4: - PowerPoint PPT Presentation

EE E6820: Speech & Audio Processing & Recognition Lecture 4: Auditory Perception 1 Motivation: Why & how 2 Auditory physiology 3 Psychophysics: detection & discrimination 4 Pitch perception 5 Auditory organization & scene analysis 6 Speech perception Dan Ellis <dpwe@ee.columbia.edu> http://www.ee.columbia.edu/~dpwe/e6820/ E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 1

Why study perception? 1 • Perception is messy: Can we avoid it? No! • Audition provides the ‘ground truth’ in audio - what is relevant and irrelevant - subjective importance of distortion (coding etc.) - (there could be other information in sound...) • Some sounds are ‘designed’ for audition - co-evolution of speech and hearing • The auditory system is very successful - we would do extremely well to duplicate it • We are now able to model complex systems - faster computers, bigger memories E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 2

How to study perception? Three different approaches: • Analyze the example: physiology - dissection & nerve recordings • Black box input/output: psychophysics - fit simple models of simple functions • Information processing models - investigate and model complex functions - e.g. scene analysis, speech perception E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 3

Outline 1 Motivation 2 Physiology - Outer, middle & inner ear - The Auditory Nerve and beyond - Models 3 Psychophysics 4 Pitch perception 5 Scene analysis 6 Speech perception E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 4

Physiology 2 • Processing chain from air to brain: Middle ear Auditory nerve Cortex Outer Midbrain ear Inner ear • Study via: - anatomy - nerve recordings • Signals flow in both directions E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 5

Outer & middle ear Ear canal Middle ear Pinna bones Eardrum (tympanum) • Pinna ‘horn’ - complex reflections give spatial (elevation) cues • Ear canal - acoustic tube • Middle ear - bones provide impedance matching E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 6

Inner ear: Cochlea Oval window Basilar Membrane (from ME bones) (BM) Travelling wave Cochlea 16 kHz Resonant frequency 50 Hz 0 Position 35mm • Mechanical input from middle ear starts traveling wave moving down Basilar Membrane • Varying stiffness and mass of BM gives results in continuous variation of resonant frequency • At resonance, traveling wave energy is dissipated in BM movement → Frequency (Fourier) analysis E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 7

Cochlea hair cells • Ear converts sound in BM motion; Each point on BM corresponds to a frequency Cochlea Tectorial membrane Basilar membrane Auditory nerve Inner Hair Cell (IHC) Outer Hair Cell (OHC) • Hair cells on BM convert motion into nerve impulses (firings) • Inner Hair Cells detect motion • Outer Hair Cells? Variable damping? [Allen simulation] E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 8

Inner Hair Cells • IHCs convert BM motion into nerve firings • Human hear has ~3500 IHCs; Each IHC has ~7 connections to Auditory Nerve • Each nerve fires (sometimes) near peak displacement: Local BM displacement 50 time / ms Typical nerve signal (mV) • Histogram to get firing probability: Firing count Cycle angle E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 9

Auditory nerve (AN) signals • Single nerve measurements: Tone burst histogram Frequency threshold Spike dB SPL count 80 100 60 40 Time 20 100 ms 1 kHz 100 Hz 10 kHz (log) frequency Tone burst (approx. One fiber: ~ 25 dB dynamic range constant-Q) 300 Spikes/sec Rate vs. 200 intensity 100 Intensity / dB SPL 0 0 20 40 60 80 100 Hearing dynamic range > 100 dB • Hard to measure: probe living ANs E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 10

AN population response • All the information the brain has about sound: - average rate & spike timings on 30,000 fibers • Not unlike a (constant-Q) spectrogram? ( ) 5 freq / 8ve re 100 Hz 4 3 2 1 0 time / ms 0 10 20 30 40 50 60 E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 11

Beyond the auditory nerve • Ascending descending and • Tonotopic x ? - modulation - position - source?? E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 12

Periphery models IHC IHC Outer/middle Cochlea Sound ear filterbank filtering • Modeled aspects: - outer/middle ear - cochlea filtering - hair cell transduction - efferent feedback? • Result: ‘neurogram’ / ‘cochleagram’ SlaneyPatterson 12 chans/oct from 180 Hz, BBC1tmp (20010218) 60 50 40 channel 30 20 10 0 0.1 0.2 0.3 0.4 0.5 time / s E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 13

Outline 1 Motivation 2 Physiology 3 Psychophysics - Detection theory modeling - Intensity perception - Masking 4 Pitch perception 5 Scene analysis 6 Speech perception E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 14

Psychophysics 3 • Physiology looks at the implementation; Psychology looks at the function/behavior p ω O ( ) • Analyze audition as : signal detection - psychological tests reflect internal decisions - assume optimal decision process - infer nature of internal representations, noise, ... → lower bounds on more complex functions • Different aspects to measure - time, frequency, intensity - tones, complexes, noise - binaural - pitch, detuning E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 15

Basic psychophysics • Relate physical and perceptual variables → - e.g. intensity loudness → frequency pitch • Methodology: subject tests - just noticeable difference (jnd) - magnitude scaling e.g. ‘adjust to twice as loud’ • Results for Loudness vs. Intensity: ∆ I α I → log( L ) = k ·log( I ) Weber’s law Hartmann(1993) Classroom loudness scaling data ( ) ( ) 2.6 log L = 0.3 log I 2 2 Textbook figure: 2.4 L α I 0.3 Log(loudness rating) log I ⋅ = 0.3 - - - - - - - - - 10 - - - - - 2.2 log 2 10 2.0 Power law fit: 0.3 - dB L α I 0.22 ⋅ - - - - - - - - - - - - - - - - - - - - = 1.8 log 2 10 10 1.6 ⁄ = dB 10 1.4 -20 -10 0 10 Sound level / dB E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 16

Loudness as a function of frequency • Fletcher-Munson equal-loudness curves: 120 Intensity / dB SPL 100 80 60 40 20 0 100 1000 10,000 freq / Hz 100 100 100 Hz 1 kHz Equivalent loudness Equivalent loudness 80 80 @ 1kHz @ 1kHz 60 60 rapid 40 40 loudness 20 growth 20 0 0 0 40 80 0 40 80 20 60 20 60 Intensity / dB Intensity / dB • Hearing impairment: exaggerates E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 17

Loudness as a function of bandwidth • Same total energy, different distribution: freq time Same mag mag I0 total energy I1 I·B freq freq B0 B1 Loudness ... but wider perceived as louder Bandwidth B ‘Critical’ bandwidth - e.g. 2 chans at -6 dB (not -10 dB) • Critical bands: independent freq. channels - ~ 25 total (4-6 / octave) [sndex] E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 18

Simultaneous masking • A louder tone can ‘mask’ the perception of a second tone nearby in frequency: masking tone absolute threshold Intensity / dB masked threshold log freq • Suggests an ‘internal noise’ model: p ( x | I ) p ( x | I+ ∆ I ) p ( x | I ) internal noise σ n decision variable I x E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 19

Sequential masking • Backward/forward in time: masker envelope simultaneous masking ~10 dB Intensity / dB masked threshold time backward masking forward masking ~5 ms ~100 ms - suggests temporal envelope of decision var. → Time-frequency masking ‘skirt’: Masking tone intensity freq Masked threshold time E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 20

What we do and don’t hear “two-interval forced-choice”: A B X X = A or B? time • Timing: 2ms attack resolution, 20ms discrim - but: spectral splatter • Tuning: ~ 1% discrimination - but: beats • Spectrum: profile changes, formants - variable time-frequency resolution • Harmonic phase • Noisy signals & texture • (Trace vs. categorical memory) E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 21

Outline 1 Motivation 2 Physiology 3 Psychophysics 4 Pitch perception - ‘Place’ models - ‘Time’ models - Multiple cues & competition 5 Scene analysis 6 Speech perception E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 22

Pitch perception: 4 A classic argument in psychophysics • Harmonic complexes are a pattern on AN 70 60 freq. chan. 50 40 30 20 10 0.1 time/s 0 0.05 - .. but give a fused percept (ecological) • What determines the pitch percept? - not the fundamental • How is it computed? Two competing models: place and time E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 23

EE E6820: Speech & Audio Processing & Recognition Lecture 4: - PowerPoint PPT Presentation

EE E6820: Speech & Audio Processing & Recognition Lecture 4: Auditory Perception 1 Motivation: Why & how 2 Auditory physiology 3 Psychophysics: detection & discrimination 4 Pitch perception 5 Auditory organization &

EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling and

EE E6820: Speech & Audio Processing & Recognition Lecture 7: Audio Compression &

EE E6820: Speech & Audio Processing & Recognition Lecture 10: ASR: Sequence Recognition

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

EE E6820: Speech & Audio Processing & Recognition Lecture 8: Spatial sound 1 Spatial

EE E6820: Speech & Audio Processing & Recognition Lecture 2: Acoustics 1 The wave

EE E6820: Speech & Audio Processing & Recognition Lecture 6: Music analysis and synthesis

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Audio- -Visual Automatic Speech Recognition: Visual Automatic Speech Recognition: Audio Theory,

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

Machine Learning for Signal Processing Lecture 1: Introduction Representing sound and images

Slides: Ancient Near East M = Mesopotamia; E = Egypt; Numbers 1-10 = Sections of the course

Economics 113 Slides J. Bradford Delong http://bradford-delong.com brad.delong@gmail.com

ASDWA - USGS Webinar Potential Human Exposure and Health Outcomes: Monitoring & Modeling to

The University of Texas Threat and Error Management Model: Components and Examples Robert L.

E9 205 Machine Learning for Signal Processing Feature Extraction 08-08-2016 Recap Real-world

Good Morning! MCS2490/BMS2308 Package Production September/October 2017 Ulrich Werner Course

IMGD 3xxx - HCI for Real, Virtual, and Teleoperated Environments: Human Hearing and Audio

EE E6820: Speech & Audio Processing & Recognition Lecture 4: - PowerPoint PPT Presentation

EE E6820: Speech & Audio Processing & Recognition Lecture 4: Auditory Perception 1 Motivation: Why & how 2 Auditory physiology 3 Psychophysics: detection & discrimination 4 Pitch perception 5 Auditory organization &

EE E6820: Speech &amp; Audio Processing &amp; Recognition Lecture 5: Speech modeling and

EE E6820: Speech &amp; Audio Processing &amp; Recognition Lecture 7: Audio Compression &amp;

EE E6820: Speech &amp; Audio Processing &amp; Recognition Lecture 10: ASR: Sequence Recognition

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

EE E6820: Speech &amp; Audio Processing &amp; Recognition Lecture 8: Spatial sound 1 Spatial

EE E6820: Speech &amp; Audio Processing &amp; Recognition Lecture 2: Acoustics 1 The wave

EE E6820: Speech &amp; Audio Processing &amp; Recognition Lecture 6: Music analysis and synthesis

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Audio- -Visual Automatic Speech Recognition: Visual Automatic Speech Recognition: Audio Theory,

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

Machine Learning for Signal Processing Lecture 1: Introduction Representing sound and images

Slides: Ancient Near East M = Mesopotamia; E = Egypt; Numbers 1-10 = Sections of the course

Economics 113 Slides J. Bradford Delong http://bradford-delong.com brad.delong@gmail.com

ASDWA - USGS Webinar Potential Human Exposure and Health Outcomes: Monitoring &amp; Modeling to

The University of Texas Threat and Error Management Model: Components and Examples Robert L.

E9 205 Machine Learning for Signal Processing Feature Extraction 08-08-2016 Recap Real-world

Good Morning! MCS2490/BMS2308 Package Production September/October 2017 Ulrich Werner Course

IMGD 3xxx - HCI for Real, Virtual, and Teleoperated Environments: Human Hearing and Audio

EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling and

EE E6820: Speech & Audio Processing & Recognition Lecture 7: Audio Compression &

EE E6820: Speech & Audio Processing & Recognition Lecture 10: ASR: Sequence Recognition

EE E6820: Speech & Audio Processing & Recognition Lecture 8: Spatial sound 1 Spatial

EE E6820: Speech & Audio Processing & Recognition Lecture 2: Acoustics 1 The wave

EE E6820: Speech & Audio Processing & Recognition Lecture 6: Music analysis and synthesis

ASDWA - USGS Webinar Potential Human Exposure and Health Outcomes: Monitoring & Modeling to