E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 1
EE E6820: Speech & Audio Processing & Recognition Lecture 4: - - PowerPoint PPT Presentation
EE E6820: Speech & Audio Processing & Recognition Lecture 4: - - PowerPoint PPT Presentation
EE E6820: Speech & Audio Processing & Recognition Lecture 4: Auditory Perception 1 Motivation: Why & how 2 Auditory physiology 3 Psychophysics: detection & discrimination 4 Pitch perception 5 Auditory organization &
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 2
Why study perception?
- Perception is messy: Can we avoid it?
No!
- Audition provides the ‘ground truth’ in audio
- what is relevant and irrelevant
- subjective importance of distortion (coding etc.)
- (there could be other information in sound...)
- Some sounds are ‘designed’ for audition
- co-evolution of speech and hearing
- The auditory system is very successful
- we would do extremely well to duplicate it
- We are now able to model complex systems
- faster computers, bigger memories
1
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 3
How to study perception?
Three different approaches:
- Analyze the example: physiology
- dissection & nerve recordings
- Black box input/output: psychophysics
- fit simple models of simple functions
- Information processing models
- investigate and model complex functions
- e.g. scene analysis, speech perception
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 4
Outline
Motivation Physiology
- Outer, middle & inner ear
- The Auditory Nerve and beyond
- Models
Psychophysics Pitch perception Scene analysis Speech perception 1 2 3 4 5 6
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 5
Physiology
- Processing chain from air to brain:
- Study via:
- anatomy
- nerve recordings
- Signals flow in both directions
2
Outer ear Middle ear Inner ear Auditory nerve Midbrain Cortex
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 6
Outer & middle ear
- Pinna ‘horn’
- complex reflections give spatial (elevation) cues
- Ear canal
- acoustic tube
- Middle ear
- bones provide impedance matching
Pinna Ear canal Eardrum (tympanum) Middle ear bones
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 7
Inner ear: Cochlea
- Mechanical input from middle ear starts
traveling wave moving down Basilar Membrane
- Varying stiffness and mass of BM gives results
in continuous variation of resonant frequency
- At resonance, traveling wave energy is
dissipated in BM movement → Frequency (Fourier) analysis
Cochlea Oval window (from ME bones) Basilar Membrane (BM) Travelling wave Resonant frequency Position
16 kHz 50 Hz 35mm
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 8
Cochlea hair cells
- Ear converts sound in BM motion;
Each point on BM corresponds to a frequency
- Hair cells on BM convert motion into nerve
impulses (firings)
- Inner Hair Cells detect motion
- Outer Hair Cells? Variable damping?
[Allen simulation] Cochlea Basilar membrane Tectorial membrane Inner Hair Cell (IHC) Outer Hair Cell (OHC) Auditory nerve
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 9
Inner Hair Cells
- IHCs convert BM motion into nerve firings
- Human hear has ~3500 IHCs;
Each IHC has ~7 connections to Auditory Nerve
- Each nerve fires (sometimes) near peak
displacement:
- Histogram to get firing probability:
Local BM displacement Typical nerve signal (mV) time / ms
50
Firing count Cycle angle
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 10
Auditory nerve (AN) signals
- Single nerve measurements:
- Hard to measure: probe living ANs
(log) frequency 100 Hz 1 kHz 10 kHz 20 40 60 80 dB SPL
Tone burst histogram Frequency threshold
Spike count Time
100 100 ms
Tone burst
Spikes/sec Intensity / dB SPL 300 200 100 20 40 60 80 100 One fiber: ~ 25 dB dynamic range Hearing dynamic range > 100 dB
Rate vs. intensity (approx. constant-Q)
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 11
AN population response
- All the information the brain has about sound:
- average rate & spike timings on 30,000 fibers
- Not unlike a (constant-Q) spectrogram?
time / ms freq / 8ve re 100 Hz ( ) 1 2 3 4 5 10 20 30 40 50 60
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 12
Beyond the auditory nerve
- Ascending
and descending
- Tonotopic x ?
- modulation - position - source??
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 13
Periphery models
- Modeled aspects:
- outer/middle ear
- cochlea filtering
- hair cell transduction
- efferent feedback?
- Result: ‘neurogram’ / ‘cochleagram’
Outer/middle ear filtering Sound Cochlea filterbank IHC IHC
time / s channel SlaneyPatterson 12 chans/oct from 180 Hz, BBC1tmp (20010218) 0.1 0.2 0.3 0.4 0.5 10 20 30 40 50 60
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 14
Outline
Motivation Physiology Psychophysics
- Detection theory modeling
- Intensity perception
- Masking
Pitch perception Scene analysis Speech perception 1 2 3 4 5 6
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 15
Psychophysics
- Physiology
looks at the implementation; Psychology looks at the function/behavior
- Analyze audition as
signal detection :
- psychological tests reflect internal decisions
- assume optimal decision process
- infer nature of internal representations, noise, ...
→ lower bounds on more complex functions
- Different aspects to measure
- time, frequency, intensity
- tones, complexes, noise
- binaural
- pitch, detuning
3
p ω O ( )
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 16
Basic psychophysics
- Relate
physical and perceptual variables
- e.g. intensity
→ loudness frequency → pitch
- Methodology: subject tests
- just noticeable difference (jnd)
- magnitude scaling e.g. ‘adjust to twice as loud’
- Results for Loudness vs. Intensity:
Weber’s law
∆I α I → log(L) = k·log(I)
- 20
- 10
10 1.4 1.6 1.8 2.0 2.2 2.4 2.6 Sound level / dB Log(loudness rating) Hartmann(1993) Classroom loudness scaling data Power law fit:
L α I 0.22
Textbook figure:
L α I 0.3
L ( )
2
log 0.3 I ( )
2
log = 0.3 I
10
log 2
10
log
- ⋅
= 0.3 2
10
log
- dB
10
- ⋅
= dB 10 ⁄ =
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 17
Loudness as a function of frequency
- Fletcher-Munson equal-loudness curves:
- Hearing impairment: exaggerates
freq / Hz Intensity / dB SPL
40 20 60 100 80 120 1000 100 10,000
Intensity / dB Equivalent loudness @ 1kHz 40 40 80 80 100 60 20 20 60 20 60 100 60 20 Intensity / dB Equivalent loudness @ 1kHz 40 40 80 80
rapid loudness growth
100 Hz 1 kHz
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 18
Loudness as a function of bandwidth
- Same total energy, different distribution:
- e.g. 2 chans at -6 dB (not -10 dB)
- Critical bands: independent freq. channels
- ~ 25 total (4-6 / octave) [sndex]
time freq
freq mag freq mag
Same total energy I·B ... but wider perceived as louder I0 I1 B0 B1
Bandwidth B
‘Critical’ bandwidth
Loudness
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 19
Simultaneous masking
- A louder tone can ‘mask’ the perception of a
second tone nearby in frequency:
- Suggests an ‘internal noise’ model:
masked threshold
log freq
absolute threshold masking tone
Intensity / dB
decision variable
x
internal noise p(x | I)
p(x | I) p(x | I+∆I) σn I
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 20
Sequential masking
- Backward/forward in time:
- suggests temporal envelope of decision var.
→ Time-frequency masking ‘skirt’:
time Intensity / dB masker envelope masked threshold simultaneous masking ~10 dB backward masking ~5 ms forward masking ~100 ms
time freq intensity Masking tone Masked threshold
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 21
What we do and don’t hear
- Timing: 2ms attack resolution, 20ms discrim
- but: spectral splatter
- Tuning: ~ 1% discrimination
- but: beats
- Spectrum: profile changes, formants
- variable time-frequency resolution
- Harmonic phase
- Noisy signals & texture
- (Trace vs. categorical memory)
A B X X = A or B? “two-interval forced-choice”:
time
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 22
Outline
Motivation Physiology Psychophysics Pitch perception
- ‘Place’ models
- ‘Time’ models
- Multiple cues & competition
Scene analysis Speech perception 1 2 3 4 5 6
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 23
Pitch perception: A classic argument in psychophysics
- Harmonic complexes are a pattern on AN
- .. but give a fused percept (ecological)
- What determines the pitch percept?
- not the fundamental
- How is it computed?
Two competing models: place and time
4
10 20 30 40 50 60 70 0.05 0.1 time/s
- freq. chan.
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 24
Place model of pitch
- AN excitation pattern shows individual peaks
- ‘Pattern matching’ method to find pitch:
- Support:
Low harmonics are very important
- But: Flat-spectrum noise can carry pitch
frequency channel frequency channel AN excitation
Pitch strength resolved harmonics broader HF channels cannot resolve harmonics Correlate with harmonic ‘sieve’:
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 25
Time model of pitch
- Timing information is preserved in AN
up to ~ 1ms scale
- Extract periodicity by e.g. autocorrelation
& combine across frequency chans:
- But: HF channels give weak pitch
lag / ms time freq per-channel autocorrelation autocorrelation Summary autocorrelation 10 20 30
common period (pitch)
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 26
Alternate & competing cues
- Pitch perception could rely on various cues
- average excitation pattern
- summary autocorrelation
- more complex pattern matching
- Relying on just one cue is brittle
- e.g. missing fundamental
→ Perceptual system appears to use a flexible,
- pportunistic combination
- Optimal detector justification?
if o1 and o2 are conditionally independent
p ω o ( ) ω argmax p o ω ( ) p ω ( ) ⋅ p o ( )
- ω
argmax = p o1 ω ( ) p o2 ω ( ) p ω ( ) ⋅ ⋅ ω argmax =
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 27
Outline
Motivation Physiology Psychophysics Pitch perception Scene analysis
- Events and sources
- Fusion and streaming
- Continuity & restoration
Speech perception 1 2 3 4 5 6
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 28
Auditory Organization
- Detection model is huge simplification
- Real role of hearing is much more general:
Recover useful information from outside world → Sound organization into events and sources:
- Research questions:
- what determines perception of sources?
- how do humans separate mixtures?
- how much can we tell about a source?
5
2 4 time/s frq/Hz 2000 4000
Voice Stab Rumble
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 29
Auditory scene analysis: Simultaneous fusion
- Harmonics are distinct on AN,
but perceived as one sound (“fused”):
- depends on common onset
- depends on harmonicity (common period)
- Methodologies:
- ask subject how many ‘objects’
- match attributes e.g. object pitch
- manipulate higher level e.g. vowel identity
time freq
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 30
Sequential grouping: streaming
- Pattern / rhythm: property of set of objects
- subsequent to fusion
employs fused events?
- Measure by relative timing judgments
- cannot compare between streams
- Separate ‘coherence’ and ‘fusion’ boundaries
- Can interact and compete with fusion
[sndex]
∴
–2 octaves TRT: 60-150 ms
time frequency
∆f:
1 kHz
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 31
Continuity & restoration
- Tone is interrupted by noise burst:
What happened?
- masking makes tone undetectable during noise
- Need to infer most probable real-world events
- observation equally likely for either explanation
- prior on continuous tone much higher → choose
- Top-down influence on perceived events...
pulsation threshold [sndex]
time freq + + +
?
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 32
Models of auditory organization
- Psychological accounts suggest bottom-up:
- (Brown 1991)
- Complications in practice:
- formation of separate elements
- contradictory cues
- influence of top-down constraints (context,
expectations) ...
input mixture signal features (maps) discrete
- bjects
Front end Object formation Grouping rules Source groups
- nset
period frq.mod time freq
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 33
Outline
Motivation Physiology Psychophysics Pitch perception Scene analysis Speech perception
- The sounds of speech
- Phoneme perception
- Context and top-down influences
- Simultaneous vowels
1 2 3 4 5 5
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 34
Speech perception
- Highly specialized function
- subsequent to source organization?
- .. but also can interact
- Kinds of speech sounds:
- vowels
- glides
- nasals
- stops
- fricatives
...
6
20 30 40 50 60 1.4 1.6 1.8 2 2.2 2.4 2.6
time/s level/dB freq / Hz
1000 2000 3000 4000
watch thin as a dime a has
stop burst fricative vowel nasal glide
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 35
Cues to phoneme perception
- Linguists describe speech with phonemes:
- phonemes define minimal word contrasts
- Acoustic-phoneticians describe phonemes by:
watch thin as a dime a has
m d n c tcl
^
θ z w z h e
I I I
a
y
ε
- formants
& transitions
- bursts
& onset times
time freq
vowel formants transition stop burst voicing onset time
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 36
Categorical perception
- (Some) speech sounds perceived categorically
rather than analogically
- e.g. stop-burst & timing:
- tokens within category are hard to distinguish
- category boundaries are very sharp
- Categories are learned for native tongue
- “merry” / “mary” / “marry”
T P K P P i e a c
- u
ε
following vowel burst freq fb / Hz 1000 2000 3000 4000 time freq stop burst vowel formants fb
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 37
Where is the information in speech?
- ‘Articulation’ of high/low-pass filtered speech:
- sums to more than 1...
- Speech message is highly redundant
- e.g. constraints of language, context
→listeners can understand with very few cues
Articulation / % 1000 20 40 60 80 2000 3000 4000 freq / Hz
high-pass low-pass
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 38
Top-down influences: Phonemic restoration
(Warren 1970)
- What if a noise burst obscures speech?
- auditory system ‘restores’ the missing phoneme
... based on semantic context ... even in retrospect!
- Subjects are typically unaware of which
sounds are restored
1.4 1.6 1.8 2 2.2 2.4 2.6 time / s freq / Hz 1000 2000 3000 4000
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 39
A predisposition for speech: Sinewave replicas
(Remez et al. 1994)
- Replace each formant with a single sinusoid:
- speech is (somewhat) intelligible
- people hear both whistles and speech (“duplex”)
- processed as speech despite un-speech-like
- What does it take to be speech?
1000 2000 3000 4000 5000 0.5 1 1.5 2 2.5 3 1000 2000 3000 4000 5000 time / s freq / Hz freq / Hz
Speech Sines
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 40
Simultaneous vowels
- Mix synthetic vowels with different f0s:
- Pitch difference helps (though not necessary):
freq
+ =
dB
/iy/ @ 100 Hz /ah/ @ 125 Hz
% both vowels correct
∆f0 (semitones)
1
1/4 1/2
2 4 25 50 75
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 41
Computational models of speech perception
- Various theoretical - practical models of speech
comprehension, e.g. :
- Open questions:
- mechanism of phoneme classification
- mechanism of lexical recall
- mechanism of grammar constraints
- ASR is a practical implementation (?)
Phoneme recognition Lexical access Grammar constraints Speech Words
E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 42
Summary
- Auditory perception provides the ‘ground truth’
underlying audio processing
- Physiology specifies information available
- Psychophysics measures basic sensitivities
- Sound sources requires further organization
- Strong contextual effects in speech perception