EE E6820: Speech & Audio Processing & Recognition Lecture 4: - - PowerPoint PPT Presentation

ee e6820 speech audio processing recognition lecture 4
SMART_READER_LITE
LIVE PREVIEW

EE E6820: Speech & Audio Processing & Recognition Lecture 4: - - PowerPoint PPT Presentation

EE E6820: Speech & Audio Processing & Recognition Lecture 4: Auditory Perception 1 Motivation: Why & how 2 Auditory physiology 3 Psychophysics: detection & discrimination 4 Pitch perception 5 Auditory organization &


slide-1
SLIDE 1

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 1

EE E6820: Speech & Audio Processing & Recognition

Lecture 4: Auditory Perception

Motivation: Why & how Auditory physiology Psychophysics: detection & discrimination Pitch perception Auditory organization & scene analysis Speech perception

Dan Ellis <dpwe@ee.columbia.edu> http://www.ee.columbia.edu/~dpwe/e6820/

1 2 3 4 5 6

slide-2
SLIDE 2

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 2

Why study perception?

  • Perception is messy: Can we avoid it?

No!

  • Audition provides the ‘ground truth’ in audio
  • what is relevant and irrelevant
  • subjective importance of distortion (coding etc.)
  • (there could be other information in sound...)
  • Some sounds are ‘designed’ for audition
  • co-evolution of speech and hearing
  • The auditory system is very successful
  • we would do extremely well to duplicate it
  • We are now able to model complex systems
  • faster computers, bigger memories

1

slide-3
SLIDE 3

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 3

How to study perception?

Three different approaches:

  • Analyze the example: physiology
  • dissection & nerve recordings
  • Black box input/output: psychophysics
  • fit simple models of simple functions
  • Information processing models
  • investigate and model complex functions
  • e.g. scene analysis, speech perception
slide-4
SLIDE 4

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 4

Outline

Motivation Physiology

  • Outer, middle & inner ear
  • The Auditory Nerve and beyond
  • Models

Psychophysics Pitch perception Scene analysis Speech perception 1 2 3 4 5 6

slide-5
SLIDE 5

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 5

Physiology

  • Processing chain from air to brain:
  • Study via:
  • anatomy
  • nerve recordings
  • Signals flow in both directions

2

Outer ear Middle ear Inner ear Auditory nerve Midbrain Cortex

slide-6
SLIDE 6

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 6

Outer & middle ear

  • Pinna ‘horn’
  • complex reflections give spatial (elevation) cues
  • Ear canal
  • acoustic tube
  • Middle ear
  • bones provide impedance matching

Pinna Ear canal Eardrum (tympanum) Middle ear bones

slide-7
SLIDE 7

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 7

Inner ear: Cochlea

  • Mechanical input from middle ear starts

traveling wave moving down Basilar Membrane

  • Varying stiffness and mass of BM gives results

in continuous variation of resonant frequency

  • At resonance, traveling wave energy is

dissipated in BM movement → Frequency (Fourier) analysis

Cochlea Oval window (from ME bones) Basilar Membrane (BM) Travelling wave Resonant frequency Position

16 kHz 50 Hz 35mm

slide-8
SLIDE 8

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 8

Cochlea hair cells

  • Ear converts sound in BM motion;

Each point on BM corresponds to a frequency

  • Hair cells on BM convert motion into nerve

impulses (firings)

  • Inner Hair Cells detect motion
  • Outer Hair Cells? Variable damping?

[Allen simulation] Cochlea Basilar membrane Tectorial membrane Inner Hair Cell (IHC) Outer Hair Cell (OHC) Auditory nerve

slide-9
SLIDE 9

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 9

Inner Hair Cells

  • IHCs convert BM motion into nerve firings
  • Human hear has ~3500 IHCs;

Each IHC has ~7 connections to Auditory Nerve

  • Each nerve fires (sometimes) near peak

displacement:

  • Histogram to get firing probability:

Local BM displacement Typical nerve signal (mV) time / ms

50

Firing count Cycle angle

slide-10
SLIDE 10

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 10

Auditory nerve (AN) signals

  • Single nerve measurements:
  • Hard to measure: probe living ANs

(log) frequency 100 Hz 1 kHz 10 kHz 20 40 60 80 dB SPL

Tone burst histogram Frequency threshold

Spike count Time

100 100 ms

Tone burst

Spikes/sec Intensity / dB SPL 300 200 100 20 40 60 80 100 One fiber: ~ 25 dB dynamic range Hearing dynamic range > 100 dB

Rate vs. intensity (approx. constant-Q)

slide-11
SLIDE 11

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 11

AN population response

  • All the information the brain has about sound:
  • average rate & spike timings on 30,000 fibers
  • Not unlike a (constant-Q) spectrogram?

time / ms freq / 8ve re 100 Hz ( ) 1 2 3 4 5 10 20 30 40 50 60

slide-12
SLIDE 12

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 12

Beyond the auditory nerve

  • Ascending

and descending

  • Tonotopic x ?
  • modulation - position - source??
slide-13
SLIDE 13

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 13

Periphery models

  • Modeled aspects:
  • outer/middle ear
  • cochlea filtering
  • hair cell transduction
  • efferent feedback?
  • Result: ‘neurogram’ / ‘cochleagram’

Outer/middle ear filtering Sound Cochlea filterbank IHC IHC

time / s channel SlaneyPatterson 12 chans/oct from 180 Hz, BBC1tmp (20010218) 0.1 0.2 0.3 0.4 0.5 10 20 30 40 50 60

slide-14
SLIDE 14

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 14

Outline

Motivation Physiology Psychophysics

  • Detection theory modeling
  • Intensity perception
  • Masking

Pitch perception Scene analysis Speech perception 1 2 3 4 5 6

slide-15
SLIDE 15

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 15

Psychophysics

  • Physiology

looks at the implementation; Psychology looks at the function/behavior

  • Analyze audition as

signal detection :

  • psychological tests reflect internal decisions
  • assume optimal decision process
  • infer nature of internal representations, noise, ...

→ lower bounds on more complex functions

  • Different aspects to measure
  • time, frequency, intensity
  • tones, complexes, noise
  • binaural
  • pitch, detuning

3

p ω O ( )

slide-16
SLIDE 16

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 16

Basic psychophysics

  • Relate

physical and perceptual variables

  • e.g. intensity

→ loudness frequency → pitch

  • Methodology: subject tests
  • just noticeable difference (jnd)
  • magnitude scaling e.g. ‘adjust to twice as loud’
  • Results for Loudness vs. Intensity:

Weber’s law

∆I α I → log(L) = k·log(I)

  • 20
  • 10

10 1.4 1.6 1.8 2.0 2.2 2.4 2.6 Sound level / dB Log(loudness rating) Hartmann(1993) Classroom loudness scaling data Power law fit:

L α I 0.22

Textbook figure:

L α I 0.3

L ( )

2

log 0.3 I ( )

2

log = 0.3 I

10

log 2

10

log

= 0.3 2

10

log

  • dB

10

= dB 10 ⁄ =

slide-17
SLIDE 17

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 17

Loudness as a function of frequency

  • Fletcher-Munson equal-loudness curves:
  • Hearing impairment: exaggerates

freq / Hz Intensity / dB SPL

40 20 60 100 80 120 1000 100 10,000

Intensity / dB Equivalent loudness @ 1kHz 40 40 80 80 100 60 20 20 60 20 60 100 60 20 Intensity / dB Equivalent loudness @ 1kHz 40 40 80 80

rapid loudness growth

100 Hz 1 kHz

slide-18
SLIDE 18

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 18

Loudness as a function of bandwidth

  • Same total energy, different distribution:
  • e.g. 2 chans at -6 dB (not -10 dB)
  • Critical bands: independent freq. channels
  • ~ 25 total (4-6 / octave) [sndex]

time freq

freq mag freq mag

Same total energy I·B ... but wider perceived as louder I0 I1 B0 B1

Bandwidth B

‘Critical’ bandwidth

Loudness

slide-19
SLIDE 19

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 19

Simultaneous masking

  • A louder tone can ‘mask’ the perception of a

second tone nearby in frequency:

  • Suggests an ‘internal noise’ model:

masked threshold

log freq

absolute threshold masking tone

Intensity / dB

decision variable

x

internal noise p(x | I)

p(x | I) p(x | I+∆I) σn I

slide-20
SLIDE 20

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 20

Sequential masking

  • Backward/forward in time:
  • suggests temporal envelope of decision var.

→ Time-frequency masking ‘skirt’:

time Intensity / dB masker envelope masked threshold simultaneous masking ~10 dB backward masking ~5 ms forward masking ~100 ms

time freq intensity Masking tone Masked threshold

slide-21
SLIDE 21

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 21

What we do and don’t hear

  • Timing: 2ms attack resolution, 20ms discrim
  • but: spectral splatter
  • Tuning: ~ 1% discrimination
  • but: beats
  • Spectrum: profile changes, formants
  • variable time-frequency resolution
  • Harmonic phase
  • Noisy signals & texture
  • (Trace vs. categorical memory)

A B X X = A or B? “two-interval forced-choice”:

time

slide-22
SLIDE 22

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 22

Outline

Motivation Physiology Psychophysics Pitch perception

  • ‘Place’ models
  • ‘Time’ models
  • Multiple cues & competition

Scene analysis Speech perception 1 2 3 4 5 6

slide-23
SLIDE 23

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 23

Pitch perception: A classic argument in psychophysics

  • Harmonic complexes are a pattern on AN
  • .. but give a fused percept (ecological)
  • What determines the pitch percept?
  • not the fundamental
  • How is it computed?

Two competing models: place and time

4

10 20 30 40 50 60 70 0.05 0.1 time/s

  • freq. chan.
slide-24
SLIDE 24

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 24

Place model of pitch

  • AN excitation pattern shows individual peaks
  • ‘Pattern matching’ method to find pitch:
  • Support:

Low harmonics are very important

  • But: Flat-spectrum noise can carry pitch

frequency channel frequency channel AN excitation

Pitch strength resolved harmonics broader HF channels cannot resolve harmonics Correlate with harmonic ‘sieve’:

slide-25
SLIDE 25

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 25

Time model of pitch

  • Timing information is preserved in AN

up to ~ 1ms scale

  • Extract periodicity by e.g. autocorrelation

& combine across frequency chans:

  • But: HF channels give weak pitch

lag / ms time freq per-channel autocorrelation autocorrelation Summary autocorrelation 10 20 30

common period (pitch)

slide-26
SLIDE 26

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 26

Alternate & competing cues

  • Pitch perception could rely on various cues
  • average excitation pattern
  • summary autocorrelation
  • more complex pattern matching
  • Relying on just one cue is brittle
  • e.g. missing fundamental

→ Perceptual system appears to use a flexible,

  • pportunistic combination
  • Optimal detector justification?

if o1 and o2 are conditionally independent

p ω o ( ) ω argmax p o ω ( ) p ω ( ) ⋅ p o ( )

  • ω

argmax = p o1 ω ( ) p o2 ω ( ) p ω ( ) ⋅ ⋅ ω argmax =

slide-27
SLIDE 27

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 27

Outline

Motivation Physiology Psychophysics Pitch perception Scene analysis

  • Events and sources
  • Fusion and streaming
  • Continuity & restoration

Speech perception 1 2 3 4 5 6

slide-28
SLIDE 28

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 28

Auditory Organization

  • Detection model is huge simplification
  • Real role of hearing is much more general:

Recover useful information from outside world → Sound organization into events and sources:

  • Research questions:
  • what determines perception of sources?
  • how do humans separate mixtures?
  • how much can we tell about a source?

5

2 4 time/s frq/Hz 2000 4000

Voice Stab Rumble

slide-29
SLIDE 29

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 29

Auditory scene analysis: Simultaneous fusion

  • Harmonics are distinct on AN,

but perceived as one sound (“fused”):

  • depends on common onset
  • depends on harmonicity (common period)
  • Methodologies:
  • ask subject how many ‘objects’
  • match attributes e.g. object pitch
  • manipulate higher level e.g. vowel identity

time freq

slide-30
SLIDE 30

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 30

Sequential grouping: streaming

  • Pattern / rhythm: property of set of objects
  • subsequent to fusion

employs fused events?

  • Measure by relative timing judgments
  • cannot compare between streams
  • Separate ‘coherence’ and ‘fusion’ boundaries
  • Can interact and compete with fusion

[sndex]

–2 octaves TRT: 60-150 ms

time frequency

∆f:

1 kHz

slide-31
SLIDE 31

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 31

Continuity & restoration

  • Tone is interrupted by noise burst:

What happened?

  • masking makes tone undetectable during noise
  • Need to infer most probable real-world events
  • observation equally likely for either explanation
  • prior on continuous tone much higher → choose
  • Top-down influence on perceived events...

pulsation threshold [sndex]

time freq + + +

?

slide-32
SLIDE 32

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 32

Models of auditory organization

  • Psychological accounts suggest bottom-up:
  • (Brown 1991)
  • Complications in practice:
  • formation of separate elements
  • contradictory cues
  • influence of top-down constraints (context,

expectations) ...

input mixture signal features (maps) discrete

  • bjects

Front end Object formation Grouping rules Source groups

  • nset

period frq.mod time freq

slide-33
SLIDE 33

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 33

Outline

Motivation Physiology Psychophysics Pitch perception Scene analysis Speech perception

  • The sounds of speech
  • Phoneme perception
  • Context and top-down influences
  • Simultaneous vowels

1 2 3 4 5 5

slide-34
SLIDE 34

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 34

Speech perception

  • Highly specialized function
  • subsequent to source organization?
  • .. but also can interact
  • Kinds of speech sounds:
  • vowels
  • glides
  • nasals
  • stops
  • fricatives

...

6

20 30 40 50 60 1.4 1.6 1.8 2 2.2 2.4 2.6

time/s level/dB freq / Hz

1000 2000 3000 4000

watch thin as a dime a has

stop burst fricative vowel nasal glide

slide-35
SLIDE 35

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 35

Cues to phoneme perception

  • Linguists describe speech with phonemes:
  • phonemes define minimal word contrasts
  • Acoustic-phoneticians describe phonemes by:

watch thin as a dime a has

m d n c tcl

^

θ z w z h e

I I I

a

y

ε

  • formants

& transitions

  • bursts

& onset times

time freq

vowel formants transition stop burst voicing onset time

slide-36
SLIDE 36

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 36

Categorical perception

  • (Some) speech sounds perceived categorically

rather than analogically

  • e.g. stop-burst & timing:
  • tokens within category are hard to distinguish
  • category boundaries are very sharp
  • Categories are learned for native tongue
  • “merry” / “mary” / “marry”

T P K P P i e a c

  • u

ε

following vowel burst freq fb / Hz 1000 2000 3000 4000 time freq stop burst vowel formants fb

slide-37
SLIDE 37

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 37

Where is the information in speech?

  • ‘Articulation’ of high/low-pass filtered speech:
  • sums to more than 1...
  • Speech message is highly redundant
  • e.g. constraints of language, context

→listeners can understand with very few cues

Articulation / % 1000 20 40 60 80 2000 3000 4000 freq / Hz

high-pass low-pass

slide-38
SLIDE 38

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 38

Top-down influences: Phonemic restoration

(Warren 1970)

  • What if a noise burst obscures speech?
  • auditory system ‘restores’ the missing phoneme

... based on semantic context ... even in retrospect!

  • Subjects are typically unaware of which

sounds are restored

1.4 1.6 1.8 2 2.2 2.4 2.6 time / s freq / Hz 1000 2000 3000 4000

slide-39
SLIDE 39

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 39

A predisposition for speech: Sinewave replicas

(Remez et al. 1994)

  • Replace each formant with a single sinusoid:
  • speech is (somewhat) intelligible
  • people hear both whistles and speech (“duplex”)
  • processed as speech despite un-speech-like
  • What does it take to be speech?

1000 2000 3000 4000 5000 0.5 1 1.5 2 2.5 3 1000 2000 3000 4000 5000 time / s freq / Hz freq / Hz

Speech Sines

slide-40
SLIDE 40

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 40

Simultaneous vowels

  • Mix synthetic vowels with different f0s:
  • Pitch difference helps (though not necessary):

freq

+ =

dB

/iy/ @ 100 Hz /ah/ @ 125 Hz

% both vowels correct

∆f0 (semitones)

1

1/4 1/2

2 4 25 50 75

slide-41
SLIDE 41

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 41

Computational models of speech perception

  • Various theoretical - practical models of speech

comprehension, e.g. :

  • Open questions:
  • mechanism of phoneme classification
  • mechanism of lexical recall
  • mechanism of grammar constraints
  • ASR is a practical implementation (?)

Phoneme recognition Lexical access Grammar constraints Speech Words

slide-42
SLIDE 42

E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 42

Summary

  • Auditory perception provides the ‘ground truth’

underlying audio processing

  • Physiology specifies information available
  • Psychophysics measures basic sensitivities
  • Sound sources requires further organization
  • Strong contextual effects in speech perception

Transduce Scene analysis Multiple represent'ns High-level recognition Sound