[PPT] - Auditory Scene Analysis: phenomena, theories and computational PowerPoint Presentation

SLIDE 1

ASA - Dan Ellis 1998jul11 - 1

Auditory Scene Analysis: phenomena, theories and computational models

July 1998 Dan Ellis International Computer Science Institute, Berkeley CA <dpwe@icsi.berkeley.edu>

Outline The computational theory of ASA Cues & grouping Expectations & inference Big issues 1 2 3 4

SLIDE 2

ASA - Dan Ellis 1998jul11 - 2

Auditory Scene Analysis

What does our sense of hearing do?

recover useful information

... about objects of interest ... in a wide range of circumstances Measuring objects in an auditory scene:

SLIDE 3

ASA - Dan Ellis 1998jul11 - 3

Subjective analysis

f auditory scenes
Subjects identify structures in dense scenes

with high agreement

S3−crash (not car) S3−closeup car S3−1st horn S3−2nd horn S1−slam S4−horn1 S4−crash S5−Honk S5−Trash can S5−Acceleration S6−double horn S1−honk, honk S6−slam S6−doppler horn S6−acceleration S7−gunshot S7−horn S7−horn2 S7−horn3 S7−horn4 S8−car horns S8−car horns S8−car horns S8−large object crash S8−truck engine S1−rev up/passing S9−horn 2 S9−horn 3 S9−door Slam? S9−horn 5 S10−car horn S10−car horn S10−door slamming S10−wheels on road S2−first double horn S2−crash S2−horn during crash S2−truck accelerating

200 400 1000 2000 4000 f/Hz City 1 2 3 4 5 6 7 8 9

Horn1 (10/10) Crash (10/10) Horn2 (5/10) Truck (7/10) Horn3 (5/10)

SLIDE 4

ASA - Dan Ellis 1998jul11 - 4

Outline

The computational theory of ASA

ASA and CASA
The grouping paradigm
Marr’s three levels of explanation

Cues & grouping Expectations & inference Big issues 1 2 3 4

SLIDE 5

ASA - Dan Ellis 1998jul11 - 5

Auditory Scene Analysis (ASA)

“The organization of sound scenes according to their inferred sources”

Real-world sounds rarely occur in isolation

→ a useful sense of hearing must be able to segregate mixtures

people (and ...) do this very well;

unexpectedly difficult to model

depends on:

subjective definition of relevant sources regularity/constraints of real-world sounds

Studied via experimental psychology
characterize ‘rules’ for organizing simple pieces

(tones, noise bursts, clicks) i.e. ‘reductive’ approach

SLIDE 6

ASA - Dan Ellis 1998jul11 - 6

Computational Auditory Scene Analysis (CASA)

Psychological ‘rules’ suggest computer

implementation

.. but many practical problems arise!
Motivations:

Practical applications

real-world interactive systems
indexing of media databases
hearing prostheses

Crossover opportunities

unknown signal/information processing

principles? Benefits for theory

implementations are very revealing

SLIDE 7

ASA - Dan Ellis 1998jul11 - 7

The grouping paradigm

Standard theory of ASA (Bregman, Darwin &c):
sound mixture is broken up into small elements

e.g. time-frequency ‘cells’

each element has a number of feature

dimensions (amplitude, ITD, period)

elements are grouped together according to their

features to form larger structures

resulting groups have overall attributes (pitch,

location)

(from Darwin 1996)

SLIDE 8

ASA - Dan Ellis 1998jul11 - 8

Marr’s levels-of-explanation

f information processing
Three distinct aspects to info. processing

Why bother?

to help organize understanding
avoid confusion/wasted effort

→ use as an analysis tool... Computational Theory ‘what’ and ‘why’; the overall goal Sound source

rganization

Algorithm ‘how’; an approach to meeting the goal Auditory grouping Implementation practical realization of the process. Feature calculation & binding

SLIDE 9

ASA - Dan Ellis 1998jul11 - 9

Level 1: Computational theory

The underlying regularities that make the

problem possible

i.e. the ‘ecological’ facts
Implicit definition of “what is a source?”:

Independence

f attributes between sources

Continuity

f attributes for each source

+ other source-specific constraints

SLIDE 10

ASA - Dan Ellis 1998jul11 - 10

Level 2: Algorithm

A particular approach to exploiting the

constraints of the computational theory

both process & representation
Audition:

the “elements-then-grouping” approach

could have been otherwise e.g. templates
Often the focus of analysis
but: debate is muddled without a clear

computational theory

SLIDE 11

ASA - Dan Ellis 1998jul11 - 11

Level 3: Implementation

A specific realization of the algorithm
computer programs
neurons
...
Can be analyzed separately?
provided epiphenomena are correctly assigned
Needs context of algorithm,

computational theory “You cannot understand stereopsis simply by thinking about neurons”

SLIDE 12

ASA - Dan Ellis 1998jul11 - 12

The advantage of the appropriate level

Computational theory
determines the purpose of the process;

provides focus necessary for analysis e.g. biosonar: benefit of hyperresolution

Algorithm
abstraction that is still specific, transferable

e.g. autocorrelation for pitch

Implementation
explain ‘epiphenomena’

e.g. ‘subjective octave’ from refractory period

SLIDE 13

ASA - Dan Ellis 1998jul11 - 13

An example: Neural inhibition

Computational theory Frequency- domain processing Algorithm Discrete-time filtering (subtraction) Implementation Neurons with GABAergic inhibitions

f X(f)

SLIDE 14

ASA - Dan Ellis 1998jul11 - 14

Summary

Acoustic scenes are very complex
.. but the auditory system extracts useful

information

Grouping is the main focus of Auditory Scene

Analysis

.. but it fits into a larger Marrian framework

1

SLIDE 15

ASA - Dan Ellis 1998jul11 - 15

Outline

The computational theory of ASA Cues & grouping

Cue analysis
Simple scenes
Models
Complications: interaction, ambiguity, time

Expectations & inference Big issues 1 2 3 4

SLIDE 16

ASA - Dan Ellis 1998jul11 - 16

Cues to grouping

Common onset/offset/modulation (“fate”)
Common periodicity (“pitch”)
Spatial location (ITD, ILD, spectral cues)
Sequential cues...
Source-specific cues...

Common onset Periodicity Computational theory Acoustic consequences tend to be synchronized (Nonlinear) cyclic processes are common Algorithm Group elements that start in a time range ? Place patterns ? Autocorrelation Implementation Onset detector cells Synchronized osc’s? ? Delay-and-mult ? Modulation spect

SLIDE 17

ASA - Dan Ellis 1998jul11 - 17

Simple grouping

E.g. isolated tones

Computational theory

common onset
common period (harmonicity)

Algorithm

locate elements (tracks)
group by shared features

Implementation ? exhaustive search

evolution in time

time freq

SLIDE 18

ASA - Dan Ellis 1998jul11 - 18

Computer models of grouping

“Bregman at face value” (e.g. Brown 1992):
feature maps
periodicity cue
common-onset boost
resynthesis

input mixture signal features (maps) discrete

bjects

Front end Object formation Grouping rules Source groups

nset

period frq.mod time freq

SLIDE 19

ASA - Dan Ellis 1998jul11 - 19

Grouping model results

Able to extract voiced speech:
Periodicity is the primary cue
how to handle aperiodic energy?
Limitations
resynthesis via filter-mask
nly

periodic targets

robustness of discrete objects

0.2 0.4 0.6 0.8 1.0

time/s

100 150 200 300 400 600 1000 1500 2000 3000

frq/Hz

brn1h.aif 0.2 0.4 0.6 0.8 1.0

time/s

100 150 200 300 400 600 1000 1500 2000 3000

frq/Hz

brn1h.fi.aif

SLIDE 20

ASA - Dan Ellis 1998jul11 - 20

Complications for grouping: 1: Cues in conflict

Mistuned harmonic (Moore, Darwin..):
harmonic usually groups by onset & periodicity
can alter frequency and/or onset time
‘degree of grouping’ from overall pitch match
Gradual, various results:
heard as separate tone, still affects pitch

time freq

3% mistuning pitch shift

SLIDE 21

ASA - Dan Ellis 1998jul11 - 21

Complications for grouping: 2: The effect of time

Added harmonics:
onset cue initially segregates;

periodicity eventually fuses

The effect of time
some cues take time to become apparent
onset cue becomes increasingly distant...
What is the impetus for fission?
e.g. double vowels
depends on what you expect .. ?

time freq

SLIDE 22

ASA - Dan Ellis 1998jul11 - 22

Summary

Known grouping cues make sense
Simple examples are straightforward
Models can be implemented directly
.. but problematic situations abound

2

SLIDE 23

ASA - Dan Ellis 1998jul11 - 23

Outline

The computational theory of ASA Cues & grouping Expectations & inference

“Old-plus-new”
Streaming
Restoration & illusions
Top-down models

Big issues 1 2 3 4

SLIDE 24

ASA - Dan Ellis 1998jul11 - 24

The effect of context

Context can create an ‘expectation’:

i.e. a bias towards a particular interpretation

e.g. Bregman’s “old-plus-new” principle:

A change in a signal will be interpreted as an added source whenever possible

a different division of the same energy

depending on what preceded it

+ time/s freq/kHz 0.0 0.4 0.8 1 2 1.2

SLIDE 25

ASA - Dan Ellis 1998jul11 - 25

Streaming

Successive tone events form separate streams
Order, rhythm &c within, not between, streams

Computational theory Consistency of properties for successive source events Algorithm

‘expectation window’ for known

streams (widens with time) Implementation

competing time-frequency

affinity weights...

±2 octaves TRT: 60-150 ms

time freq.

∆f: 1 kHz

SLIDE 26

ASA - Dan Ellis 1998jul11 - 26

Restoration & illusions

Direct evidence may be masked or distorted

→make best guess using available information

E.g. the ‘continuity illusion’:
tones alternates with noise bursts
noise is strong enough to mask tone

... so listener discriminate presence

continuous tone distinctly perceived

for gaps ~100s of ms → Inference acts at low, preconscious level

1000 2000 4000 f/Hz ptshort 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 time/s

SLIDE 27

ASA - Dan Ellis 1998jul11 - 27

Speech restoration

Speech provides very strong bases for

inference (coarticulation, grammar, semantics):

1.2 1.3 1.4 1.5 1.6 1.7

time/s

500 1000 1500 2000 2500 3000 3500

frq/Hz

nsoffee.aif

5 10 15 f/Bark S1−env.pf:0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 40 60 80

Temporal compound (1998jul10) time / ms 50 100 150 200 250 300 350 400 450 500 550 20 40 60 80 100 120

Phonemic

restoration

Sinewave

speech (duplex?)

Temporal

compounds

SLIDE 28

ASA - Dan Ellis 1998jul11 - 28

Models of top-down processing

Perception as a search for plausible explanations

‘Prediction-driven’ CASA (PDCASA):
An approach as well as an implementation...
Key features:
‘complete explanation’ of all scene energy
vocabulary of periodic/noise/transient elements
multiple hypotheses
explanation hierarchy

input mixture signal features prediction errors hypotheses predicted features

Front end Compare & reconcile Hypothesis management Predict & combine Periodic components Noise components

SLIDE 29

ASA - Dan Ellis 1998jul11 - 29

PDCASA for old-plus-new

Incremental analysis

t1 t2 t3

Input signal Time t1: initial element created Time t2: Additional element required Time t3: Second element finished

SLIDE 30

ASA - Dan Ellis 1998jul11 - 30

PDCASA for the continuity illusion

Subjects hear the tone as continuous

... if the noise is a plausible masker

Data-driven analysis gives just visible portions:
Prediction-driven can infer masking:

1000 2000 4000 f/Hz ptshort 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 i /

SLIDE 31

ASA - Dan Ellis 1998jul11 - 31

PDCASA analysis of a complex scene

−70 −60 −50 −40 dB 200 400 1000 2000 4000 f/Hz Noise1 200 400 1000 2000 4000 f/Hz Noise2,Click1 200 400 1000 2000 4000 f/Hz City 1 2 3 4 5 6 7 8 9 50 100 200 400 1000

Horn1 (10/10) Crash (10/10) Horn2 (5/10) Truck (7/10) Horn3 (5/10) Squeal (6/10) Horn4 (8/10) Horn5 (10/10)

1 2 3 4 5 6 7 8 9 time/s 200 400 1000 2000 4000 f/Hz Wefts1−4 50 100 200 400 1000

Weft5 Wefts6,7 Weft8 Wefts9−12

SLIDE 32

ASA - Dan Ellis 1998jul11 - 32

Marrian analysis of PDCASA

Marr invoked to separate high-level function

from low-level details “It is not enough to be able to describe the response of single cells, nor predict the results of psychophysical experiments. Nor is it enough even to write computer programs that perform approximately in the desired way: One has to do all these things at once, and also be very aware

f the computational theory...”

Computational theory

Objects persist predictably
Observations interact irreversibly

Algorithm

Build hypotheses from generic

elements

Update by prediction-reconciliation

Implementation ???

SLIDE 33

ASA - Dan Ellis 1998jul11 - 33

Summary

Perceptual processing is highly

context-dependent

Auditory system will use prior knowledge

to fill-in gaps (subconsciously)

Prediction-reconciliation models can

encompass this behavior

3

SLIDE 34

ASA - Dan Ellis 1998jul11 - 34

Outline

The computational theory of ASA Cues & grouping Expectations & inference Big issues

the state of ASA and CASA
outstanding issues
discussion points

1 2 3 4

SLIDE 35

ASA - Dan Ellis 1998jul11 - 35

The current state of ASA and CASA

ASA
detailed descriptions of “in vitro” tests
some quite subtle effects explained (DV beats)

but: how to extend to complex scenarios?

CASA
numerous models, some convergence

(mainly periodicity-based)

best results sound impressive

(least plausible systems!)

applications in speech recognition?

but: domains limited, poor robustness

SLIDE 36

ASA - Dan Ellis 1998jul11 - 36

Big issues in CASA:

Plausibility
correct level for human correspondence?
which phenomena are important to match?
how to implement symbolic-style processing?
Top-down vs. bottom-up
different approaches to ambiguity, latency
how far down for top-down?
how far ‘up’ for high level?
choice between extraction & inference?
Integrating multiple cues (e.g. binaural)
Other debates:
what is the real goal?
resynthesis
evaluation

SLIDE 37

ASA - Dan Ellis 1998jul11 - 37

Big issues in ASA & CASA:

Knowledge:

how to acquire, represent & store ...

short-term: context
long-term: memories
abstract: classes, generalities
Attention:
what does it mean in these models?
limitation or important principle?

SLIDE 38

ASA - Dan Ellis 1998jul11 - 38

Conclusions

Real-world sounds are complex;

scene-analysis is required

We know certain cues & some rules,

but real situations raise contradictions

Current models handle ‘obvious’ cases;

robustness & generality are hard

Many issues remain

SLIDE 39

ASA - Dan Ellis 1998jul11 - 39

Discussion points

Are Marr’s levels important? Useful?

Can you study levels in isolation?

What do restoration phenomena imply about

internal representations?

Do we have an adequate account of an ASA

algorithm? e.g. where do hypotheses come from?

How important/challenging are phenomena like