ASA - Dan Ellis 1998jul11 - 1
Auditory Scene Analysis: phenomena, theories and computational - - PowerPoint PPT Presentation
Auditory Scene Analysis: phenomena, theories and computational - - PowerPoint PPT Presentation
Auditory Scene Analysis: phenomena, theories and computational models July 1998 Dan Ellis International Computer Science Institute, Berkeley CA <dpwe@icsi.berkeley.edu> Outline 1 The computational theory of ASA 2 Cues & grouping
ASA - Dan Ellis 1998jul11 - 2
Auditory Scene Analysis
What does our sense of hearing do?
- recover useful information
... about objects of interest ... in a wide range of circumstances Measuring objects in an auditory scene:
ASA - Dan Ellis 1998jul11 - 3
Subjective analysis
- f auditory scenes
- Subjects identify structures in dense scenes
with high agreement
S3−crash (not car) S3−closeup car S3−1st horn S3−2nd horn S1−slam S4−horn1 S4−crash S5−Honk S5−Trash can S5−Acceleration S6−double horn S1−honk, honk S6−slam S6−doppler horn S6−acceleration S7−gunshot S7−horn S7−horn2 S7−horn3 S7−horn4 S8−car horns S8−car horns S8−car horns S8−large object crash S8−truck engine S1−rev up/passing S9−horn 2 S9−horn 3 S9−door Slam? S9−horn 5 S10−car horn S10−car horn S10−door slamming S10−wheels on road S2−first double horn S2−crash S2−horn during crash S2−truck accelerating
200 400 1000 2000 4000 f/Hz City 1 2 3 4 5 6 7 8 9
Horn1 (10/10) Crash (10/10) Horn2 (5/10) Truck (7/10) Horn3 (5/10)
ASA - Dan Ellis 1998jul11 - 4
Outline
The computational theory of ASA
- ASA and CASA
- The grouping paradigm
- Marr’s three levels of explanation
Cues & grouping Expectations & inference Big issues 1 2 3 4
ASA - Dan Ellis 1998jul11 - 5
Auditory Scene Analysis (ASA)
“The organization of sound scenes according to their inferred sources”
- Real-world sounds rarely occur in isolation
→ a useful sense of hearing must be able to segregate mixtures
- people (and ...) do this very well;
unexpectedly difficult to model
- depends on:
subjective definition of relevant sources regularity/constraints of real-world sounds
- Studied via experimental psychology
- characterize ‘rules’ for organizing simple pieces
(tones, noise bursts, clicks) i.e. ‘reductive’ approach
ASA - Dan Ellis 1998jul11 - 6
Computational Auditory Scene Analysis (CASA)
- Psychological ‘rules’ suggest computer
implementation
- .. but many practical problems arise!
- Motivations:
Practical applications
- real-world interactive systems
- indexing of media databases
- hearing prostheses
Crossover opportunities
- unknown signal/information processing
principles? Benefits for theory
- implementations are very revealing
ASA - Dan Ellis 1998jul11 - 7
The grouping paradigm
- Standard theory of ASA (Bregman, Darwin &c):
- sound mixture is broken up into small elements
e.g. time-frequency ‘cells’
- each element has a number of feature
dimensions (amplitude, ITD, period)
- elements are grouped together according to their
features to form larger structures
- resulting groups have overall attributes (pitch,
location)
(from Darwin 1996)
ASA - Dan Ellis 1998jul11 - 8
Marr’s levels-of-explanation
- f information processing
- Three distinct aspects to info. processing
Why bother?
- to help organize understanding
- avoid confusion/wasted effort
→ use as an analysis tool... Computational Theory ‘what’ and ‘why’; the overall goal Sound source
- rganization
Algorithm ‘how’; an approach to meeting the goal Auditory grouping Implementation practical realization of the process. Feature calculation & binding
ASA - Dan Ellis 1998jul11 - 9
Level 1: Computational theory
- The underlying regularities that make the
problem possible
- i.e. the ‘ecological’ facts
- Implicit definition of “what is a source?”:
Independence
- f attributes between sources
Continuity
- f attributes for each source
+ other source-specific constraints
ASA - Dan Ellis 1998jul11 - 10
Level 2: Algorithm
- A particular approach to exploiting the
constraints of the computational theory
- both process & representation
- Audition:
the “elements-then-grouping” approach
- could have been otherwise e.g. templates
- Often the focus of analysis
- but: debate is muddled without a clear
computational theory
ASA - Dan Ellis 1998jul11 - 11
Level 3: Implementation
- A specific realization of the algorithm
- computer programs
- neurons
- ...
- Can be analyzed separately?
- provided epiphenomena are correctly assigned
- Needs context of algorithm,
computational theory “You cannot understand stereopsis simply by thinking about neurons”
ASA - Dan Ellis 1998jul11 - 12
The advantage of the appropriate level
- Computational theory
- determines the purpose of the process;
provides focus necessary for analysis e.g. biosonar: benefit of hyperresolution
- Algorithm
- abstraction that is still specific, transferable
e.g. autocorrelation for pitch
- Implementation
- explain ‘epiphenomena’
e.g. ‘subjective octave’ from refractory period
ASA - Dan Ellis 1998jul11 - 13
An example: Neural inhibition
Computational theory Frequency- domain processing Algorithm Discrete-time filtering (subtraction) Implementation Neurons with GABAergic inhibitions
f X(f)
ASA - Dan Ellis 1998jul11 - 14
Summary
- Acoustic scenes are very complex
- .. but the auditory system extracts useful
information
- Grouping is the main focus of Auditory Scene
Analysis
- .. but it fits into a larger Marrian framework
1
ASA - Dan Ellis 1998jul11 - 15
Outline
The computational theory of ASA Cues & grouping
- Cue analysis
- Simple scenes
- Models
- Complications: interaction, ambiguity, time
Expectations & inference Big issues 1 2 3 4
ASA - Dan Ellis 1998jul11 - 16
Cues to grouping
- Common onset/offset/modulation (“fate”)
- Common periodicity (“pitch”)
- Spatial location (ITD, ILD, spectral cues)
- Sequential cues...
- Source-specific cues...
Common onset Periodicity Computational theory Acoustic consequences tend to be synchronized (Nonlinear) cyclic processes are common Algorithm Group elements that start in a time range ? Place patterns ? Autocorrelation Implementation Onset detector cells Synchronized osc’s? ? Delay-and-mult ? Modulation spect
ASA - Dan Ellis 1998jul11 - 17
Simple grouping
- E.g. isolated tones
Computational theory
- common onset
- common period (harmonicity)
Algorithm
- locate elements (tracks)
- group by shared features
Implementation ? exhaustive search
- evolution in time
time freq
ASA - Dan Ellis 1998jul11 - 18
Computer models of grouping
- “Bregman at face value” (e.g. Brown 1992):
- feature maps
- periodicity cue
- common-onset boost
- resynthesis
input mixture signal features (maps) discrete
- bjects
Front end Object formation Grouping rules Source groups
- nset
period frq.mod time freq
ASA - Dan Ellis 1998jul11 - 19
Grouping model results
- Able to extract voiced speech:
- Periodicity is the primary cue
- how to handle aperiodic energy?
- Limitations
- resynthesis via filter-mask
- nly
periodic targets
- robustness of discrete objects
0.2 0.4 0.6 0.8 1.0
time/s
100 150 200 300 400 600 1000 1500 2000 3000
frq/Hz
brn1h.aif 0.2 0.4 0.6 0.8 1.0
time/s
100 150 200 300 400 600 1000 1500 2000 3000
frq/Hz
brn1h.fi.aif
ASA - Dan Ellis 1998jul11 - 20
Complications for grouping: 1: Cues in conflict
- Mistuned harmonic (Moore, Darwin..):
- harmonic usually groups by onset & periodicity
- can alter frequency and/or onset time
- ‘degree of grouping’ from overall pitch match
- Gradual, various results:
- heard as separate tone, still affects pitch
time freq
3% mistuning pitch shift
ASA - Dan Ellis 1998jul11 - 21
Complications for grouping: 2: The effect of time
- Added harmonics:
- onset cue initially segregates;
periodicity eventually fuses
- The effect of time
- some cues take time to become apparent
- onset cue becomes increasingly distant...
- What is the impetus for fission?
- e.g. double vowels
- depends on what you expect .. ?
time freq
ASA - Dan Ellis 1998jul11 - 22
Summary
- Known grouping cues make sense
- Simple examples are straightforward
- Models can be implemented directly
- .. but problematic situations abound
2
ASA - Dan Ellis 1998jul11 - 23
Outline
The computational theory of ASA Cues & grouping Expectations & inference
- “Old-plus-new”
- Streaming
- Restoration & illusions
- Top-down models
Big issues 1 2 3 4
ASA - Dan Ellis 1998jul11 - 24
The effect of context
- Context can create an ‘expectation’:
i.e. a bias towards a particular interpretation
- e.g. Bregman’s “old-plus-new” principle:
A change in a signal will be interpreted as an added source whenever possible
- a different division of the same energy
depending on what preceded it
+ time/s freq/kHz 0.0 0.4 0.8 1 2 1.2
ASA - Dan Ellis 1998jul11 - 25
Streaming
- Successive tone events form separate streams
- Order, rhythm &c within, not between, streams
Computational theory Consistency of properties for successive source events Algorithm
- ‘expectation window’ for known
streams (widens with time) Implementation
- competing time-frequency
affinity weights...
±2 octaves TRT: 60-150 ms
time freq.
∆f: 1 kHz
ASA - Dan Ellis 1998jul11 - 26
Restoration & illusions
- Direct evidence may be masked or distorted
→make best guess using available information
- E.g. the ‘continuity illusion’:
- tones alternates with noise bursts
- noise is strong enough to mask tone
... so listener discriminate presence
- continuous tone distinctly perceived
for gaps ~100s of ms → Inference acts at low, preconscious level
1000 2000 4000 f/Hz ptshort 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 time/s
ASA - Dan Ellis 1998jul11 - 27
Speech restoration
- Speech provides very strong bases for
inference (coarticulation, grammar, semantics):
1.2 1.3 1.4 1.5 1.6 1.7
time/s
500 1000 1500 2000 2500 3000 3500
frq/Hz
nsoffee.aif
5 10 15 f/Bark S1−env.pf:0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 40 60 80
Temporal compound (1998jul10) time / ms 50 100 150 200 250 300 350 400 450 500 550 20 40 60 80 100 120
- Phonemic
restoration
- Sinewave
speech (duplex?)
- Temporal
compounds
ASA - Dan Ellis 1998jul11 - 28
Models of top-down processing
Perception as a search for plausible explanations
- ‘Prediction-driven’ CASA (PDCASA):
- An approach as well as an implementation...
- Key features:
- ‘complete explanation’ of all scene energy
- vocabulary of periodic/noise/transient elements
- multiple hypotheses
- explanation hierarchy
input mixture signal features prediction errors hypotheses predicted features
Front end Compare & reconcile Hypothesis management Predict & combine Periodic components Noise components
ASA - Dan Ellis 1998jul11 - 29
PDCASA for old-plus-new
- Incremental analysis
t1 t2 t3
Input signal Time t1: initial element created Time t2: Additional element required Time t3: Second element finished
ASA - Dan Ellis 1998jul11 - 30
PDCASA for the continuity illusion
- Subjects hear the tone as continuous
... if the noise is a plausible masker
- Data-driven analysis gives just visible portions:
- Prediction-driven can infer masking:
1000 2000 4000 f/Hz ptshort 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 i /
ASA - Dan Ellis 1998jul11 - 31
PDCASA analysis of a complex scene
−70 −60 −50 −40 dB 200 400 1000 2000 4000 f/Hz Noise1 200 400 1000 2000 4000 f/Hz Noise2,Click1 200 400 1000 2000 4000 f/Hz City 1 2 3 4 5 6 7 8 9 50 100 200 400 1000
Horn1 (10/10) Crash (10/10) Horn2 (5/10) Truck (7/10) Horn3 (5/10) Squeal (6/10) Horn4 (8/10) Horn5 (10/10)
1 2 3 4 5 6 7 8 9 time/s 200 400 1000 2000 4000 f/Hz Wefts1−4 50 100 200 400 1000
Weft5 Wefts6,7 Weft8 Wefts9−12
ASA - Dan Ellis 1998jul11 - 32
Marrian analysis of PDCASA
- Marr invoked to separate high-level function
from low-level details “It is not enough to be able to describe the response of single cells, nor predict the results of psychophysical experiments. Nor is it enough even to write computer programs that perform approximately in the desired way: One has to do all these things at once, and also be very aware
- f the computational theory...”
Computational theory
- Objects persist predictably
- Observations interact irreversibly
Algorithm
- Build hypotheses from generic
elements
- Update by prediction-reconciliation
Implementation ???
ASA - Dan Ellis 1998jul11 - 33
Summary
- Perceptual processing is highly
context-dependent
- Auditory system will use prior knowledge
to fill-in gaps (subconsciously)
- Prediction-reconciliation models can
encompass this behavior
3
ASA - Dan Ellis 1998jul11 - 34
Outline
The computational theory of ASA Cues & grouping Expectations & inference Big issues
- the state of ASA and CASA
- outstanding issues
- discussion points
1 2 3 4
ASA - Dan Ellis 1998jul11 - 35
The current state of ASA and CASA
- ASA
- detailed descriptions of “in vitro” tests
- some quite subtle effects explained (DV beats)
but: how to extend to complex scenarios?
- CASA
- numerous models, some convergence
(mainly periodicity-based)
- best results sound impressive
(least plausible systems!)
- applications in speech recognition?
but: domains limited, poor robustness
ASA - Dan Ellis 1998jul11 - 36
Big issues in CASA:
- Plausibility
- correct level for human correspondence?
- which phenomena are important to match?
- how to implement symbolic-style processing?
- Top-down vs. bottom-up
- different approaches to ambiguity, latency
- how far down for top-down?
- how far ‘up’ for high level?
- choice between extraction & inference?
- Integrating multiple cues (e.g. binaural)
- Other debates:
- what is the real goal?
- resynthesis
- evaluation
ASA - Dan Ellis 1998jul11 - 37
Big issues in ASA & CASA:
- Knowledge:
how to acquire, represent & store ...
- short-term: context
- long-term: memories
- abstract: classes, generalities
- Attention:
- what does it mean in these models?
- limitation or important principle?
ASA - Dan Ellis 1998jul11 - 38
Conclusions
- Real-world sounds are complex;
scene-analysis is required
- We know certain cues & some rules,
but real situations raise contradictions
- Current models handle ‘obvious’ cases;
robustness & generality are hard
- Many issues remain
ASA - Dan Ellis 1998jul11 - 39
Discussion points
- Are Marr’s levels important? Useful?
Can you study levels in isolation?
- What do restoration phenomena imply about
internal representations?
- Do we have an adequate account of an ASA
algorithm? e.g. where do hypotheses come from?
- How important/challenging are phenomena like