[PPT] - Sound Organization By Source Models in Humans and Machines Dan PowerPoint Presentation

SLIDE 1

Sound Organization by Models - Dan Ellis 2006-12-09 - /29 1

1. Mixtures and Models
2. Human Sound Organization
3. Machine Sound Organization
4. Research Questions

Sound Organization By Source Models in Humans and Machines

Dan Ellis

Laboratory for Recognition and Organization of Speech and Audio

Dept. Electrical Eng., Columbia Univ., NY USA

dpwe@ee.columbia.edu http://labrosa.ee.columbia.edu/

SLIDE 2

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

The Problem of Mixtures

“Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there

n the lake and where are they?” (after Bregman’90)
Received waveform is a mixture

2 sensors, N sources - underconstrained

Undoing mixtures: hearing’s primary goal?

.. by any means available

2

SLIDE 3

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

Sound Organization Scenarios

Interactive voice systems

human-level understanding is expected

Speech prostheses

crowds: #1 complaint of hearing aid users

Archive analysis

identifying and isolating sound events

Unmixing/remixing/enhancement...

3

!"#$%&%'$( )$*$)%&%+,

.$/%&%012

dpwe-2004-09-10-13:15:40

3 4 5 6 7 8 9 : ; < 3 5 7 9 ; =73 =53 3 53

pa-2004-09-10-131540.wav

SLIDE 4

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

How Can We Separate?

By between-sensor differences (spatial cues)

‘steer a null’ onto a compact interfering source the filtering/signal processing paradigm

By finding a ‘separable representation’

spectral? sources are broadband but sparse periodicity? maybe – for pitched sounds something more signal-specific...

By inference (based on knowledge/models)

acoustic sources are redundant → use part to guess the remainder

limited possible solutions

4

SLIDE 5

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

combination physics source models

Separation vs. Inference

Ideal separation is rarely possible

i.e. no projection can completely remove overlaps

Overlaps → Ambiguity

scene analysis = find “most reasonable” explanation

Ambiguity can be expressed probabilistically

i.e. posteriors of sources {Si} given observations X:

P({Si}| X) ∝ P(X |{Si}) P({Si})

Better source models → better inference

.. learn from examples?

5

SLIDE 6

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

A Simple Example

Source models are codebooks

from separate subspaces

6



             



                     



    

SLIDE 7

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

A Slightly Less Simple Example

Sources with Markov transitions

7

 

  

            



                                            



           

SLIDE 8

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

What is a Source Model?

Source Model describes signal behavior

encapsulates constraints on form of signal (any such constraint can be seen as a model...)

A model has parameters

model + parameters → instance

What is not a source model?

detail not provided in instance e.g. using phase from original mixture constraints on interaction between sources e.g. independence, clustering attributes

8

Excitation source g[n] n n Resonance filter H(ej!) !

SLIDE 9

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

Outline

1. Mixtures and Models
2. Human Sound Organization

Auditory Scene Analysis Using source characteristics Illusions

3. Machine Sound Organization
4. Research Questions

9

SLIDE 10

Sound Organization by Models - Dan Ellis 2006-12-09 - /29 10

Auditory Scene Analysis

How do people analyze sound mixtures?

break mixture into small elements (in time-freq) elements are grouped in to sources using cues sources have aggregate attributes

Grouping rules (Darwin, Carlyon, ...):

cues: common onset/modulation, harmonicity, ...

Also learned “schema” (for speech etc.)

Frequency analysis Sound Elements Sources Grouping mechanism Onset map Harmonicity map Spatial map Source properties

(after Darwin 1996)

Bregman’90 Darwin & Carlyon’95

SLIDE 11

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

Perceiving Sources

Harmonics distinct in ear, but perceived as
ne source (“fused”):

depends on common onset depends on harmonics

Experimental techniques

ask subjects “how many” match attributes e.g. pitch, vowel identity brain recordings (EEG “mismatch negativity”)

11

time freq

SLIDE 12

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

Auditory “Illusions”

How do we explain illusions?

pulsation threshold sinewave speech phonemic restoration

Something is providing the

missing (illusory) pieces ... source models

12

!"#$ %&$'($)*+ , ,-. / /-. 0-. , 0,,, 1,,, 2,,, 3,,, !"#$ %&$'($)*+ ,-. / /-. 0-. 4 4-. , /,,, 0,,, 4,,, 1,,, .,,, !"#$ %&$'($)*+ , ,-. / /-. , /,,, 0,,, 4,,, 1,,,

time freq + + +

?

SLIDE 13

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

Human Speech Separation

Task: Coordinate Response Measure

“Ready Baron go to green eight now” 256 variants, 16 speakers correct = color and number for “Baron”

Accuracy as a function of spatial separation:

A, B same speaker o Range effect

13

Brungart et al.’02

crm-11737+16515.wav

SLIDE 14

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

Separation by Vocal Differences

CRM varying the level and voice character

energetic vs. informational masking more than pitch .. source models

14

Brungart et al.’01 (same spatial location)

SLIDE 15

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

Outline

1. Mixtures and Models
2. Human Sound Organization
3. Machine Sound Organization

Computational Auditory Scene Analysis Dictionary Source Models

4. Research Questions

15

SLIDE 16

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

Source Model Issues

Domain

parsimonious expression of constraints nice combination physics

Tractability

size of search space tricks to speed search/inference

Acquisition

hand-designed vs. learned static vs. short-term

Factorization

independent aspects hierarchy & specificity

16

SLIDE 17

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

Computational Auditory Scene Analysis

Central idea:

Segment time-frequency into sources based on perceptual grouping cues

... principal cue is harmonicity

17

Brown & Cooke’94 Okuno et al.’99 Hu & Wang’04 ...

input mixture signal features (maps) discrete

bjects

Front end Object formation Grouping rules Source groups

nset

period frq.mod time freq

Segment Group

SLIDE 18

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

CASA limitations

Limitations of T
F masking

cannot undo overlaps – leaves gaps

Driven by local features

limited model scope ➝ no inference or illusions

Does not learn from data

18

                                               

from Hu & Wang ’04

huwang-v3n7.wav

SLIDE 19

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

Basic Dictionary Models

Given models for sources,

find “best” (most likely) states for spectra:

can include sequential constraints... different domains for combining c and defining 

E.g. stationary noise:

19

{i1(t),i2(t)} = argmaxi1,i2p(x(t)|i1,i2) p(x|i1,i2) = N (x;ci1 +ci2,Σ) combination

model inference of source state

time / s freq / mel bin

!"#$#%&'()*++,-

1 2 20 40 60 80

.%()*++,-/)-&*+0(%1#)+(( 23+'(3&$)%"(4(5678(09:

1 2 20 40 60 80

;<(#%=+""+0()>&>+)(( 23+'(3&$)%"(4(?6@(09:

1 2 20 40 60 80

Roweis ’01, ’03 Kristjannson ’04, ’06

SLIDE 20

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

Deeper Models: Iriquois

Optimal inference on mixed spectra

speaker-specific models (e.g. 512 mix GMM) Algonquin inference

.. for Speech Separation Challenge (Cooke/Lee’06)

exploit grammar constraints - higher-level dynamics

20

Kristjansson, Hershey et al. ’06

200 400 600 800 1000 10 15 20 25 30 35 40 45 Noisy vector, clean vector and estimate of clean vector Frequency [Hz] Energy [dB] x estimate noisy vector clean vector

wer / %

6 dB 3 dB 0 dB

3 dB
6 dB
9 dB

50 100 Same Gender Speakers SDL Recognizer No dynamics Acoustic dyn. Grammar dyn. Human

SLIDE 21

Sound Organization by Models - Dan Ellis 2006-12-09 - /29 21

Faster Search: Fragment Decoder

Match ‘uncorrupt’

spectrum to ASR models using missing data recognition

easy if you know the segregation mask S

Joint search for model M and segregation S

to maximize:

Barker et al. ’05

freq Observation Y(f) Segregation S Source X(f)

P M S Y $ " # P M " # P X M " # P X Y S $ " # P X " #

!

X d

%

P S Y " # ! =

Isolated Source Model Segregation Model

SLIDE 22

Sound Organization by Models - Dan Ellis 2006-12-09 - /29 22

CASA in the Fragment Decoder

CASA can help search

consider only segregations made from CASA chunks

CASA can rate segregation

construct P(S|Y) to reward CASA qualities:

Frequency Proximity Harmonicity Common Onset

P M S Y $ " # P M " # P X M " # P X Y S $ " # P X " #

!

X d

%

P S Y " # ! =

SLIDE 23

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

(Pitch) Factored Dictionaries

Separate representations for

“source” (pitch) and “filter”

NM codewords from N+M entries but: overgeneration...

Faster search

direct extraction of pitches immediate separation of (most of) spectra

23

Ghandi & Has-John. ’04 Radfar et al. ‘06

SLIDE 24

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

Discriminant Models for Music

Transcribe piano recordings by classification

train SVM detectors for every piano note 88 separate detectors, independent smoothing

Trained on player piano recordings
Can resynthesize from transcript...

24

time / sec level / dB freq / pitch 1 2 3 4 5 6 7 8 9 A1 A2 A3 A4 A5 A6

20
10

10 20

Bach 847 Disklavier

Poliner & Ellis ’06

SLIDE 25

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

Piano Transcription Results

Significant improvement from classifier:

frame-level accuracy results: Breakdown by frame type:

25

Algorithm Errs False Pos False Neg d′ SVM 43.3% 27.9% 15.4% 3.44 Klapuri&Ryyn¨ anen 66.6% 28.1% 38.5% 2.71 Marolt 84.6% 36.5% 48.1% 2.35

1 2 3 4 5 6 7 8 20 40 60 80 100 120 # notes present Classification error % False Negatives False Positives

SLIDE 26

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

Outline

1. Mixtures & Models
2. Human Sound Organization
3. Machine Sound Organization
4. Research Questions

Task and Evaluation Generic vs. Specific

26

SLIDE 27

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

Task & Evaluation

How to measure separation performance?

depends what you are trying to do

SNR?

energy (and distortions) are not created equal different nonlinear components [Vincent et al. ’06]

Human Intelligibility?

rare for nonlinear processing to improve intelligibility listening tests expensive

ASR performance?

separate-then-recognize too simplistic; ASR needs to accommodate separation

27

Transmission errors Agressiveness

f processing
ptimum

Reduced interference Increased artefacts Net effect

SLIDE 28

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

How Many Models?

More specific models ➝ better separation

need individual dictionaries for “everything”??

Model adaptation and hierarchy

speaker adapted models : base + parameters extrapolation beyond normal generic-specific: pitched ➝ piano ➝ this piano

Time scales of model acquisition

innate/evolutionary (hair-cell tuning) developmental (mother tongue phones) dynamic - the “slung mugs” effect; Ozerov

28

mean

pitch / Hz VT length ratio

Smith, Patterson et al. ’05

SLIDE 29

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

Summary & Conclusions

Listeners do well separating sound mixtures

using signal cues (location, periodicity) using source-property variations

Machines do less well

difficult to apply enough constraints need to exploit signal detail

Models capture constraints

learn from the real world adapt to sources

Separation feasible only sometimes

describing source properties is easier

29