Sound Organization By Source Models in Humans and Machines Dan - - PowerPoint PPT Presentation

sound organization by source models in humans and machines
SMART_READER_LITE
LIVE PREVIEW

Sound Organization By Source Models in Humans and Machines Dan - - PowerPoint PPT Presentation

Sound Organization By Source Models in Humans and Machines Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu http://labrosa.ee.columbia.edu/


slide-1
SLIDE 1

Sound Organization by Models - Dan Ellis 2006-12-09 - /29 1

  • 1. Mixtures and Models
  • 2. Human Sound Organization
  • 3. Machine Sound Organization
  • 4. Research Questions

Sound Organization By Source Models in Humans and Machines

Dan Ellis

Laboratory for Recognition and Organization of Speech and Audio

  • Dept. Electrical Eng., Columbia Univ., NY USA

dpwe@ee.columbia.edu http://labrosa.ee.columbia.edu/

slide-2
SLIDE 2

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

The Problem of Mixtures

“Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there

  • n the lake and where are they?” (after Bregman’90)
  • Received waveform is a mixture

2 sensors, N sources - underconstrained

  • Undoing mixtures: hearing’s primary goal?

.. by any means available

2

slide-3
SLIDE 3

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

Sound Organization Scenarios

  • Interactive voice systems

human-level understanding is expected

  • Speech prostheses

crowds: #1 complaint of hearing aid users

  • Archive analysis

identifying and isolating sound events

  • Unmixing/remixing/enhancement...

3

!"#$%&%'$( )$*$)%&%+,

  • .$/%&%012

dpwe-2004-09-10-13:15:40

3 4 5 6 7 8 9 : ; < 3 5 7 9 ; =73 =53 3 53

pa-2004-09-10-131540.wav

slide-4
SLIDE 4

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

How Can We Separate?

  • By between-sensor differences (spatial cues)

‘steer a null’ onto a compact interfering source the filtering/signal processing paradigm

  • By finding a ‘separable representation’

spectral? sources are broadband but sparse periodicity? maybe – for pitched sounds something more signal-specific...

  • By inference (based on knowledge/models)

acoustic sources are redundant → use part to guess the remainder

  • limited possible solutions

4

slide-5
SLIDE 5

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

combination physics source models

Separation vs. Inference

  • Ideal separation is rarely possible

i.e. no projection can completely remove overlaps

  • Overlaps → Ambiguity

scene analysis = find “most reasonable” explanation

  • Ambiguity can be expressed probabilistically

i.e. posteriors of sources {Si} given observations X:

P({Si}| X) ∝ P(X |{Si}) P({Si})

  • Better source models → better inference

.. learn from examples?

5

slide-6
SLIDE 6

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

A Simple Example

  • Source models are codebooks

from separate subspaces

6



             



                     



    

slide-7
SLIDE 7

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

A Slightly Less Simple Example

  • Sources with Markov transitions

7

 

  

            



                                            



           

slide-8
SLIDE 8

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

What is a Source Model?

  • Source Model describes signal behavior

encapsulates constraints on form of signal (any such constraint can be seen as a model...)

  • A model has parameters

model + parameters → instance

  • What is not a source model?

detail not provided in instance e.g. using phase from original mixture constraints on interaction between sources e.g. independence, clustering attributes

8

Excitation source g[n] n n Resonance filter H(ej!) !

slide-9
SLIDE 9

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

Outline

  • 1. Mixtures and Models
  • 2. Human Sound Organization

Auditory Scene Analysis Using source characteristics Illusions

  • 3. Machine Sound Organization
  • 4. Research Questions

9

slide-10
SLIDE 10

Sound Organization by Models - Dan Ellis 2006-12-09 - /29 10

Auditory Scene Analysis

  • How do people analyze sound mixtures?

break mixture into small elements (in time-freq) elements are grouped in to sources using cues sources have aggregate attributes

  • Grouping rules (Darwin, Carlyon, ...):

cues: common onset/modulation, harmonicity, ...

  • Also learned “schema” (for speech etc.)

Frequency analysis Sound Elements Sources Grouping mechanism Onset map Harmonicity map Spatial map Source properties

(after Darwin 1996)

Bregman’90 Darwin & Carlyon’95

slide-11
SLIDE 11

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

Perceiving Sources

  • Harmonics distinct in ear, but perceived as
  • ne source (“fused”):

depends on common onset depends on harmonics

  • Experimental techniques

ask subjects “how many” match attributes e.g. pitch, vowel identity brain recordings (EEG “mismatch negativity”)

11

time freq

slide-12
SLIDE 12

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

Auditory “Illusions”

  • How do we explain illusions?

pulsation threshold sinewave speech phonemic restoration

  • Something is providing the

missing (illusory) pieces ... source models

12

!"#$ %&$'($)*+ , ,-. / /-. 0-. , 0,,, 1,,, 2,,, 3,,, !"#$ %&$'($)*+ ,-. / /-. 0-. 4 4-. , /,,, 0,,, 4,,, 1,,, .,,, !"#$ %&$'($)*+ , ,-. / /-. , /,,, 0,,, 4,,, 1,,,

time freq + + +

?

slide-13
SLIDE 13

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

Human Speech Separation

  • Task: Coordinate Response Measure

“Ready Baron go to green eight now” 256 variants, 16 speakers correct = color and number for “Baron”

  • Accuracy as a function of spatial separation:

A, B same speaker o Range effect

13

Brungart et al.’02

crm-11737+16515.wav

slide-14
SLIDE 14

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

Separation by Vocal Differences

  • CRM varying the level and voice character

energetic vs. informational masking more than pitch .. source models

14

Brungart et al.’01 (same spatial location)

slide-15
SLIDE 15

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

Outline

  • 1. Mixtures and Models
  • 2. Human Sound Organization
  • 3. Machine Sound Organization

Computational Auditory Scene Analysis Dictionary Source Models

  • 4. Research Questions

15

slide-16
SLIDE 16

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

Source Model Issues

  • Domain

parsimonious expression of constraints nice combination physics

  • Tractability

size of search space tricks to speed search/inference

  • Acquisition

hand-designed vs. learned static vs. short-term

  • Factorization

independent aspects hierarchy & specificity

16

slide-17
SLIDE 17

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

Computational Auditory Scene Analysis

  • Central idea:

Segment time-frequency into sources based on perceptual grouping cues

... principal cue is harmonicity

17

Brown & Cooke’94 Okuno et al.’99 Hu & Wang’04 ...

input mixture signal features (maps) discrete

  • bjects

Front end Object formation Grouping rules Source groups

  • nset

period frq.mod time freq

Segment Group

slide-18
SLIDE 18

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

CASA limitations

  • Limitations of T
  • F masking

cannot undo overlaps – leaves gaps

  • Driven by local features

limited model scope ➝ no inference or illusions

  • Does not learn from data

18

                                               

from Hu & Wang ’04

huwang-v3n7.wav

slide-19
SLIDE 19

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

Basic Dictionary Models

  • Given models for sources,

find “best” (most likely) states for spectra:

can include sequential constraints... different domains for combining c and defining 

  • E.g. stationary noise:

19

{i1(t),i2(t)} = argmaxi1,i2p(x(t)|i1,i2) p(x|i1,i2) = N (x;ci1 +ci2,Σ) combination

model inference of source state

time / s freq / mel bin

!"#$#%&'()*++,-

1 2 20 40 60 80

.%()*++,-/)-&*+0(%1#)+(( 23+'(3&$)%"(4(5678(09:

1 2 20 40 60 80

;<(#%=+""+0()>&>+)(( 23+'(3&$)%"(4(?6@(09:

1 2 20 40 60 80

Roweis ’01, ’03 Kristjannson ’04, ’06

slide-20
SLIDE 20

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

Deeper Models: Iriquois

  • Optimal inference on mixed spectra

speaker-specific models (e.g. 512 mix GMM) Algonquin inference

  • .. for Speech Separation Challenge (Cooke/Lee’06)

exploit grammar constraints - higher-level dynamics

20

Kristjansson, Hershey et al. ’06

200 400 600 800 1000 10 15 20 25 30 35 40 45 Noisy vector, clean vector and estimate of clean vector Frequency [Hz] Energy [dB] x estimate noisy vector clean vector

wer / %

6 dB 3 dB 0 dB

  • 3 dB
  • 6 dB
  • 9 dB

50 100 Same Gender Speakers SDL Recognizer No dynamics Acoustic dyn. Grammar dyn. Human

slide-21
SLIDE 21

Sound Organization by Models - Dan Ellis 2006-12-09 - /29 21

Faster Search: Fragment Decoder

  • Match ‘uncorrupt’

spectrum to ASR models using missing data recognition

easy if you know the segregation mask S

  • Joint search for model M and segregation S

to maximize:

Barker et al. ’05

freq Observation Y(f) Segregation S Source X(f)

P M S Y $ " # P M " # P X M " # P X Y S $ " # P X " #

  • !

X d

%

P S Y " # ! =

Isolated Source Model Segregation Model

slide-22
SLIDE 22

Sound Organization by Models - Dan Ellis 2006-12-09 - /29 22

CASA in the Fragment Decoder

  • CASA can help search

consider only segregations made from CASA chunks

  • CASA can rate segregation

construct P(S|Y) to reward CASA qualities:

Frequency Proximity Harmonicity Common Onset

P M S Y $ " # P M " # P X M " # P X Y S $ " # P X " #

  • !

X d

%

P S Y " # ! =

slide-23
SLIDE 23

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

(Pitch) Factored Dictionaries

  • Separate representations for

“source” (pitch) and “filter”

NM codewords from N+M entries but: overgeneration...

  • Faster search

direct extraction of pitches immediate separation of (most of) spectra

23

Ghandi & Has-John. ’04 Radfar et al. ‘06

slide-24
SLIDE 24

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

Discriminant Models for Music

  • Transcribe piano recordings by classification

train SVM detectors for every piano note 88 separate detectors, independent smoothing

  • Trained on player piano recordings
  • Can resynthesize from transcript...

24

time / sec level / dB freq / pitch 1 2 3 4 5 6 7 8 9 A1 A2 A3 A4 A5 A6

  • 20
  • 10

10 20

Bach 847 Disklavier

Poliner & Ellis ’06

slide-25
SLIDE 25

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

Piano Transcription Results

  • Significant improvement from classifier:

frame-level accuracy results: Breakdown by frame type:

25

Algorithm Errs False Pos False Neg d′ SVM 43.3% 27.9% 15.4% 3.44 Klapuri&Ryyn¨ anen 66.6% 28.1% 38.5% 2.71 Marolt 84.6% 36.5% 48.1% 2.35

1 2 3 4 5 6 7 8 20 40 60 80 100 120 # notes present Classification error % False Negatives False Positives

slide-26
SLIDE 26

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

Outline

  • 1. Mixtures & Models
  • 2. Human Sound Organization
  • 3. Machine Sound Organization
  • 4. Research Questions

Task and Evaluation Generic vs. Specific

26

slide-27
SLIDE 27

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

Task & Evaluation

  • How to measure separation performance?

depends what you are trying to do

  • SNR?

energy (and distortions) are not created equal different nonlinear components [Vincent et al. ’06]

  • Human Intelligibility?

rare for nonlinear processing to improve intelligibility listening tests expensive

  • ASR performance?

separate-then-recognize too simplistic; ASR needs to accommodate separation

27

Transmission errors Agressiveness

  • f processing
  • ptimum

Reduced interference Increased artefacts Net effect

slide-28
SLIDE 28

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

How Many Models?

  • More specific models ➝ better separation

need individual dictionaries for “everything”??

  • Model adaptation and hierarchy

speaker adapted models : base + parameters extrapolation beyond normal generic-specific: pitched ➝ piano ➝ this piano

  • Time scales of model acquisition

innate/evolutionary (hair-cell tuning) developmental (mother tongue phones) dynamic - the “slung mugs” effect; Ozerov

28

mean

pitch / Hz VT length ratio

Smith, Patterson et al. ’05

slide-29
SLIDE 29

Sound Organization by Models - Dan Ellis 2006-12-09 - /29

Summary & Conclusions

  • Listeners do well separating sound mixtures

using signal cues (location, periodicity) using source-property variations

  • Machines do less well

difficult to apply enough constraints need to exploit signal detail

  • Models capture constraints

learn from the real world adapt to sources

  • Separation feasible only sometimes

describing source properties is easier

29