[PPT] - Machine Listening in Complex Environments Some challenges in PowerPoint Presentation

SLIDE 1

Machine Listening in Complex Environments Some challenges in understanding musical and environmental sounds

Mathieu Lagrange June 25, 2013

SLIDE 2

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Outline

Motivation

Let humans access audio data in a way that makes sense for them

2 / 28

SLIDE 3

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Outline

Motivation

Let humans access audio data in a way that makes sense for them

Means

explore different means of representing sound to quantify the notion

f resemblance between sounds as experienced by humans

in musical corpora for environmental sounds

2 / 28

SLIDE 4

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Listening in a Complex Environment

semantic representations: human perception processes: mathematical representation: computational issues:

3 / 28

SLIDE 5

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Listening in a Complex Environment

semantic representations: human perception processes: is there a need for segregating elements of interest ? mathematical representation: computational issues:

3 / 28

SLIDE 6

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Listening in a Complex Environment

semantic representations: human perception processes: mathematical representation: what models can be relevant for representing complex scenes in a generic way ? computational issues:

3 / 28

SLIDE 7

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Listening in a Complex Environment

semantic representations: human perception processes: mathematical representation: computational issues: is it meaningful to evaluate computational systems using artificial data ?

3 / 28

SLIDE 8

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Music Information Retrieval (MIR)

As in every multimedia retrieval task, the main issue is to bridge the semantic gap. Depending on the data at hand, the difficulty of the task ranges from impossible to hardly doable

1 raw data (signal) 4 / 28

SLIDE 9

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Music Information Retrieval (MIR)

As in every multimedia retrieval task, the main issue is to bridge the semantic gap. Depending on the data at hand, the difficulty of the task ranges from impossible to hardly doable

1 raw data (signal) 2 meta data (tags: genre) 4 / 28

SLIDE 10

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Music Information Retrieval (MIR)

As in every multimedia retrieval task, the main issue is to bridge the semantic gap. Depending on the data at hand, the difficulty of the task ranges from impossible to hardly doable

1 raw data (signal) 2 meta data (tags: genre) 3 user ratings (likes) 4 / 28

SLIDE 11

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Content-based Similarity in Music

Fingerprint Cover Similarity Signal Chords Tag User

5 / 28

SLIDE 12

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Can we break the glass ceiling in MIR ?

One way is to consider the important property that Music is usually polyphonic.

6 / 28

SLIDE 13

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Can we break the glass ceiling in MIR ?

One way is to consider the important property that Music is usually polyphonic. Let us focus on what is important in this polyphony !

6 / 28

SLIDE 14

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Can we break the glass ceiling in MIR ?

One way is to consider the important property that Music is usually polyphonic. Let us focus on what is important in this polyphony ! at a low auditory level [Mesgarani & al Nature’12]

6 / 28

SLIDE 15

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Can we break the glass ceiling in MIR ?

One way is to consider the important property that Music is usually polyphonic. Let us focus on what is important in this polyphony ! at a low auditory level [Mesgarani & al Nature’12] Since the multi-track recordings are usually not available, one has to resort to use approximate solutions, based on enhancement of the sources of interest within mix-down versions.

6 / 28

SLIDE 16

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Can we break the glass ceiling in MIR ?

One way is to consider the important property that Music is usually polyphonic. Let us focus on what is important in this polyphony ! at a low auditory level [Mesgarani & al Nature’12] Since the multi-track recordings are usually not available, one has to resort to use approximate solutions, based on enhancement of the sources of interest within mix-down versions. Expected performance gain is far from being achieved with state-of-the-art approaches, though.

6 / 28

SLIDE 17

Music Human Perception Melody Enhancement Scattering Scene Synthesis

We do segregate, right ?

In front of a complex scene, we are remarkably efficient at focusing

n a source of interest. But arguably, are we actually performing

the segregation at a low level, i.e. spectrogram level) ?

7 / 28

SLIDE 18

Music Human Perception Melody Enhancement Scattering Scene Synthesis

We do segregate, right ?

In front of a complex scene, we are remarkably efficient at focusing

n a source of interest. But arguably, are we actually performing

the segregation at a low level, i.e. spectrogram level) ? Apparently, we do ;-)

7 / 28

SLIDE 19

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Misgarani & al [Nature’12]

record cortical activity from human subjects implanted with customized high-density multi- electrode arrays as part of their clinical work-up for epilepsy surgery

8 / 28

SLIDE 20

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Misgarani & al [Nature’12]

record cortical activity from human subjects implanted with customized high-density multi- electrode arrays as part of their clinical work-up for epilepsy surgery Subjects listened to speech samples from a corpus commonly used in multi-talker communication research. A typical sentence was "ready tiger go to red two now" where "tiger" is the call sign, and "red two" is the colour - number combination.

8 / 28

SLIDE 21

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Misgarani & al [Nature’12]

record cortical activity from human subjects implanted with customized high-density multi- electrode arrays as part of their clinical work-up for epilepsy surgery Subjects listened to speech samples from a corpus commonly used in multi-talker communication research. A typical sentence was "ready tiger go to red two now" where "tiger" is the call sign, and "red two" is the colour - number combination. The method of stimulus reconstruction was used to estimate the speech spectrogram represented by the population neural responses.

8 / 28

SLIDE 22

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Misgarani & al [Nature’12]

9 / 28

SLIDE 23

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Misgarani & al [Nature’12]

9 / 28

SLIDE 24

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Why Source Separation for Music Analysis is not trivial ?

The enhancement process must operate with limited prior knowledge about the properties of the specific parts to be enhanced.

10 / 28

SLIDE 25

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Why Source Separation for Music Analysis is not trivial ?

The enhancement process must operate with limited prior knowledge about the properties of the specific parts to be enhanced. Distortions inevitably remain that propagate to the subsequent feature extraction and classification stages.

10 / 28

SLIDE 26

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Why Source Separation for Music Analysis is not trivial ?

The enhancement process must operate with limited prior knowledge about the properties of the specific parts to be enhanced. Distortions inevitably remain that propagate to the subsequent feature extraction and classification stages. To reduce their impact, one needs to

design features that are robust use standard features and estimate whether they are relevant

r not by considering a notion of uncertainty.

10 / 28

SLIDE 27

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Why Source Separation for Music Analysis is not trivial ?

The enhancement process must operate with limited prior knowledge about the properties of the specific parts to be enhanced. Distortions inevitably remain that propagate to the subsequent feature extraction and classification stages. To reduce their impact, one needs to

design features that are robust use standard features and estimate whether they are relevant

r not by considering a notion of uncertainty.

Uncertainty can be

binary (missing feature theory [Eggink2003]) Gaussian [Droppo2005].

10 / 28

SLIDE 28

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Contributions

1 Promote the use of Gaussian uncertainty instead of binary

uncertainty for robust classification in the field of MIR,

11 / 28

SLIDE 29

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Contributions

1 Promote the use of Gaussian uncertainty instead of binary

uncertainty for robust classification in the field of MIR,

2 Use a procedure for Gaussian uncertainty estimation

that is fully automatic that be derived for any feature

3 Learn classifiers directly from noisy data with Gaussian

uncertainty.

4 Matlab source code for GMM decoding and learning is

available at http://bass-db.gforge.inria.fr/amulet.

11 / 28

SLIDE 30

Music Human Perception Melody Enhancement Scattering Scene Synthesis

The standard MFCC / GMM classification scheme.

Mixture MFCC computation Model learning Learning Decoding GMM Likelihood computation Likelihood Mixture MFCC computation

12 / 28

SLIDE 31

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Considering melody enhancement as a pre-processing step

Melody enhancement Mixture MFCC computation Model learning Learning Decoding GMM Likelihood computation Likelihood Melody enhancement Mixture MFCC computation MFCC computation

13 / 28

SLIDE 32

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Considering melody enhancement as a pre-processing step

Melody enhancement Mixture MFCC computation Model learning Learning Decoding GMM Likelihood computation Likelihood Melody enhancement Mixture MFCC computation MFCC computation

14 / 28

SLIDE 33

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Proposed Approach with melody enhancement and Gaussian uncertainty

Melody enhancement Mixture MFCC computation Model learning Learning Decoding GMM Likelihood computation Likelihood Melody enhancement Mixture MFCC computation

15 / 28

SLIDE 34

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Results

Accuracy (%) per 10 sec. singing segment input Fold 1 Fold 2 Fold 3 Fold 4 Total mix 51 53 55 38 49 v-sep 60 63 53 43 55 v-sep-uncrt 71 72 84 83 77

16 / 28

SLIDE 35

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Future work

larger scale experiments current experiments on Orca whales calls

17 / 28

SLIDE 36

Music Human Perception Melody Enhancement Scattering Scene Synthesis

What is a good representation of sounds ?

Seek invariance

in time

18 / 28

SLIDE 37

Music Human Perception Melody Enhancement Scattering Scene Synthesis

What is a good representation of sounds ?

Seek invariance

in time

18 / 28

SLIDE 38

Music Human Perception Melody Enhancement Scattering Scene Synthesis

What is a good representation of sounds ?

Seek invariance

in time in frequency

18 / 28

SLIDE 39

Music Human Perception Melody Enhancement Scattering Scene Synthesis

What is a good representation of sounds ?

Seek invariance

in time in frequency

18 / 28

SLIDE 40

Music Human Perception Melody Enhancement Scattering Scene Synthesis

What is a good representation of sounds ?

Seek invariance

in time in frequency

compacity

18 / 28

SLIDE 41

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Goals

Need a way to find alternative to the accumulation of heterogeneous features. Need to model spectral and temporal characteristics of the sound that are perceptually meaningful

19 / 28

SLIDE 42

Music Human Perception Melody Enhancement Scattering Scene Synthesis

The scattering in a nutshell [Anden11]

20 / 28

SLIDE 43

Music Human Perception Melody Enhancement Scattering Scene Synthesis

The scattering in a nutshell [Anden11]

20 / 28

SLIDE 44

Music Human Perception Melody Enhancement Scattering Scene Synthesis

The scattering in a nutshell [Anden11]

The scattering has relevant properties

local time-shift invariance while keeping the amplitude modulation logarithmic filter bank allows us to stabilize the high frequencies

20 / 28

SLIDE 45

Music Human Perception Melody Enhancement Scattering Scene Synthesis

The scattering in a nutshell [Anden11]

20 / 28

SLIDE 46

Music Human Perception Melody Enhancement Scattering Scene Synthesis

The scattering in a nutshell [Anden11]

The scattering operator

at the order 1, the Cosine Log Scattering is equivalent to the MFCCs at the order 2, capture nicely temporal modulations higher orders does not capture a lot of energy, but may be interesting to study

20 / 28

SLIDE 47

Music Human Perception Melody Enhancement Scattering Scene Synthesis

MDS visualization on the Houix Database

Human similarity judgment (MAP=94%)

21 / 28

SLIDE 48

Music Human Perception Melody Enhancement Scattering Scene Synthesis

MDS visualization on the Houix Database

Dynamic Time Warping (DTW) (MAP=55.4%)

21 / 28

SLIDE 49

Music Human Perception Melody Enhancement Scattering Scene Synthesis

MDS visualization on the Houix Database

l1-norm of order 2 CLS Coefficients (MAP=56.7%)

21 / 28

SLIDE 50

Music Human Perception Melody Enhancement Scattering Scene Synthesis

MDS visualization on the Houix Database

Cosine distance of order 2 Separable Scattering Coefficients (MAP=59%)

21 / 28

SLIDE 51

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Take home message on the scattering

The scattering operator

has strong mathematical grounding show some strong resemblances with models issued from neurophysiological studies (Torsten Dau, Shihab Shamma)

Challenges for retrieval purposes

dimensionality larger time scales: model long term complex modulations due to articulation separability (as by segregation)

22 / 28

SLIDE 52

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Synthetizing Environmental Scenes

Automatically analyzing environmental scenes have interesting applications and is useful for research purposes. That said, evaluation data is still relatively scarce and does not

ffer any kind of flexibility in terms of controlling important

characteristics such as diversity of sources degree of overlap background level ...

23 / 28

SLIDE 53

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Synthetizing Environmental Scenes

Automatically analyzing environmental scenes have interesting applications and is useful for research purposes. That said, evaluation data is still relatively scarce and does not

ffer any kind of flexibility in terms of controlling important

characteristics such as diversity of sources degree of overlap background level ...

23 / 28

SLIDE 54

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Proposal: build a synthetizer

Design choices

pen source

communicate with freesound.org full webaudio interface matlab frontend

24 / 28

SLIDE 55

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Usage

First use for the evaluation of machine audition systems during the IEEE ASSP Challenge on Detection and Classification of Acoustic Scenes and Events Good results so far !!

25 / 28

SLIDE 56

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Usage

First use for the evaluation of machine audition systems during the IEEE ASSP Challenge on Detection and Classification of Acoustic Scenes and Events

25 / 28

SLIDE 57

Music Human Perception Melody Enhancement Scattering Scene Synthesis

and beyond

Numerous potential research applications have been raised "Paris by ear, Paris by heart", reverse engineering the mental representation of the city Sound and well being in urban areas ... Bio acoustics ?

26 / 28

SLIDE 58

Music Human Perception Melody Enhancement Scattering Scene Synthesis

Thank you !!

27 / 28

SLIDE 59

Music Human Perception Melody Enhancement Scattering Scene Synthesis

People

Carlo Baugé Joakim Anden Stéphane Mallat

28 / 28