[PPT] - using the Similarity Matrix Zafar Rafii & Bryan Pardo PowerPoint Presentation

SLIDE 1

Music/Voice Separation using the Similarity Matrix

Zafar Rafii & Bryan Pardo

SLIDE 2

Musical pieces are often characterized by an

underlying repeating structure over which varying elements are superimposed

Introduction

Zafar Rafii & Bryan Pardo 2 10/12/12

2 4 6 8 10 12

1

1 Propellerheads - History Repeating time (s)

SLIDE 3

Introduction

Zafar Rafii & Bryan Pardo 3

The REpeating Pattern Extraction Technique

(REPET) was proposed to extract the repeating structure from the non-repeating structure

10/12/12

REPET

Mixture Repeating Structure Non-repeating Structure

SLIDE 4 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Repeating Spectrogram W

REPET

Zafar Rafii & Bryan Pardo 4

Step 2

Mixture Signal x

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6

1

8 6 4 2 2 4 6 8 1

Beat Spectrum b

.1 .2 .3 .4 .5 .6 .7 .8 .9 1

V

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

1p 2p

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Median

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

V

1 2 3 4 5 6 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Time-Frequency Mask M min min Mixture Spectrogram V

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

p

Repeating Segment S S min

Step 1 Step 3

SLIDE 5 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Repeating Spectrogram U i

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Repeating Spectrogram W

Adaptive REPET

Zafar Rafii & Bryan Pardo 5

Step 2

Mixture Signal x

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6

1

8 6 4 2 2 4 6 8 1

Beat Spectrogram B V

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

V

1 2 3 4 5 6 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Time-Frequency Mask M Mixture Spectrogram V

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

pi i

.1 .2 .3 .4 .5 .6 .7 .8 .9 1

i i+1pi i-1pi Median i i+1pi i-1pi

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

U min i

Step 1 Step 3

SLIDE 6

Limitations

Both the original and the adaptive REPET

assume periodically repeating patterns

10/12/12 Zafar Rafii & Bryan Pardo 6

Mixture Periodically repeating background Beat spectrogram period finder

SLIDE 7

Limitations

Repetitions can also happen intermittently or

without a global (or local) period

10/12/12 Zafar Rafii & Bryan Pardo 7

Mixture Non-periodically repeating background Beat spectrogram period finder

SLIDE 8

Limitations

Instead of looking for periodicities, we can

look for similarities, using a similarity matrix

10/12/12 Zafar Rafii & Bryan Pardo 8

Mixture Non-periodically repeating background Similarity matrix +similar +dissimilar

SLIDE 9

The similarity matrix is a matrix where each

bin measures the (dis)similarity between any two elements of a sequence given a metric

Similarity Matrix

10/12/12 Zafar Rafii & Bryan Pardo 9

Sequence i1 i2 Similarity matrix +similar +dissimilar i2 i1 metric

SLIDE 10

In audio, the SM can help to visualize the time

structure and find repeating/similar patterns

Similarity Matrix

10/12/12 Zafar Rafii & Bryan Pardo 10

cosine

Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 time (s) time (s) Similarity Matrix 2 4 6 8 10 12 2 4 6 8 10 12 1

+similar +dissimilar

SLIDE 11

Assumptions

Given a mixture of music + voice:

– The repeating background is dense & low-ranked – The non-repeating foreground is sparse & varied

10/12/12 Zafar Rafii & Bryan Pardo 11

Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 Background Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 Foreground Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20

SLIDE 12

Assumptions

The SM of a mixture is then likely to reveal the

structure of the repeating background

10/12/12 Zafar Rafii & Bryan Pardo 12

Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 Background Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 time (s) time (s) Similarity Matrix 2 4 6 8 10 12 2 4 6 8 10 12

SLIDE 13

REPET-SIM

Zafar Rafii & Bryan Pardo 13

REPET with Similarity Matrix!
1. Identify the repeating/similar elements
2. Derive a repeating model
3. Extract the repeating structure

10/12/12

REPET- SIM

Mixture Signal Repeating Structure Non-repeating Structure

SLIDE 14

REPET-SIM

Zafar Rafii & Bryan Pardo 14

Advantages compared with REPET:

– Can handle intermittent repeating elements – Can handle fast-varying repeating structures – Can handle full-track songs

10/12/12

REPET- SIM

Mixture Signal Repeating Structure Non-repeating Structure

SLIDE 15

Interests

Zafar Rafii & Bryan Pardo 15

Practical Interests

– Audio post processing – Melody extraction – Karaoke gaming

Intellectual Interests

– Music perception – Music understanding – Simply based on self-similarity!

10/12/12

SLIDE 16

1 2 3 4 5 6 1 2 3 4 5 6

Similarity Matrix S

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Repeating Spectrogram U i

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Repeating Spectrogram W

REPET-SIM

Zafar Rafii & Bryan Pardo 16

Step 2

Mixture Signal x

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6

1

8 6 4 2 2 4 6 8 1

V

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

V

1 2 3 4 5 6 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Time-Frequency Mask M Mixture Spectrogram V

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

j2=i j3 j1 Median j2 j3 j1

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

U min i i j2 j1 j3

Step 1 Step 3

SLIDE 17

1 2 3 4 5 6 1 2 3 4 5 6

Similarity Matrix S

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Repeating Spectrogram U i

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Repeating Spectrogram W

1. Repeating Elements

Zafar Rafii & Bryan Pardo 17

Step 2

Mixture Signal x

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6

1

8 6 4 2 2 4 6 8 1

V

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

V

1 2 3 4 5 6 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Time-Frequency Mask M Mixture Spectrogram V

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

j2=i j3 j1 Median j2 j3 j1

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

U min i i j2 j1 j3

Step 1 Step 3

SLIDE 18

1. Repeating Elements
We take the cosine similarity between any two

pairs of columns and get a similarity matrix

Zafar Rafii & Bryan Pardo 18 10/12/12

Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20

cosine

Similarity Matrix time (s) time (s) 2 4 6 8 10 12 2 4 6 8 10 12

i2 i1 i1 i2

SLIDE 19

1. Repeating Elements
The SM reveals for every frame i, the frames jk

that are the most similar to frame i

Zafar Rafii & Bryan Pardo 19 10/12/12

Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 Similarity Matrix time (s) time (s) 2 4 6 8 10 12 2 4 6 8 10 12

i i j2 j1 j3 j2 j1 j3 cosine

SLIDE 20

1 2 3 4 5 6 1 2 3 4 5 6

Similarity Matrix S

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Repeating Spectrogram U i

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Repeating Spectrogram W

1. Repeating Elements

Zafar Rafii & Bryan Pardo 20

Step 2

Mixture Signal x

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6

1

8 6 4 2 2 4 6 8 1

V

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

V

1 2 3 4 5 6 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Time-Frequency Mask M Mixture Spectrogram V

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

j2=i j3 j1 Median j2 j3 j1

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

U min i i j2 j1 j3

Step 1 Step 3

SLIDE 21

1 2 3 4 5 6 1 2 3 4 5 6

Similarity Matrix S

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Repeating Spectrogram U i

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Repeating Spectrogram W

2. Repeating Model

Zafar Rafii & Bryan Pardo 21

Step 2

Mixture Signal x

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6

1

8 6 4 2 2 4 6 8 1

V

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

V

1 2 3 4 5 6 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Time-Frequency Mask M Mixture Spectrogram V

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

j2=i j3 j1 Median j2 j3 j1

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

U min i i j2 j1 j3

Step 1 Step 3

SLIDE 22

2. Repeating Model
For every frame i, we take the median of its

most similar frames jk found using the SM

10/12/12 Zafar Rafii & Bryan Pardo 22

Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20

i

Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20

j2 j1 j3 SM

SLIDE 23

2. Repeating Model
We obtain an initial repeating spectrogram

model

10/12/12 Zafar Rafii & Bryan Pardo 23

Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20

i

Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20

j2 j1 j3

Repeating Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20

i SM

median

SLIDE 24

1 2 3 4 5 6 1 2 3 4 5 6

Similarity Matrix S

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Repeating Spectrogram U i

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Repeating Spectrogram W

2. Repeating Model

Zafar Rafii & Bryan Pardo 24

Step 2

Mixture Signal x

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6

1

8 6 4 2 2 4 6 8 1

V

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

V

1 2 3 4 5 6 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Time-Frequency Mask M Mixture Spectrogram V

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

j2=i j3 j1 Median j2 j3 j1

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

U min i i j2 j1 j3

Step 1 Step 3

SLIDE 25

1 2 3 4 5 6 1 2 3 4 5 6

Similarity Matrix S

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Repeating Spectrogram U i

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Repeating Spectrogram W

3. Repeating Structure

Zafar Rafii & Bryan Pardo 25

Step 2

Mixture Signal x

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6

1

8 6 4 2 2 4 6 8 1

V

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

V

1 2 3 4 5 6 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Time-Frequency Mask M Mixture Spectrogram V

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

j2=i j3 j1 Median j2 j3 j1

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

U min i i j2 j1 j3

Step 1 Step 3

SLIDE 26

3. Repeating Structure
We take the element-wise minimum between

the repeating and mixture spectrograms

10/12/12 Zafar Rafii & Bryan Pardo 26

Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 2 4 6 8 10 12 10 20

min

SLIDE 27

Repeating Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20

3. Repeating Structure
We obtain a refined repeating spectrogram

model for the repeating background

10/12/12 Zafar Rafii & Bryan Pardo 27

Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 2 4 6 8 10 12 10 20

min

SLIDE 28

Repeating Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20

3. Repeating Structure
The repeating spectrogram cannot have

values higher than the mixture spectrogram

10/12/12 Zafar Rafii & Bryan Pardo 28

Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 Non-repeating Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20

≥ 𝟏 ≥ 𝟏 ≥ 𝟏

SLIDE 29

Repeating Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20

3. Repeating Structure
We divide the repeating spectrogram by the

mixture spectrogram, element-wise

10/12/12 Zafar Rafii & Bryan Pardo 29

Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 Mixture Spectrogram 2 4 6 8 10 12 10 20

divides

SLIDE 30

Repeating Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20

3. Repeating Structure
We obtain a soft time-frequency mask (with

values in [0,1])

10/12/12 Zafar Rafii & Bryan Pardo 30

Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 Mixture Spectrogram 2 4 6 8 10 12 10 20

divides

Time-frequency Mask time (s) frequency (kHz) 2 4 6 8 10 12 10 20

SLIDE 31

2 4 6 8 10 12

1

1 Background Signal time (s)

iSTFT

Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20

We apply the t-f mask to the mixture STFT and
btain the repeating background
3. Repeating Structure

Zafar Rafii & Bryan Pardo 31

.x

Time-frequency Mask time (s) frequency (kHz) 2 4 6 8 10 12 10 20 Background Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20

SLIDE 32

2 4 6 8 10 12

1

1 Foreground Signal time (s) 2 4 6 8 10 12

1

1 Background Signal time (s) 2 4 6 8 10 12

1

1 Mixture Signal time (s)

The non-repeating foreground is obtained by

subtracting the background from the mixture

iSTFT

3. Repeating Structure

Zafar Rafii & Bryan Pardo 32 10/12/12

Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 2 4 6 8 10 12

1

1 Background Signal time (s) Background Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20

SLIDE 33

2 4 6 8 10 12

1

1 Mixture Signal time (s)

Repeating background ≈ music component
Non-repeating foreground ≈ voice component

Music/Voice Separation

Zafar Rafii & Bryan Pardo 33

REPET-SIM

1. Repeating elements
2. Repeating model
3. Repeating structure

10/12/12

2 4 6 8 10 12

1

1 Foreground Signal time (s) 2 4 6 8 10 12

1

1 Background Signal time (s)

SLIDE 34

Evaluation

Competitive method 1 [Liutkus et al., 2012]

– Adaptive REPET with automatic periods finder and soft time-frequency masking

Competitive method 2 [FitzGerald et al., 2010]

– Median filtering of the spectrogram at different frequency resolutions to extract the vocals

Data set

– 14 full-track real-world songs (Beach Boys) – 3 voice-to-music mixing ratios (-6, 0, and 6 dB)

Zafar Rafii & Bryan Pardo 34 10/12/12

SLIDE 35

Evaluation

35 10/12/12

MMFS = FitzGerald et al. REPET+ = Liutkus et al. Proposed = REPET-SIM

Zafar Rafii & Bryan Pardo

SLIDE 36

5 10 15 20 25

1

1 Voice estimate (REPET-SIM) time (s) 5 10 15 20 25

1

1 Music estimate (REPET-SIM) time (s) 5 10 15 20 25

1

1 Voice estimate (FitzGerald) time (s) 5 10 15 20 25

1

1 Music estimate (FitzGerald) time (s) 5 10 15 20 25

1

1 Wham! - Freedom time (s)

Examples

REPET-SIM vs. FitzGerald et al.

Zafar Rafii & Bryan Pardo 36 10/12/12

SLIDE 37

20 40 60 80 100 120

1

1 Voice estimate time (s) 20 40 60 80 100 120

1

1 Music estimate time (s)

Examples

REPET-SIM

Zafar Rafii & Bryan Pardo 37 10/12/12

20 40 60 80 100 120

1

1 Blackalicious - Alphabet Aerobics time (s)

SLIDE 38

20 40 60 80 100 120

1

1 Voice estimate time (s) 20 40 60 80 100 120

1

1 Music estimate time (s)

Examples

Adaptive REPET

Zafar Rafii & Bryan Pardo 38 10/12/12

20 40 60 80 100 120

1

1 Blackalicious - Alphabet Aerobics time (s)

SLIDE 39

The analysis of the repetitions/similarities in

music can be used for source separation

Conclusion

Zafar Rafii & Bryan Pardo 39

REPET-SIM

1. Repeating elements
2. Repeating model
3. Repeating structure

10/12/12

Mixture Signal Repeating Structure Non-repeating Structure

SLIDE 40

Questions?

D. FitzGerald and M. Gainza, “Single Channel Vocal Separation using Median

Filtering and Factorisation Techniques,” ISAST Transactions on Electronic and Signal Processing, vol. 4, no. 1, pp. 62-73, 2010.

J. Foote, “Visualizing Music and Audio using Self-Similarity,” ACM International

Conference on Multimedia, Orlando, FL, USA, October 30-November 5, 1999.

A. Liutkus, Z. Rafii, R. Badeau, B. Pardo, and G. Richard, “Adaptive Filtering for

Music/Voice Separation exploiting the Repeating Musical Structure,” IEEE International Conference on Acoustics, Speech and Signal Processing, Kyoto, Japan, March 25-30, 2012.

Z. Rafii and B. Pardo, “A Simple Music/Voice Separation Method based on the

Extraction of the Repeating Musical Structure,” IEEE International Conference on Acoustics, Speech and Signal Processing, Prague, Czech Republic, May 22-27, 2011.

Zafar Rafii & Bryan Pardo 40 10/12/12