using the Similarity Matrix Zafar Rafii & Bryan Pardo - - PowerPoint PPT Presentation

using the similarity matrix
SMART_READER_LITE
LIVE PREVIEW

using the Similarity Matrix Zafar Rafii & Bryan Pardo - - PowerPoint PPT Presentation

Music/Voice Separation using the Similarity Matrix Zafar Rafii & Bryan Pardo Introduction Musical pieces are often characterized by an underlying repeating structure over which varying elements are superimposed Propellerheads - History


slide-1
SLIDE 1

Music/Voice Separation using the Similarity Matrix

Zafar Rafii & Bryan Pardo

slide-2
SLIDE 2
  • Musical pieces are often characterized by an

underlying repeating structure over which varying elements are superimposed

Introduction

Zafar Rafii & Bryan Pardo 2 10/12/12

2 4 6 8 10 12

  • 1

1 Propellerheads - History Repeating time (s)

slide-3
SLIDE 3

Introduction

Zafar Rafii & Bryan Pardo 3

  • The REpeating Pattern Extraction Technique

(REPET) was proposed to extract the repeating structure from the non-repeating structure

10/12/12

REPET

Mixture Repeating Structure Non-repeating Structure

slide-4
SLIDE 4 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Repeating Spectrogram W

REPET

Zafar Rafii & Bryan Pardo 4

Step 2

Mixture Signal x

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
  • 1
8 6 4 2 2 4 6 8 1

Beat Spectrum b

.1 .2 .3 .4 .5 .6 .7 .8 .9 1

V

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

1p 2p

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Median

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

V

1 2 3 4 5 6 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Time-Frequency Mask M min min Mixture Spectrogram V

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

p

Repeating Segment S S min

Step 1 Step 3

slide-5
SLIDE 5 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Repeating Spectrogram U i

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Repeating Spectrogram W

Adaptive REPET

Zafar Rafii & Bryan Pardo 5

Step 2

Mixture Signal x

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
  • 1
8 6 4 2 2 4 6 8 1

Beat Spectrogram B V

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

V

1 2 3 4 5 6 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Time-Frequency Mask M Mixture Spectrogram V

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

pi i

.1 .2 .3 .4 .5 .6 .7 .8 .9 1

i i+1pi i-1pi Median i i+1pi i-1pi

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

U min i

Step 1 Step 3

slide-6
SLIDE 6

Limitations

  • Both the original and the adaptive REPET

assume periodically repeating patterns

10/12/12 Zafar Rafii & Bryan Pardo 6

Mixture Periodically repeating background Beat spectrogram period finder

slide-7
SLIDE 7

Limitations

  • Repetitions can also happen intermittently or

without a global (or local) period

10/12/12 Zafar Rafii & Bryan Pardo 7

Mixture Non-periodically repeating background Beat spectrogram period finder

slide-8
SLIDE 8

Limitations

  • Instead of looking for periodicities, we can

look for similarities, using a similarity matrix

10/12/12 Zafar Rafii & Bryan Pardo 8

Mixture Non-periodically repeating background Similarity matrix +similar +dissimilar

slide-9
SLIDE 9
  • The similarity matrix is a matrix where each

bin measures the (dis)similarity between any two elements of a sequence given a metric

Similarity Matrix

10/12/12 Zafar Rafii & Bryan Pardo 9

Sequence i1 i2 Similarity matrix +similar +dissimilar i2 i1 metric

slide-10
SLIDE 10
  • In audio, the SM can help to visualize the time

structure and find repeating/similar patterns

Similarity Matrix

10/12/12 Zafar Rafii & Bryan Pardo 10

cosine

Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 time (s) time (s) Similarity Matrix 2 4 6 8 10 12 2 4 6 8 10 12 1

+similar +dissimilar

slide-11
SLIDE 11

Assumptions

  • Given a mixture of music + voice:

– The repeating background is dense & low-ranked – The non-repeating foreground is sparse & varied

10/12/12 Zafar Rafii & Bryan Pardo 11

Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 Background Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 Foreground Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20

slide-12
SLIDE 12

Assumptions

  • The SM of a mixture is then likely to reveal the

structure of the repeating background

10/12/12 Zafar Rafii & Bryan Pardo 12

Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 Background Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 time (s) time (s) Similarity Matrix 2 4 6 8 10 12 2 4 6 8 10 12

slide-13
SLIDE 13

REPET-SIM

Zafar Rafii & Bryan Pardo 13

  • REPET with Similarity Matrix!
  • 1. Identify the repeating/similar elements
  • 2. Derive a repeating model
  • 3. Extract the repeating structure

10/12/12

REPET- SIM

Mixture Signal Repeating Structure Non-repeating Structure

slide-14
SLIDE 14

REPET-SIM

Zafar Rafii & Bryan Pardo 14

  • Advantages compared with REPET:

– Can handle intermittent repeating elements – Can handle fast-varying repeating structures – Can handle full-track songs

10/12/12

REPET- SIM

Mixture Signal Repeating Structure Non-repeating Structure

slide-15
SLIDE 15

Interests

Zafar Rafii & Bryan Pardo 15

  • Practical Interests

– Audio post processing – Melody extraction – Karaoke gaming

  • Intellectual Interests

– Music perception – Music understanding – Simply based on self-similarity!

10/12/12

slide-16
SLIDE 16

1 2 3 4 5 6 1 2 3 4 5 6

Similarity Matrix S

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Repeating Spectrogram U i

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Repeating Spectrogram W

REPET-SIM

Zafar Rafii & Bryan Pardo 16

Step 2

Mixture Signal x

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
  • 1
8 6 4 2 2 4 6 8 1

V

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

V

1 2 3 4 5 6 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Time-Frequency Mask M Mixture Spectrogram V

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

j2=i j3 j1 Median j2 j3 j1

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

U min i i j2 j1 j3

Step 1 Step 3

slide-17
SLIDE 17

1 2 3 4 5 6 1 2 3 4 5 6

Similarity Matrix S

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Repeating Spectrogram U i

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Repeating Spectrogram W

  • 1. Repeating Elements

Zafar Rafii & Bryan Pardo 17

Step 2

Mixture Signal x

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
  • 1
8 6 4 2 2 4 6 8 1

V

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

V

1 2 3 4 5 6 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Time-Frequency Mask M Mixture Spectrogram V

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

j2=i j3 j1 Median j2 j3 j1

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

U min i i j2 j1 j3

Step 1 Step 3

slide-18
SLIDE 18
  • 1. Repeating Elements
  • We take the cosine similarity between any two

pairs of columns and get a similarity matrix

Zafar Rafii & Bryan Pardo 18 10/12/12

Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20

cosine

Similarity Matrix time (s) time (s) 2 4 6 8 10 12 2 4 6 8 10 12

i2 i1 i1 i2

slide-19
SLIDE 19
  • 1. Repeating Elements
  • The SM reveals for every frame i, the frames jk

that are the most similar to frame i

Zafar Rafii & Bryan Pardo 19 10/12/12

Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 Similarity Matrix time (s) time (s) 2 4 6 8 10 12 2 4 6 8 10 12

i i j2 j1 j3 j2 j1 j3 cosine

slide-20
SLIDE 20

1 2 3 4 5 6 1 2 3 4 5 6

Similarity Matrix S

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Repeating Spectrogram U i

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Repeating Spectrogram W

  • 1. Repeating Elements

Zafar Rafii & Bryan Pardo 20

Step 2

Mixture Signal x

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
  • 1
8 6 4 2 2 4 6 8 1

V

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

V

1 2 3 4 5 6 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Time-Frequency Mask M Mixture Spectrogram V

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

j2=i j3 j1 Median j2 j3 j1

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

U min i i j2 j1 j3

Step 1 Step 3

slide-21
SLIDE 21

1 2 3 4 5 6 1 2 3 4 5 6

Similarity Matrix S

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Repeating Spectrogram U i

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Repeating Spectrogram W

  • 2. Repeating Model

Zafar Rafii & Bryan Pardo 21

Step 2

Mixture Signal x

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
  • 1
8 6 4 2 2 4 6 8 1

V

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

V

1 2 3 4 5 6 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Time-Frequency Mask M Mixture Spectrogram V

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

j2=i j3 j1 Median j2 j3 j1

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

U min i i j2 j1 j3

Step 1 Step 3

slide-22
SLIDE 22
  • 2. Repeating Model
  • For every frame i, we take the median of its

most similar frames jk found using the SM

10/12/12 Zafar Rafii & Bryan Pardo 22

Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20

i

Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20

j2 j1 j3 SM

slide-23
SLIDE 23
  • 2. Repeating Model
  • We obtain an initial repeating spectrogram

model

10/12/12 Zafar Rafii & Bryan Pardo 23

Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20

i

Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20

j2 j1 j3

Repeating Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20

i SM

median

slide-24
SLIDE 24

1 2 3 4 5 6 1 2 3 4 5 6

Similarity Matrix S

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Repeating Spectrogram U i

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Repeating Spectrogram W

  • 2. Repeating Model

Zafar Rafii & Bryan Pardo 24

Step 2

Mixture Signal x

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
  • 1
8 6 4 2 2 4 6 8 1

V

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

V

1 2 3 4 5 6 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Time-Frequency Mask M Mixture Spectrogram V

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

j2=i j3 j1 Median j2 j3 j1

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

U min i i j2 j1 j3

Step 1 Step 3

slide-25
SLIDE 25

1 2 3 4 5 6 1 2 3 4 5 6

Similarity Matrix S

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Repeating Spectrogram U i

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Repeating Spectrogram W

  • 3. Repeating Structure

Zafar Rafii & Bryan Pardo 25

Step 2

Mixture Signal x

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
  • 1
8 6 4 2 2 4 6 8 1

V

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

V

1 2 3 4 5 6 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Time-Frequency Mask M Mixture Spectrogram V

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

j2=i j3 j1 Median j2 j3 j1

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

U min i i j2 j1 j3

Step 1 Step 3

slide-26
SLIDE 26
  • 3. Repeating Structure
  • We take the element-wise minimum between

the repeating and mixture spectrograms

10/12/12 Zafar Rafii & Bryan Pardo 26

Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 2 4 6 8 10 12 10 20

min

slide-27
SLIDE 27

Repeating Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20

  • 3. Repeating Structure
  • We obtain a refined repeating spectrogram

model for the repeating background

10/12/12 Zafar Rafii & Bryan Pardo 27

Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 2 4 6 8 10 12 10 20

min

slide-28
SLIDE 28

Repeating Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20

  • 3. Repeating Structure
  • The repeating spectrogram cannot have

values higher than the mixture spectrogram

10/12/12 Zafar Rafii & Bryan Pardo 28

Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 Non-repeating Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20

≥ 𝟏 ≥ 𝟏 ≥ 𝟏

slide-29
SLIDE 29

Repeating Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20

  • 3. Repeating Structure
  • We divide the repeating spectrogram by the

mixture spectrogram, element-wise

10/12/12 Zafar Rafii & Bryan Pardo 29

Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 Mixture Spectrogram 2 4 6 8 10 12 10 20

divides

slide-30
SLIDE 30

Repeating Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20

  • 3. Repeating Structure
  • We obtain a soft time-frequency mask (with

values in [0,1])

10/12/12 Zafar Rafii & Bryan Pardo 30

Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 Mixture Spectrogram 2 4 6 8 10 12 10 20

divides

Time-frequency Mask time (s) frequency (kHz) 2 4 6 8 10 12 10 20

slide-31
SLIDE 31

2 4 6 8 10 12

  • 1

1 Background Signal time (s)

iSTFT

Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20

  • We apply the t-f mask to the mixture STFT and
  • btain the repeating background
  • 3. Repeating Structure

Zafar Rafii & Bryan Pardo 31

.x

Time-frequency Mask time (s) frequency (kHz) 2 4 6 8 10 12 10 20 Background Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20

slide-32
SLIDE 32

2 4 6 8 10 12

  • 1

1 Foreground Signal time (s) 2 4 6 8 10 12

  • 1

1 Background Signal time (s) 2 4 6 8 10 12

  • 1

1 Mixture Signal time (s)

  • The non-repeating foreground is obtained by

subtracting the background from the mixture

iSTFT

  • 3. Repeating Structure

Zafar Rafii & Bryan Pardo 32 10/12/12

Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 2 4 6 8 10 12

  • 1

1 Background Signal time (s) Background Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20

slide-33
SLIDE 33

2 4 6 8 10 12

  • 1

1 Mixture Signal time (s)

  • Repeating background ≈ music component
  • Non-repeating foreground ≈ voice component

Music/Voice Separation

Zafar Rafii & Bryan Pardo 33

REPET-SIM

  • 1. Repeating elements
  • 2. Repeating model
  • 3. Repeating structure

10/12/12

2 4 6 8 10 12

  • 1

1 Foreground Signal time (s) 2 4 6 8 10 12

  • 1

1 Background Signal time (s)

slide-34
SLIDE 34

Evaluation

  • Competitive method 1 [Liutkus et al., 2012]

– Adaptive REPET with automatic periods finder and soft time-frequency masking

  • Competitive method 2 [FitzGerald et al., 2010]

– Median filtering of the spectrogram at different frequency resolutions to extract the vocals

  • Data set

– 14 full-track real-world songs (Beach Boys) – 3 voice-to-music mixing ratios (-6, 0, and 6 dB)

Zafar Rafii & Bryan Pardo 34 10/12/12

slide-35
SLIDE 35

Evaluation

35 10/12/12

MMFS = FitzGerald et al. REPET+ = Liutkus et al. Proposed = REPET-SIM

Zafar Rafii & Bryan Pardo

slide-36
SLIDE 36

5 10 15 20 25

  • 1

1 Voice estimate (REPET-SIM) time (s) 5 10 15 20 25

  • 1

1 Music estimate (REPET-SIM) time (s) 5 10 15 20 25

  • 1

1 Voice estimate (FitzGerald) time (s) 5 10 15 20 25

  • 1

1 Music estimate (FitzGerald) time (s) 5 10 15 20 25

  • 1

1 Wham! - Freedom time (s)

Examples

  • REPET-SIM vs. FitzGerald et al.

Zafar Rafii & Bryan Pardo 36 10/12/12

slide-37
SLIDE 37

20 40 60 80 100 120

  • 1

1 Voice estimate time (s) 20 40 60 80 100 120

  • 1

1 Music estimate time (s)

Examples

  • REPET-SIM

Zafar Rafii & Bryan Pardo 37 10/12/12

20 40 60 80 100 120

  • 1

1 Blackalicious - Alphabet Aerobics time (s)

slide-38
SLIDE 38

20 40 60 80 100 120

  • 1

1 Voice estimate time (s) 20 40 60 80 100 120

  • 1

1 Music estimate time (s)

Examples

  • Adaptive REPET

Zafar Rafii & Bryan Pardo 38 10/12/12

20 40 60 80 100 120

  • 1

1 Blackalicious - Alphabet Aerobics time (s)

slide-39
SLIDE 39
  • The analysis of the repetitions/similarities in

music can be used for source separation

Conclusion

Zafar Rafii & Bryan Pardo 39

REPET-SIM

  • 1. Repeating elements
  • 2. Repeating model
  • 3. Repeating structure

10/12/12

Mixture Signal Repeating Structure Non-repeating Structure

slide-40
SLIDE 40

Questions?

  • D. FitzGerald and M. Gainza, “Single Channel Vocal Separation using Median

Filtering and Factorisation Techniques,” ISAST Transactions on Electronic and Signal Processing, vol. 4, no. 1, pp. 62-73, 2010.

  • J. Foote, “Visualizing Music and Audio using Self-Similarity,” ACM International

Conference on Multimedia, Orlando, FL, USA, October 30-November 5, 1999.

  • A. Liutkus, Z. Rafii, R. Badeau, B. Pardo, and G. Richard, “Adaptive Filtering for

Music/Voice Separation exploiting the Repeating Musical Structure,” IEEE International Conference on Acoustics, Speech and Signal Processing, Kyoto, Japan, March 25-30, 2012.

  • Z. Rafii and B. Pardo, “A Simple Music/Voice Separation Method based on the

Extraction of the Repeating Musical Structure,” IEEE International Conference on Acoustics, Speech and Signal Processing, Prague, Czech Republic, May 22-27, 2011.

Zafar Rafii & Bryan Pardo 40 10/12/12