using the Similarity Matrix Zafar Rafii & Bryan Pardo - - PowerPoint PPT Presentation
using the Similarity Matrix Zafar Rafii & Bryan Pardo - - PowerPoint PPT Presentation
Music/Voice Separation using the Similarity Matrix Zafar Rafii & Bryan Pardo Introduction Musical pieces are often characterized by an underlying repeating structure over which varying elements are superimposed Propellerheads - History
- Musical pieces are often characterized by an
underlying repeating structure over which varying elements are superimposed
Introduction
Zafar Rafii & Bryan Pardo 2 10/12/12
2 4 6 8 10 12
- 1
1 Propellerheads - History Repeating time (s)
Introduction
Zafar Rafii & Bryan Pardo 3
- The REpeating Pattern Extraction Technique
(REPET) was proposed to extract the repeating structure from the non-repeating structure
10/12/12
REPET
Mixture Repeating Structure Non-repeating Structure
Repeating Spectrogram W
REPET
Zafar Rafii & Bryan Pardo 4
Step 2
Mixture Signal x
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6- 1
Beat Spectrum b
.1 .2 .3 .4 .5 .6 .7 .8 .9 1V
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 55001p 2p
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500Median
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500V
1 2 3 4 5 6 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500Time-Frequency Mask M min min Mixture Spectrogram V
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500p
Repeating Segment S S min
Step 1 Step 3
Repeating Spectrogram U i
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500Repeating Spectrogram W
Adaptive REPET
Zafar Rafii & Bryan Pardo 5
Step 2
Mixture Signal x
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6- 1
Beat Spectrogram B V
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500V
1 2 3 4 5 6 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500Time-Frequency Mask M Mixture Spectrogram V
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500pi i
.1 .2 .3 .4 .5 .6 .7 .8 .9 1i i+1pi i-1pi Median i i+1pi i-1pi
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500U min i
Step 1 Step 3
Limitations
- Both the original and the adaptive REPET
assume periodically repeating patterns
10/12/12 Zafar Rafii & Bryan Pardo 6
Mixture Periodically repeating background Beat spectrogram period finder
Limitations
- Repetitions can also happen intermittently or
without a global (or local) period
10/12/12 Zafar Rafii & Bryan Pardo 7
Mixture Non-periodically repeating background Beat spectrogram period finder
Limitations
- Instead of looking for periodicities, we can
look for similarities, using a similarity matrix
10/12/12 Zafar Rafii & Bryan Pardo 8
Mixture Non-periodically repeating background Similarity matrix +similar +dissimilar
- The similarity matrix is a matrix where each
bin measures the (dis)similarity between any two elements of a sequence given a metric
Similarity Matrix
10/12/12 Zafar Rafii & Bryan Pardo 9
Sequence i1 i2 Similarity matrix +similar +dissimilar i2 i1 metric
- In audio, the SM can help to visualize the time
structure and find repeating/similar patterns
Similarity Matrix
10/12/12 Zafar Rafii & Bryan Pardo 10
cosine
Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 time (s) time (s) Similarity Matrix 2 4 6 8 10 12 2 4 6 8 10 12 1
+similar +dissimilar
Assumptions
- Given a mixture of music + voice:
– The repeating background is dense & low-ranked – The non-repeating foreground is sparse & varied
10/12/12 Zafar Rafii & Bryan Pardo 11
Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 Background Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 Foreground Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20
Assumptions
- The SM of a mixture is then likely to reveal the
structure of the repeating background
10/12/12 Zafar Rafii & Bryan Pardo 12
Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 Background Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 time (s) time (s) Similarity Matrix 2 4 6 8 10 12 2 4 6 8 10 12
REPET-SIM
Zafar Rafii & Bryan Pardo 13
- REPET with Similarity Matrix!
- 1. Identify the repeating/similar elements
- 2. Derive a repeating model
- 3. Extract the repeating structure
10/12/12
REPET- SIM
Mixture Signal Repeating Structure Non-repeating Structure
REPET-SIM
Zafar Rafii & Bryan Pardo 14
- Advantages compared with REPET:
– Can handle intermittent repeating elements – Can handle fast-varying repeating structures – Can handle full-track songs
10/12/12
REPET- SIM
Mixture Signal Repeating Structure Non-repeating Structure
Interests
Zafar Rafii & Bryan Pardo 15
- Practical Interests
– Audio post processing – Melody extraction – Karaoke gaming
- Intellectual Interests
– Music perception – Music understanding – Simply based on self-similarity!
10/12/12
1 2 3 4 5 6 1 2 3 4 5 6
Similarity Matrix S
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500Repeating Spectrogram U i
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500Repeating Spectrogram W
REPET-SIM
Zafar Rafii & Bryan Pardo 16
Step 2
Mixture Signal x
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6- 1
V
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500V
1 2 3 4 5 6 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500Time-Frequency Mask M Mixture Spectrogram V
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500j2=i j3 j1 Median j2 j3 j1
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500U min i i j2 j1 j3
Step 1 Step 3
1 2 3 4 5 6 1 2 3 4 5 6
Similarity Matrix S
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500Repeating Spectrogram U i
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500Repeating Spectrogram W
- 1. Repeating Elements
Zafar Rafii & Bryan Pardo 17
Step 2
Mixture Signal x
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6- 1
V
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500V
1 2 3 4 5 6 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500Time-Frequency Mask M Mixture Spectrogram V
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500j2=i j3 j1 Median j2 j3 j1
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500U min i i j2 j1 j3
Step 1 Step 3
- 1. Repeating Elements
- We take the cosine similarity between any two
pairs of columns and get a similarity matrix
Zafar Rafii & Bryan Pardo 18 10/12/12
Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20
cosine
Similarity Matrix time (s) time (s) 2 4 6 8 10 12 2 4 6 8 10 12
i2 i1 i1 i2
- 1. Repeating Elements
- The SM reveals for every frame i, the frames jk
that are the most similar to frame i
Zafar Rafii & Bryan Pardo 19 10/12/12
Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 Similarity Matrix time (s) time (s) 2 4 6 8 10 12 2 4 6 8 10 12
i i j2 j1 j3 j2 j1 j3 cosine
1 2 3 4 5 6 1 2 3 4 5 6
Similarity Matrix S
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500Repeating Spectrogram U i
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500Repeating Spectrogram W
- 1. Repeating Elements
Zafar Rafii & Bryan Pardo 20
Step 2
Mixture Signal x
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6- 1
V
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500V
1 2 3 4 5 6 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500Time-Frequency Mask M Mixture Spectrogram V
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500j2=i j3 j1 Median j2 j3 j1
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500U min i i j2 j1 j3
Step 1 Step 3
1 2 3 4 5 6 1 2 3 4 5 6
Similarity Matrix S
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500Repeating Spectrogram U i
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500Repeating Spectrogram W
- 2. Repeating Model
Zafar Rafii & Bryan Pardo 21
Step 2
Mixture Signal x
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6- 1
V
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500V
1 2 3 4 5 6 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500Time-Frequency Mask M Mixture Spectrogram V
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500j2=i j3 j1 Median j2 j3 j1
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500U min i i j2 j1 j3
Step 1 Step 3
- 2. Repeating Model
- For every frame i, we take the median of its
most similar frames jk found using the SM
10/12/12 Zafar Rafii & Bryan Pardo 22
Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20
i
Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20
j2 j1 j3 SM
- 2. Repeating Model
- We obtain an initial repeating spectrogram
model
10/12/12 Zafar Rafii & Bryan Pardo 23
Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20
i
Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20
j2 j1 j3
Repeating Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20
i SM
median
1 2 3 4 5 6 1 2 3 4 5 6
Similarity Matrix S
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500Repeating Spectrogram U i
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500Repeating Spectrogram W
- 2. Repeating Model
Zafar Rafii & Bryan Pardo 24
Step 2
Mixture Signal x
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6- 1
V
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500V
1 2 3 4 5 6 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500Time-Frequency Mask M Mixture Spectrogram V
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500j2=i j3 j1 Median j2 j3 j1
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500U min i i j2 j1 j3
Step 1 Step 3
1 2 3 4 5 6 1 2 3 4 5 6
Similarity Matrix S
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500Repeating Spectrogram U i
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500Repeating Spectrogram W
- 3. Repeating Structure
Zafar Rafii & Bryan Pardo 25
Step 2
Mixture Signal x
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6- 1
V
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500V
1 2 3 4 5 6 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500Time-Frequency Mask M Mixture Spectrogram V
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500j2=i j3 j1 Median j2 j3 j1
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500U min i i j2 j1 j3
Step 1 Step 3
- 3. Repeating Structure
- We take the element-wise minimum between
the repeating and mixture spectrograms
10/12/12 Zafar Rafii & Bryan Pardo 26
Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 2 4 6 8 10 12 10 20
min
Repeating Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20
- 3. Repeating Structure
- We obtain a refined repeating spectrogram
model for the repeating background
10/12/12 Zafar Rafii & Bryan Pardo 27
Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 2 4 6 8 10 12 10 20
min
Repeating Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20
- 3. Repeating Structure
- The repeating spectrogram cannot have
values higher than the mixture spectrogram
10/12/12 Zafar Rafii & Bryan Pardo 28
Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 Non-repeating Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20
≥ 𝟏 ≥ 𝟏 ≥ 𝟏
Repeating Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20
- 3. Repeating Structure
- We divide the repeating spectrogram by the
mixture spectrogram, element-wise
10/12/12 Zafar Rafii & Bryan Pardo 29
Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 Mixture Spectrogram 2 4 6 8 10 12 10 20
divides
Repeating Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20
- 3. Repeating Structure
- We obtain a soft time-frequency mask (with
values in [0,1])
10/12/12 Zafar Rafii & Bryan Pardo 30
Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 Mixture Spectrogram 2 4 6 8 10 12 10 20
divides
Time-frequency Mask time (s) frequency (kHz) 2 4 6 8 10 12 10 20
2 4 6 8 10 12
- 1
1 Background Signal time (s)
iSTFT
Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20
- We apply the t-f mask to the mixture STFT and
- btain the repeating background
- 3. Repeating Structure
Zafar Rafii & Bryan Pardo 31
.x
Time-frequency Mask time (s) frequency (kHz) 2 4 6 8 10 12 10 20 Background Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20
2 4 6 8 10 12
- 1
1 Foreground Signal time (s) 2 4 6 8 10 12
- 1
1 Background Signal time (s) 2 4 6 8 10 12
- 1
1 Mixture Signal time (s)
- The non-repeating foreground is obtained by
subtracting the background from the mixture
iSTFT
- 3. Repeating Structure
Zafar Rafii & Bryan Pardo 32 10/12/12
Mixture Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20 2 4 6 8 10 12
- 1
1 Background Signal time (s) Background Spectrogram time (s) frequency (kHz) 2 4 6 8 10 12 10 20
2 4 6 8 10 12
- 1
1 Mixture Signal time (s)
- Repeating background ≈ music component
- Non-repeating foreground ≈ voice component
Music/Voice Separation
Zafar Rafii & Bryan Pardo 33
REPET-SIM
- 1. Repeating elements
- 2. Repeating model
- 3. Repeating structure
10/12/12
2 4 6 8 10 12
- 1
1 Foreground Signal time (s) 2 4 6 8 10 12
- 1
1 Background Signal time (s)
Evaluation
- Competitive method 1 [Liutkus et al., 2012]
– Adaptive REPET with automatic periods finder and soft time-frequency masking
- Competitive method 2 [FitzGerald et al., 2010]
– Median filtering of the spectrogram at different frequency resolutions to extract the vocals
- Data set
– 14 full-track real-world songs (Beach Boys) – 3 voice-to-music mixing ratios (-6, 0, and 6 dB)
Zafar Rafii & Bryan Pardo 34 10/12/12
Evaluation
35 10/12/12
MMFS = FitzGerald et al. REPET+ = Liutkus et al. Proposed = REPET-SIM
Zafar Rafii & Bryan Pardo
5 10 15 20 25
- 1
1 Voice estimate (REPET-SIM) time (s) 5 10 15 20 25
- 1
1 Music estimate (REPET-SIM) time (s) 5 10 15 20 25
- 1
1 Voice estimate (FitzGerald) time (s) 5 10 15 20 25
- 1
1 Music estimate (FitzGerald) time (s) 5 10 15 20 25
- 1
1 Wham! - Freedom time (s)
Examples
- REPET-SIM vs. FitzGerald et al.
Zafar Rafii & Bryan Pardo 36 10/12/12
20 40 60 80 100 120
- 1
1 Voice estimate time (s) 20 40 60 80 100 120
- 1
1 Music estimate time (s)
Examples
- REPET-SIM
Zafar Rafii & Bryan Pardo 37 10/12/12
20 40 60 80 100 120
- 1
1 Blackalicious - Alphabet Aerobics time (s)
20 40 60 80 100 120
- 1
1 Voice estimate time (s) 20 40 60 80 100 120
- 1
1 Music estimate time (s)
Examples
- Adaptive REPET
Zafar Rafii & Bryan Pardo 38 10/12/12
20 40 60 80 100 120
- 1
1 Blackalicious - Alphabet Aerobics time (s)
- The analysis of the repetitions/similarities in
music can be used for source separation
Conclusion
Zafar Rafii & Bryan Pardo 39
REPET-SIM
- 1. Repeating elements
- 2. Repeating model
- 3. Repeating structure
10/12/12
Mixture Signal Repeating Structure Non-repeating Structure
Questions?
- D. FitzGerald and M. Gainza, “Single Channel Vocal Separation using Median
Filtering and Factorisation Techniques,” ISAST Transactions on Electronic and Signal Processing, vol. 4, no. 1, pp. 62-73, 2010.
- J. Foote, “Visualizing Music and Audio using Self-Similarity,” ACM International
Conference on Multimedia, Orlando, FL, USA, October 30-November 5, 1999.
- A. Liutkus, Z. Rafii, R. Badeau, B. Pardo, and G. Richard, “Adaptive Filtering for
Music/Voice Separation exploiting the Repeating Musical Structure,” IEEE International Conference on Acoustics, Speech and Signal Processing, Kyoto, Japan, March 25-30, 2012.
- Z. Rafii and B. Pardo, “A Simple Music/Voice Separation Method based on the
Extraction of the Repeating Musical Structure,” IEEE International Conference on Acoustics, Speech and Signal Processing, Prague, Czech Republic, May 22-27, 2011.
Zafar Rafii & Bryan Pardo 40 10/12/12