Adaptive Filtering for Music/Voice Separation Exploiting the - - PowerPoint PPT Presentation

adaptive filtering for music voice separation exploiting
SMART_READER_LITE
LIVE PREVIEW

Adaptive Filtering for Music/Voice Separation Exploiting the - - PowerPoint PPT Presentation

Time-Fequency masking Fixed patterns Varying patterns Demonstration Adaptive Filtering for Music/Voice Separation Exploiting the Repeating Musical Structure Adaptive REPET Antoine Liutkus 1 , Zafar Rafii 2 , Roland Badeau 1 , Bryan Pardo 2 ,


slide-1
SLIDE 1

Time-Fequency masking Fixed patterns Varying patterns Demonstration

Adaptive Filtering for Music/Voice Separation Exploiting the Repeating Musical Structure

Adaptive REPET Antoine Liutkus1, Zafar Rafii2, Roland Badeau1, Bryan Pardo2, Ga¨ el Richard1

1Telecom ParisTech, CNRS LTCI, Paris, France 2Northwestern University, EECS Department, Evanston, USA

ICASSP 2012, Kyoto, Japan

Liutkus, Rafii et al Adaptive REPET

slide-2
SLIDE 2

Time-Fequency masking Fixed patterns Varying patterns Demonstration Notation

Source separation: notation

−1 1 voice v voice spectrogram V frequency (Hz) 1250 2500 3750 5000 −1 1 Background b Background spectrogram B frequency (Hz) 1250 2500 3750 5000 10 20 30 40 50 −1 1 mix x time (s) mix spectrogram X frame frequency (Hz) 1000 2000 3000 4000 5000 1250 2500 3750 5000 Liutkus, Rafii et al Adaptive REPET

slide-3
SLIDE 3

Time-Fequency masking Fixed patterns Varying patterns Demonstration Notation

Separation as an adaptive filter

Separating a source = filtering the mixture Time-varying filter wt: different for each frame t Element-wise weighting of the STFT Here: W ∈ [0 1]

Time−Frequency mask W frame frequency (Hz) 1000 2000 3000 4000 5000 1250 2500 3750 5000 mix spectrogram X frame frequency (Hz) 1000 2000 3000 4000 5000 1250 2500 3750 5000 Weighted mix spectrogram W .* X frame frequency (Hz) 1000 2000 3000 4000 5000 1250 2500 3750 5000

Liutkus, Rafii et al Adaptive REPET

slide-4
SLIDE 4

Time-Fequency masking Fixed patterns Varying patterns Demonstration Time-frequency masks

Time-Frequency masks

interpretation

W (f , t) ∈ [0 1]: Proportion of the source of interest in the mix. W (f , t) ≈ 1⇒TF bin (f , t) mostly comes from source of interest W (f , t) ≈ 0⇒TF bin (f , t) mostly comes from other sources Comb filter Given a pitch contour f0 (t), keep multiples of f0 (t)

Time−varying comb−filter frame frequency (Hz) 100 200 300 400 500 600 700 800 900 500

Liutkus, Rafii et al Adaptive REPET

slide-5
SLIDE 5

Time-Fequency masking Fixed patterns Varying patterns Demonstration Time-frequency masks

Beyond the harmonic model

Modeling the accompaniement

Most studies focus on harmonic voice models:

Voice assumed harmonic and predominant pitch is estimated Filtering e.g. through comb filters

Problems:

breathy voices ? Consonants ? Loud accompaniement ?

We focus on a model for the background B !

Liutkus, Rafii et al Adaptive REPET

slide-6
SLIDE 6

Time-Fequency masking Fixed patterns Varying patterns Demonstration Time-frequency masks

Filtering given the model

From the B to the mask

Mask from B alone Imagine X and B are available. What is W ? X (f , t) close to B (f , t) → W (f , t) ≈ 1 X (f , t) far from B (f , t) → W (f , t) ≈ 0 Binary Mask: 0 or 1 based on a thresholding of B

X

Soft mask: W (f , t) = exp

  • −(log X (f , t) − log B (f , t))2

λ2

  • Liutkus, Rafii et al

Adaptive REPET

slide-7
SLIDE 7

Time-Fequency masking Fixed patterns Varying patterns Demonstration REPET

Repeating patterns in music

modeling B

Musical background is repetitive !

Background spectrogram B and its repeating pattern frequency (Hz) frames 2500 5000 7500

T

Given several repetitions, average to estimate B and filter it out !

Liutkus, Rafii et al Adaptive REPET

slide-8
SLIDE 8

Time-Fequency masking Fixed patterns Varying patterns Demonstration REPET

REpeating Pattern Extraction Technique (REPET)

Original REPET algorithm Estimate a fixed repeating period T Estimate the fixed repeating pattern through averaging Compute W as a binary mask

Liutkus, Rafii et al Adaptive REPET

slide-9
SLIDE 9

Time-Fequency masking Fixed patterns Varying patterns Demonstration Advantages and limitations

Advantages and limitations of REPET

Advantages

Fast Efficient for constant rythmic patterns (electro, short excerpts)

Limitations

Repeating pattern is changing over time Binary masking leads to artifacts

We extend REPET to varying repeating patterns

Liutkus, Rafii et al Adaptive REPET

slide-10
SLIDE 10

Time-Fequency masking Fixed patterns Varying patterns Demonstration Time-varying period

Pseudo-periodic patterns

Patterns are not fixed:

period may vary pattern may vary

Frequency bands of B are assumed pseudo periodic, with the same period

Background spectrogram B frequency (Hz) 1250 2500 3750 5000

1000 2000 3000 4000 5000 frame log−value of three frequency bands of the spectrogram B band 20 band 40 band 60

Liutkus, Rafii et al Adaptive REPET

slide-11
SLIDE 11

Time-Fequency masking Fixed patterns Varying patterns Demonstration Time-varying period

Beat-spectrum estimation

Estimating the period (1/2)

Perform a short-term analysis of each band Add them all together Beat spectrogram : rythmic content of the signal

mix spectrogram X frame frequency (Hz) 1000 2000 3000 4000 5000 2500 5000 7500 10000 spectrogram of band 20 bag of frames frequency (1/frame) 50 100 150 200 250 20 40 60 80 100 spectrogram of band 401 bag of frames frequency (1/frame) 50 100 150 200 250 20 40 60 80 100 beat spectrogram bag of frames frequency (1/frame) 50 100 150 200 250 20 40 60 80 100 spectrogram of band 1001 bag of frames frequency (1/frame) 50 100 150 200 250 20 40 60 80 100

Liutkus, Rafii et al Adaptive REPET

slide-12
SLIDE 12

Time-Fequency masking Fixed patterns Varying patterns Demonstration Time-varying period

Pseudo-period estimation

Compute the beat spectrogram Estimate the time-varying repeating period Any frequency-based pitch detector will do !

Liutkus, Rafii et al Adaptive REPET

slide-13
SLIDE 13

Time-Fequency masking Fixed patterns Varying patterns Demonstration Model and estimation

Background model given T0 (t)

Background model ∀t, accompaniement is periodic for 2K periods around t: B (f , t) = B (f , t + kT0 (t)) , k = −K · · · K

frequency (Hz) 2500 5000 7500

Liutkus, Rafii et al Adaptive REPET

slide-14
SLIDE 14

Time-Fequency masking Fixed patterns Varying patterns Demonstration Model and estimation

Voice model

Voice model voice V is assumed to be sparse

voice spectrogram V frequency (Hz) 1250 2500 3750 5000

Liutkus, Rafii et al Adaptive REPET

slide-15
SLIDE 15

Time-Fequency masking Fixed patterns Varying patterns Demonstration Model and estimation

Background estimation

estimation of B given X and T0 (t)

Sparsity of V

Most of the time, V ≈ 0⇒X ≈ B Sometimes, V active⇒ outliers mix spectrogram X frequency (Hz) 1250 2500 3750 5000

ˆ B (f , t) = median [X (f , t + kT0 (t))]k=−K···K

Liutkus, Rafii et al Adaptive REPET

slide-16
SLIDE 16

Time-Fequency masking Fixed patterns Varying patterns Demonstration Model and estimation

Adaptive REPET Block diagram

Liutkus, Rafii et al Adaptive REPET

slide-17
SLIDE 17

Time-Fequency masking Fixed patterns Varying patterns Demonstration

Demonstration

Demonstration on different musical genres

Liutkus, Rafii et al Adaptive REPET

slide-18
SLIDE 18

Time-Fequency masking Fixed patterns Varying patterns Demonstration

Conclusion

Adaptive algorithms for complete recordings Fast (approx. reading time) Extensions : from repetitivity to self-similarity

Liutkus, Rafii et al Adaptive REPET