Jointly Detecting and Separating Singing Voice: A Multi-Task - PowerPoint PPT Presentation

Jointly Detecting and Separating Singing Voice: A Multi-Task Approach Daniel Stoller 1 , Sebastian Ewert 2 ∗ , Simon Dixon 1 1 Centre for Digital Music Queen Mary University of London 2 Spotify London LVA ICA 05.07.2018 ∗ Work was conducted at Queen Mary University of London D. Stoller, S. Ewert, S. Dixon (QMUL) Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 1 / 13

Vocal separation Introduction Main task: Separate vocals from music pieces Applications: Karaoke generation, singer identification, voice analysis... D. Stoller, S. Ewert, S. Dixon (QMUL) Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 2 / 13

Vocal separation Challenges Difficult task, small multi-track datasets ⇒ Overfitting Give model more knowledge: Regularization (e.g. weight decay) Regularise Accompaniment Separator Music model Voice D. Stoller, S. Ewert, S. Dixon (QMUL) Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 3 / 13

Vocal separation Challenges Difficult task, small multi-track datasets ⇒ Overfitting Give model more knowledge: Knowledge-driven (e.g. KAM [4]) Restrict Accompaniment Separator Music model Voice D. Stoller, S. Ewert, S. Dixon (QMUL) Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 3 / 13

Vocal separation Challenges Difficult task, small multi-track datasets ⇒ Overfitting Give model more knowledge: Informed source separation [2] Side information Accompaniment Separator Music model Voice D. Stoller, S. Ewert, S. Dixon (QMUL) Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 3 / 13

Vocal separation Challenges Difficult task, small multi-track datasets ⇒ Overfitting Give model more knowledge: Integrate information from related tasks/datasets Accompaniment Music Separator (dataset A) model Voice Share information Music Model Label (dataset B) D. Stoller, S. Ewert, S. Dixon (QMUL) Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 3 / 13

Goal Which other tasks could help? Vocal activity detection is promising: Knowing vocal activity improves vocal separation [1] Vocal detection networks learn a form of separation: [5] D. Stoller, S. Ewert, S. Dixon (QMUL) Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 4 / 13

Initial approach Using additional non-vocal sections U-Net adaptation [3] as separator, MSE loss Sample instrumental sections also from SVD databases ⇒ Diversifies instrumental training data Song A SVS Database ? ? ? Song B SVD Database D. Stoller, S. Ewert, S. Dixon (QMUL) Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 5 / 13

Initial approach Results DSD100 Jamendo DSD100 Electro RWC Training (SVS) Training (SVD) Validation/Test (SVS) Performance decrease D. Stoller, S. Ewert, S. Dixon (QMUL) Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 6 / 13

Initial approach Results DSD100 Jamendo DSD100 Electro CCMixter RWC iKala Training (SVS) Training (SVD) Validation/Test (SVS) Performance increase D. Stoller, S. Ewert, S. Dixon (QMUL) Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 6 / 13

Initial approach Results DSD100 Jamendo DSD100 CCMixter Electro CCMixter iKala RWC iKala MedleyDB MedleyDB Training (SVS) Training (SVD) Validation/Test (SVS) Performance decrease D. Stoller, S. Ewert, S. Dixon (QMUL) Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 6 / 13

Initial approach Results DSD100 Jamendo DSD100 CCMixter Electro CCMixter iKala RWC iKala MedleyDB MedleyDB Training (SVS) Training (SVD) Validation/Test (SVS) Performance decrease Dataset bias? D. Stoller, S. Ewert, S. Dixon (QMUL) Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 6 / 13

Dataset bias Analysis a) Mean RMS of mixture b) Accomp. to vocals RMS ratio c) Mean vocal activity 1.0 6 0.8 0.3 0.6 4 0.2 0.4 2 0.1 0.2 0 0.0 DSD MDB CCM iKala Jam. RWC DSD MDB CCM iKala DSD MDB CCM iKala Jam. RWC D. Stoller, S. Ewert, S. Dixon (QMUL) Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 7 / 13

Multi-task approach Introduction and motivation Key idea: Predict both audio and label Separator model Accompaniment Music (SVS dataset) Voice Shared weights Music Voice activity (SVD dataset) label Detection model D. Stoller, S. Ewert, S. Dixon (QMUL) Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 8 / 13

Multi-task approach Introduction and motivation Key idea: Predict both audio and label Separator model Accompaniment Music (SVS dataset) � ( m , s ) ∼ p 1 1 | 2 L MSE = || s − f ϕ ( m ) | N Voice Shared weights T Music Voice activity � ( m , o ) ∼ p 2 1 p t (SVD dataset) label L CE = log ϕ o t ( | m ) T ∑ t =1 Detection model L MTL = α L MSE + (1 − α ) L CE D. Stoller, S. Ewert, S. Dixon (QMUL) Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 8 / 13

Multi-task approach Introduction and motivation Key idea: Predict both audio and label Separator model Accompaniment Music (SVS dataset) � ( m , s ) ∼ p 1 1 | 2 L MSE = || s − f ϕ ( m ) | N Voice Shared weights T Music Voice activity � ( m , o ) ∼ p 2 1 p t (SVD dataset) label L CE = log ϕ o t ( | m ) T ∑ t =1 Detection model L MTL = α L MSE + (1 − α ) L CE Robust to dataset bias and label accuracy D. Stoller, S. Ewert, S. Dixon (QMUL) Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 8 / 13

Multi-task approach Introduction and motivation Key idea: Predict both audio and label Separator model Accompaniment Music (SVS dataset) � ( m , s ) ∼ p 1 1 | 2 L MSE = || s − f ϕ ( m ) | N Voice Shared weights T Music Voice activity � ( m , o ) ∼ p 2 1 p t (SVD dataset) label L CE = log ϕ o t ( | m ) T ∑ t =1 Detection model L MTL = α L MSE + (1 − α ) L CE Can train with vocal sections from SVD data D. Stoller, S. Ewert, S. Dixon (QMUL) Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 8 / 13

Multi-task approach Introduction and motivation Key idea: Predict both audio and label Separator model Accompaniment Music (SVS dataset) � ( m , s ) ∼ p 1 1 | 2 L MSE = || s − f ϕ ( m ) | N Voice Shared weights T Music Voice activity � ( m , o ) ∼ p 2 1 p t (SVD dataset) label L CE = log ϕ o t ( | m ) T ∑ t =1 Detection model L MTL = α L MSE + (1 − α ) L CE Needs only mixture at test time D. Stoller, S. Ewert, S. Dixon (QMUL) Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 8 / 13

Multi-task approach Introduction and motivation Key idea: Predict both audio and label Separator model Accompaniment Music (SVS dataset) � ( m , s ) ∼ p 1 1 | 2 L MSE = || s − f ϕ ( m ) | N Voice Shared weights T Music Voice activity � ( m , o ) ∼ p 2 1 p t (SVD dataset) label L CE = log ϕ o t ( | m ) T ∑ t =1 Detection model L MTL = α L MSE + (1 − α ) L CE Solves two tasks at once D. Stoller, S. Ewert, S. Dixon (QMUL) Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 8 / 13

Experimental setup Model architecture and dataset DSD100 as SVS, Jamendo as SVD training data D. Stoller, S. Ewert, S. Dixon (QMUL) Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 9 / 13

Experimental setup Evaluation metrics: AU-ROC, MSE, SDR AU-ROC for SVD MSE and SDR/SIR/SAR for separation SDR gives log(0) for non-vocal sections ( ≈ 10%) ⇒ Also measure RMS of vocal estimates for non-vocal sections D. Stoller, S. Ewert, S. Dixon (QMUL) Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 10 / 13

Results Single-task vs. multi-task model Metric Vocals Accompaniment AU-ROC MSE Non-voc. RMS SDR SIR SAR SDR SIR SAR SVD 0.9239 - - - - - - - - Model SVS - 0.01865 0.0194 2.83 5.27 6.88 6.71 14.75 13.25 Ours 0.9250 0.01755 0.0155 2.86 5.56 6.23 6.69 13.24 14.11 Table: Comparing SVS and SVD baseline with our approach D. Stoller, S. Ewert, S. Dixon (QMUL) Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 11 / 13

Results Qualitative comparison D. Stoller, S. Ewert, S. Dixon (QMUL) Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 12 / 13

Summary Current SotA methods only use multi-track data Our approach also uses SVD databases Improved separation and detection performance Future work: Larger datasets, more related tasks D. Stoller, S. Ewert, S. Dixon (QMUL) Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 13 / 13

T.-S. Chan, T.-C. Yeh, Z.-C. Fan, H.-W. Chen, L. Su, Y.-H. Yang, and R. Jang. Vocal activity informed singing voice separation with the ikala dataset. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on , pages 718–722. IEEE, 2015. S. Ewert, B. Pardo, M. M¨ uller, and M. D. Plumbley. Score-informed source separation for musical audio recordings: An overview. IEEE Signal Processing Magazine , 31(3):116–124, 2014. A. Jansson, E. J. Humphrey, N. Montecchio, R. Bittner, A. Kumar, and T. Weyde. Singing voice separation with deep U-Net convolutional networks. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR) , pages 323–332, 2017. A. Liutkus, D. Fitzgerald, and Z. Rafii. Scalable audio separation with light kernel additive modelling. D. Stoller, S. Ewert, S. Dixon (QMUL) Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 13 / 13

Jointly Detecting and Separating Singing Voice: A Multi-Task - PowerPoint PPT Presentation

Jointly Detecting and Separating Singing Voice: A Multi-Task Approach Daniel Stoller 1 , Sebastian Ewert 2 , Simon Dixon 1 1 Centre for Digital Music Queen Mary University of London 2 Spotify London LVA ICA 05.07.2018 Work was conducted

SEPARATING UNITS SEPARATING UNITS Application Separating workpieces and media, e.g. grinding

Effects of song familiarity, singing training and recent song exposure on the singing of melodies

7. Separating Hyperplane Theorems I Daisuke Oyama Mathematics II May 1, 2020 Separating

Semi-Supervised Adversarial Audio Source Separation applied to Singing Voice Extraction Daniel

Slide 1 Page: 1 The Leader's Voice Slide 3 Page: 5 The Leader's Voice Slide 4 Page: 6 The

Detecting Spammers and Content Detecting Spammers and Content Detecting Spammers and Content

12/6/2013 Detecting Fakes Image Forensics: Detecting Forged Photos 1.Detecting photorealistic

The Singing Buckeyes Dual Chorus Concept Guiding Principles Singing Buckeyes Vision (Modified)

Speech Processing 15-492/18-492 Speech Synthesis Talking heads Singing Synthesis More

DMR and Digital Voice Modes DMR and Digital Voice Modes DMR and Digital Voice Modes DMR and

Digital Voice VHF, UHF, and HF Analog Voice - AM/SSB Analog Voice - FM Digital Voice GMSK UHF

Jointly and the Jointly ecosystem Madeleine Starr Director of Business Development and

Aisle Safety Light Brightness SFMTA Fleet Engineering Voice Annunciator Volume Voice

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

There is a voice speaking. That voice is sovereign. That voice alone is sovereign. Jeremiah

WHAT IS THIS SESSION ? SING AND STRUM: Building Harmonic Skill and Understanding Through Singing

This presentation is intended as a quick start guide to using the News feature of Wires-X. The

Wavelet-domain convolution for audio localization Paul Hubbard phubbard@anl.gov Joint work

contextual data sampling Peregrines and other falcon Introduction STSM Holder Lucie Michel,

Projecto LIFE 09 NAT/ES/000533 Aces innovadoras contra o uso ilegal de venenos em reas

Presentation Notes for DVD Talk Timeline 1958 Laserdisc technology, using a transparent

Microphone Array Processing for Distant Speech Recognition From close-talking microphones to

New lamps for old: the magical Aladdin approach to our Special Collections Eleanor Johnston &

ERIS LIFESCIENCES LTD Q4 19 and FY 19 INVESTOR PRESENTATION 1 SAFE HARBOR STATEMENT This

Sambuz

Useful Links

Newsletter

Mail Us

Jointly Detecting and Separating Singing Voice: A Multi-Task - PowerPoint PPT Presentation

Jointly Detecting and Separating Singing Voice: A Multi-Task Approach Daniel Stoller 1 , Sebastian Ewert 2 , Simon Dixon 1 1 Centre for Digital Music Queen Mary University of London 2 Spotify London LVA ICA 05.07.2018 Work was conducted

SEPARATING UNITS SEPARATING UNITS Application Separating workpieces and media, e.g. grinding

Effects of song familiarity, singing training and recent song exposure on the singing of melodies

7. Separating Hyperplane Theorems I Daisuke Oyama Mathematics II May 1, 2020 Separating

Semi-Supervised Adversarial Audio Source Separation applied to Singing Voice Extraction Daniel

Slide 1 Page: 1 The Leader's Voice Slide 3 Page: 5 The Leader's Voice Slide 4 Page: 6 The

Detecting Spammers and Content Detecting Spammers and Content Detecting Spammers and Content

12/6/2013 Detecting Fakes Image Forensics: Detecting Forged Photos 1.Detecting photorealistic

The Singing Buckeyes Dual Chorus Concept Guiding Principles Singing Buckeyes Vision (Modified)

Speech Processing 15-492/18-492 Speech Synthesis Talking heads Singing Synthesis More

DMR and Digital Voice Modes DMR and Digital Voice Modes DMR and Digital Voice Modes DMR and

Digital Voice VHF, UHF, and HF Analog Voice - AM/SSB Analog Voice - FM Digital Voice GMSK UHF

Jointly and the Jointly ecosystem Madeleine Starr Director of Business Development and

Aisle Safety Light Brightness SFMTA Fleet Engineering Voice Annunciator Volume Voice

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

There is a voice speaking. That voice is sovereign. That voice alone is sovereign. Jeremiah

WHAT IS THIS SESSION ? SING AND STRUM: Building Harmonic Skill and Understanding Through Singing

This presentation is intended as a quick start guide to using the News feature of Wires-X. The

Wavelet-domain convolution for audio localization Paul Hubbard phubbard@anl.gov Joint work

contextual data sampling Peregrines and other falcon Introduction STSM Holder Lucie Michel,

Projecto LIFE 09 NAT/ES/000533 Aces innovadoras contra o uso ilegal de venenos em reas

Presentation Notes for DVD Talk Timeline 1958 Laserdisc technology, using a transparent

Microphone Array Processing for Distant Speech Recognition From close-talking microphones to

New lamps for old: the magical Aladdin approach to our Special Collections Eleanor Johnston &amp;

ERIS LIFESCIENCES LTD Q4 19 and FY 19 INVESTOR PRESENTATION 1 SAFE HARBOR STATEMENT This

Sambuz

Useful Links

Newsletter

Mail Us

New lamps for old: the magical Aladdin approach to our Special Collections Eleanor Johnston &