GCT634@KAIST Invited lecture: Sound Source Separation 7 June 2018 - - PowerPoint PPT Presentation

▶

Apr 07, 2023 601 likes •868 views

GCT634@KAIST Invited lecture: Sound Source Separation 7 June 2018 Keunwoo Choi at QMUL.uk, Spotify.us, groovo.io Sound Source Separation Lets isolate the target audio signal! Cocktail party e ff ects ..as if were

SLIDE 1

GCT634@KAIST Invited lecture: Sound Source Separation

7 June 2018 Keunwoo Choi at QMUL.uk, Spotify.us, groovo.io

SLIDE 2

Sound Source Separation

“Cocktail party effects”

..as if we’re simulating human brain (as if we know what’s going on there)

Let’s isolate the “target” audio signal!

SLIDE 3

Sound Source Separation

Input Target Noise Speech, Ambience Speech Noise Mixture of speech Speaker i all j != i Music ((Vocal, Drum, Guitar, Bass + ..) Instrument i all j != i

problem = f(assumptions} assumptions = {environments: {dry, wet, ..}, signal = {ch: {mono, stereo, ..}, content: {speech, music}}, target: {...}}

SLIDE 4

SSS Applications

KA-RA-O-KE!
Transcription
DJing/Mixing
Many other MIR tasks
Once we called it a “Chicken-and-egg”;

Solving SSS would make many tasks extremely easier

SLIDE 5

BIG ASSUMPTION   FOR A LONG WHILE

with |STFT| (or CQT)
1 time-frequency bin, 1 instrument -- aka “W-disjoint”
Phase doesn’t matter much
It used to apply to (almost) every research

SLIDE 6

Problem config 1 -   mixing matrix A

There was a mixing matrix A that we’ll estimate its inverse.

s : source (instruments) a_xx : amplitude mixing coeffs x : stereo signal input ! w : estimated mixing coeffs ; y : estimated source (instruments)

SLIDE 7

ICA

Independent Component Analysis (ICASSP ’98)
Based on some stats -- independency, (non-)Gaussianity
Not directly about audio but a general technique
Example: http://www.kecl.ntt.co.jp/icl/signal/sawada/

demo/bss2to4/index.html

Further study: https://www.cs.helsinki.fi/u/ahyvarin/papers/NN00new.pdf

SLIDE 8

ADRESS

Azimuth Discrimination and Resynthesis (DAFx 2004)
1-dim clustering; for stereo sound source separation

SLIDE 9

Problem config 2 -   mixing matrix and delay

Sources are at different angles and distances

→Mixing matrix A is also about time delay

SLIDE 10

DUET

Location = {angle, distance}
Each location, each 2D cluster, each instrument
DOA (Direction Of Arrival) estimation
Something similar is in your phone (with 2+ microphones)

to suppress non-speech sounds (but perhaps not in your earphones/headphones)

SLIDE 11

Problem config 3 - music - spectra of instruments

http://www.physics.usyd.edu.au/teach_res/hsp/sp/mod31/m31_strings.htm

SLIDE 12

NMF

Assumptions of using NMF for SSS:
The spectral shapes of musical instruments are known.
NMF would separate each note (aka basis)!
Many applications for drum separation (it works)

https://www.slideshare.net/DaichiKitamura/robust-music-signal-separation-based-on- supervised-nonnegative-matrix-factorization-with-prevention-of-basis-sharing

SLIDE 13

Problem config 4 - music - repeats

“Instrumental parts repeat!” (↔ vocal)
“Drums/beats repeat!” (↔ harmonic instruments)
A valid assumption for modern popular music
E.g., REPET (IEEE 2013), KAM (D. Fano Yela, ICASSP

2017), ...

SLIDE 14

Problem config 5 - music - some musical cases

“Central” (~= vocal) source separation
Because - main vocals are almost always

at the centre (and we all love karaoke)

Harmonic/percussive source separation
Because - they are (almost) completely

different in spectral/temporal axes

Median filtering for drum separation (D.

Fitzgerald, DAFx)

“Gaussian mixture model for singing voice separation from stereophonic music”, M Kim et al, 2011

SLIDE 15

Problem config 6 - music - ‘informed’ source separation

Exploiting the score as side information

“Score-Informed Source Separation for Musical Audio Recordings”, S Ewert et al., 2013

SLIDE 16

History so far...

less generality stronger assumptions

as time goes by

SLIDE 17

DEEP! LEARNING!!

SLIDE 18

DL and SSS

Less assumptions (let’s think further papers!)
Data-related; trained models do NOT extrapolate.

E.g., A model with speech probably wouldn’t work with music.

Model-related; E.g., frame-based? context-free? Does

it estimate the phase? Stereo-input?

SLIDE 19

Frame-based DL-SS

Because vocals are distinguishable in a frame (or frames)

“Deep Learning For Monaural Speech Separation”, Po-sen Huang et al, 2014

SLIDE 20

U-Net and SS

Because vocals are distinguishable in the |STFT| image

“Singing Voice Separation With Deep U-Net Convolutional Networks”, A Jansson et al., ismir 2017 “U-Net: Convolutional Networks for Biomedical Image Segmentation”, O Ronneberger et al., 2015

SLIDE 21

A practical limitation

Supervised learning requires a *paired*

dataset

for such a system;

x: [mixtures]  y: [instrumental mixtures; vocal tracks]

→ not sustainable

Inst Vocal 1 Inst Vocal 2 Inst Vocal 3

paired dataset

SLIDE 22

GANs and SS

Weakly labelled dataset:

{many instrumental tracks}   (aka Real)  +  {many voc + instrumental tracks}   (input of aka Fake)

We alternate to show a GAN-based model

{real instrumental / vocal-separated (fake) instrumental}  and let the model learns 

i) to classify real/fake  
ii) to fake an instrumental track = to remove vocal

simultaneously.

Inst tracks mix tracks

unpaired dataset

Inst Vocal 1 Inst Vocal 2 Inst Vocal 3

paired dataset

“Adversarial Semi-Supervised Audio Source Separation applied to Singing Voice Extraction”, D Stoller, 2018 ICASSP

SLIDE 23

Further study

A great SS tutorial: http://ismir2010.ismir.net/

proceedings/tutorial_1_Vincent-Ono.pdf

SLIDE 24

Further me

keunwoochoi.wordpress.com
keunwoochoi.blogspot.com
groovo.io
spotify.com
http://c4dm.eecs.qmul.ac.uk

GCT634@KAIST Invited lecture: Sound Source Separation

Sound Source Separation

Sound Source Separation

SSS Applications

BIG ASSUMPTION FOR A LONG WHILE

Problem config 1 - mixing matrix A

ICA

ADRESS

Problem config 2 - mixing matrix and delay

DUET

Problem config 3 - music - spectra of instruments

NMF

Problem config 4 - music - repeats

Problem config 5 - music - some musical cases

Problem config 6 - music - ‘informed’ source separation

History so far...

DEEP! LEARNING!!

DL and SSS

Frame-based DL-SS

U-Net and SS

A practical limitation

GANs and SS

Further study

Further me

BIG ASSUMPTION   FOR A LONG WHILE

Problem config 1 -   mixing matrix A

Problem config 2 -   mixing matrix and delay