GCT634@KAIST Invited lecture: Sound Source Separation
7 June 2018 Keunwoo Choi at QMUL.uk, Spotify.us, groovo.io
GCT634@KAIST Invited lecture: Sound Source Separation 7 June 2018 - - PowerPoint PPT Presentation
GCT634@KAIST Invited lecture: Sound Source Separation 7 June 2018 Keunwoo Choi at QMUL.uk, Spotify.us, groovo.io Sound Source Separation Lets isolate the target audio signal! Cocktail party e ff ects ..as if were
7 June 2018 Keunwoo Choi at QMUL.uk, Spotify.us, groovo.io
..as if we’re simulating human brain (as if we know what’s going on there)
Let’s isolate the “target” audio signal!
Input Target Noise Speech, Ambience Speech Noise Mixture of speech Speaker i all j != i Music ((Vocal, Drum, Guitar, Bass + ..) Instrument i all j != i
problem = f(assumptions} assumptions = {environments: {dry, wet, ..}, signal = {ch: {mono, stereo, ..}, content: {speech, music}}, target: {...}}
Solving SSS would make many tasks extremely easier
s : source (instruments) a_xx : amplitude mixing coeffs x : stereo signal input ! w : estimated mixing coeffs ; y : estimated source (instruments)
demo/bss2to4/index.html
Further study: https://www.cs.helsinki.fi/u/ahyvarin/papers/NN00new.pdf
→Mixing matrix A is also about time delay
to suppress non-speech sounds (but perhaps not in your earphones/headphones)
http://www.physics.usyd.edu.au/teach_res/hsp/sp/mod31/m31_strings.htm
https://www.slideshare.net/DaichiKitamura/robust-music-signal-separation-based-on- supervised-nonnegative-matrix-factorization-with-prevention-of-basis-sharing
2017), ...
at the centre (and we all love karaoke)
different in spectral/temporal axes
Fitzgerald, DAFx)
“Gaussian mixture model for singing voice separation from stereophonic music”, M Kim et al, 2011
“Score-Informed Source Separation for Musical Audio Recordings”, S Ewert et al., 2013
less generality stronger assumptions
as time goes by
E.g., A model with speech probably wouldn’t work with music.
it estimate the phase? Stereo-input?
“Deep Learning For Monaural Speech Separation”, Po-sen Huang et al, 2014
“Singing Voice Separation With Deep U-Net Convolutional Networks”, A Jansson et al., ismir 2017 “U-Net: Convolutional Networks for Biomedical Image Segmentation”, O Ronneberger et al., 2015
dataset
x: [mixtures] y: [instrumental mixtures; vocal tracks]
Inst Vocal 1 Inst Vocal 2 Inst Vocal 3
paired dataset
{many instrumental tracks} (aka Real) + {many voc + instrumental tracks} (input of aka Fake)
{real instrumental / vocal-separated (fake) instrumental} and let the model learns
simultaneously.
Inst tracks mix tracks
unpaired dataset
Inst Vocal 1 Inst Vocal 2 Inst Vocal 3
paired dataset
“Adversarial Semi-Supervised Audio Source Separation applied to Singing Voice Extraction”, D Stoller, 2018 ICASSP
proceedings/tutorial_1_Vincent-Ono.pdf