Deep Learning with Audio Signals
Prepare, Process, Design, Expect Keunwoo Ch i
Deep Learning with Audio Signals Prepare, Process, Design, Expect - - PowerPoint PPT Presentation
Deep Learning with Audio Signals Prepare, Process, Design, Expect Keunwoo Ch i Keunwoo Choi Research Scientist QMUL, UK ETRI, S. Korea SNU, S. Korea @keunwoochoi (twtr, github) WARNING THIS MATERIAL IS WRITTEN FOR ATTENDEES IN QCON.AI,
Prepare, Process, Design, Expect Keunwoo Ch i
QMUL, UK ETRI, S. Korea SNU, S. Korea @keunwoochoi (twtr, github)
Research Scientist
THIS MATERIAL IS WRITTEN FOR ATTENDEES IN QCON.AI, NAMELY, SOFTWARE ENGINEERS AND DEEP LEARNING PRACTITIONERS TO PROVIDE AN OFF-THE- SHELF GUIDE. MY ADVICE MIGHT NOT BE THE FINAL SOLUTION FOR YOUR PROBLEM, BUT WOULD BE A GOOD STARTING POINT. ..ALSO, THERE'S NO SPOTIFY SECRET HERE :P
digital space
the sound is in the real world
Our lovely cyberspace
Source Noise Reverberation Microphone
Dear everyone, YOU ARE ALWAYS IN THE "UGH..." SITUATION
DL models are robust only within the variance they've seen. → Good at interpolation.. only.
E.g., a model trained with clean signals probably can't deal with noisy signals
noisy environment cheap mic
+ noise signal
clean signal noisy signal
room impulse response
dry signal wet signal
band-pass filter
signal recorded signal
Noise babble noise recording home noise recording cafe noise recording street noise recording white noise, brown noise x_noise = x + alpha * noise Reverberation (maybe skip it) room impulse responses, RIR reverberation simulators x_wet = np.conv(x, rir) Microphone band pass filter scipy.signal filtering microphone specification speaker specification microphone frequency response scipy.signal.convolve scipy.signal.fftconvolve Or trimming-off your spectrograms
size=(44100, ), dtype=int16
CIFAR10: (32, 32, 3), int8 ImageNet: (256, 256, 3), int8
Type Description Data shape and size for e.g., 1 second, sampling rate=44100 Waveform x 44100 x [int16] Spectrograms STFT(x) Melspectrogram(x) CQT(x) 513 x 87 x [float32] 128 x 87 x [float32] 72 x 87 x [float32] Features MFCC(x) = some process on STFT(x) 20 x 87 x [float32]
Spoiler: log10(Melspectrograms) for the win, but let's see some details
TODO: IMAGE
https://www.summerrankin.com/dogandponyshow/2017/10/16/catdog
interested in.
humans are more interested
the training
= Best to optimize audio-related parameters
Disclaimer: I'm the maintainer
A dumb-but-strong-therefore-good-while- annoying-since-it's-from-computer-vision baseline approach
convnet, 3x3 kernel (=aka vggnet)
My spectrogram is 28x28 bc the model I downloaded is trained on MNIST Don't use spectrograms as if they are images It all boils down to the pattern recognition, they're actually similar tasks. the time and frequency axes have totally different meanings I don't know how to incorporate them into my model.. BUT IT WORKS!
Conclusion!
analog process, too.
Reduce the size. Don't start with end-to-end.
Prepare, Process, Design, Expect Keunwoo Ch i