Deep Learning with Audio Signals Prepare, Process, Design, Expect - - PowerPoint PPT Presentation

deep learning with audio signals
SMART_READER_LITE
LIVE PREVIEW

Deep Learning with Audio Signals Prepare, Process, Design, Expect - - PowerPoint PPT Presentation

Deep Learning with Audio Signals Prepare, Process, Design, Expect Keunwoo Ch i Keunwoo Choi Research Scientist QMUL, UK ETRI, S. Korea SNU, S. Korea @keunwoochoi (twtr, github) WARNING THIS MATERIAL IS WRITTEN FOR ATTENDEES IN QCON.AI,


slide-1
SLIDE 1

Deep Learning with Audio Signals

Prepare, Process, Design, Expect Keunwoo Ch i

slide-2
SLIDE 2

Keunwoo Choi

QMUL, UK ETRI, S. Korea SNU, S. Korea @keunwoochoi (twtr, github)

Research Scientist

slide-3
SLIDE 3

WARNING

THIS MATERIAL IS WRITTEN FOR ATTENDEES IN QCON.AI, NAMELY, SOFTWARE ENGINEERS AND DEEP LEARNING PRACTITIONERS TO PROVIDE AN OFF-THE- SHELF GUIDE. MY ADVICE MIGHT NOT BE THE FINAL SOLUTION FOR YOUR PROBLEM, BUT WOULD BE A GOOD STARTING POINT. ..ALSO, THERE'S NO SPOTIFY SECRET HERE :P

slide-4
SLIDE 4

Content

  • Prepare the dataset
  • Pre-process the signal
  • Design your network
  • Expect the result
slide-5
SLIDE 5

Prepare the datasets

  • r, know your data
  • Q. How to start an audio task?
slide-6
SLIDE 6

LMGTFY

  • Google them, of course
  • But....
slide-7
SLIDE 7

Audio dataset

  • Lucky → the exactly same class(es), many of them, yay!
  • Meh → same or similar classes, sounds alright..
  • Ugh.. → there are 2 in freesound.org and 3 on youtube
slide-8
SLIDE 8

Audio (or, sound) dataset

  • Our algorithm is living in the

digital space

  • So is the .wav files
  • But,


the sound is in the real world

Our lovely cyberspace

slide-9
SLIDE 9

Audio dataset

Source Noise Reverberation Microphone

  • Room reverberation image from https://johnlsayers.com/Recmanual/Pages/Reverb.htm
slide-10
SLIDE 10

Audio dataset

Dear everyone, YOU ARE ALWAYS IN THE "UGH..." SITUATION

→ HOW TO BUILD A CORRECT AUDIO DATASET?

slide-11
SLIDE 11

What we can do

  • Know your real situation
  • You can mimic noise/reverberation/mic if you have
  • clean/dry/high-quality source signals

DL models are robust only within the variance they've seen. → Good at interpolation.. only.

E.g., a model trained with clean signals probably can't deal with noisy signals

noisy environment cheap mic

slide-12
SLIDE 12

Simulate the real world

+ noise signal

clean signal noisy signal

room impulse response

dry signal wet signal

band-pass filter

  • riginal

signal recorded signal

slide-13
SLIDE 13

What to Google

Noise babble noise recording home noise recording cafe noise recording street noise recording white noise, brown noise x_noise = x + alpha * noise Reverberation (maybe skip it) room impulse responses, RIR reverberation simulators x_wet = np.conv(x, rir) Microphone band pass filter scipy.signal filtering microphone specification speaker specification microphone frequency response scipy.signal.convolve scipy.signal.fftconvolve Or trimming-off your spectrograms

slide-14
SLIDE 14

Pre-process the signals

  • r, log(melgram)
  • Q. What to do after loading the signals?
slide-15
SLIDE 15

Digital Audio 101

  • 1 second of digital audio:


size=(44100, ), dtype=int16

  • MNIST: (28, 28, 1), int8


CIFAR10: (32, 32, 3), int8
 ImageNet: (256, 256, 3), int8

  • Audio: Lots of data points in
  • ne item!
slide-16
SLIDE 16

Audio representations

Type Description Data shape and size for e.g., 1 second,
 sampling rate=44100 Waveform x 44100 x [int16] Spectrograms STFT(x) Melspectrogram(x) CQT(x) 513 x 87 x [float32] 128 x 87 x [float32] 72 x 87 x [float32] Features MFCC(x) = some process on STFT(x) 20 x 87 x [float32]

Spoiler: log10(Melspectrograms) for the win, but let's see some details

slide-17
SLIDE 17

Spectrograms

  • 2-dim representation of audio signal

TODO: IMAGE

slide-18
SLIDE 18

Practitioner's choice

  • Rule of thumb: DISCARD ALL THE REDUNDANCY
  • Sample rate, or bandwidth
  • Goal: To optimize the input audio data for your model
  • by resampling - can be computation heavier
  • by discarding some freq bands - can be storage heavy

https://www.summerrankin.com/dogandponyshow/2017/10/16/catdog

slide-19
SLIDE 19

Practitioner's choice

  • Melspectrogram

  • in decibel scale

  • which only covers the frequency range you're

interested in.

  • Why?

  • smaller, therefore easier and faster training

  • perceptual - weighing more on the freq region where

humans are more interested


  • faster than CQT to compute

  • decibel scale - another perceptually motivated choice
  • Q. Ok, how can I compute them?
slide-20
SLIDE 20

import librosa import madmom

  • Python libraries - librosa/madmom/scipy/..
  • Computations on CPU
  • Best when all the processing will be done before

the training

slide-21
SLIDE 21

import kapre

  • Keras Audio Preprocessing layers
  • CPU and GPU
  • Best when you want to do things on the fly/GPU


= Best to optimize audio-related parameters

  • pip install kapre
  • There's also pytorch-audio!

Disclaimer: I'm the maintainer

slide-22
SLIDE 22

Design your network

  • r, know the assumptions
  • Q. What kind of network structure I need?
slide-23
SLIDE 23

A dumb-but-strong-therefore-good-while- annoying-since-it's-from-computer-vision baseline approach

  • Trim the signals properly (e.g. 1-sec)
  • Do the classification with 2D

convnet, 3x3 kernel (=aka vggnet)

  • Raise $1B
  • Retire
  • Post "why i retired.." on Medium
  • Happy life!
slide-24
SLIDE 24

Go even dumber

  • Just download some pre-trained networks for..

  • music

  • audio

  • image (?)
  • Re-use it for your task (aka transfer learning)
  • 1B - retire - Medium - happy - repeat
slide-25
SLIDE 25

Better and stronger, by understanding assumptions

  • assert "Receptive field" size == size of the target pattern
  • How sparse the target pattern is?

  • Bird singing sparse? 

  • Voice-in-music sparse? 

  • Distortion-guitar-in-Metallica sparse?
slide-26
SLIDE 26

Have no idea?

  • Go see how computer vision people are doing
  • Clone it
  • It's ok, it's a good baseline at least
slide-27
SLIDE 27

My spectrogram is 28x28 bc the model I downloaded is trained on MNIST Don't use spectrograms as if they are images It all boils down to the pattern recognition, they're actually similar tasks. the time and frequency axes have totally different meanings I don't know how to incorporate them into my model.. BUT IT WORKS!

slide-28
SLIDE 28

Expecting the result

  • r, know the problem
  • Q. How would it work?
slide-29
SLIDE 29

YOU

  • You are responsible for the feasibility
  • Is it a task you can?
  • Is the information in the input (mel-spectrogram)?
  • Are similar tasks being solved?
slide-30
SLIDE 30

Think about it!

  • Is it possible? To what extent? E.g.,
  • Baby crying detection
  • Baby crying recognition and classification
  • Dog barking translation
  • Hit song detection
slide-31
SLIDE 31

Conclusion Conclusion..

Conclusion!

slide-32
SLIDE 32

Conclusion

  • Sound is analog, you might need to think about some

analog process, too.

  • Pre-process: Follow others when you're lost
  • Audio is big in data size, but sparse in information.

Reduce the size. Don't start with end-to-end.

  • Design: Follow others when you're lost
  • Expect: Make sure if it's doable
slide-33
SLIDE 33

Deep Learning with Audio Signal

Prepare, Process, Design, Expect Keunwoo Ch i

Q&A

  • PS. See you soon at the panel talk!