GCT634: Musical Applications of Machine Learning
Deep Learning: Part 2
Graduate School of Culture Technology, KAIST Juhan Nam
Deep Learning: Part 2 Graduate School of Culture Technology, KAIST - - PowerPoint PPT Presentation
GCT634: Musical Applications of Machine Learning Deep Learning: Part 2 Graduate School of Culture Technology, KAIST Juhan Nam Outlines Convolutional Neural Networks (CNN) - Introduction - Mechanics - CNN for music classification Training
Graduate School of Culture Technology, KAIST Juhan Nam
subsampling (or pooling) layer
feature maps
INPUT 32x32
Convolutions Subsampling Convolutions
C1: feature maps 6@28x28
Subsampling
S2: f. maps 6@14x14 S4: f. maps 16@5x5 C5: layer 120 C3: f. maps 16@10x10 F6: layer 84
Full connection Full connection Gaussian connections
OUTPUT 10
( LeCun 98)
(Manassi, 2013)
(Krizhevsky et. al., 2012)
(NIPS 2017 Tutorial – Deep Learning: Practice and Trend)
(Zeiler and Fergus, 2013) Trainable Classifier Low-Level Feature Mid-Level Feature High-Level Feature
Borrowed from LeCun’s slides
(NIPS 2017 Tutorial – Deep Learning: Practice and Trend)
The bird occupies a local area and looks the same in different part of an image
(from the MS-Celeb-1M dataset)
(NIPS 2017 Tutorial – Deep Learning: Practice and Trend)
same weights (filters)
(NIPS 2017 Tutorial – Deep Learning: Practice and Trend)
(e.g. face recognition)
Channel (or Depth) Width Height Filter
2D convolution
Filter must have the same depth as the input has Hidden unit Height Width Channel This channel corresponds to the number of filters! Note that they become a 4D tensor when batch or mini-batch is used
striding (equivalent to zero-padding in STFT)
The Output size is (N-F)/S +1 (N: input size, F: filter size, S: stride size) Do zero-padding if (N-F)/S is not integer
Fei-Fei Li & Justin Johnson & Serena Yeung
April 18, 2017 Lecture 5 -
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 5 - April 18, 2017 62
Common settings: K = (powers of 2, e.g. 32, 64, 128, 512)
(Stanford CS231n Slides)
1 5 2 4 2 3 9 1 5 3 3 4 7 8 2 2 5 9 8 4 2 x 2 max pooling Stride with 2
tml
Spectrogram
Time Convolution and Pooling layers Fully-Connected layers Frequency Channel
Frequency (Channel) Time Channel
(compared to the 2D feature map)
FCN-4 Mel-spectrogram (input: 96×1366×1) Conv 3×3×128 MP (2, 4) (output: 48×341×128) Conv 3×3×384 MP (4, 5) (output: 24×85×384) Conv 3×3×768 MP (3, 8) (output: 12×21×768) Conv 3×3×2048 MP (4, 8) (output: 1×1×2048) Output 50×1 (sigmoid)
(compared to the 2D feature map)
resources (e.g. GPU and memory)
level (e.g. 2 or 3 samples)
time Convolution and Pooling layers
Block 2 Block 1 Block 3
1D convolutional blocks multi-level
global max pooling global max pooling
Conv1D BatchNorm MaxPool
relu relu relu sigmoid relu T×C 1×C 1×αC 1×C T×C T×C
Conv1D BatchNorm Dropout Conv1D BatchNorm
GlobalAvgPool
FC FC Scale MaxPool
Layer 2 Layer 1 Layer 3
1D convolutional blocks
Large size filter & strides
Layer 4 Block 4
Output
39 model, 19683 frames 59049 samples (2678 ms) as input layer stride
# of params conv 3-128 3 19683 × 128 512 conv 3-128 maxpool 3 1 3 19683 × 128 6561 × 128 49280 conv 3-128 maxpool 3 1 3 6561 × 128 2187 × 128 49280 conv 3-256 maxpool 3 1 3 2187 × 256 729 × 256 98560 conv 3-256 maxpool 3 1 3 729 × 256 243 × 256 196864 conv 3-256 maxpool 3 1 3 243 × 256 81 × 256 196864 conv 3-256 maxpool 3 1 3 81 × 256 27 × 256 196864 conv 3-256 maxpool 3 1 3 27 × 256 9 × 256 196864 conv 3-256 maxpool 3 1 3 9 × 256 3 × 256 196864 conv 3-512 maxpool 3 1 3 3 × 512 1 × 512 393728 conv 1-512 dropout 0.5 1 − 1 × 512 1 × 512 262656 sigmoid − 50 25650 Total params 1.9 × 106 Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland SMC2017-222 Sample-level raw waveform model
Sample-level strided convolution layer
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland SMC2017-221
Regularization!
they are trained with different training sets
“Jazz” “Classical” Feature Space
Under-fitting
“Jazz” “Classical” Feature Space
good-fitting
“Jazz” “Classical” Feature Space
𝑋
%,' ( %,'
(weight decay)
𝑋
%,' %,'
be added to the loss separately
𝑀 𝜄 = + 𝑔 𝑦., 𝑧0 𝑋
1 023
+ 𝜇𝑆(𝑋)
turned off
random variables
(Srivastava et. al, 2014)
(Stanford CS231n Slides)
validation set does not decrease
10 20 30 40 50 0.15 0.2 0.25 10 20 30 40 50 0.35 0.4 0.45
Training Error Validation Error (The PRML book)
corresponding output property
Normalization!
Mean Subtraction Std. Division Mean Subtraction Rotation & Std Division Be careful with dividing by zero
music classification (e.g. log-compressed spectrogram)
data (not the entire dataset)
through the network!
to have this consistent distribution on the activations
variance are mostly in the linear range (sigmoid or tanh)
FC or Conv Batch Norm Nonlinear (Ioffe and Szegedy, 2015)
5M 10M 15M 20M 25M 30M 0.4 0.5 0.6 0.7 0.8 Inception BN−Baseline BN−x5 BN−x30 BN−x5−Sigmoid Steps to match Inception
Figure 2: Single crop validation accuracy of Inception and its batch-normalized variants, vs. the number of training steps.
Batch Normalization is a must-have layer!!!
ImageNet Classification (Ioffe and Szegedy, 2015)
Optimization!
number
𝑦893 = 𝑦8 − 𝛽∇𝑔(𝑦8)
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 25, 2017
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 25, 2017 16
What if loss changes quickly in one direction and slowly in another? What does gradient descent do? Very slow progress along shallow dimension, jitter along steep direction Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large
(Stanford CS231n Slides)
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 25, 2017
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 25, 2017 18
What if the loss function has a local minima or saddle point? Zero gradient, gradient descent gets stuck
Local Minimum Saddle Point
(Stanford CS231n Slides)
also the previous history
𝑦893 = 𝑦8 + 𝑤893 𝑤893 = 𝜍𝑤8 − 𝛽∇𝑔(𝑦8)
Current gradients (with learning rate) Accumulated gradients
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 25, 2017
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 25, 2017 22
Local Minima Saddle points Poor Conditioning Gradient Noise It’s like rolling a ball
(Stanford CS231n Slides)
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 25, 2017
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 25, 2017 24
Gradient Velocity actual step Momentum update:
Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k^2)”, 1983 Nesterov, “Introductory lectures on convex optimization: a basic course”, 2004 Sutskever et al, “On the importance of initialization and momentum in deel learning”, ICML 2013
Gradient Velocity actual step Nesterov Momentum
𝑦893 = 𝑦8 + 𝑤893 𝑤893 = 𝜍𝑤8 − 𝛽∇𝑔(𝑦8 + 𝜍𝑤8)
𝛼𝑔(𝑦8) 𝛼𝑔(𝑦8 + 𝜍𝑤8) 𝑦8 + 𝜍𝑤8 𝑦8 + 𝜍𝑤8
(Stanford CS231n Slides)
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 25, 2017
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 25, 2017 28
SGD SGD+Momentum Nesterov
(Stanford CS231n Slides)
𝑦893 = 𝑦8 − 𝛽𝛼𝑔(𝑦8) 𝑦893,0 = 𝑦8,0 − 𝛽0𝛼𝑔(𝑦8,0)
parameters
learning rate to zero 𝑦893,0 = 𝑦8,0 − 𝛽 0 + 𝜗
0 = + 𝛼𝑔(𝑦8,0)(
8
sum (exponentially decaying) 𝑦893,0 = 𝑦8,0 − 𝛽 893,0 + 𝜗
893,0 = 𝛾8,0 + (1 − 𝛾)𝛼𝑔(𝑦8,0)(
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 25, 2017
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 25, 2017 33
SGD SGD+Momentum RMSProp
(Stanford CS231n Slides)
𝑦893,0 = 𝑦8,0 − 𝛽 893,0 + 𝜗
893,0 = 𝛾8,0 + (1 − 𝛾)𝛼𝑔(𝑦8,0)( 𝑤893 = 𝜍𝑤8 + (1 − 𝜍)∇𝑔(𝑦8)
First moment Second moment
Alec Radford: https://imgur.com/a/Hqolp
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 25, 2017
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 25, 2017 42
SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learning rate as a hyperparameter.
Loss Epoch Learning rate decay! More critical with SGD+Momentum, less common with Adam (Stanford CS231n Slides)