Deep Learning: Intro Juhan Nam Review of Traditional Machine - - PowerPoint PPT Presentation
Deep Learning: Intro Juhan Nam Review of Traditional Machine - - PowerPoint PPT Presentation
GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) Deep Learning: Intro Juhan Nam Review of Traditional Machine Learning The traditional machine learning pipeline Temporal Frame-level Unsupervised Classifier Summary
- The traditional machine learning pipeline
Review of Traditional Machine Learning
Classifier
Frame-level Features Temporal Summary Unsupervised Learning
- The traditional machine learning pipeline
Review of Traditional Machine Learning
Classifier
Frame-level Features Temporal Summary Unsupervised Learning
DFT Mel Filterbank DCT Abs (magnitude) Log compression
MFCC
Linear Classifier Non-linear Transform Temporal Pooling
K-means Logistic Regression
- Each module can be replaced with a chain of linear transform and non-
linear function
Review of Traditional Machine Learning
Linear Transform Linear Transform Linear Transform Non-linear function Non-linear function Non-linear function Linear Transform Temporal Pooling Linear Classifier
Classifier
Frame-level Features Temporal Summary Unsupervised Learning
DFT Mel Filterbank DCT Abs (magnitude) Log compression
MFCC
Linear Classifier Non-linear Transform Temporal Pooling
K-means Logistic Regression
- The entire modules can be replaced with a long chain of linear transform
and non-linear function (i.e. deep neural network)
○ Each module is locally optimized in the traditional machine learning
Review of Traditional Machine Learning
Linear Transform Linear Transform Linear Transform Non-linear function Non-linear function
Classifier
Frame-level Features Temporal Summary Unsupervised Learning
DFT Mel Filterbank DCT Abs (magnitude) Log compression
MFCC
Linear Classifier Non-linear Transform Temporal Pooling
K-means Logistic Regression
Non-linear function Linear Transform Temporal Pooling Linear Classifier
Deep Neural Network
- The entire blocks (or layers) are optimized in an end-to-end manner
○ The parameters (or weights) in all layers are learned to minimize the loss function of the classifier ○ The loss is back-propagated through all layers (from right to left) as a gradient with regard to each parameter (“error back-propagation”)
- Therefore, we “learn features” instead of designing or engineering them
Deep Learning
Linear Transform Linear Transform Linear Transform Non-linear function Non-linear function Non-linear function Linear Transform Temporal Pooling Linear Classifier
Deep Neural Network
- There are many choices of basic building blocks (or layers)
- Connectivity patterns (parametric)
○ Fully-connected (i.e. linear transform) ○ Convolutional (note that STFT is a convolutional operation) ○ Skip / Residual ○ Recurrent
- Nonlinearity functions (non-parametric)
○ Sigmoid ○ Tanh ○ Rectified Linear Units (ReLU) and variations
Deep Learning: Building Models
Deep Learning: Building Models
- We “design” a deep neural network architecture depending on the nature
- f data and the task
○ Modular synth as a “musical analogy”
(Images from the Arturia Modula V manual)
– –
Se he Range of oscillaor 3 o 16. I ill pla 1 ocae pper han he 3 ohers Increase he aack ime of he enelope o 2 oclock in order o make he No les make anoher modlaion appear gien b he ill be prooked b he aferoch. For his, connec he Tri op of his LFO
– –
in mode poide acce o man efl feae. Le look a hem
– –1 Die cia: a he aagee f he feec ad ie idh f he 3 ae cia. Thee 3 ae cia ca be ed ad An ocillaor bank: 1 drier and 3 lae ocillaor
– –- –
- scillator modules (“Low Frequency Oscillator”) are used to
- scillator modules (“Low Frequency Oscillator”) are used to
ih he oaing leel bon and amplide modlaion inp. impl click on he link bon ha epaae hem.
– – ih he oaing leel bon and amplide modlaion inp. impl click on he link bon ha epaae hem.- The porameno (Glide)
- –
- r glide in English
- –
–
in mode poide acce o man efl feae. Le look a hem
– –1 Die cia: a he aagee f he feec ad ie idh f he 3 ae cia. Thee 3 ae cia ca be ed ad An ocillaor bank: 1 drier and 3 lae ocillaor
– –1 Die cia: a he aagee f he feec ad ie idh f he 3 ae cia. Thee 3 ae cia ca be ed ad An ocillaor bank: 1 drier and 3 lae ocillaor
– –Deep Learning: Training Models
- Loss Function
○ Cross entropy (logistic loss) ○ Hinge loss ○ Maximum likelihood ○ L2 (root mean square) and L1 ○ Adversarial ○ Variational
- Optimizers
○ SGD ○ Momentum ○ RMSProp ○ Adagrad ○ Adam
- Hyper parameter (initialization,
regularization, model search)
○ Weight initialization ○ L1 and L2 (Weight decay) ○ Dropout ○ Learning rate ○ Layer size ○ Batch size ○ Data augmentation
Multi-Layer Perceptron (MLP)
- Neural networks that consist of fully-connected layers and non-linear
functions
○ Also called Feedforward Neural Network or Deep Feedforward Network ○ A long history: perceptron (Rosenblatt, 1962), back-propagation (Rumelhart, 1986), deep belief networks (Hinton and Salakhutdinov, 2006)
Output layer Input layer Hidden layers
𝑨(") = 𝑋(")𝑦 + 𝑐(") ℎ(") = (𝑨(")) 𝑨($) = 𝑋($)ℎ(") + 𝑐($) ℎ($) = (𝑨($)) 𝑨(%) = 𝑋(%)ℎ($) + 𝑐(%) ℎ(%) = (𝑨(%)) ℎ(") ℎ($) ℎ(%) 𝑧 𝑦 𝑧 = 𝑋(&)ℎ(%) + 𝑐(&)
Deep Feedforward Network
- It is argued that the first breakthrough of deep learning is from the deep
feedforward network (2011)
○ The state-of-the-art acoustic model (GMM-HMM) in the speech recognition ○ Replace the GMM module with a deep feedforward network (up to 5 layers) ○ Initialize the weights matrix using an unsupervised learning algorithm
■ Deep belief network: greedy layer-wise using restricted Boltzmann machine
Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition, George Dahl, Dong Yu, Li Deng, Alex Acero, 2012
- There are several choices of non-linear functions (or activation functions)
○ ReLU is the default choice in modern deep learning: fast and effective ○ There are also other choices such as Elastic Linear Unit and Maxout ○ Note that this is an element-wise operation in the neural network
Non-linear Functions
Sigmoid ReLU Leaky ReLU Tanh
1 10
- 10
1 10
- 10
𝜏 𝑦 = 1 1 + 𝑓!" 𝑓" − 𝑓!" 𝑓" + 𝑓!"
10 10
- 10
max(0, 𝑦) max(0.1𝑦, 𝑦)
10 10
- 10
Why the Nonlinear Function in the Hidden Layer?
- They capture the interaction between input elements with high-order
○ This enables finding non-linear boundaries between different classes ○ Taylor series of a nonlinear function 𝑨
■ Non-zero coefficients for high-order polynomials:
𝑨 = 𝑏# + 𝑏$𝑨 + 𝑏%𝑨% + 𝑏&𝑨& + ⋯
interactions between all input elements
𝑨 = 𝑥$𝑦$ + 𝑥%𝑦% + 𝑐 𝑨% = 𝑥$
%𝑦$ % + 2𝑥$𝑥%𝑦$𝑦% + 𝑥% %𝑦% % + 2𝑥$𝑦$𝑐 + 2𝑥%𝑦%𝑐 + 𝑐%
𝑦" 𝑦$ 𝑦" 𝑦$
𝑨 = 0 𝑏! + 𝑏"𝑨 + 𝑏𝑨# = 0
Why the Nonlinear Function in the Hidden Layer?
- What if the nonlinear functions are absent?
○ Multiplications of linear transform = another linear transform ○ Geometrically, linear transformation does scaling, sheering and rotation
source: http://www.ams.org/publicoutreach/feature-column/fcarc-svd
Why the Nonlinear Function in the Hidden Layer?
- The non-linear function warps the input space such that the data with
different classes are more linearly separable
Linear Classifier NN- input space NN- hidden layer space
source: http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
Training Deep Neural Network
- Gradient descent learning
○ Need to compute the gradient for all parameters in all layers ○ Compute them via the error back-propagation from the top layer
𝑋(')()* = 𝑋(')+', − 𝜈 𝜖𝑚 𝑋(')+', 𝜖𝑋(')+', ℎ(") ℎ($) ℎ(%) 𝑧 𝑦
𝑋($) 𝑋(%) 𝑋(&) 𝑋())
𝑚(𝑧, 0 𝑧)
Training Deep Neural Network
- Step 1) initialize all weights to random numbers
- Step 2) Feedforward computation
○ Feed the input 𝑦 and compute ℎ$ and up to ) 𝑧 and keep all of the intermediate values
ℎ(") ℎ($) ℎ(%) 𝑧 𝑦
𝑋($) 𝑋(%) 𝑋(&) 𝑋())
𝑚(𝑧, 0 𝑧) 𝑨(") = 𝑋(")𝑦 + 𝑐(") ℎ(") = (𝑨(")) 𝑨($) = 𝑋($)ℎ(") + 𝑐($) ℎ($) = (𝑨($)) 𝑨(%) = 𝑋(%)ℎ($) + 𝑐(%) ℎ(%) = (𝑨(%)) 𝑧 = 𝑋(&)ℎ(%) + 𝑐(&)
Training Deep Neural Network
- Step 3) compute the loss 𝑚(𝑧, %
𝑧) using % 𝑧 and the ground truth 𝑧
○ For simplicity, let’s assume root mean square (RMS) loss
■ 𝑚 𝑧, 0 𝑧 = (𝑧 − 0 𝑧)$= (𝑧 − 𝑋(&)ℎ(%) + 𝑐(&) )$ ■
- '
- .
*,, (-) = 2(𝑧 − 𝑋
/,1 (&)ℎ1 (%) + 𝑐/ (&))(−ℎ/ (%))
■
- '
- 2,
(.) = 2(𝑧 − 𝑋
/,1 (&)ℎ1 (%) + 𝑐/ (&))(−𝑋 /,1 (&)) We can compute this because ℎ! was computed from the forward pass and 𝑋
" and 𝑐" were already initialized
ℎ(") ℎ($) ℎ(%) 𝑧 𝑦
𝑋($) 𝑋(%) 𝑋(&) 𝑋())
𝑚(𝑧, 0 𝑧)
We need the gradient w.r.t the hidden layer units for the lower layer !!!
𝑨(") = 𝑋(")𝑦 + 𝑐(") ℎ(") = (𝑨(")) 𝑨($) = 𝑋($)ℎ(") + 𝑐($) ℎ($) = (𝑨($)) 𝑨(%) = 𝑋(%)ℎ($) + 𝑐(%) ℎ(%) = (𝑨(%)) 𝑧 = 𝑋(&)ℎ(%) + 𝑐(&)
Training Deep Neural Network
- Step 4) Back-propagate the loss to the lower layers
○ Let’s assume the sigmoid function for non-linearity
■ ℎ1
(%) = (𝑨1 (%))= " "6)
/0, (.)
■
- '
- 7,
(.) =
- '
- 2,
(.) 3
- 2,
(.)
- 7,
(.) =
- '
- 2,
(.) 3
" "6)
/0, (.)
8
=
- '
- 2,
(.) 3
" "6)
/0, (.) 3
)
/0, (.)
"6)
/0, (.)
ℎ(") ℎ($) ℎ(%) 𝑧 𝑦
𝑋($) 𝑋(%) 𝑋(&) 𝑋())
𝑚(𝑧, 0 𝑧)
We know this from the upper layer (the previous slide) We can compute this using 𝑨! from the forward pass
𝑨(") = 𝑋(")𝑦 + 𝑐(") ℎ(") = (𝑨(")) 𝑨($) = 𝑋($)ℎ(") + 𝑐($) ℎ($) = (𝑨($)) 𝑨(%) = 𝑋(%)ℎ($) + 𝑐(%) ℎ(%) = (𝑨(%)) 𝑧 = 𝑋(&)ℎ(%) + 𝑐(&)
Training Deep Neural Network
- Step 5) Back-propagate the loss to the lower layers more
■
- '
- .
*,, (.) =
- '
- 7,
(.) 3
- 7,
.
- .
*,, . =
- '
- 7,
. 3 ℎ/
$
■
- '
- 2,
(1) =
- '
- 7*
(.) 3
- 7*
(.)
- 2,
(1) =
- '
- 7*
(.) 3 𝑋
/,1 (%)
ℎ(") ℎ($) ℎ(%) 𝑧 𝑦
𝑋($) 𝑋(%) 𝑋(&) 𝑋())
𝑚(𝑧, 0 𝑧) 𝑨(") = 𝑋(")𝑦 + 𝑐(") ℎ(") = (𝑨(")) 𝑨($) = 𝑋($)ℎ(") + 𝑐($) ℎ($) = (𝑨($)) 𝑨(%) = 𝑋(%)ℎ($) + 𝑐(%) ℎ(%) = (𝑨(%)) 𝑧 = 𝑋(&)ℎ(%) + 𝑐(&)
c
We know this from the upper layer (the previous slide) These are from the local layer
Training Deep Neural Network
- Step 6) Repeat step 4 for the 2nd layer
■
- '
- 7,
(1) =
- '
- 2,
(1) 3
- 2,
(1)
- 7,
(1) =
- '
- 2,
(1) 3
" "6)
/0, (1)
8
=
- '
- 2,
(1) 3
" "6)
/0, (1) 3
)
/0, (1)
"6)
/0, (1)
ℎ(") ℎ($) ℎ(%) 𝑧 𝑦
𝑋($) 𝑋(%) 𝑋(&) 𝑋())
𝑚(𝑧, 0 𝑧) 𝑨(") = 𝑋(")𝑦 + 𝑐(") ℎ(") = (𝑨(")) 𝑨($) = 𝑋($)ℎ(") + 𝑐($) ℎ($) = (𝑨($)) 𝑨(%) = 𝑋(%)ℎ($) + 𝑐(%) ℎ(%) = (𝑨(%)) 𝑧 = 𝑋(&)ℎ(%) + 𝑐(&)
We know this from the upper layer (the previous slide)
Training Deep Neural Network
- Step 7) Repeat step5 for the 2nd layer
■
- '
- .
*,, (1) =
- '
- 7,
(1) 3
- 7,
1
- .
*,, 1 =
- '
- 7,
1 3 ℎ/
"
■
- '
- 2,
(2) =
- '
- 7*
(1) 3
- 7*
(1)
- 2,
(2) =
- '
- 7*
(1) 3 𝑋
/,1 ($)
ℎ(") ℎ($) ℎ(%) 𝑧 𝑦
𝑋($) 𝑋(%) 𝑋(&) 𝑋())
𝑚(𝑧, 0 𝑧) 𝑨(") = 𝑋(")𝑦 + 𝑐(") ℎ(") = (𝑨(")) 𝑨($) = 𝑋($)ℎ(") + 𝑐($) ℎ($) = (𝑨($)) 𝑨(%) = 𝑋(%)ℎ($) + 𝑐(%) ℎ(%) = (𝑨(%)) 𝑧 = 𝑋(&)ℎ(%) + 𝑐(&)
We know this from the upper layer (the previous slide) These are from the local layer
Training Deep Neural Network
- Step 8) repeat the two previous steps for the 1st layer
○ Done with computing one iteration of the gradient ○ Update the weights using the gradient!
ℎ(") ℎ($) ℎ(%) 𝑧 𝑦
𝑋($) 𝑋(%) 𝑋(&) 𝑋())
𝑚(𝑧, 0 𝑧) 𝑨(") = 𝑋(")𝑦 + 𝑐(") ℎ(") = (𝑨(")) 𝑨($) = 𝑋($)ℎ(") + 𝑐($) ℎ($) = (𝑨($)) 𝑨(%) = 𝑋(%)ℎ($) + 𝑐(%) ℎ(%) = (𝑨(%)) 𝑧 = 𝑋(&)ℎ(%) + 𝑐(&) 𝑋(')()* = 𝑋(')+', − 𝜈 𝜖𝑚 𝑋(')+', 𝜖𝑋(')+',
Training Deep Neural Network
- Step 9) keep repeating the feedforward and backward passes
○ Updating the weights with the gradients
ℎ(") ℎ($) ℎ(%) 𝑧 𝑦
𝑋($) 𝑋(%) 𝑋(&) 𝑋())
𝑚(𝑧, 0 𝑧) 𝑨(") = 𝑋(")𝑦 + 𝑐(") ℎ(") = (𝑨(")) 𝑨($) = 𝑋($)ℎ(") + 𝑐($) ℎ($) = (𝑨($)) 𝑨(%) = 𝑋(%)ℎ($) + 𝑐(%) ℎ(%) = (𝑨(%)) 𝑧 = 𝑋(&)ℎ(%) + 𝑐(&) 𝑋(')()* = 𝑋(')+', − 𝜈 𝜖𝑚 𝑋(')+', 𝜖𝑋(')+',
𝑥# 𝑥$ 𝑥∗
Training Deep Neural Network
- Step 10) Monitor both training loss and validation loss and stop the
iteration when the validation loss decreases any more (early stopping)
○ Note that the training set is used for both the feedforward and backward passes whereas the validation set is used for only the feedforward pass ○ Early stopping is a regularization method
This update is from the training set 𝑥$ 𝑥∗ 𝑥#
Epoch
Loss
Early stopping (overfitting starts) Validation Training
Wrap up: Training Deep Neural Networks
- Generalization
○ Any “differential” module can be used in the neural networks as a layer ○ Gradient-based learning will train the entire network
𝑦 𝑧 𝜖𝑚 𝜖𝑧 𝜖𝑚 𝜖𝑦 𝜖𝑚 𝜖𝑥
1/
𝑥
1/
𝑦 𝑧 𝜖𝑚 𝜖𝑧 𝜖𝑚 𝜖𝑦
Parametric module Non-parametric module
Connectivity patterns (Fully-connected, convolutional, …) Activation functions Pooling (max, average, …) Forward Backward Forward Backward
MLP Demo and Visualization
- https://playground.tensorflow.org/