Deep Learning: Intro Juhan Nam Review of Traditional Machine - - PowerPoint PPT Presentation

deep learning intro
SMART_READER_LITE
LIVE PREVIEW

Deep Learning: Intro Juhan Nam Review of Traditional Machine - - PowerPoint PPT Presentation

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) Deep Learning: Intro Juhan Nam Review of Traditional Machine Learning The traditional machine learning pipeline Temporal Frame-level Unsupervised Classifier Summary


slide-1
SLIDE 1

Juhan Nam

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020)

Deep Learning: Intro

slide-2
SLIDE 2
  • The traditional machine learning pipeline

Review of Traditional Machine Learning

Classifier

Frame-level Features Temporal Summary Unsupervised Learning

slide-3
SLIDE 3
  • The traditional machine learning pipeline

Review of Traditional Machine Learning

Classifier

Frame-level Features Temporal Summary Unsupervised Learning

DFT Mel Filterbank DCT Abs (magnitude) Log compression

MFCC

Linear Classifier Non-linear Transform Temporal Pooling

K-means Logistic Regression

slide-4
SLIDE 4
  • Each module can be replaced with a chain of linear transform and non-

linear function

Review of Traditional Machine Learning

Linear Transform Linear Transform Linear Transform Non-linear function Non-linear function Non-linear function Linear Transform Temporal Pooling Linear Classifier

Classifier

Frame-level Features Temporal Summary Unsupervised Learning

DFT Mel Filterbank DCT Abs (magnitude) Log compression

MFCC

Linear Classifier Non-linear Transform Temporal Pooling

K-means Logistic Regression

slide-5
SLIDE 5
  • The entire modules can be replaced with a long chain of linear transform

and non-linear function (i.e. deep neural network)

○ Each module is locally optimized in the traditional machine learning

Review of Traditional Machine Learning

Linear Transform Linear Transform Linear Transform Non-linear function Non-linear function

Classifier

Frame-level Features Temporal Summary Unsupervised Learning

DFT Mel Filterbank DCT Abs (magnitude) Log compression

MFCC

Linear Classifier Non-linear Transform Temporal Pooling

K-means Logistic Regression

Non-linear function Linear Transform Temporal Pooling Linear Classifier

Deep Neural Network

slide-6
SLIDE 6
  • The entire blocks (or layers) are optimized in an end-to-end manner

○ The parameters (or weights) in all layers are learned to minimize the loss function of the classifier ○ The loss is back-propagated through all layers (from right to left) as a gradient with regard to each parameter (“error back-propagation”)

  • Therefore, we “learn features” instead of designing or engineering them

Deep Learning

Linear Transform Linear Transform Linear Transform Non-linear function Non-linear function Non-linear function Linear Transform Temporal Pooling Linear Classifier

Deep Neural Network

slide-7
SLIDE 7
  • There are many choices of basic building blocks (or layers)
  • Connectivity patterns (parametric)

○ Fully-connected (i.e. linear transform) ○ Convolutional (note that STFT is a convolutional operation) ○ Skip / Residual ○ Recurrent

  • Nonlinearity functions (non-parametric)

○ Sigmoid ○ Tanh ○ Rectified Linear Units (ReLU) and variations

Deep Learning: Building Models

slide-8
SLIDE 8

Deep Learning: Building Models

  • We “design” a deep neural network architecture depending on the nature
  • f data and the task

○ Modular synth as a “musical analogy”

(Images from the Arturia Modula V manual)

– –

Se he Range of oscillaor 3 o 16. I ill pla 1 ocae pper han he 3 ohers Increase he aack ime of he enelope o 2 oclock in order o make he No les make anoher modlaion appear gien b he ill be prooked b he aferoch. For his, connec he Tri op of his LFO

– –

in mode poide acce o man efl feae. Le look a hem

– –

1 Die cia: a he aagee f he feec ad ie idh f he 3 ae cia. Thee 3 ae cia ca be ed ad An ocillaor bank: 1 drier and 3 lae ocillaor

– –
– – – – –
  • scillator modules (“Low Frequency Oscillator”) are used to
The “slave” oscillators can also be used as LFOs when they are brought to quencies (“low freq”). This gives a total availability of 11 LFO modules. – –
  • scillator modules (“Low Frequency Oscillator”) are used to
The “slave” oscillators can also be used as LFOs when they are brought to quencies (“low freq”). This gives a total availability of 11 LFO modules. – –

ih he oaing leel bon and amplide modlaion inp. impl click on he link bon ha epaae hem.

– – ih he oaing leel bon and amplide modlaion inp. impl click on he link bon ha epaae hem.
  • The porameno (Glide)
The sitch lets ou choose: a monophonic plaing mode The button, actie hen the snthesier is in monophonic mode,
  • r glide in English
The button, also actie hen the snthesier is in monophonic mode, in our plaing sequence. If, on the other hand, ou dont ish to re simultaneousl (pol screen. Thi To actiate the portamento mode, click on the ON button under the portamento intensit knob (glide), situated net to the 2 dials, on the

in mode poide acce o man efl feae. Le look a hem

– –

1 Die cia: a he aagee f he feec ad ie idh f he 3 ae cia. Thee 3 ae cia ca be ed ad An ocillaor bank: 1 drier and 3 lae ocillaor

– –

1 Die cia: a he aagee f he feec ad ie idh f he 3 ae cia. Thee 3 ae cia ca be ed ad An ocillaor bank: 1 drier and 3 lae ocillaor

– –
slide-9
SLIDE 9

Deep Learning: Training Models

  • Loss Function

○ Cross entropy (logistic loss) ○ Hinge loss ○ Maximum likelihood ○ L2 (root mean square) and L1 ○ Adversarial ○ Variational

  • Optimizers

○ SGD ○ Momentum ○ RMSProp ○ Adagrad ○ Adam

  • Hyper parameter (initialization,

regularization, model search)

○ Weight initialization ○ L1 and L2 (Weight decay) ○ Dropout ○ Learning rate ○ Layer size ○ Batch size ○ Data augmentation

slide-10
SLIDE 10

Multi-Layer Perceptron (MLP)

  • Neural networks that consist of fully-connected layers and non-linear

functions

○ Also called Feedforward Neural Network or Deep Feedforward Network ○ A long history: perceptron (Rosenblatt, 1962), back-propagation (Rumelhart, 1986), deep belief networks (Hinton and Salakhutdinov, 2006)

Output layer Input layer Hidden layers

𝑨(") = 𝑋(")𝑦 + 𝑐(") ℎ(") = 𝑕(𝑨(")) 𝑨($) = 𝑋($)ℎ(") + 𝑐($) ℎ($) = 𝑕(𝑨($)) 𝑨(%) = 𝑋(%)ℎ($) + 𝑐(%) ℎ(%) = 𝑕(𝑨(%)) ℎ(") ℎ($) ℎ(%) 𝑧 𝑦 𝑧 = 𝑋(&)ℎ(%) + 𝑐(&)

slide-11
SLIDE 11

Deep Feedforward Network

  • It is argued that the first breakthrough of deep learning is from the deep

feedforward network (2011)

○ The state-of-the-art acoustic model (GMM-HMM) in the speech recognition ○ Replace the GMM module with a deep feedforward network (up to 5 layers) ○ Initialize the weights matrix using an unsupervised learning algorithm

■ Deep belief network: greedy layer-wise using restricted Boltzmann machine

Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition, George Dahl, Dong Yu, Li Deng, Alex Acero, 2012

slide-12
SLIDE 12
  • There are several choices of non-linear functions (or activation functions)

○ ReLU is the default choice in modern deep learning: fast and effective ○ There are also other choices such as Elastic Linear Unit and Maxout ○ Note that this is an element-wise operation in the neural network

Non-linear Functions

Sigmoid ReLU Leaky ReLU Tanh

1 10

  • 10

1 10

  • 10

𝜏 𝑦 = 1 1 + 𝑓!" 𝑓" − 𝑓!" 𝑓" + 𝑓!"

10 10

  • 10

max(0, 𝑦) max(0.1𝑦, 𝑦)

10 10

  • 10
slide-13
SLIDE 13

Why the Nonlinear Function in the Hidden Layer?

  • They capture the interaction between input elements with high-order

○ This enables finding non-linear boundaries between different classes ○ Taylor series of a nonlinear function 𝑕 𝑨

■ Non-zero coefficients for high-order polynomials:

𝑕 𝑨 = 𝑏# + 𝑏$𝑨 + 𝑏%𝑨% + 𝑏&𝑨& + ⋯

interactions between all input elements

𝑨 = 𝑥$𝑦$ + 𝑥%𝑦% + 𝑐 𝑨% = 𝑥$

%𝑦$ % + 2𝑥$𝑥%𝑦$𝑦% + 𝑥% %𝑦% % + 2𝑥$𝑦$𝑐 + 2𝑥%𝑦%𝑐 + 𝑐%

𝑦" 𝑦$ 𝑦" 𝑦$

𝑨 = 0 𝑏! + 𝑏"𝑨 + 𝑏𝑨# = 0

slide-14
SLIDE 14

Why the Nonlinear Function in the Hidden Layer?

  • What if the nonlinear functions are absent?

○ Multiplications of linear transform = another linear transform ○ Geometrically, linear transformation does scaling, sheering and rotation

source: http://www.ams.org/publicoutreach/feature-column/fcarc-svd

slide-15
SLIDE 15

Why the Nonlinear Function in the Hidden Layer?

  • The non-linear function warps the input space such that the data with

different classes are more linearly separable

Linear Classifier NN- input space NN- hidden layer space

source: http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

slide-16
SLIDE 16

Training Deep Neural Network

  • Gradient descent learning

○ Need to compute the gradient for all parameters in all layers ○ Compute them via the error back-propagation from the top layer

𝑋(')()* = 𝑋(')+', − 𝜈 𝜖𝑚 𝑋(')+', 𝜖𝑋(')+', ℎ(") ℎ($) ℎ(%) 𝑧 𝑦

𝑋($) 𝑋(%) 𝑋(&) 𝑋())

𝑚(𝑧, 0 𝑧)

slide-17
SLIDE 17

Training Deep Neural Network

  • Step 1) initialize all weights to random numbers
  • Step 2) Feedforward computation

○ Feed the input 𝑦 and compute ℎ$ and up to ) 𝑧 and keep all of the intermediate values

ℎ(") ℎ($) ℎ(%) 𝑧 𝑦

𝑋($) 𝑋(%) 𝑋(&) 𝑋())

𝑚(𝑧, 0 𝑧) 𝑨(") = 𝑋(")𝑦 + 𝑐(") ℎ(") = 𝑕(𝑨(")) 𝑨($) = 𝑋($)ℎ(") + 𝑐($) ℎ($) = 𝑕(𝑨($)) 𝑨(%) = 𝑋(%)ℎ($) + 𝑐(%) ℎ(%) = 𝑕(𝑨(%)) 𝑧 = 𝑋(&)ℎ(%) + 𝑐(&)

slide-18
SLIDE 18

Training Deep Neural Network

  • Step 3) compute the loss 𝑚(𝑧, %

𝑧) using % 𝑧 and the ground truth 𝑧

○ For simplicity, let’s assume root mean square (RMS) loss

■ 𝑚 𝑧, 0 𝑧 = (𝑧 − 0 𝑧)$= (𝑧 − 𝑋(&)ℎ(%) + 𝑐(&) )$ ■

  • '
  • .

*,, (-) = 2(𝑧 − 𝑋

/,1 (&)ℎ1 (%) + 𝑐/ (&))(−ℎ/ (%))

  • '
  • 2,

(.) = 2(𝑧 − 𝑋

/,1 (&)ℎ1 (%) + 𝑐/ (&))(−𝑋 /,1 (&)) We can compute this because ℎ! was computed from the forward pass and 𝑋

" and 𝑐" were already initialized

ℎ(") ℎ($) ℎ(%) 𝑧 𝑦

𝑋($) 𝑋(%) 𝑋(&) 𝑋())

𝑚(𝑧, 0 𝑧)

We need the gradient w.r.t the hidden layer units for the lower layer !!!

𝑨(") = 𝑋(")𝑦 + 𝑐(") ℎ(") = 𝑕(𝑨(")) 𝑨($) = 𝑋($)ℎ(") + 𝑐($) ℎ($) = 𝑕(𝑨($)) 𝑨(%) = 𝑋(%)ℎ($) + 𝑐(%) ℎ(%) = 𝑕(𝑨(%)) 𝑧 = 𝑋(&)ℎ(%) + 𝑐(&)

slide-19
SLIDE 19

Training Deep Neural Network

  • Step 4) Back-propagate the loss to the lower layers

○ Let’s assume the sigmoid function for non-linearity

■ ℎ1

(%) = 𝑕(𝑨1 (%))= " "6)

/0, (.)

  • '
  • 7,

(.) =

  • '
  • 2,

(.) 3

  • 2,

(.)

  • 7,

(.) =

  • '
  • 2,

(.) 3

" "6)

/0, (.)

8

=

  • '
  • 2,

(.) 3

" "6)

/0, (.) 3

)

/0, (.)

"6)

/0, (.)

ℎ(") ℎ($) ℎ(%) 𝑧 𝑦

𝑋($) 𝑋(%) 𝑋(&) 𝑋())

𝑚(𝑧, 0 𝑧)

We know this from the upper layer (the previous slide) We can compute this using 𝑨! from the forward pass

𝑨(") = 𝑋(")𝑦 + 𝑐(") ℎ(") = 𝑕(𝑨(")) 𝑨($) = 𝑋($)ℎ(") + 𝑐($) ℎ($) = 𝑕(𝑨($)) 𝑨(%) = 𝑋(%)ℎ($) + 𝑐(%) ℎ(%) = 𝑕(𝑨(%)) 𝑧 = 𝑋(&)ℎ(%) + 𝑐(&)

slide-20
SLIDE 20

Training Deep Neural Network

  • Step 5) Back-propagate the loss to the lower layers more

  • '
  • .

*,, (.) =

  • '
  • 7,

(.) 3

  • 7,

.

  • .

*,, . =

  • '
  • 7,

. 3 ℎ/

$

  • '
  • 2,

(1) =

  • '
  • 7*

(.) 3

  • 7*

(.)

  • 2,

(1) =

  • '
  • 7*

(.) 3 𝑋

/,1 (%)

ℎ(") ℎ($) ℎ(%) 𝑧 𝑦

𝑋($) 𝑋(%) 𝑋(&) 𝑋())

𝑚(𝑧, 0 𝑧) 𝑨(") = 𝑋(")𝑦 + 𝑐(") ℎ(") = 𝑕(𝑨(")) 𝑨($) = 𝑋($)ℎ(") + 𝑐($) ℎ($) = 𝑕(𝑨($)) 𝑨(%) = 𝑋(%)ℎ($) + 𝑐(%) ℎ(%) = 𝑕(𝑨(%)) 𝑧 = 𝑋(&)ℎ(%) + 𝑐(&)

c

We know this from the upper layer (the previous slide) These are from the local layer

slide-21
SLIDE 21

Training Deep Neural Network

  • Step 6) Repeat step 4 for the 2nd layer

  • '
  • 7,

(1) =

  • '
  • 2,

(1) 3

  • 2,

(1)

  • 7,

(1) =

  • '
  • 2,

(1) 3

" "6)

/0, (1)

8

=

  • '
  • 2,

(1) 3

" "6)

/0, (1) 3

)

/0, (1)

"6)

/0, (1)

ℎ(") ℎ($) ℎ(%) 𝑧 𝑦

𝑋($) 𝑋(%) 𝑋(&) 𝑋())

𝑚(𝑧, 0 𝑧) 𝑨(") = 𝑋(")𝑦 + 𝑐(") ℎ(") = 𝑕(𝑨(")) 𝑨($) = 𝑋($)ℎ(") + 𝑐($) ℎ($) = 𝑕(𝑨($)) 𝑨(%) = 𝑋(%)ℎ($) + 𝑐(%) ℎ(%) = 𝑕(𝑨(%)) 𝑧 = 𝑋(&)ℎ(%) + 𝑐(&)

We know this from the upper layer (the previous slide)

slide-22
SLIDE 22

Training Deep Neural Network

  • Step 7) Repeat step5 for the 2nd layer

  • '
  • .

*,, (1) =

  • '
  • 7,

(1) 3

  • 7,

1

  • .

*,, 1 =

  • '
  • 7,

1 3 ℎ/

"

  • '
  • 2,

(2) =

  • '
  • 7*

(1) 3

  • 7*

(1)

  • 2,

(2) =

  • '
  • 7*

(1) 3 𝑋

/,1 ($)

ℎ(") ℎ($) ℎ(%) 𝑧 𝑦

𝑋($) 𝑋(%) 𝑋(&) 𝑋())

𝑚(𝑧, 0 𝑧) 𝑨(") = 𝑋(")𝑦 + 𝑐(") ℎ(") = 𝑕(𝑨(")) 𝑨($) = 𝑋($)ℎ(") + 𝑐($) ℎ($) = 𝑕(𝑨($)) 𝑨(%) = 𝑋(%)ℎ($) + 𝑐(%) ℎ(%) = 𝑕(𝑨(%)) 𝑧 = 𝑋(&)ℎ(%) + 𝑐(&)

We know this from the upper layer (the previous slide) These are from the local layer

slide-23
SLIDE 23

Training Deep Neural Network

  • Step 8) repeat the two previous steps for the 1st layer

○ Done with computing one iteration of the gradient ○ Update the weights using the gradient!

ℎ(") ℎ($) ℎ(%) 𝑧 𝑦

𝑋($) 𝑋(%) 𝑋(&) 𝑋())

𝑚(𝑧, 0 𝑧) 𝑨(") = 𝑋(")𝑦 + 𝑐(") ℎ(") = 𝑕(𝑨(")) 𝑨($) = 𝑋($)ℎ(") + 𝑐($) ℎ($) = 𝑕(𝑨($)) 𝑨(%) = 𝑋(%)ℎ($) + 𝑐(%) ℎ(%) = 𝑕(𝑨(%)) 𝑧 = 𝑋(&)ℎ(%) + 𝑐(&) 𝑋(')()* = 𝑋(')+', − 𝜈 𝜖𝑚 𝑋(')+', 𝜖𝑋(')+',

slide-24
SLIDE 24

Training Deep Neural Network

  • Step 9) keep repeating the feedforward and backward passes

○ Updating the weights with the gradients

ℎ(") ℎ($) ℎ(%) 𝑧 𝑦

𝑋($) 𝑋(%) 𝑋(&) 𝑋())

𝑚(𝑧, 0 𝑧) 𝑨(") = 𝑋(")𝑦 + 𝑐(") ℎ(") = 𝑕(𝑨(")) 𝑨($) = 𝑋($)ℎ(") + 𝑐($) ℎ($) = 𝑕(𝑨($)) 𝑨(%) = 𝑋(%)ℎ($) + 𝑐(%) ℎ(%) = 𝑕(𝑨(%)) 𝑧 = 𝑋(&)ℎ(%) + 𝑐(&) 𝑋(')()* = 𝑋(')+', − 𝜈 𝜖𝑚 𝑋(')+', 𝜖𝑋(')+',

𝑥# 𝑥$ 𝑥∗

slide-25
SLIDE 25

Training Deep Neural Network

  • Step 10) Monitor both training loss and validation loss and stop the

iteration when the validation loss decreases any more (early stopping)

○ Note that the training set is used for both the feedforward and backward passes whereas the validation set is used for only the feedforward pass ○ Early stopping is a regularization method

This update is from the training set 𝑥$ 𝑥∗ 𝑥#

Epoch

Loss

Early stopping (overfitting starts) Validation Training

slide-26
SLIDE 26

Wrap up: Training Deep Neural Networks

  • Generalization

○ Any “differential” module can be used in the neural networks as a layer ○ Gradient-based learning will train the entire network

𝑦 𝑧 𝜖𝑚 𝜖𝑧 𝜖𝑚 𝜖𝑦 𝜖𝑚 𝜖𝑥

1/

𝑥

1/

𝑦 𝑧 𝜖𝑚 𝜖𝑧 𝜖𝑚 𝜖𝑦

Parametric module Non-parametric module

Connectivity patterns (Fully-connected, convolutional, …) Activation functions Pooling (max, average, …) Forward Backward Forward Backward

slide-27
SLIDE 27

MLP Demo and Visualization

  • https://playground.tensorflow.org/