Advanced Machine Learning Dense Neural Networks Amit Sethi - - PowerPoint PPT Presentation
Advanced Machine Learning Dense Neural Networks Amit Sethi - - PowerPoint PPT Presentation
Advanced Machine Learning Dense Neural Networks Amit Sethi Electrical Engineering, IIT Bombay Learning objectives Learn the motivations behind neural networks Become familiar with neural network terms Understand the working of neural
Learning objectives
- Learn the motivations behind neural networks
- Become familiar with neural network terms
- Understand the working of neural networks
- Understand behind-the-scenes training of
neural networks
Neural networks are inspired from mammalian brain
- Each unit (neuron) is simple
- But, human brain has 100
billion neurons with 100 trillion connections
- The strength and nature of
the connections stores memories and the “program” that makes us human
- A neural network is a web
- f artificial neurons
Artificial neurons is inspired by biological neurons
- Neural networks are made up of artificial
neurons
- Artificial neurons are only loosely based on
real neurons, just like neural networks are
- nly loosely based on the human brain
Σ g
x1 x2 x3 w1 w2 w3 1 b
Activation function is the secret sauce
- f neural networks
- Neural network training is
all about tuning weights and biases
- If there was no activation
function f, the output of the entire neural network would be a linear function
- f the inputs
- The earliest models used a
step function
Σ g
x1 x2 x3 w1 w2 w3 1 b
Types of activation functions
- Step: original concept
behind classification and region bifurcation. Not used anymore
- Sigmoid and tanh:
trainable approximations
- f the step-function
- ReLU: currently preferred
due to fast convergence
- Softmax: currently
preferred for output of a classification net. Generalized sigmoid
- Linear: good for modeling
a range in the output of a regression net
Formulas for activation functions
- Step: 𝑦 =
sign 𝑦 +1 2
- Sigmoid: 𝑦 =
1 1+𝑓−𝑦
- Tanh: 𝑦 = tanh
(𝑦)
- ReLU: 𝑦 = max
(0, 𝑦)
- Softmax: 𝑦𝑗 =
𝑓𝑦𝑗 𝑓𝑦𝑗
𝑗
- Linear: 𝑦 = 𝑦
Step function divides the input space into two halves 0 and 1
- In a single neuron, step
function is a linear binary classifier
- The weights and biases
determine where the step will be in n-dimensions
- But, as we shall see later, it
gives little information about how to change the weights if we make a mistake
- So, we need a smoother
version of a step function
- Enter: the Sigmoid function
The sigmoid function is a smoother step function
- Smoothness ensures that there is more
information about the direction in which to change the weights if there are errors
- Sigmoid function is also mathematically
linked to logistic regression, which is a theoretically well-backed linear classifier
The problem with sigmoid is (near) zero gradient on both extremes
- For both large
positive and negative input values, sigmoid doesn’t change much with change
- f input
- ReLU has a
constant gradient for almost half of the inputs
- But, ReLU cannot
give a meaningful final output
Output activation functions can only be of the following kinds
- Sigmoid gives binary
classification output
- Tanh can also do that
provided the desired
- utput is in {-1, +1}
- Softmax generalizes
sigmoid to n-ary classification
- Linear is used for
regression
- ReLU is only used in
internal nodes (non-
- utput)
Contents
- Introduction to neural networks
- Feed forward neural networks
- Gradient descent and backpropagation
- Learning rate setting and tuning
Basic structure of a neural network
- It is feed forward
– Connections from inputs towards
- utputs
– No connection comes backwards
- It consists of layers
– Current layer’s input is previous layer’s output – No lateral (intra- layer) connections
- That’s it!
x1 x2 xd h11 h12 h1n
1
y1 y2 yn … … … … … … …
Basic structure of a neural network
- Output layer
– Represent the output of the neural network – For a two class problem or regression with a 1-d output, we need only one output node
- Hidden layer(s)
– Represent the intermediary nodes that divide the input space into regions with (soft) boundaries – These usually form a hidden layer – Usually, there is only one such layer – Given enough hidden nodes, we can model an arbitrary input-
- utput relation.
- Input layer
– Represent dimensions of the input vector (one node for each dimension) – These usually form an input layer, and – Usually there is only one such layer
x1 x2 xd h11 h12 h1n
1
y1 y2 yn … … … … … … …
Importance of hidden layers
- First hidden
layer extracts features
- Second hidden
layer extracts features of features
- …
- Output layer
gives the desired output
+ + + + + + + + − − − − − − − − + + Single sigmoid + + + + + + + + − − − − − − − − + + Sigmoid hidden layers and sigmoid
- utput
Overall function of a neural network
- 𝑔 𝒚𝑗 = 𝑚(𝑿𝑚 ∗ 𝑚−1 𝑿𝑚−1 … 1 𝑿1 ∗ 𝒚𝑗 … )
- Weights form a matrix
- Output of the previous layer form a vector
- The activation (nonlinear) function is applied
point-wise to the weight times input
- Design questions (hyper parameters):
– Number of layers – Number of neurons in each layer (rows of weight matrices)
Training the neural network
- Given 𝒚𝑗 and 𝑧𝑗
- Think of what hyper-parameters and neural
network design might work
- Form a neural network:
𝑔 𝒚𝑗 = 𝑚(𝑿𝑚 ∗ 𝑚−1 𝑿𝑚−1 … 1 𝑿1 ∗ 𝒚𝑗 … )
- Compute 𝑔
𝒙 𝒚𝑗 as an estimate of 𝑧𝑗 for all
samples
- Compute loss:
1 𝑂
𝑀(𝑔
𝒙 𝒚𝑗 , 𝑧𝑗) 𝑂 𝑗=1
= 1
𝑂
𝑚𝑗(𝒙)
𝑂 𝑗=1
- Tweak 𝒙 to reduce loss (optimization algorithm)
- Repeat last three steps
Loss function choice
- There are positive
and negative errors in classification and MSE is the most common loss function
- There is
probability of correct class in classification, for which cross entropy is the most common loss function
error→ error→
Some loss functions and their derivatives
- Terminology
– 𝑧 is the output – 𝑢 is the target output
- Mean square error
- Loss: (𝑧 − 𝑢 )2
- Derivative of the loss: 2(𝑧 − 𝑢 )
- Cross entropy
- Loss: −
𝑢𝑑 log 𝑧𝑑
𝐷 𝑑=1
- Derivative of the loss: −
1 𝑧𝑑 |𝑑=𝜕
Computational graph of a single hidden layer NN
x W1 * ? + b1 Z1 A1 ReL U W2 * ? + b2 Z2 A2 SoftM ax targ et CE Loss
Advanced Machine Learning Backpropagation
Amit Sethi Electrical Engineering, IIT Bombay
Learning objectives
- Write derivative of a nested function using
chain rule
- Articulate how storage of partial derivatives
leads to an efficient gradient descent for neural networks
- Write gradient descent as matrix operations
Overall function of a neural network
- 𝑔 𝒚𝑗 = 𝑚(𝑿𝑚 ∗ 𝑚−1 𝑿𝑚−1 … 1 𝑿1 ∗ 𝒚𝑗 … )
- Weights form a matrix
- Output of the previous layer form a vector
- The activation (nonlinear) function is applied
point-wise to the weight times input
- Design questions (hyper parameters):
– Number of layers – Number of neurons in each layer (rows of weight matrices)
Training the neural network
- Given 𝒚𝑗 and 𝑧𝑗
- Think of what hyper-parameters and neural
network design might work
- Form a neural network:
𝑔 𝒚𝑗 = 𝑚(𝑿𝑚 ∗ 𝑚−1 𝑿𝑚−1 … 1 𝑿1 ∗ 𝒚𝑗 … )
- Compute 𝑔
𝒙 𝒚𝑗 as an estimate of 𝑧𝑗 for all
samples
- Compute loss:
1 𝑂
𝑀(𝑔
𝒙 𝒚𝑗 , 𝑧𝑗) 𝑂 𝑗=1
= 1
𝑂
𝑚𝑗(𝒙)
𝑂 𝑗=1
- Tweak 𝒙 to reduce loss (optimization algorithm)
- Repeat last three steps
Gradient ascent
- If you didn’t know the shape of a mountain
- But at every step you knew the slope
- Can you reach the top of the mountain?
Gradient descent minimizes the loss function
- At every point, compute
- Loss (scalar): 𝑚𝑗(𝒙)
- Gradient of loss with respect to
weights (vector): 𝛼
𝒙𝑚𝑗(𝒙)
- Take a step towards negative
gradient: 𝒙 ← 𝒙 − 𝜃 𝛼
𝒙
1 𝑂 𝑚𝑗(𝒙)
𝑂 𝑗=1
Derivative of a function of a scalar
- Derivative 𝑔’ 𝑦 =
𝑒 𝑔(𝑦) 𝑒 𝑦 is the rate of change of 𝑔 𝑦 with 𝑦
- It is zero when then function is flat (horizontal), such as at the
minimum or maximum of 𝑔 𝑦
- It is positive when 𝑔 𝑦 is sloping up, and negative when 𝑔 𝑦 is
sloping down
- To move towards the maxima, taking a small step in a direction of
the derivative
E.g. 𝑔 𝑦 = 𝑏𝑦2 + 𝑐𝑦 + 𝑑, 𝑔′(𝑦) = 2𝑏𝑦 + 𝑐, 𝑔′′ 𝑦 = 2𝑏
Gradient of a function of a vector
- Derivative with respect to each
dimension, holding other dimensions constant
- 𝛼𝑔 𝒚 = 𝛼𝑔 𝑦1, 𝑦2 =
𝜖𝑔 𝜖𝑦1 𝜖𝑔 𝜖𝑦2
- At a minima or a maxima the
gradient is a zero vector
The function is flat in every direction
- At a minima or a maxima the
gradient is a zero vector
f(x1, x2) →
Gradient of a function of a vector
- Gradient gives a direction for
moving towards the minima
- Take a small step towards
negative of the gradient
f(x1, x2) →
Example of gradient
- Let 𝑔 𝒚 = 𝑔 𝑦1, 𝑦2 = 5𝑦12 + 3𝑦22
- Then 𝛼𝑔 𝒚 = 𝛼𝑔 𝑦1, 𝑦2 =
𝜖𝑔 𝜖𝑦1 𝜖𝑔 𝜖𝑦2
= 10𝑦1 6𝑦2
- At a location 2,1 a step in 20
6 or 0.958 0.287 direction will lead to maximal increase in the function
This story is unfolding in multiple dimensions
Backpropagation
- Backpropagation is an
efficient method to do gradient descent
- It saves the gradient
w.r.t. the upper layer
- utput to compute the
gradient w.r.t. the weights immediately below
- It is linked to the chain
rule of derivatives
- All intermediary
functions must be differentiable, including the activation functions
x1 x2 xd h11 h12 h1n
1
y1 y2 yn … … … … … … …
Chain rule of differentiation
- Very handy for complicated functions
- Especially functions of functions
- E.g. NN outputs are functions of previous layers
- For example: Let 𝑔 𝑦 = 𝑖 𝑦
- Let 𝑧 = 𝑖 𝑦 , 𝑨 = 𝑧 = 𝑖 𝑦
- Then 𝑔′ 𝑦 = 𝑒 𝑨
𝑒 𝑦 = 𝑒 𝑨 𝑒 𝑧 𝑒 𝑧 𝑒 𝑦 = ′(𝑧)𝑖′(𝑦)
- For example:
𝑒 sin (𝑦2) 𝑒 𝑦
= 2𝑦 cos (𝑦2)
Backpropagation makes use of chain rule of derivatives
x W1 * ? + b1 Z1 A1 ReL U W2 * ? + b2 Z2 A2 SoftM ax targ et CE Loss
- Chain rule: 𝜖𝑔( 𝑦 )
𝜖𝑦
= 𝜖𝑔( 𝑦 )
𝜖(𝑦) 𝜖 𝑦 𝜖𝑦
Vector valued functions and Jacobians
- We often deal with functions that give multiple
- utputs
- Let 𝒈 𝒚 = 𝑔
1(𝒚)
𝑔
2(𝒚) = 𝑔 1(𝑦1, 𝑦2, 𝑦3)
𝑔
2(𝑦1, 𝑦2, 𝑦3)
- Thinking in terms of vector of functions can make the
representation less cumbersome and computations more efficient
- Then the Jacobian is
- 𝑲(𝒈) =
𝜖𝒈 𝜖𝑦1 𝜖𝒈 𝜖𝑦2 𝜖𝒈 𝜖𝑦3 = 𝜖𝑔
1
𝜖𝑦1 𝜖𝑔
1
𝜖𝑦2 𝜖𝑔
1
𝜖𝑦3 𝜖𝑔
2
𝜖𝑦1 𝜖𝑔
2
𝜖𝑦2 𝜖𝑔
2
𝜖𝑦3
Jacobian of each layer
- Compute the derivatives of a higher layer’s
- utput with respect to those of the lower
layer
- What if we scale all the weights by a factor R?
- What happens a few layers down?
Role of step size and learning rate
- Tale of two loss functions
– Same value, and – Same gradient (first derivative), but – Different Hessian (second derivative) – Different step sizes needed
- Success not guaranteed
The perfect step size is impossible to guess
- Goldilocks finds the perfect balance only in a
fairy tale
- The step size is decided by learning rate 𝜃 and
the gradient
Double derivative
- Double derivative 𝑔’’ 𝑦 =
𝑒2 𝑔(𝑦) 𝑒 𝑦2 is the derivative of
derivative of 𝑔 𝑦
- Double derivative is positive for convex functions (have a
single minima), and negative for concave functions (have a single maxima)
E.g. 𝑔 𝑦 = 𝑏𝑦2 + 𝑐𝑦 + 𝑑, 𝑔′(𝑦) = 2𝑏𝑦 + 𝑐, 𝑔′′ 𝑦 = 2𝑏
Double derivative
- Double derivative tells how far the minima might be
from a given point.
- From 𝑦 = 0 the minima is closer for the red dashed
curve than for the blue solid curve, because the former has a larger second derivative (its slope reverses faster)
𝑔 𝑦 = 𝑏𝑦2 + 𝑐𝑦 + 𝑑, 𝑔′ 𝑦 = 2𝑏𝑦 + 𝑐, 𝑔′′ 𝑦 = 2𝑏
Perfect step size for a paraboloid
- Let 𝑔 𝑦 = 𝑏𝑦2 + 𝑐𝑦 + 𝑑
- Assuming 𝑏 < 0
- Minima is at: 𝑦∗ = − 𝑐
2𝑏
- For any 𝑦 the perfect step would be:
− 𝑐
2𝑏 − 𝑦 = − 2𝑏𝑦+𝑐 2𝑏
= − 𝑔′ 𝑦
𝑔′′ 𝑦
- So, the perfect learning rate is: 𝜃∗ =
1 𝑔′′ 𝑦
- In multiple dimensions, 𝒚 ← 𝒚 − 𝐼 𝑔 𝒚
−1𝛼(𝑔 𝒚 )
- Practically, we do not want to compute the inverse of
a Hessian matrix, so we approximate Hessian inverse
Hessian of a function of a vector
- Double derivative with respect
to a pair of dimensions forms the Hessian matrix:
- If all eigenvalues of a Hessian
matrix are positive, then the function is convex
f(x1, x2) →
Example of Hessian
- Let 𝑔 𝒚 = 𝑔 𝑦1, 𝑦2 = 5𝑦12 + 3𝑦22 + 4𝑦1𝑦2
- Then
𝛼𝑔 𝒚 = 𝛼𝑔 𝑦1, 𝑦2 =
𝜖𝑔 𝜖𝑦1 𝜖𝑔 𝜖𝑦2
= 10𝑦1 + 4𝑦2 6𝑦2 + 4𝑦1
- And, 𝐼(𝑔 𝒚 ) =
𝜖2𝑔 𝜖𝑦1
2
𝜖2𝑔 𝜖𝑦1𝜖𝑦2 𝜖2𝑔 𝜖𝑦2𝜖𝑦1 𝜖2𝑔 𝜖𝑦2
2
= 10 4 4 6
Saddle points, Hessian and long local furrows
- Some variables may
have reached a local minima while others have not
- Some weights may have
almost zero gradient
- At least some
eigenvalues may not be negative
Complicated loss functions
A realistic picture
Image source: https://www.cs.umd.edu/~tomg/projects/landscapes/
Saddle point Local minima Global minima? Local maxima