Advanced Machine Learning Dense Neural Networks Amit Sethi - - PowerPoint PPT Presentation

advanced machine learning
SMART_READER_LITE
LIVE PREVIEW

Advanced Machine Learning Dense Neural Networks Amit Sethi - - PowerPoint PPT Presentation

Advanced Machine Learning Dense Neural Networks Amit Sethi Electrical Engineering, IIT Bombay Learning objectives Learn the motivations behind neural networks Become familiar with neural network terms Understand the working of neural


slide-1
SLIDE 1

Advanced Machine Learning Dense Neural Networks

Amit Sethi Electrical Engineering, IIT Bombay

slide-2
SLIDE 2

Learning objectives

  • Learn the motivations behind neural networks
  • Become familiar with neural network terms
  • Understand the working of neural networks
  • Understand behind-the-scenes training of

neural networks

slide-3
SLIDE 3

Neural networks are inspired from mammalian brain

  • Each unit (neuron) is simple
  • But, human brain has 100

billion neurons with 100 trillion connections

  • The strength and nature of

the connections stores memories and the “program” that makes us human

  • A neural network is a web
  • f artificial neurons
slide-4
SLIDE 4

Artificial neurons is inspired by biological neurons

  • Neural networks are made up of artificial

neurons

  • Artificial neurons are only loosely based on

real neurons, just like neural networks are

  • nly loosely based on the human brain

Σ g

x1 x2 x3 w1 w2 w3 1 b

slide-5
SLIDE 5

Activation function is the secret sauce

  • f neural networks
  • Neural network training is

all about tuning weights and biases

  • If there was no activation

function f, the output of the entire neural network would be a linear function

  • f the inputs
  • The earliest models used a

step function

Σ g

x1 x2 x3 w1 w2 w3 1 b

slide-6
SLIDE 6

Types of activation functions

  • Step: original concept

behind classification and region bifurcation. Not used anymore

  • Sigmoid and tanh:

trainable approximations

  • f the step-function
  • ReLU: currently preferred

due to fast convergence

  • Softmax: currently

preferred for output of a classification net. Generalized sigmoid

  • Linear: good for modeling

a range in the output of a regression net

slide-7
SLIDE 7

Formulas for activation functions

  • Step: 𝑕 𝑦 =

sign 𝑦 +1 2

  • Sigmoid: 𝑕 𝑦 =

1 1+𝑓−𝑦

  • Tanh: 𝑕 𝑦 = tanh

(𝑦)

  • ReLU: 𝑕 𝑦 = max

(0, 𝑦)

  • Softmax: 𝑕 𝑦𝑗 =

𝑓𝑦𝑗 𝑓𝑦𝑗

𝑗

  • Linear: 𝑕 𝑦 = 𝑦
slide-8
SLIDE 8

Step function divides the input space into two halves  0 and 1

  • In a single neuron, step

function is a linear binary classifier

  • The weights and biases

determine where the step will be in n-dimensions

  • But, as we shall see later, it

gives little information about how to change the weights if we make a mistake

  • So, we need a smoother

version of a step function

  • Enter: the Sigmoid function
slide-9
SLIDE 9

The sigmoid function is a smoother step function

  • Smoothness ensures that there is more

information about the direction in which to change the weights if there are errors

  • Sigmoid function is also mathematically

linked to logistic regression, which is a theoretically well-backed linear classifier

slide-10
SLIDE 10

The problem with sigmoid is (near) zero gradient on both extremes

  • For both large

positive and negative input values, sigmoid doesn’t change much with change

  • f input
  • ReLU has a

constant gradient for almost half of the inputs

  • But, ReLU cannot

give a meaningful final output

slide-11
SLIDE 11

Output activation functions can only be of the following kinds

  • Sigmoid gives binary

classification output

  • Tanh can also do that

provided the desired

  • utput is in {-1, +1}
  • Softmax generalizes

sigmoid to n-ary classification

  • Linear is used for

regression

  • ReLU is only used in

internal nodes (non-

  • utput)
slide-12
SLIDE 12

Contents

  • Introduction to neural networks
  • Feed forward neural networks
  • Gradient descent and backpropagation
  • Learning rate setting and tuning
slide-13
SLIDE 13

Basic structure of a neural network

  • It is feed forward

– Connections from inputs towards

  • utputs

– No connection comes backwards

  • It consists of layers

– Current layer’s input is previous layer’s output – No lateral (intra- layer) connections

  • That’s it!

x1 x2 xd h11 h12 h1n

1

y1 y2 yn … … … … … … …

slide-14
SLIDE 14

Basic structure of a neural network

  • Output layer

– Represent the output of the neural network – For a two class problem or regression with a 1-d output, we need only one output node

  • Hidden layer(s)

– Represent the intermediary nodes that divide the input space into regions with (soft) boundaries – These usually form a hidden layer – Usually, there is only one such layer – Given enough hidden nodes, we can model an arbitrary input-

  • utput relation.
  • Input layer

– Represent dimensions of the input vector (one node for each dimension) – These usually form an input layer, and – Usually there is only one such layer

x1 x2 xd h11 h12 h1n

1

y1 y2 yn … … … … … … …

slide-15
SLIDE 15

Importance of hidden layers

  • First hidden

layer extracts features

  • Second hidden

layer extracts features of features

  • Output layer

gives the desired output

+ + + + + + + + − − − − − − − − + + Single sigmoid + + + + + + + + − − − − − − − − + + Sigmoid hidden layers and sigmoid

  • utput
slide-16
SLIDE 16

Overall function of a neural network

  • 𝑔 𝒚𝑗 = 𝑕𝑚(𝑿𝑚 ∗ 𝑕𝑚−1 𝑿𝑚−1 … 𝑕1 𝑿1 ∗ 𝒚𝑗 … )
  • Weights form a matrix
  • Output of the previous layer form a vector
  • The activation (nonlinear) function is applied

point-wise to the weight times input

  • Design questions (hyper parameters):

– Number of layers – Number of neurons in each layer (rows of weight matrices)

slide-17
SLIDE 17

Training the neural network

  • Given 𝒚𝑗 and 𝑧𝑗
  • Think of what hyper-parameters and neural

network design might work

  • Form a neural network:

𝑔 𝒚𝑗 = 𝑕𝑚(𝑿𝑚 ∗ 𝑕𝑚−1 𝑿𝑚−1 … 𝑕1 𝑿1 ∗ 𝒚𝑗 … )

  • Compute 𝑔

𝒙 𝒚𝑗 as an estimate of 𝑧𝑗 for all

samples

  • Compute loss:

1 𝑂

𝑀(𝑔

𝒙 𝒚𝑗 , 𝑧𝑗) 𝑂 𝑗=1

= 1

𝑂

𝑚𝑗(𝒙)

𝑂 𝑗=1

  • Tweak 𝒙 to reduce loss (optimization algorithm)
  • Repeat last three steps
slide-18
SLIDE 18

Loss function choice

  • There are positive

and negative errors in classification and MSE is the most common loss function

  • There is

probability of correct class in classification, for which cross entropy is the most common loss function

error→ error→

slide-19
SLIDE 19

Some loss functions and their derivatives

  • Terminology

– 𝑧 is the output – 𝑢 is the target output

  • Mean square error
  • Loss: (𝑧 − 𝑢 )2
  • Derivative of the loss: 2(𝑧 − 𝑢 )
  • Cross entropy
  • Loss: −

𝑢𝑑 log 𝑧𝑑

𝐷 𝑑=1

  • Derivative of the loss: −

1 𝑧𝑑 |𝑑=𝜕

slide-20
SLIDE 20

Computational graph of a single hidden layer NN

x W1 * ? + b1 Z1 A1 ReL U W2 * ? + b2 Z2 A2 SoftM ax targ et CE Loss

slide-21
SLIDE 21

Advanced Machine Learning Backpropagation

Amit Sethi Electrical Engineering, IIT Bombay

slide-22
SLIDE 22

Learning objectives

  • Write derivative of a nested function using

chain rule

  • Articulate how storage of partial derivatives

leads to an efficient gradient descent for neural networks

  • Write gradient descent as matrix operations
slide-23
SLIDE 23

Overall function of a neural network

  • 𝑔 𝒚𝑗 = 𝑕𝑚(𝑿𝑚 ∗ 𝑕𝑚−1 𝑿𝑚−1 … 𝑕1 𝑿1 ∗ 𝒚𝑗 … )
  • Weights form a matrix
  • Output of the previous layer form a vector
  • The activation (nonlinear) function is applied

point-wise to the weight times input

  • Design questions (hyper parameters):

– Number of layers – Number of neurons in each layer (rows of weight matrices)

slide-24
SLIDE 24

Training the neural network

  • Given 𝒚𝑗 and 𝑧𝑗
  • Think of what hyper-parameters and neural

network design might work

  • Form a neural network:

𝑔 𝒚𝑗 = 𝑕𝑚(𝑿𝑚 ∗ 𝑕𝑚−1 𝑿𝑚−1 … 𝑕1 𝑿1 ∗ 𝒚𝑗 … )

  • Compute 𝑔

𝒙 𝒚𝑗 as an estimate of 𝑧𝑗 for all

samples

  • Compute loss:

1 𝑂

𝑀(𝑔

𝒙 𝒚𝑗 , 𝑧𝑗) 𝑂 𝑗=1

= 1

𝑂

𝑚𝑗(𝒙)

𝑂 𝑗=1

  • Tweak 𝒙 to reduce loss (optimization algorithm)
  • Repeat last three steps
slide-25
SLIDE 25

Gradient ascent

  • If you didn’t know the shape of a mountain
  • But at every step you knew the slope
  • Can you reach the top of the mountain?
slide-26
SLIDE 26

Gradient descent minimizes the loss function

  • At every point, compute
  • Loss (scalar): 𝑚𝑗(𝒙)
  • Gradient of loss with respect to

weights (vector): 𝛼

𝒙𝑚𝑗(𝒙)

  • Take a step towards negative

gradient: 𝒙 ← 𝒙 − 𝜃 𝛼

𝒙

1 𝑂 𝑚𝑗(𝒙)

𝑂 𝑗=1

slide-27
SLIDE 27

Derivative of a function of a scalar

  • Derivative 𝑔’ 𝑦 =

𝑒 𝑔(𝑦) 𝑒 𝑦 is the rate of change of 𝑔 𝑦 with 𝑦

  • It is zero when then function is flat (horizontal), such as at the

minimum or maximum of 𝑔 𝑦

  • It is positive when 𝑔 𝑦 is sloping up, and negative when 𝑔 𝑦 is

sloping down

  • To move towards the maxima, taking a small step in a direction of

the derivative

E.g. 𝑔 𝑦 = 𝑏𝑦2 + 𝑐𝑦 + 𝑑, 𝑔′(𝑦) = 2𝑏𝑦 + 𝑐, 𝑔′′ 𝑦 = 2𝑏

slide-28
SLIDE 28

Gradient of a function of a vector

  • Derivative with respect to each

dimension, holding other dimensions constant

  • 𝛼𝑔 𝒚 = 𝛼𝑔 𝑦1, 𝑦2 =

𝜖𝑔 𝜖𝑦1 𝜖𝑔 𝜖𝑦2

  • At a minima or a maxima the

gradient is a zero vector

The function is flat in every direction

  • At a minima or a maxima the

gradient is a zero vector

f(x1, x2) →

slide-29
SLIDE 29

Gradient of a function of a vector

  • Gradient gives a direction for

moving towards the minima

  • Take a small step towards

negative of the gradient

f(x1, x2) →

slide-30
SLIDE 30

Example of gradient

  • Let 𝑔 𝒚 = 𝑔 𝑦1, 𝑦2 = 5𝑦12 + 3𝑦22
  • Then 𝛼𝑔 𝒚 = 𝛼𝑔 𝑦1, 𝑦2 =

𝜖𝑔 𝜖𝑦1 𝜖𝑔 𝜖𝑦2

= 10𝑦1 6𝑦2

  • At a location 2,1 a step in 20

6 or 0.958 0.287 direction will lead to maximal increase in the function

slide-31
SLIDE 31

This story is unfolding in multiple dimensions

slide-32
SLIDE 32

Backpropagation

  • Backpropagation is an

efficient method to do gradient descent

  • It saves the gradient

w.r.t. the upper layer

  • utput to compute the

gradient w.r.t. the weights immediately below

  • It is linked to the chain

rule of derivatives

  • All intermediary

functions must be differentiable, including the activation functions

x1 x2 xd h11 h12 h1n

1

y1 y2 yn … … … … … … …

slide-33
SLIDE 33

Chain rule of differentiation

  • Very handy for complicated functions
  • Especially functions of functions
  • E.g. NN outputs are functions of previous layers
  • For example: Let 𝑔 𝑦 = 𝑕 𝑖 𝑦
  • Let 𝑧 = 𝑖 𝑦 , 𝑨 = 𝑕 𝑧 = 𝑕 𝑖 𝑦
  • Then 𝑔′ 𝑦 = 𝑒 𝑨

𝑒 𝑦 = 𝑒 𝑨 𝑒 𝑧 𝑒 𝑧 𝑒 𝑦 = 𝑕′(𝑧)𝑖′(𝑦)

  • For example:

𝑒 sin (𝑦2) 𝑒 𝑦

= 2𝑦 cos (𝑦2)

slide-34
SLIDE 34

Backpropagation makes use of chain rule of derivatives

x W1 * ? + b1 Z1 A1 ReL U W2 * ? + b2 Z2 A2 SoftM ax targ et CE Loss

  • Chain rule: 𝜖𝑔(𝑕 𝑦 )

𝜖𝑦

= 𝜖𝑔(𝑕 𝑦 )

𝜖𝑕(𝑦) 𝜖𝑕 𝑦 𝜖𝑦

slide-35
SLIDE 35

Vector valued functions and Jacobians

  • We often deal with functions that give multiple
  • utputs
  • Let 𝒈 𝒚 = 𝑔

1(𝒚)

𝑔

2(𝒚) = 𝑔 1(𝑦1, 𝑦2, 𝑦3)

𝑔

2(𝑦1, 𝑦2, 𝑦3)

  • Thinking in terms of vector of functions can make the

representation less cumbersome and computations more efficient

  • Then the Jacobian is
  • 𝑲(𝒈) =

𝜖𝒈 𝜖𝑦1 𝜖𝒈 𝜖𝑦2 𝜖𝒈 𝜖𝑦3 = 𝜖𝑔

1

𝜖𝑦1 𝜖𝑔

1

𝜖𝑦2 𝜖𝑔

1

𝜖𝑦3 𝜖𝑔

2

𝜖𝑦1 𝜖𝑔

2

𝜖𝑦2 𝜖𝑔

2

𝜖𝑦3

slide-36
SLIDE 36

Jacobian of each layer

  • Compute the derivatives of a higher layer’s
  • utput with respect to those of the lower

layer

  • What if we scale all the weights by a factor R?
  • What happens a few layers down?
slide-37
SLIDE 37

Role of step size and learning rate

  • Tale of two loss functions

– Same value, and – Same gradient (first derivative), but – Different Hessian (second derivative) – Different step sizes needed

  • Success not guaranteed
slide-38
SLIDE 38

The perfect step size is impossible to guess

  • Goldilocks finds the perfect balance only in a

fairy tale

  • The step size is decided by learning rate 𝜃 and

the gradient

slide-39
SLIDE 39

Double derivative

  • Double derivative 𝑔’’ 𝑦 =

𝑒2 𝑔(𝑦) 𝑒 𝑦2 is the derivative of

derivative of 𝑔 𝑦

  • Double derivative is positive for convex functions (have a

single minima), and negative for concave functions (have a single maxima)

E.g. 𝑔 𝑦 = 𝑏𝑦2 + 𝑐𝑦 + 𝑑, 𝑔′(𝑦) = 2𝑏𝑦 + 𝑐, 𝑔′′ 𝑦 = 2𝑏

slide-40
SLIDE 40

Double derivative

  • Double derivative tells how far the minima might be

from a given point.

  • From 𝑦 = 0 the minima is closer for the red dashed

curve than for the blue solid curve, because the former has a larger second derivative (its slope reverses faster)

𝑔 𝑦 = 𝑏𝑦2 + 𝑐𝑦 + 𝑑, 𝑔′ 𝑦 = 2𝑏𝑦 + 𝑐, 𝑔′′ 𝑦 = 2𝑏

slide-41
SLIDE 41

Perfect step size for a paraboloid

  • Let 𝑔 𝑦 = 𝑏𝑦2 + 𝑐𝑦 + 𝑑
  • Assuming 𝑏 < 0
  • Minima is at: 𝑦∗ = − 𝑐

2𝑏

  • For any 𝑦 the perfect step would be:

− 𝑐

2𝑏 − 𝑦 = − 2𝑏𝑦+𝑐 2𝑏

= − 𝑔′ 𝑦

𝑔′′ 𝑦

  • So, the perfect learning rate is: 𝜃∗ =

1 𝑔′′ 𝑦

  • In multiple dimensions, 𝒚 ← 𝒚 − 𝐼 𝑔 𝒚

−1𝛼(𝑔 𝒚 )

  • Practically, we do not want to compute the inverse of

a Hessian matrix, so we approximate Hessian inverse

slide-42
SLIDE 42

Hessian of a function of a vector

  • Double derivative with respect

to a pair of dimensions forms the Hessian matrix:

  • If all eigenvalues of a Hessian

matrix are positive, then the function is convex

f(x1, x2) →

slide-43
SLIDE 43

Example of Hessian

  • Let 𝑔 𝒚 = 𝑔 𝑦1, 𝑦2 = 5𝑦12 + 3𝑦22 + 4𝑦1𝑦2
  • Then

𝛼𝑔 𝒚 = 𝛼𝑔 𝑦1, 𝑦2 =

𝜖𝑔 𝜖𝑦1 𝜖𝑔 𝜖𝑦2

= 10𝑦1 + 4𝑦2 6𝑦2 + 4𝑦1

  • And, 𝐼(𝑔 𝒚 ) =

𝜖2𝑔 𝜖𝑦1

2

𝜖2𝑔 𝜖𝑦1𝜖𝑦2 𝜖2𝑔 𝜖𝑦2𝜖𝑦1 𝜖2𝑔 𝜖𝑦2

2

= 10 4 4 6

slide-44
SLIDE 44

Saddle points, Hessian and long local furrows

  • Some variables may

have reached a local minima while others have not

  • Some weights may have

almost zero gradient

  • At least some

eigenvalues may not be negative

slide-45
SLIDE 45

Complicated loss functions

slide-46
SLIDE 46

A realistic picture

Image source: https://www.cs.umd.edu/~tomg/projects/landscapes/

Saddle point Local minima Global minima? Local maxima