Deep Feedforward Networks Thanks to Sargur Srihari, Alexander - - PowerPoint PPT Presentation
Deep Feedforward Networks Thanks to Sargur Srihari, Alexander - - PowerPoint PPT Presentation
Deep Feedforward Networks Thanks to Sargur Srihari, Alexander Ororbia, Christopher Olah Deep Learning Srihari Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and
Deep Learning Srihari
Topics
- Overview
- 1. Example: Learning XOR
- 2. Gradient-Based Learning
- 3. Hidden Units
- 4. Architecture Design
- 5. Backpropagation and Other Differentiation
- 6. Historical Notes
2
Deep Learning Srihari
Feedforward Neural Networks:
quintessential deep learning models
- Deep Feedforward Networks are also called as
– Feedforward neural networks or – Multilayer Perceptrons (MLPs)
- Their Goal is to approximate some function f *
– E.g., a classifier y = f * (x) maps an input x to a category y – Feedforward Network defines a mapping
y = f * (x ; θ )
- and learns the values of the parameters θ that result in
the best function approximation
3
Deep Learning Srihari
Feedforward Network
- Models are called Feedforward because:
– Information flows through function being evaluated from x through intermediate computations used to define f and finally to
- utput y
- There are no feedback connections
– No outputs of the model are fed back
4
- Inputs are raw votes cast
- Hidden layer is electoral college
- Output are candidates
- US Presidential Election
Deep Learning Srihari
Feedforward vs. Recurrent
- When feedforward neural networks are
extended to include feedback connections they are called Recurrent Neural Networks
5
Deep Learning Srihari
Importance of Feedforward Networks
- They are extremely important to ML practice
- Form basis for many commercial applications
- 1. Convolutional networks are a special kind of
feedforward networks
- used for recognizing objects from photos
- 2. They are a conceptual stepping stone on path
to recurrent networks
- Which power many NLP applications
6
Deep Learning Srihari
Feedforward Neural Network Structures
- They are called networks because they are
composed of many different functions
- Model is associated with a directed acyclic
graph describing how functions composed
– E.g., functions f (1), f (2), f (3) connected in a chain to form f (x)= f (3) [ f (2) [ f (1)(x)]]
- f (1) is called the first layer of the network
- f (2) is called the second layer, etc
- These chain structures are the most commonly
used structures of neural networks
7
Deep Learning Srihari
Definition of Depth
- Overall length of the chain is the depth of
the model
- The name deep learning arises from this
terminology
- Final layer of a feedforward network is
called the output layer
8
Deep Learning Srihari
A Feed-forward Neural Network
9
yk(x,w) = σ wkj
(2) j=1 M
∑
h w ji
(1) i=1 D
∑
xi + w j 0
(1)
⎛ ⎝ ⎜ ⎞ ⎠ ⎟ + wk0
(2)
⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟
fm (1) =zm= h(x,w(1)), m=1,..M fk (2) = σ (z,w(2)), k=1,..K K outputs y1,..yK for a given input x Hidden layer consists of M units
Deep Learning Srihari
Example of Feedforward Network
10
- Hidden layer compares raw pixel inputs to
component patterns
Optical Character Recognition (OCR)
Deep Learning Srihari
Training the Network
- In network training we drive f(x) to match f*(x)
- Training data provides us with noisy,
approximate examples of f*(x) evaluated at different training points
- Each example accompanied by label y ≈ f*(x)
- Training examples specify directly what the
- utput layer must do at each point x
– It must produce a value that is close to y
11
Deep Learning Srihari
What are Hidden Layers?
- Behavior of other layers is not directly specified
by the data
- Learning algorithm must decide how to use
those layers to produce value that is close to y
- Training data does not say what individual
layers should do
- Since the desired output for these layers is not
shown, they are called hidden layers
12
Deep Learning Srihari
Netoworks and Neuroscience
- These networks are loosely inspired by
neuroscience
- Each hidden layer is typically vector-valued
– Dimensionality of hidden layer is width of the model – Each element of vector viewed as a neuron – Instead of thinking of it as a vector-vector function, they are regarded as units in parallel
- Each unite receives inputs from many other
units and computes its own activation value
13
Deep Learning Srihari
Function Approximation is goal
- Choice of functions f (i)(x):
– Loosely guided by neuroscientific observations about biological neurons
- Modern neural networks are guided by many
mathematical and engineering disciplines
- Not perfectly model the brain
- Think of feedforward networks as function
approximation machines
– Designed to achieve statistical generalization – Occasionally draw insights from what we know about the brain – Rather than as models of brain function
14
Deep Learning Srihari
Extending Linear Models
- To represent non-linear functions of x apply
linear model not to x but to a transformed input ϕ(x) where ϕ is non-linear
– Equivalently kernel trick obtains a nonlinearity
SVM: f (x)=wTx+b written as b + Σi αi ϕ(x) ϕ(x(i))
- Choose k(x,x(i))=ϕ(x) ϕ(x(i))
- Use linear regression on Lagrangian for weights αi
- Evaluate f over samples for non-zero (support vectors)
- ϕ provides a set of features describing x
- Replace x by function ϕ(x)
15
Deep Learning Srihari
View as Extension of Linear Models
- Begin with linear models and see limitations
– Linear regression:
- Simple closed form solutions:
- Or solved with convex optimization:
– Logistic regression: y(x,w)= σ (wTφ (x))
- No closed-form solution
- Convex Optimization:
- If ϕ (x) =x model capacity is limited to linear
functions and model has no understanding of interaction between any two input variables 16
y(x,w) = w jφ j(x)
j= 0 M −1
∑
= wTφ(x)
wML = (ΦTΦ)−1ΦTt
n
E ∇ − =
+
η
τ τ ) ( ) 1 (
w w
∇En = − tn − wTφ(xn)
{ }
n=1 N
∑
φ(xn)
T
⎟ ⎟ ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛ = Φ
− −
) x ( ) x ( ) x ( ) x ( ... ) x ( ) x (
1 2 1 1 1 1 1 N M N M
φ φ φ φ φ φ
wτ+1 = wτ − η∇En ∇En = (yn −tn)φ(xn)
ED(w) = 1 2 t n−wTφ(xn)
{ }
n=1 N
∑
2
Deep Learning Srihari
Three methods to choose ϕ
- 1. Generic feature function ϕ (x)
RBF: N(x; x(i), σ2I) centered at x(i)
- 2. Manually engineer ϕ
– Dominant approach until arrival of deep learning – Requires decades of effort
- e.g., speech recognition, computer vision
– Laborious, non-transferable between domains
- 3. Principle of Deep Learning: Learn ϕ
- Approach used in deep learning
17
Deep Learning Srihari
Approach 3: Learn Features
- Model is y=f (x;θ,w) = ϕ(x;θ)T w
– θ used to learn ϕ from broad class of functions – Parameters w map from ϕ (x) to output – Defines FFN where ϕ define a hidden layer
- Unlike other two (basis functions, manual
engineering), this approach gives-up on convexity of training
– But its benefits outweigh harms
18
Deep Learning Srihari
Extend Linear Methods to Learn ϕ
19
Can be viewed as a generalization of linear models
- Nonlinear function fk with M+1 parameters wk= (wk0 ,..wkM ) with
- M basis functions, ϕj j=1,..M each with D parameters θj= (θj1,..θjD)
- Both wk and θj are learnt from data
yk(x;θ,w) = wkj
j=1 M
∑
φj θji
i=1 D
∑
xi + θj 0 ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟ ⎟+wk0
K outputs y1,..yK for a given input x Hidden layer consists of M units yk = fk (x;θ,w) = ϕ (x;θ)T w
θMD ϕM ϕ1 wKM w10 ϕ0
Deep Learning Srihari
Approaches to Learning ϕ
- Parameterize the basis functions as ϕ(x;θ)
– Use optimization to find θ that corresponds to a good representation
- Approach can capture benefit of first approach
(fixed basis functions) by being highly generic
– By using a broad family for ϕ(x;θ)
- Can also capture benefits of second approach
– Human practitioners design families of ϕ(x;θ) that will perform well – Need only find right function family rather than precise right function
20
Deep Learning Srihari
Importance of Learning ϕ
- Learning ϕ is discussed beyond this first
introduction to feed-forward networks
– It is a recurring theme throughout deep learning applicable to all kinds of models
- Feedforward networks are application of this
principle to learning deterministic mappings form x to y without feedback
- Applicable to
– learning stochastic mappings – functions with feedback – learning probability distributions over a single vector
21
Deep Learning Srihari
Plan of Discussion: Feedforward Networks
- 1. A simple example: learning XOR
- 2. Design decisions for a feedforward network
– Many are same as for designing a linear model
- Basics of gradient descent
– Choosing the optimizer, Cost function, Form of output units
– Some are unique
- Concept of hidden layer
– Makes it necessary to have activation functions
- Architecture of network
– How many layers , How are they connected to each other, How many units in each later
- Learning requires gradients of complicated functions
– Backpropagation and modern generalizations
22
Deep Learning Srihari
- 1. Ex: XOR problem
- XOR: an operation on binary variables x1 and x2
– When exactly one value equals 1 it returns 1
- therwise it returns 0
– Target function is y=f *(x) that we want to learn
- Our model is y =f ([x1, x2] ; θ) which we learn, i.e., adapt
parameters θ to make it similar to f *
- Not concerned with statistical generalization
– Perform correctly on four training points:
- X={[0,0]T, [0,1]T,[1,0]T, [1,1]T}
– Challenge is to fit the training set
- We want f ([0,0]T; θ) = f ([1,1]T; θ) = 0
- f ([0,1]T; θ) = f ([1,0]T; θ) = 1
23
Deep Learning Srihari
ML for XOR: linear model doesn’t fit
- Treat it as regression with MSE loss function
– Usually not used for binary data – But math is simple
- We must choose the form of the model
- Consider a linear model with θ ={w,b} where
– Minimize to get closed-form solution
- Differentiate wrt w and b to obtain w = 0 and b=½
– Then the linear model f(x;w,b)=½ simply outputs 0.5 everywhere
– Why does this happen?
24
J(θ) = 1 4 f *(x)− f(x;θ)
( )
2 x∈X
∑
= 1 4 f *(xn)− f(xn;θ)
( )
2 n=1 4
∑
f(x;w,b) = xTw +b
J(θ) = −ln p(t | θ) = − tn lnyn +(1−tn)ln(1−yn)
{ }
n=1 N
∑
yn= σ (θTxn) Alternative is Cross-entropy J(θ)
J(θ) = 1 4 tn −xn
Tw -b)
( )
2 n=1 4
∑
Deep Learning Srihari
Linear model cannot solve XOR
- Bold numbers are values system must output
- When x1=0, output has to increase with x2
- When x1=1, output has to decrease with x2
- Linear model f (x;w,b)= x1w1+x2w2+b has to assign a
single weight to x2, so it cannot solve this problem
- A better solution:
– use a model to learn a different representation
- in which a linear model is able to represent the solution
– We use a simple feedforward network
- one hidden layer containing two hidden units
25
Deep Learning Srihari
Feedforward Network for XOR
- Introduce a simple feedforward
network
– with one hidden layer containing two units
- Same network drawn in two different
styles
– Matrix W describes mapping from x to h – Vector w describes mapping from h to y – Intercept parameters b are omitted
26
Deep Learning Srihari
Functions computed by Network
- Layer 1 (hidden layer): vector of hidden
units h computed by function f (1)(x; W,c)
– c are bias variables
- Layer 2 (output layer) computes
f (2)(h; w,b)
– w are linear regression weights – Output is linear regression applied to h rather than to x
- Complete model is
f (x; W,c,w,b)=f (2)(f (1)(x))
27
Deep Learning Srihari
Linear vs Nonlinear functions
- If we choose both f (1) and f (2) to be linear, the
total function will still be linear f (x)=xTw’
– Suppose f (1)(x)= WTx and f (2)(h)=hTw – Then we could represent this function as f (x)=xTw’ where w’=Ww
- Since linear is insufficient, we must use a
nonlinear function to describe the features
– We use the strategy of neural networks – by using a nonlinear activation function
h=g(WTx+c)
28
f (x)=xTw’
Deep Learning Srihari
Activation Function
- In linear regression we used a vector of weights
w and scalar bias b
– to describe an affine transformation from an input vector to an output scalar
- Now we describe an affine transformation from
a vector x to a vector h, so an entire vector of bias parameters is needed
- Activation function g is typically chosen to be
applied element-wise hi=g(xTW:,i+ci)
29
f(x;w,b) = xTw +b
Deep Learning Srihari
Default Activation Function
- Activation: g(z)=max{0,z}
– Applying this to the output of a linear transformation yields a nonlinear transformation – However function remains close to linear
- Piecewise linear with two pieces
- Therefore preserve properties that
make linear models easy to
- ptimize with gradient-based
methods
- Preserve many properties that
make linear models generalize well A principle of CS:
Build complicated systems from minimal components. A Turing Machine Memory needs only 0 and 1 states. We can build Universal Function approximator from ReLUs Rectified Liner Unit (ReLU)
Deep Learning Srihari
Specifying the Network using ReLU
- Activation: g(z)=max{0,z}
- We can now specify the complete network as
f (x; W,c,w,b)=f (2)(f (1)(x))=wT max {0,WTx+c}+b
Deep Learning Srihari
We can now specify XOR Solution
- Let
- Now walk through how model processes a
batch of inputs
- Design matrix X of all four points:
- First step is XW:
- Adding c:
- Compute h Using ReLU
- Finish by multiplying by w:
- Network has obtained
correct answer for all 4 examples
32
W = 1 1 1 1 ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥, c = −1 ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥, w = 1 −2 ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥, b = 0
X = 1 1 1 1 ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ XW = 1 1 1 1 2 2 ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ XW +c = −1 1 1 2 1 ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥
max{0,XW +c} = 1 1 2 1 ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥
f(x) = 1 1 ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥
f (x; W,c,w,b)= wT max {0,WTx+c}+b
In this space all points lie along a line with slope 1. Cannot be implemented by a linear model Has changed relationship among examples. They no longer lie on a single line. A linear model suffices
Deep Learning Srihari
Learned representation for XOR
- Two points that must have
- utput 1 have been
collapsed into one
- Points x=[0,1]T and
x=[1,0]T have been
mapped into h=[0,1]T
- Described in linear model
– For fixed h2, output increases in h1
33
When h1=0, output is constant 0 with h2 When h1=1, output is constant 1 with h2 When h1=2, output is constant 0 with h2 When x1=0, output has to increase with x2 When x1=1, output has to decrease with x2
Deep Learning Srihari
About the XOR example
- We simply specified the solution
– Then showed that it achieves zero error
- In real situations there might be billions of
parameters and billions of training examples
– So one cannot simply guess the solution
- Instead gradient descent optimization can find
parameters that produce very little error
– The solution described is at the global minimum
- Gradient descent could converge to this solution
- Convergence depends on initial values
- Would not always find easily understood integer solutions
34
Deep Learning Srihari
Topics
- Overview
- 1. Example: Learning XOR
- 2. Gradient-Based Learning
- 3. Hidden Units
- 4. Architecture Design
- 5. Backpropagation and Other Differentiation
- 6. Historical Notes
2
Deep Learning Srihari
Topics in Gradient-based Learning
- Overview
- 1. Cost Functions
- 1. Learning Conditional Distributions with Max
Likelihood
- 2. Learning Conditional Statistics
- 2. Output Units
- 1. Linear Units for Gaussian Output Distributions
- 2. Sigmoid Units for Bernoulli Output Distributions
- 3. Softmax Units for Multinoulli Output Distributions
- 4. Other Output Types
3
Deep Learning Srihari
Overview of Gradient-based Learning
4
Deep Learning Srihari
Standard ML Training vs NN Training
- Neural Network training not different from ML
models with gradient descent. Need
- 1. optimization procedure, e.g., gradient descent
- 2. cost function, e.g., MLE
- 3. model family, e.g., linear with basis functions
- Difference: nonlinearity causes non-convex loss
– Use iterative gradient-based optimizers that merely drives cost to low value, rather than
- Exact linear equation solvers used for linear regression or
- convex optimization algorithms used for logistic
regression or SVMs
5
Deep Learning Srihari
Convex vs Non-convex
- Convex methods:
– Converge from any initial parameters – Robust-- but can encounter numerical problems
- SGD with non-convex:
– Sensitive to initial parameters – For feedforward networks, important to initialize
- Weights to small values, Biases to zero or small positives
– SGD can also train Linear Regression and SVM Especially with large training sets – Training neural net no similar to other models
- Except computing gradient is more complex
6
Linear Regression with Basis Functions: ED(w) = 1 2 t n−wTϕ(xn)
{ }
n=1 N∑
2Deep Learning Srihari
Cost Functions
7
Deep Learning Srihari
Cost Functions for Deep Learning
- Important aspect of design of deep neural
networks is the cost function
– They are similar to those for parametric models such as linear models
- Parametric model: logistic regression
– Binary Training data defines a likelihood p(y |x ;θ)
data set {ϕn, tn}, tnε{0,1}, ϕn=ϕ( xn)
– and we use the principle of maximum likelihood
- Cost function: cross-entropy between training data tn and the
model’s prediction yn
- Gradient of the error function is
Using dσ(a)/da =σ(1-σ)
8
J(θ) = −ln p(t | θ) = − tn lnyn +(1−tn)ln(1−yn)
{ }
n=1 N
∑
p(t | θ) = yn
tn n=1 N
∏
1−yn
{ }
1−tn ,yn = σ(θTxn)
p(C1 | φ) = y(φ) = σ(θTφ)
∇J(θ) = yn −tn
( )
n=1 N
∑
φn
Deep Learning Srihari
Learning Conditional Distributions with maximum likelihood
- Specifying the model p(y |x) automatically
determines a cost function log p(y |x)
– Equivalently described as the cross-entropy between the training data and the model distribution – Gaussian case:
- If pmodel(y|x) =N ( y| f (x ; θ), I)
- then we recover the mean squared error cost
- upto a scaling factor ½ and a term independent of θ
– const depends on the variance of Gaussian which we chose not to parameterize
9
J(θ) = −1 2 Ex,y∼ˆ
pdata y − f(x;θ) 2 +const
J(θ) = −Ex,y∼ˆ
pdata log pmodel(y |x)
= 1 2π exp −1 2 || y − f(x;θ ||2 ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟ ⎟
Deep Learning Srihari
Desirable Property of Gradient
- Recurring theme in neural network design is:
– Gradient must be large and predictable enough to serve as good guide to the learning algorithm
- Functions that saturate (become very flat)
undermine this objective
– Because the gradient becomes very small
- Happens when activation functions producing output of
hidden/output units saturate
10
Deep Learning Srihari
Keeping the Gradient Large
- Negative log-likelihood helps avoid saturation
problem for many models
– Many output units involve exp functions that saturate when its argument is very negative – Log function in Negative log likelihood cost function undoes exp of some units
11
Deep Learning Srihari
Cross Entropy and Gradient
- A property of cross-entropy cost used for MLE
is that it does not have a minimum value
– For discrete output variables, they cannot represent probability of zero or one but come arbitrarily close
- Logistic Regression is an example
– For real-valued output variables it becomes possible to assign extremely high density to correct training set outputs, e.g, variance parameter of Gaussian
- utput, and cross-entropy approaches negative
infinity
- Regularization modifies learning problem so
model cannot reap unlimited reward this way
12
Deep Learning Srihari
Learning Conditional Statistics
- Instead of learning a full probability distribution,
learn just one conditional statistic of y given x
– E.g., we may have a predictor f (x ;θ) which gives the mean of y
- Think of neural network as being powerful to
determine any function f
– This function is limited only by
- boundedness and
- continuity
- rather than by having a specific parameteric form
– From this point of view, cost function is a functional rather than a function
13
Deep Learning Srihari
Cost Function vs Cost Functional
- Cost function is a functional, not a function
– A functional is a mapping from functions to real nos.
- We can think of learning as a task of choosing
a function rather than a set of parameters
- Cost Functional has a Minimum occur at a
function we desire
– E.g., Design the cost functional to have a Minimum
- f that lies on function that maps x to the expected
value of y given x
14
Deep Learning Srihari
Optimization via Calculus of Variations
- Solving the optimization problem requires a
mathematical tool: calculus of variations
– E.g., Minimum of Cost functional is:
- function that maps x to the expected value of y given x
- Only necessary to understand that calculus of
variations can be used to derive two results
15
Deep Learning Srihari
First Result from Calculus of Variations
- Solving the optimization problem
yields
- which means if we could train infinitely many
samples from the true data generating distribution
– minimizing MSE gives a function that predicts the mean of y for each value of x
16
f* = arg min
f
Ex,y∼ˆ
pdata y - f(x) 2
f *(x) = Ey∼pdata( y|x) y ⎡ ⎣ ⎢ ⎤ ⎦ ⎥
Deep Learning Srihari
Second Result from Calculus of Variations
- A different cost function is
– yields a function that minimizes the median of y for each each x – Referred to as mean absolute error
- MSE/median saturate: produce small gradients
– This is one reason cross-entropy cost is more popular than mean square error and mean absolute error
- Even when it is not necessary to estimate the entire
distribution p(y |x)
17
f* = arg min
f
Ex,y~pdata ||y - f(x) ||
1
Deep Learning Srihari
Output Units
18
Deep Learning Srihari
Output Units
- Choice of cost function is tightly coupled with
choice of output unit
– Most of the time we use cross-entropy between data distribution and model distribution
- Choice of how to represent the output then determines
the form of the cross-entropy function
19
J(θ) = −ln p(t | θ) = − tn lnyn +(1−tn)ln(1−yn)
{ }
n=1 N
∑
yn= σ (θTxn) Cross-entropy in logistic regression
θ={w,b}
Deep Learning Srihari
Role of Output Units
- Any output unit is also usable as a hidden unit
- Our focus is units as output, not internally
– Revisit it when discussing hidden units
- A feedforward network provides a hidden set of
features h =f (x ; θ)
- Role of output layer is to provide some
additional transformation from the features to the task that network must perform
20
Deep Learning Srihari
Types of output units
- 1. Linear units: no non-linearity
– for Gaussian Output distributions
- 2. Sigmoid units
– for Bernoulli Output Distributions
- 3. Softmax units
– for Multinoulli Output Distributions
- 4. Other Output Types
– Not direct prediction of y but provide parameters of distribution over y
21
Deep Learning Srihari
Linear Units for Gaussian Output Distributions
- Linear unit: simple output based on affine
transformation with no nonlinearity
– Given features h, a layer of linear output units produces a vector
- Linear units are often used to produce mean
- f a conditional Gaussian distribution
- Maximizing the output is equivalent to MSE
- Can be used to learn the covariance of a
Gaussian too
22
ˆ y =W Th +b
P(y |x) = N(y; ˆ y,I)
ˆ y
Deep Learning Srihari
Sigmoid Units for Bernoulli Output Distributions
- Task of predicting value of binary variable y
– Classification problem with two classes
- Maximum likelihood approach is to define a
Bernoulli distribution over y conditioned on x
- Neural net needs to predict p(y=1|x)
– which lies in the interval [0,1]
- Constraint needs careful design
– If we use
- We would define a valid conditional distribution, but
cannot train it effectively with gradient descent
- A gradient of 0: learning algorithm cannot be guided
23
P(y = 1 |x) = max 0,min 1,wTh +b
{ }
{ }
Deep Learning Srihari
Sigmoid and Logistic Regression
- Using sigmoid always gives a strong gradient
– Sigmoid output units combined with maximum likelihood
- where σ (x) is the logistic sigmoid function:
- Sigmoid output unit has two components:
- 1. A linear layer to compute
- 2. Use sigmoid activation function to convert z into a
probability
24
ˆ y = σ wTh +b
( )
σ x
( ) =
1 1+ exp(−x)
z = wTh +b
Deep Learning Srihari
Probability distribution using Sigmoid
- Describe probability distribution over y using z
y is output, z is input
– Construct unnormalized probability distribution
- Assuming unnormalized log probability is linear in y and z
- Normalizing yields a Bernoulli distribution controlled by σ
– Probability distributions based on exponentiation and normalization are common throughout statistical modeling
- z variable defining such a distribution over binary
variables is called a logit
25
! P
log ! P(y) = yz ! P(y) = exp(yz)
P(y) = exp(yz) exp(y 'z)
y '=0 1
∑
=σ((2y-1)z)
z = wTh +b
Deep Learning Srihari
Max Likelihood Loss Function
- Given binary y and some z, an normalized
probability distribution over y is
- We can use this approach in maximum
likelihood learning
– Loss for max likelihood learning is –log P(y|x)
- This is for a single sample
log ! P(y) = yz ! P(y) = exp(yz)
J(θ) = −logP(y |x) = −logσ((2y −1)z) =ζ((1 - 2y)z)
P(y) = exp(yz) exp(yz)
y '=0 1
∑
= σ((2y −1)z)
ζ is the softplus function
Deep Learning Srihari
Softplus function
- Sigmoid saturates when its argument is very
positive or very negative
– i.e., function is insensitive to small changes in input
- Another function is the softplus function
ζ(x) = log(1+ exp(x))
– Its range is (0,∞). It arises in expressions involving sigmoids.
- Its name comes from its being a smoothed or
softened version of x+=max(0, x)
27
Deep Learning Srihari
Properties of Sigmoid & Softplus
28
Last equation provides extra justification for the name ‘softplus’ Smoothed version of positive part function x+=max{0,x} The positive part function is the counterpart
- f the negative part function x -=max{0,-x}
Deep Learning Srihari
Loss Function for Bernoulli MLE
– By rewriting the loss in terms of the softplus function, we can see that it saturates only when (1-2y)z <<0. – Saturation occurs only when model already has the right answer
- i.e., when y=1 and z>>0 or y=0 and z <<0
- When z has the wrong sign (1-2y)z can be simplified to |z|
– As |z| becomes large while z has the wrong sign, softplus asymptotes towards simply returning argument |z| & derivative wrt z asymptotes to sign(z), so, in the limit of extremely incorrect z softplus does not shrink the gradient at all – This is a useful property because gradient-based learning can act quickly to correct a mistaken z
J(θ) = −logP(y | x) = −logσ((2y −1)z) =ζ((1 - 2y)z)
Deep Learning Srihari
Cross-Entropy vs Softplus Loss
– Cross-entropy loss can saturate anytime σ(z) saturates
- Sigmoid saturates to 0 when z becomes very negative
and saturates to 1 when z becomes very positive
– Gradient can shrink to too small to be useful for learning, whether model has correct or incorrect answer – We have provided an alternative implementation of logistic regression!
30
p(y | θ) = σ(θTxn)
yn n=1 N
∏
1−σ(θTxn)
{ }
1−yn
J(θ) = −ln p(y | θ) = − yn ln σ(θTxn)
( )+(1−yn)ln(1−σ(θTxn))
{ }
n=1 N
∑
J(θ) = −logP(y | x) = −logσ((2y −1)z) =ζ((1 - 2y)z)
z = θTx +b
Deep Learning Srihari
Softmax units for Multinoulli Output
- Any time we want a probability distribution over
a discrete variable with n values we may us the softmax function
– Can be seen as a generalization of sigmoid function used to represent probability distribution over a binary variable
- Softmax most often used for output of classifier
to represent distribution over n classes
– Also inside the model itself when we wish to choose between one of n options
31
Deep Learning Srihari
From Sigmoid to Softmax
- Binary case: we wished to produce a single no.
- Since (i) this number needed to lie between 0 and 1 and
(ii) because we wanted its logarithm to be well-behaved for a gradient-based optimization of log-likelihood, we chose instead to predict a number
- Exponentiating and normalizing, gave us a Bernoulli
distribution controlled by the sigmoidal transformation of z
- Case of n values: need to produce vector
- with values
32
ˆ yi = P(y = i |x)
ˆ y
ˆ y = P(y =1|x)
z = log ! P(y =1|x)
log ! P(y) = yz ! P(y) = exp(yz)
P(y) = exp(yz) exp(yz)
y '=0 1
∑
= σ((2y −1)z)
z = wTh +b
Deep Learning Srihari
Softmax definition
- We need to produce a vector with values
- We need elements of lie in [0,1] and they sum to 1
- Same approach as with Bernoulli works for
Multinoulli distribution
- First a linear layer predicts unnormalized log probabilities
z =WTh+b
– where
- Softmax can then exponentiate and normalize z
to obtain the desired
- Softmax is given by:
33
ˆ yi = P(y = i |x) zi = log ˆ P(y = i |x)
ˆ y
softmax(z)i = exp(zi) exp(zj)
j
∑
ˆ y
ˆ y
Deep Learning Srihari
Softmax Regression
34
z =WTx+b
y = softmax(z)i = exp(zi) exp(zj)
j
∑
Network Computes In matrix multiplication notation
Generalization of Logistic Regression to multivalued output
Softmax definition An example
Deep Learning Srihari
Intuition of Log-likelihood Terms
- The exp within softmax works
very well when training using log-likelihood
– Log-likelihood can undo the exp of softmax – Input zi always has a direct contribution to cost
- Because this term cannot saturate, learning can proceed
even if second term becomes very small
– First term encourages zi to be pushed up – Second term encourages all z to be pushed down
35
log softmax(z)i = zi − log exp(z j
j
∑
)
softmax(z)i = exp(zi) exp(zj)
j
∑
Deep Learning Srihari
Intuition of second term of likelihood
- Log likelihood is
- Consider second term:
- It can be approximated by maxj zj
– Based on the idea that exp(zk) is insignificant for any zk noticeably less that maxj zj
- Intuition gained:
– Cost penalizes most active incorrect prediction – If the correct answer already has the largest input to softmax, then -zi term and terms will roughly cancel. This example will then contribute little to overall training cost
- Which will be dominated by other incorrect examples
36
logsoftmax(z)i = zi − log exp(zj
j
∑
) log exp(zj
j
∑
) ≈ maxj zj = zi log exp(zj
j
∑
)
Deep Learning Srihari
Generalization to Training Set
- So far we discussed only a single example
- Overall, unregularized maximum likelihood will
drive the model to learn parameters that drive the softmax to predict a fraction of counts of each outcome observed in training set
37
softmax(z(x;θ))i ≈ 1
y(j )=i,x(j )=x j=1 m
∑
1
x(j )=x j=1 m
∑
Deep Learning Srihari
Softmax and Objective Functions
- Objective functions that do not use a log to
undo the exp of softmax fail to learn when argument of exp becomes very negative, causing gradient to vanish
- Squared error is a poor loss function for
softmax units
– Fail to train model change its output even when the model makes highly incorrect predictions
38
Deep Learning Srihari
Saturation of Sigmoid and Softmax
- Sigmoid has a single output that saturates
– When input is extremely negative or positive
- Like sigmoid, softmax activation can saturate
– In case of softmax there are multiple output values
- These output values can saturate when the differences
between input values become extreme
– Many cost functions based on softmax also saturate
39
Deep Learning Srihari
Softmax & Input Difference
- Softmax invariant to adding the same scalar to
all inputs:
softmax(z) = softmax(z+c)
- Using this property we can derive a numerically
stable variant of softmax softmax(z) = softmax(z – maxi zi)
- Reformulation allows us to evaluate softmax
– With only small numerical errors even when z contains extremely large/small numbers – It is driven by amount that its inputs deviate from maxi zi
40
Deep Learning Srihari
Saturation of Softmax
- An output softmax(z)i saturates to 1 when the
corresponding input is maximal (zi= maxi zi) and zi is much greater than all the other inputs
- The output can also saturate to 0 when is not
maximal and the maximum is much greater
- This is a generalization of the way the sigmoid
units saturate
– They can cause similar difficulties in learning if the loss function is not designed to compensate for it
41
Deep Learning Srihari
Other Output Types
- Linear, Sigmoid and Softmax output units are
the most common
- Neural networks can generalize to any kind of
- utput layer
- Principle of maximum likelihood provides a
guide for how to design a good cost function for any output layer
– If we define conditional distribution p(y |x), principle of maximum likelihood suggests we use log p(y |x) for our cost function
42
Deep Learning Srihari
Determining Distribution Parameters
- We can think of the neural network as
representing a function f (x ; θ)
- Outputs are not direct predictions of value of y
- Instead f (x ; θ)=ω provides the parameters for
a distribution over y
- Our loss function can then be interpreted as
- log p(y ; ω(x))
43
Deep Learning Srihari
Ex: Learning a Distribution Parameter
- We wish to learn the variance of a conditional
Gaussian of y given x
- Simple case: variance σ2 is constant
– Has closed-form expression: empirical mean of squared difference between observations y and their expected value – Computationally more expensive approach
- Does not require writing special-case code
- Include variance as one of the properties of distribution
p(y |x) that is controlled by ω = f (x ; θ)
- Negative log-likelihood -log p(y ; ω(x)) will then provide
cost function with appropriate terms to learn variance
44
Deep Learning Srihari
Topics in Deep Feedforward Networks
- Overview
- 1. Example: Learning XOR
- 2. Gradient-Based Learning
- 3. Hidden Units
- 4. Architecture Design
- 5. Backpropagation and Other Differentiation
- 6. Historical Notes
2
Deep Learning Srihari
Topics in Hidden Units
- 1. ReLU and their generalizations
- 2. Logistic sigmoid and Hyperbolic tangent
- 3. Other hidden units
3
Deep Learning Srihari
Choice of hidden unit
- Previously discussed design choices for neural
networks that are common to most parametric learning models trained with gradient
- ptimization
- We now look at how to choose the type of
hidden unit in the hidden layers of the model
- Design of hidden units is an active research
area that does not have many definitive guiding theoretical principles
4
Deep Learning Srihari
Choice of hidden unit
- ReLU is an excellent default choice
- But there are many other types of hidden units
available
- When to use which kind (though ReLU is
usually an acceptable choice)?
- We discuss motivations behind choice of
hidden unit
– Impossible to predict in advance which will work best – Design process is trial and error
- Evaluate performance on a validation set
5
Deep Learning Srihari
Is Differentiability necessary?
- Some hidden units are not differentiable at all
input points
– Rectified Linear Function g(z)=max{0,z} is not differentiable at z=0
- May seem like it invalidates for use in gradient-
based learning
- In practice gradient descent still performs well
enough for these models to be used in ML tasks
6
Deep Learning Srihari
Differentiability ignored
- Neural network training
– not usually arrives at a local minimum of cost function – Instead reduces value significantly
- Not expecting training to reach a
point where gradient is 0,
– Accept minima to correspond to points of undefined gradient
- Hidden units not differentiable
are usually non-differentiable at
- nly a small no. of points
7
Deep Learning Srihari
Left and Right Differentiability
- A function g(z) has a left derivative defined by
the slope immediately to the left of z
- A right derivative defined by the slope of the
function immediately to the right of z
- A function is differentiable at z=a only if both
– the left derivative and – The right derivative are equal
8
Function is not continuous: No derivative at marked point However it has a right derivative at all points with δ+f(a)=0 at all points
Deep Learning Srihari
Software Reporting of Non-differentiability
- In the case of g(z)=max{0,z}, the left derivative
at z = 0 is 0 and right derivative is 1
- Software implementations of neural network
training usually return:
– one of the one-sided derivatives rather than reporting that derivative is undefined or an error
- Justified in that gradient-based optimization is subject to
numerical anyway
- When a function is asked to evaluate g(0), it is very
unlikely that the underlying value was truly 0, instead it was a small value ε that was rounded to 0
9
Deep Learning Srihari
What a Hidden unit does
- Accepts a vector of inputs x and computes an
affine transformation z = WTx+b
- Computes an element-wise non-linear function
g(z)
- Most hidden units are distinguished from each
- ther by the choice of activation function g(z)
– We look at: ReLU, Sigmoid and tanh, and other hidden units
10
Deep Learning Srihari
Rectified Linear Unit & Generalizations
- Rectified linear units use the activation function
g(z)=max{0,z}
– They are easy to optimize due to similarity with linear units
- Only difference with linear units that they output 0 across
half its domain
- Derivative is 1 everywhere that the unit is active
- Thus gradient direction is far more useful than with
activation functions with second-order effects
11
Deep Learning Srihari
Use of ReLU
- Usually used on top of an affine transformation
h=g(WTx+b)
- Good practice to set all elements of b to a
small value such as 0.1
– This makes it likely that ReLU will be initially active for most training samples and allow derivatives to pass through
12
Deep Learning Srihari
Generalizations of ReLU
- Perform comparably to ReLU and occasionally
perform better
- ReLU cannot learn on examples for which the
activation is zero.
- Generalizations guarantee that they receive
gradient everywhere
13
Deep Learning Srihari
Three generalizations of ReLU
- Three methods based on using a non-zero
slope αi when zi<0: hi=g(z,α)i=max(0,zi)+αi min(0,zi)
- 1. Absolute-value rectification:
- fixes αi=-1 to obtain g(z)=|z|
- 2. Leaky ReLU:
- fixes αi to a small value like 0.01
- 3. Parametric ReLU or PReLU:
- treats αi as a parameter
14
Deep Learning Srihari
Maxout Units
- Maxout units further generalize ReLUs
- Instead of applying element-wise function g(z),
maxout units divide z into groups of k values
- Each maxout unit then outputs the maximum
element of one of these groups: g(z)i=maxjεG(i)zj
– where G(i) is the set of indices into the inputs for group i, {(i-1)k+1,..,ik}
- This provides a way of learning a piecewise
linear function that responds to multiple directions in the input x space
15
Deep Learning Srihari
Maxout as Learning Activation
- A maxout unit can learn piecewise linear,
convex function with upto k pieces
– Thus seen as learning the activation function itself rather than just the relationship between units
- With large enough k, approximate any convex function
– A maxout layer with two pieces can learn to implement the same function of the input x as a traditional layer using ReLU or its generalizations
16
Deep Learning Srihari
Learning Dynamics of Maxout
- Parameterized differently
- Learning dynamics different even in case of
implementing same function of x as one of the
- ther layer types
– Each maxout unit parameterized by k weight vectors instead of one
- So Requires more regularization than ReLU
- Can work well without regularization if training set is large
and no. of pieces per unit is kept low
17
Deep Learning Srihari
Other benefits of maxout
- Can gain statistical and computational
advantages by requiring fewer parameters
- If the features captured by n different linear
filters can be summarized without losing information by taking max over each group of k features, then next layer can get by with k times fewer weights
- Because of multiple filters, their redundancy
helps them avoid catastrophic forgetting
– Where network forgets how to perform tasks they were trained to perform
18
Deep Learning Srihari
Principle of Linearity
- ReLU based on principle that models are easier
to optimize if behavior closer to linear
– Principle applies besides deep linear networks
- Recurrent networks can learn from sequences and
produce a sequence of states and outputs
- When training them need to propagate information
through several steps
– Which is much easier when some linear computations (with some directional derivatives being of magnitude near 1) are involves
19
Deep Learning Srihari
Linearity in LSTM
- LSTM: best performing recurrent architecture
– Propagates information through time via summation
- A straightforward kind of linear activation
20 Input gate Conditional Input Forget gate Output gate
y = σ wixi
∑
( )
y = xi
∏
y = wixi
∑
Determine when inputs are allowed to flow into block
LSTM: an ANN that contains LSTM blocks in addition to regular network units Input gate: when its output is close to zero, it zeros the input Forget gate: when close to zero block forgets whatever value it was remembering Output gate: when unit should
- utput its value
LSTM Block
Deep Learning Srihari
Logistic Sigmoid
- Prior to introduction of ReLU, most neural
networks used logistic sigmoid activation g(z)=σ(z)
- Or the hyperbolic tangent
g(z)=tanh(z)
- These activation functions are closely related
because tanh(z)=2σ(2z)-1
- Sigmoid units are used to predict probability
that a binary variable is 1
21
Deep Learning Srihari
Sigmoid Saturation
- Sigmoidals saturate across most of domain
– Saturate to 1 when z is very positive and 0 when z is very negative – Strongly sensitive to input when z is near 0 – Saturation makes gradient-learning difficult
- ReLU and Softplus increase for input >0
22
Sigmoid can still be used When cost function undoes the Sigmoid in the output layer
Deep Learning Srihari
Sigmoid vs tanh Activation
- Hyperbolic tangent typically performs better
than logistic sigmoid
- It resembles the identity function more closely
tanh(0)=0 while σ(0)=½
- Because tanh is similar to identity near 0,
training a deep neural network resembles training a linear model so long as the activations can be kept small
23
ˆ y = wT tanh U T tanh V Tx
( )
( )
ˆ y = wTU TV Tx
Deep Learning Srihari
Sigmoidal units still useful
- Sigmoidal more common in settings other
than feed-forward networks
- Recurrent networks, many probabilistic
models and autoencoders have additional requirements that rule out piecewise linear activation functions
- They make sigmoid units appealing
despite saturation
24
Deep Learning Srihari
Other Hidden Units
- Many other types of hidden units possible, but
used less frequently
– Feed-forward network using h = cos(Wx + b)
- on MNIST obtained error rate of less than 1%
– Radial Basis
- Becomes more active as x approaches a template W:,i
– Softplus
- Smooth version of the rectifier
– Hard tanh
- Shaped similar to tanh and the rectifier but it is bounded
25
hi = exp − 1 σ2 ||W:,i −x ||2 ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ g(a) = ζ(a) = log(1+ea) g(a) = max(−1,min(1,a))
Deep Learning Srihari
Topics
- Overview
- 1. Example: Learning XOR
- 2. Gradient-Based Learning
- 3. Hidden Units
- 4. Architecture Design
- 5. Backpropagation and Other Differentiation
- 6. Historical Notes
2
Deep Learning Srihari
Topics in Architecture Design
- 1. Chart of 27 neural network designs (generic)
- 2. Specific deep learning architectures
- 3. Architecture Terminology
- 4. Equations for Layers
- 5. Theoretical underpinnings
– Universal Approximation Theorem – No Free Lunch Theorem
- 6. Advantages of deeper networks
3
Deep Learning Srihari
Generic Neural Architectures (1-11)
4
Deep Learning Srihari
Generic Neural Architectures (12-19)
5
Deep Learning Srihari
Generic Neural Architectures (20-27)
6
Deep Learning Srihari
Specific Application Architectures
7
Architecture to study how images in the mind can influence movements and motor skills (RNN) Cancer Prognosis
Deep Learning Srihari
An architecture for Game Design
8
Deep Learning Srihari
CNN Architectures
9
More complex features captured In deeper layers
Deep Learning Srihari
Architecture Blending Deep Learning and Reinforcement Learning
- Human Level Control Through Deep
Reinforcement Learning
10
Deep Learning Srihari
Architecture Terminology
- The word architecture refers to the overall
structure of the network:
– How many units should it have? – How the units should be connected to each other?
- Most neural networks are organized into groups
- f units called layers
– Most neural network architectures arrange these layers in a chain structure – With each layer being a function of the layer that preceded it
11
Deep Learning Srihari
Equations for Layers
- Organized groups of units are called layers
- Layers are arranged in a chain structure
- Each layer is a function of the layer that
preceded it
– First layer is given by h(1)=g(1)(W(1)Tx + b(1)) – Second layer is h(2)=g(2)(W(2)Tx + b(2)), etc.
12
Network layer input In matrix multiplication notation One Network layer
Deep Learning Srihari
Main Architectural Considerations
- 1. Choice of depth of network
- 2. Choice of width of each layer
13
Network with even one hidden layer is sufficient to fit training set
Deep Learning Srihari
Advantage of Deeper Networks
- Deeper networks have
– Far fewer units in each layer – Far fewer parameters – Often generalize well to the test set – But are often more difficult to optimize
- Ideal network architecture must be found via
experimentation guided by validation set error
14
Deep Learning Srihari
Theoretical underpinnings
- Mathematical theory of Artificial Neural
Networks
– Linear versus Nonlinear Models – Universal Approximation Theorem
- No Free Lunch Theorem
- Size of network
15
Deep Learning Srihari
Linear vs Nonlinear Models
- A linear model with features-to-output via matrix
multiplication only represent linear functions
– They are easy to train
- Because loss functions result in convex optimization
- Unfortunately often we want to learn nonlinear
functions
– Not necessary to define a family of nonlinear functions – Feedforward networks with hidden layers provide a universal approximation framework
16
Deep Learning Srihari
Universal Approximation Theorem
- A feed-forward network with a single hidden
layer containing a finite number of neurons can approximate continuous functions on compact subsets of Rn, under mild assumptions on the activation function
– Simple neural networks can represent a wide variety of interesting functions when given appropriate parameters – However, it does not touch upon the algorithmic learnability of those parameters.
17
Borel measurable functions
Deep Learning Srihari
Formal Theorem
– Let φ(⋅) be continuous (activation function)
- Non-constant, bounded, monotonic increasing function
– Im is the unit hypercube [0,1]m (m inputs, values in [0,1] ) – Space of continuous functions on Im is C(Im)
- Then, given any function f ∈ C(Im) and ε>0, there exists
an integer N, (no. of outputs)
- real constants vi,bi ∈ R, (output weights, input bias)
- real vectors wi∈Rm, i =1, ⋯ ,N, (input weights)
- such that we may define:
F(x) = ∑i=1,..N viφ(wi
Tx+bi) as an approximation of f
where f is independent of φ; i.e., |F(x)−f (x)|<εfor all x ∈ Im i.e., functions of the form F(x) are dense in C(Im)
18
Deep Learning Srihari
Implication of Theorem
- A feedforward network with a linear output
layer and at least one hidden layer with any “squashing” activation function (such as logistic sigmoid) can approximate:
– Any Borel measurable function from one finite-dimensional space to another – With any desired non-zero amount of error – Provided the network is given enough hidden units
- The derivatives of the network can also
approximate derivatives of function well
19
Deep Learning Srihari
Applicability of Theorem
- Any continuous function on a closed and
bounded subset of Rn is Borel measurable
– Therefore approximated by a neural network
- Discrete case:
– A neural network may also approximate any function mapping from any finite dimensional discrete space to another
- Original theorems stated for activations that
saturate for very negative/positive arguments
– Also proved for wider class including ReLU
20
Deep Learning Srihari
Theorem and Training
- Whatever function we are trying to learn, a
large MLP will be able to represent it
- However we are not guaranteed that the
training algorithm will learn this function
- 1. Optimizing algorithms may not find the parameters
- 2. May choose wrong function due to over-fitting
- No Free Lunch: There is no universal
procedure for examining a training set of samples and choosing a function that will generalize to points not in training set
21
Deep Learning Srihari
Feed-forward & No Free Lunch
- Feed-forward networks provide a universal
system for representing functions
– Given a function, there is a feed-forward network that approximates the function
- There is no universal procedure for examining
a training set of specific examples and choosing a function that will generalize to points not in training set
22
Deep Learning Srihari
On Size of Network
- Universal Approximation Theorem
– Says there is a network large enough to achieve any degree of accuracy – but does not say how large the network will be
- Some bounds on size of the single-layer
network exist for a broad class of functions
– But worst case is exponential no. of hidden units
- No. of possible binary functions on vectors v ∈ {0,1}n is
22**n
- Selecting one such function requires 2n bits which will
require O(2n) degrees of freedom
23
Deep Learning Srihari
Summary/Implications of Theorem
- A feedforward network with a single layer
is sufficient to represent any function
- But the layer may be infeasibly large and
may fail to generalize correctly
- Using deeper models can reduce no. of
units required and reduce generalization error
24
Deep Learning Srihari
Function Families and Depth
- Some families of functions can be represented
efficiently if depth >d but require much larger model if depth <d
- In some cases no. of hidden units required by
shallow model is exponential in n
– Functions representable with a deep rectifier net can require an exponential no. of hidden units with a shallow (one hidden layer) network
- Piecewise linear networks (which can be obtained from
rectifier nonlinearities or maxout units) can represent functions with a no. of regions that is exponential in d
25
Deep Learning Srihari
Advantage of deeper networks
26
Has same output for every pair of mirror points in input. Mirror axis of symmetry Is given by weights and bias
- f unit. Function computed
- n top of unit (green decision
surface) will be a mirror image
- f simpler pattern across
axis of symmetry Function can be obtained By folding the space around axis of symmetry Another repeating Pattern can be folded on Top of the first (by another downstream unit) to obtain another symmetry (which is now repeated four times with two hidden layers)
Absolute value rectification creates mirror images of function computed on top Of some hidden unit, wrt the input of that hidden unit. Each hidden unit specifies where to fold the input space in order to create mirror responses. By composing these folding operations we obtain an exponentially large no. of piecewise linear regions which can capture all kinds of repeating patterns
Deep Learning Srihari
Theorem on Depth
- The no. of linear regions carved out by a deep
rectifier network with d inputs, depth l and n units per hidden layer is
– i.e., exponential in the depth l.
- In the case of maxout networks with k filters per
unit, the no. of linear regions is
- There is no guarantee that the kinds of
functions we want to learn in AI share such a property
27
O n d ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟ ⎟
d(l−1)
nd ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟
O k(l−1)+d
( )
Deep Learning Srihari
Statistical Justification for Depth
- We may want to choose a deep model for
statistical reasons
- Any time we choose a ML algorithm we are
implicitly stating a set of beliefs about what kind
- f functions that algorithm should learn
- Choosing a deep model encodes a belief that
the function should be a composition several simpler functions
28
Deep Learning Srihari
Intuition on Depth
- We can interpret the use of a deep architecture
as expressing a belief that the function we want to learn is a computer program consisting of m where each multiple steps where each step makes use of the previous step’s output
- The intermediate outputs are not necessarily
factors of variation, but can be analogous to counters or pointers used for organizing processing
- Empirically greater depth results in better
generalization
29
Deep Learning Srihari
Empirical Results
- Deeper networks perform better
- Deep architectures indeed express a useful
prior over the space of functions the model learns
30
Test accuracy consistently increases with depth Increasing parameters without increasing depth is not as effective
Deep Learning Srihari
Other architectural considerations
- Specialized architectures are discussed later
- Convolutional Networks
– Used for computer vision
- Recurrent Neural Networks
– Used for sequence processing – Have their own architectural considerations
31
Deep Learning Srihari
Non-chain architecture
- Layers connected in a chain is common
- Skipping going from layer i to layer i+2 or
higher
– During learning, makes it easier for gradient to flow from output layers to layer nearer input
32
Deep Learning Srihari
Connecting a pair of layers
- In the default neural network layer described by
a linear transformation via a matrix W
- Every input unit connected to every output unit
- Specialized networks have fewer connections
– Each unit in input layer is connected to only small subset of units in output layer – Reduce no. of parameters and computation for evaluation – E.g., CNNs use specialized patterns of sparse connections that are effective for computer vision
33
Deep Learning Srihari
Topics (Deep Feedforward Networks)
- Overview
- 1. Example: Learning XOR
- 2. Gradient-Based Learning
- 3. Hidden Units
- 4. Architecture Design
- 5. Backpropagation and Other Differentiation
Algorithms
- 6. Historical Notes
2
Deep Learning Srihari
Topics in Backpropagation
- 1. Overview
- 2. Computational Graphs
- 3. Chain Rule of Calculus
- 4. Recursively applying the chain rule to obtain
backprop
- 5. Backpropagation computation in fully-connected MLP
- 6. Symbol-to-symbol derivatived
- 7. General backpropagation
- 8. Ex: backpropagation for MLP training
- 9. Complications
- 10. Differentiation outside the deep learning community
- 11. Higher-order derivatives
3
Deep Learning Srihari
Overview of Backpropagation
4
Deep Learning Srihari
Forward Propagation
- Producing an output from input
– When we use a Feed-Forward Network to accept an input x and produce an output information x propagates to hidden units at each layer and finally produces – This is called forward propagation
- During training (quality of result is evaluated):
– forward propagation can continue onward – until it produces scalar cost J (θ) over N training samples (xn,yn)
5
ˆ y ˆ y
Deep Learning Srihari
Equations for Forward Propagation
6
First layer given by h(1)=g(1)(W(1)Tx + b(1)) Second layer is h(2)=g(2)(W(2)Th(2)+ b(2)), ….. Final output is =g(d)(W(d)Th(d)+ b(d))
Producing an output:
ˆ y
J(θ) = JMLE +λ Wi,j
(1)
( )
2
+ Wi,j
(2)
( )
2
+....
i,j
∑
i,j
∑
⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ JMLE = 1 N || ˆ y −yn ||2
n=1 N
∑
During training, cost over n exemplars:
Deep Learning Srihari
Back-Propagation Algorithm
- Often simply called backprop
– Allows information from the cost to flow back through network to compute gradient
- Computing analytical expression for the
gradient is straightforward
– But numerically evaluating the gradient is computationally expensive
- The backpropagation algorithm does this using
a simple and inexpensive procedure
7
Deep Learning Srihari
Analytical Expression for Gradient
- Sum-of-squares criterion over n samples
– Expression for gradient
- Another way of saying the same with cost J(θ):
8
E(w) = 1 2 wTφ(xn) -t n
{ }
n=1 N
∑
2
∇wE(w) = wTφ(xn) - t n
{ }
n=1 N
∑
φ(xn)
Jn(θ) =|| θTxn −yn ||2
∇θJn(θ) = θT θ xn − yn
( ) = θTθxn −θT yn
Deep Learning Srihari
Backpropagation is not Learning
- Backpropagation often misunderstood as the
whole learning algorithm for multilayer networks
– It only refers to method of computing gradient
- Another algorithm, e.g., SGD, is used to
perform learning using this gradient
– Learning is updating weights using gradient:
- Backpropagation is also misunderstood to
being specific to multilayer neural networks
– It can be used to compute derivatives for any function (or report that the derivative is undefined)
9
w(τ +1) = w(τ ) −η∇Jn(θ)
Deep Learning Srihari
Importance of Backpropagation
- Backprop is a technique for computing
derivatives quickly
– It is the key algorithm that makes training deep models computationally tractable – For modern neural networks it can make training gradient descent 10 million times faster relative to naiive implementation
- It is the difference between a model that takes a week to
train instead of 200,000 years
10
Deep Learning Srihari
Computing gradient for arbitrary function
- Arbitrary function f(x,y)
– x : variables for which derivatives are desired – y is an additional set of variables that are inputs to the function but whose derivatives are not required
- Gradient required is of cost wrt parameters,
- Backprop is also useful for other ML tasks
– Those that need derivatives, as part of learning process or to analyze a learned model – To compute Jacobian of a function f with multiple
- utputs
- We restrict to case where f has a single output
∇xf(x,y)
∇θJ(θ)
Deep Learning Srihari
Computational Graphs
12
Deep Learning Srihari
Computational Graphs
- To describe backpropagation use precise
computational graph language
– Each node is either
- A variable
– Scalar, vector, matrix, tensor, or other type
- Or an Operation
– Simple function of one or more variables – Functions more complex than operations are obtained by composing operations
– If variable y is computed by applying operation to variable x then draw directed edge from x to y
Deep Learning Srihari
Ex: Computational Graph of xy
(a) Compute z = xy
14
Deep Learning Srihari
Ex: Graph of Logistic Regression
(b) Logistic Regression Prediction
– Variables in graph u(1) and u(2) are not in original expression, but are needed in graph
15
ˆ y = σ(xTw +b)
Deep Learning Srihari
Ex: Graph for ReLU
(c) Compute expression H=max{0,XW+b}
– Computes a design matrix of Rectified linear unit activations H given design matrix of minibatch of inputs X
16
Deep Learning Srihari
Ex: Two operations on input
(d) Perform more than one
- peration to a variable
Weights w are used in two
- perations:
- to make prediction and
- the weight decay penalty
17
ˆ y
λ wi
2 i
∑
Deep Learning Srihari
Chain Rule of Calculus
18
Deep Learning Srihari
Calculus’ Chain Rule for Scalars
- Formula for computing derivatives of functions
formed by composing other functions whose derivatives are known
– Backpropagation is an algorithm that computes the chain rule, with a specific order of operations that is highly efficient
- Let x be a real number
- Let f and g be functions mapping from a real number to a
real number
- If y=g(x) and z=f (g(x))=f (y)
- Then the chain rule states that
19
dz dx = dz dy ⋅ dy dx
Deep Learning Srihari
- Suppose
g maps from Rm to Rn and
f from Rn to R
- If y=g(x) and z=f (y) then
- In vector notation this is
- where is the n x m Jacobian matrix of g
- Thus gradient of z wrt x is product of:
- Jacobian matrix and gradient vector
- Backprop algorithm consists of performing
Jacobian-gradient product for each step of graph
g x y
Generalizing Chain Rule to Vectors
x ∈Rm,y ∈Rn
∂z ∂xi = ∂z ∂yj ⋅ ∂yj ∂x
j
∑
∇xz= ∂y ∂x ⎛ ⎝ ⎜ ⎞ ⎠ ⎟
T
∇yz
∂y ∂x ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ ∂y ∂x
∇yz
Deep Learning Srihari
Generalizing Chain Rule to Tensors
- Backpropagation is usually applied to tensors
with arbitrary dimensionality
- This is exactly the same as with vectors
– Only difference is how numbers are arranged in a grid to form a tensor
- We could flatten each tensor into a vector, compute a
vector-valued gradient and reshape it back to a tensor
- In this view backpropagation is still multiplying
Jacobians by gradients
21
∇xz= ∂y ∂x ⎛ ⎝ ⎜ ⎞ ⎠ ⎟
T
∇yz
Deep Learning Srihari
Chain Rule for tensors
- To denote gradient of value z wrt a tensor
X we write as if X were a vector
- For 3-D tensor, X has three coordinates
– We can abstract this away by using a single variable i to represent complete tuple of indices
- For all possible tuples i gives
- Exactly same as how for all possible indices i into a
vector, gives
- Chain rule for tensors
– If Y=g (X) and z=f (Y) then
22
∇Xz ∇Xz
( )i
∂z ∂Xi
∇X(z)=
j
∑
∇XYj
( )
∂z ∂Yj
∇Xz
( )i
∂z ∂Xi
g x y
Deep Learning Srihari
Recursively applying the chain rule to obtain backprop
23
Deep Learning Srihari
Backprop is Recursive Chain Rule
- Backprop is obtained by recursively applying
the chain rule
- Using chain rule it is straightforward to write
expression for gradient of a scalar wrt any node in graph for producing that scalar
- However, evaluating that expression on a
computer has some extra considerations
– E.g., many subexpressions may be repeated several times within overall expression
- Whether to store subexpressions or recompute them
24
Deep Learning Srihari
Example of repeated subexpressions
- Let w be the input to the graph
– We use same function f: RàR at every step:
x=f (w), y=f (x), z=f (y)
- To compute apply
- Eq. (1): compute f (w) once and store it in x
– This the approach taken by backprop
- Eq. (2): expression f(w) appears more than once
f(w) is recomputed each time it is needed For low memory, (1) preferable: reduced runtime (2) is also valid chain rule, useful for limited memory
For complicated graphs,
exponentially wasted computations making naiive implementation
- f chain rule infeasible
∂z ∂w = ∂z ∂y ∂y ∂x ∂x ∂w = f '(y)f '(x)f '(w) (1) = f '(f(f(w)))f '(f(w))f '(w) (2)
∂z ∂w
Deep Learning Srihari
Simplified Backprop Algorithm
- Version that directly computes actual gradient
– In the order it will actually be done according to recursive application of chain rule
- Algorithm Simplified Backprop along with associated
- Forward Propagation
- Could either directly perform these operations
– or view algorithm as symbolic specification of computational graph for computing the back-prop
- This formulation does not make specific
– Manipulation and construction of symbolic graph that performs gradient computation
Deep Learning Srihari
Computing a single scalar
- Consider computational graph of how to
compute scalar u(n)
– say loss on a training example
- We want gradient of this scalar u(n) wrt ni input
nodes u(1),..u(ni)
- i.e., we wish to compute for all i =1,..,ni
- In application of backprop to computing
gradients for gradient descent over parameters
– u(n) will be cost associated with an example or a minibatch, while – u(1),..u(ni) correspond to model parameters
∂u(n) ∂ui
Deep Learning Srihari
Nodes of Computational Graph
- Assume that nodes of the graph have been
- rdered such that
– We can compute their output one after another – Starting at and going to u(n)
- As defined in Algorithm shown next
– Each node u(i) is associated with operation f (i) and is computed by evaluating the function
u(i) = f (A(i))
where A(i) = Pa(u(i)) is set of nodes that are parents of u(i)
- Algorithm specifies a computational graph G
– Computation in reverse order gives back- propagation computational graph B
28
u
(ni +1)
Deep Learning Srihari
Forward Propagation Algorithm
Algorithm 1: Performs computations mapping ni inputs u(1),..u(ni) to an output u(n). This defines computational graph G where each node computes numerical value u(i) by applying a function f (i) to a set of arguments A(i) that comprises values of previous nodes u(j), j < i with j ε Pa(u(i)). Input to G is x set to the first ni nodes u(1),..u(ni) . Output of G is read off the last (output) node for u(n).
- for i =1,..,ni do
u(i)ß xi
- end for
- for i=ni+1,.., n do
A(i)ß{u(j)| j ε Pa(u(i)) } u(i)ßf (i) (A(i))
- end for
- return u(n)
Deep Learning Srihari
Computation in B
- Proceeds exactly in reverse order of
computation in G
- Each node in B computes the derivative
associated with the forward graph node u(i)
- This is done using the chain rule wrt the
scalar output u(n)
30
∂u(n) ∂ui
∂u(n) ∂u(j) = ∂u(n) ∂u(i) ∂u(i) ∂u(j)
i:jÎPa u(i)
( )
∑
Deep Learning Srihari
Preamble to Simplified Backprop
- Objective is to compute derivatives of u(n) with
respect to variables in the graph
– Here all variables are scalars and we wish to compute derivatives wrt – We wish to compute the derivatives wrt u(1),..u(ni)
- Algorithm computes the derivatives of all nodes
in the graph
31
Deep Learning Srihari
Simplified Backprop Algorithm
32
- Algorithm 2: For computing derivatives of u(n) wrt variables in
- G. All variables are scalars and we wish to compute derivatives
wrt u(1),..u(ni) . We compute derivatives of all nodes in G.
- Run forward propagation to obtain network activations
- Initialize grad-table, a data structure that will store derivatives that
have been computed, The entry grad-table [u(i)] will store the computed value of
- 1. grad-table [u(n)] ß1
- 2. for j=n-1 down to 1 do
grad-table [u(j)] ß
- 3. endfor
- 4. return {[u(i)] |i=1,.., ni}
∂u(n) ∂ui
grad-table u(i) ⎡ ⎣ ⎤ ⎦
i:j∈ Pa u(i )
( )
∑
∂u(i) ∂u( j) Step 2 computes ∂u(n) ∂u( j) = ∂u(n) ∂u(i)
i:j∈ Pa u(i )
( )
∑
∂u(i) ∂u( j)
Deep Learning Srihari
Computational Complexity
- Computational cost is proportional to no. of
edges in graph (same as for forward prop)
– Each is a function of the parents of u(j) and u(i) thus linking nodes of the forward graph to those added for B
- Backpropagation thus avoids exponential
explosion in repeated subexpressions
– By simplifications on the computational graph
33
∂u(i) ∂u(j)
Deep Learning Srihari
Generalization to Tensors
- Backprop is designed to reduce the no. of
common sub-expressions without regard to memory
- It performs on the order of one Jacobian
product per node in the graph
34
Deep Learning Srihari
35
Backprop in fully connected MLP
Deep Learning Srihari
Backprop in fully connected MLP
- Consider specific graph associated with
fully-connected multilayer perceptron
- Algorithm discussed next shows forward
propagation
– Maps parameters to supervised loss associated with a single training example (x,y) with the output when x is the input
36
L ˆ y, y
( )
ˆ y
Deep Learning Srihari
Forward Prop: deep nn & cost computation
37
- Algorithm 3: The loss L(y,y depends on output and on the
target y. To obtain total cost J the loss may be added to a regularizer Ω(θ) where θ contains all the parameters (weights and biases).Algorithm 4 computes gradients of J wrt parameters W and b. This demonstration uses only single input example x.
- Require: Net depth l; Weight matrices W(i), i ε{1,..,l};
bias parameters b(i), i ε{1,..,l}; input x; target output y
- 1. h(0) = x
- 2. for k =1 to l do
a(k) = b(k)+W(k)h(k-1) h(k)=f(a(k))
- 3. end for
- 4. = h(l)
J= + λΩ(θ)
L ˆ y,y
( )
L ˆ y,y
( )
ˆ y ˆ y
Deep Learning Srihari
Backward compute: deep NN of Algorithm 3
38
Algorithm 4: uses in addition to input x a target y. Yields gradients on
activations a(k) for each layer starting from output layer to first hidden layer. From these gradients one can obtain gradient on parameters of each layer, Gradients can be used as part of SGD.
After forward computation compute gradient on output layer g ß for k= l , l -1,..1 do Convert gradient on layer’s output into a gradient into the pre- nonlinearity activation (elementwise multiply if f is elementwise) g ß Compute gradients on weights biases (incl. regularizn term) Propagate the gradients wrt the next lower-level hidden layer’s activations: gß end for
∇ ˆ
y J = ∇ ˆ yL( ˆ
y, y) ∇
a(k ) J = g⊙ f ' a(k)
( )
∇
h(k−1) J =W (k)T g
∇
b(k ) J = g+ λ∇ b(k )Ω(θ), ∇ W(k ) J = gh(k−1)T + λ∇ W (k )Ω(θ)
Deep Learning Srihari
Symbol-to-Symbol Derivatives
- Both algebraic expressions and computational
graphs operate on symbols, or variables that do not have specific values
- They are called symbolic representations
- When we actually use or train a neural network,
we must assign specific values for these symbols
- We replace a symbolic input to the network with
a specific numeric value
– E.g., [2.5, 3.75, -1.8]T
39
Deep Learning Srihari
Two approaches to backpropagation
- 1. Symbol-to-number differentiation
– Take a computational graph and a set of numerical values for inputs to the graph – Return a set of numerical values describing gradient at those input values – Used by libraries: Torch and Caffe
- 2. Symbol-to-symbol differentiation
– Take a computational graph – Add additional nodes to the graph that provide a symbolic description of desired derivatives – Used by libraries: Theano and Tensorflow
40
Deep Learning Srihari
Symbol-to-symbol Derivatives
41
- To compute derivative using this
approach, backpropagation does not need to ever access any actual numerical values
– Instead it adds nodes to a computational graph describing how to compute the derivatives for any specific numerical values – A generic graph evaluation engine can later compute derivatives for any specific numerical values
Deep Learning Srihari
Ex: Symbol-to-symbol Derivatives
42
- Begin with graph representing
z=f ( f ( f (w)))
Deep Learning Srihari
Symbol-to-Symbol Derivative Computation
43
Result is a computational graph with a symbolic description of the derivative We run BP instructing it to construct graph for expression corresponding to dz/dw
Deep Learning Srihari
Advantages of Approach
- Derivatives are described in the same
language as the original expression
- Because the derivatives are just another
computational graph, it is possible to run back-propagation again
– Differentiating the derivatives – Yields higher-order derivatives
44
Deep Learning Srihari
General Backpropagation
- To compute gradient of scalar z wrt one
- f its ancestors x in the graph
– Begin by observing that gradient wrt z is – Then compute gradient wrt each parent of z by multiplying current gradient by Jacobian
- f: operation that produced z
– We continue multiplying by Jacobians traveling backwards until we reach x – For any node that can be reached by going backwards from z through two or more paths sum the gradients from different paths at that node
dz dz =1 g x y
∇xz = ∂y ∂x ⎛ ⎝ ⎜ ⎞ ⎠ ⎟
T
∇yz
Deep Learning Srihari
Formal Notation for backprop
- Each node in the graph G corresponds to
a variable
- Each variable is described by a tensor V
– Tensors have any no. of dimensions – They subsume scalars, vectors and matrices
46
Deep Learning Srihari
Each variable V is associated with the following subroutines:
- get_operation (V)
– Returns the operation that computes V represented by the edges coming into V in G – Suppose we have a variable that is computed by matrix multiplication C=AB
- Then get_operation (V) returns a pointer to an
instance of the corresponding C++ class
47
Deep Learning Srihari
Other Subroutines of V
- get_consumers (V, G)
– Returns list of variables that are children of V in the computational graph G
- get_inputs (V, G)
– Returns list of variables that are parents of V in the computational graph G
48
Deep Learning Srihari
bprop operation
- Each operation op is associated with a bprop operation
- bprop operation can compute a Jacobian vector
product as described by
- This is how the backpropagation algorithm can
achieve great generality
– Each operation is responsible for knowing how to backpropagate through the edges in the graph that it participates in
∇xz = ∂y ∂x ⎛ ⎝ ⎜ ⎞ ⎠ ⎟
T
∇ yz
Deep Learning Srihari
Example of bprop
- Suppose we have
- a variable computed by matrix multiplication C=AB
– the gradient of a scalar z wrt C is given by G
- The matrix multiplication operation is
responsible for two back propagation rules
– One for each of its input arguments
- If we call bprop to request the gradient wrt A given that
the gradient on the output is G
– Then bprop method of matrix multiplication must state that gradient wrt A is given by GBT
- If we call bprop to request the gradient wrt B
– Then matrix operation is responsible for implementing the bprop and specifying that the desired gradient is ATG
Deep Learning Srihari
Inputs, outputs of bprop
- Backprogation algorithm itself does not need to
know any differentiation rules
– It only needs to call each operation’s bprop rules with the right arguments
- Formally op.bprop (inputs X, G) must return
- which is just an implementation of the chain rule
- inputs is a list of inputs that are supplied to the operation,
- p.f is a math function that the operation implements,
- X is the input whose gradient we wish to compute,
- G is the gradient on the output of the operation
51
∇xop.f(inputs)i
( )
i
∑
Gi
∇xz = ∂y ∂x ⎛ ⎝ ⎜ ⎞ ⎠ ⎟
T
∇ yz
Deep Learning Srihari
Computing derivative of x2
- Example: The mul operator is passed to two
copies of x to compute x2
- The ob.prop still returns x as derivative wrt to
both inputs
- Backpropagation will add both arguments
together to obtain 2x
52
Deep Learning Srihari
Software Implementations
- Usually provide both:
- 1. Operations
- 2. Their bprop methods
- Users of software libraries are able to
backpropagate through graphs built using common operations like
– Matrix multiplication, exponents, logarithms, etc
- To add a new operation to existing library must
derive ob.prop method manually
53
Deep Learning Srihari
Formal Backpropagation Algorithm
54
- Algorithm 5: Outermost skeleton of backprop
- This portion does simple setup and cleanup work, Most of the
important work happens in the build_grad subroutine of Algorithm 6
- Require: T, Target set of variables whose gradients must be
computed
- Require: G, the computational graph
- 1. Let G’ be G pruned to contain only nodes that are
ancestors of z and descendants of nodes in T
- 2. for V in T do
build-grad (V,G , G’, grad-table) endfor
- 4. Return grad-table restricted to T
Deep Learning Srihari
Inner Loop: build-grad
55
- Algorithm 6: Innerloop subroutine build-grad(V,G,G’,grad-table) of the
back-propagation algorithm, called by Algorithm 5
- Require: V, Target set of variables whose gradients to be computed; G, the
graph to modify; G’, the restriction of G to modify; grad-table,a data structure mapping nodes to their gradients if V is in grad-table, then return grad-table [V] endif iß1 for C in get-customers(V,G’) do
- pßget-operation(C)
Dßbuild-grad(C,G,G’,grad-table) G(i)ßob.bprop(get-inputs(C,G’),V,D) ißi+1 endfor Gß ΣiG(i)
grad-table[V] = G Insert G and the operations creating it into G
Return G
Deep Learning Srihari
Ex: backprop for MLP training
- As an example, walk through back-propagation
ealgorithm as it is used to train a multilayer perceptron
- We use Minibatch stochastic gradient descent
- Backpropagation algorithm is used to compute
the gradient of the cost on a single minibatch
- We use a minibatch of examples from the
training set formatted as a design matrix X, and a vector of associated class labels y
56
Deep Learning Srihari
Ex: details of MLP training
- Network computes a layer hidden features
H=max{0,XW(1)} – No biases in model
- Graph language has relu to compute max{0,Z}
- Prediction: log-probs(unnorm) over classes: HW(2)
- Graph language includes cross-entropy
- peration
– computes cross-entropy between targets y and probability distribution defined by log probs – Resulting cross-entropy defines cost JMLE – We include a regularization term
57
Deep Learning Srihari
Forward propagation graph
58
Total cost:
Deep Learning Srihari
Computational Graph of Gradient
- It would be large and tedious for this
example
- One benefit of back-propagation algorithm
is that it can automatically generate gradients that would be straightforward but tedious manually for a software engineer to derive
59
Deep Learning Srihari
Tracing behavior of Backprop
- Looking at forward prop graph
- To train we wish to compute
both
- There are two different paths
leading backward from J to the weights:
– one through weight decay cost
- It will always contribute 2λW(i) to the
gradient on W(i)
– other through cross-entropy cost
- It is more complicated
60
∇
W(1) J and ∇ W(2) J
Deep Learning Srihari
Cross-entropy cost
- Let G be gradient on unnormalized log
probabilities U(2) given by cross-entropy op.
- Backprop needs to explore two branches:
– On shorter branch adds HTG to the gradient on W(2)
- Using the backpropagation rule for the second argument
to the matrix multiplication operation
– Other branch: longer descending along network
- First backprop computes
- Next relu operation uses backpropagation rule to zero out
components of gradient corresponding to entries of U(1) that were less than 0. Let result be called G’
- Use backpropagation rule for the second argument of
matmul to add XTG’ to the gradient on W(1)
∇H J = GW (2)T
Deep Learning Srihari
After Gradient Computation
- It is the responsibility of SGD or other
- ptimization algorithm to use gradients to
update parameters
62
Backpropagation as automatic differentiation
3
Summary
- Neurons are arranged into fully-connected layers
- The abstraction of a layer has the nice property that it
allows us to use efficient vectorized code (e.g. matrix multiplications)
- Neural networks are not really neural
- Neural networks: bigger = better (but might have to
regularize more strongly)
- Backpropagation a type of automatic differentiation with
all computations local
- Disadvantage: Define bprop for every operation – limited use
- Advantage: Customized rule for every operation - speed
- Maybe other methods can be used – open question
References
- Calculus on Computational Graphs: Backpropagation
- https://colah.github.io/posts/2015-08-Backprop/