SLIDE 1
Understanding Neural Networks Part I: Artificial Neurons and Network - - PowerPoint PPT Presentation
Understanding Neural Networks Part I: Artificial Neurons and Network - - PowerPoint PPT Presentation
TensorFlow Workshop 2018 Understanding Neural Networks Part I: Artificial Neurons and Network Optimization Nick Winovich Department of Mathematics Purdue University July 2018 SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks :
SLIDE 2
SLIDE 3
Outline
1
Neural Networks
Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm
2
Network Optimization
Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 4
Outline
1
Neural Networks
Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm
2
Network Optimization
Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 5
Artificial Neural Networks
Neural networks are a class of simple, yet effective, “computing systems” with a diverse range of applications. In these systems, small computational units, or nodes, are arranged to form networks in which connectivity is leveraged to carry out complex calculations.
Deep Learning by Goodfellow, Bengio, and Courville: http://www.deeplearningbook.org/ Convolutional Neural Networks for Visual Recognition at Stanford: http://cs231n.stanford.edu/
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 6
Artificial Neurons
Diagram modified from Stack Exchange post answered by Gonzalo Medina.
Weights are first used to scale inputs; the results are summed
with a bias term and passed through an activation function.
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 7
Formula and Vector Representation
The diagram from the previous slide can be interpreted as:
y = f ( x1 · w1 + x2 · w2 + x3 · w3 + b )
which can be conveniently represented in vector form via:
y = f
- wT x + b
- by interpreting the neuron inputs and weights as column vectors.
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 8
Artificial Neurons: Multiple Outputs
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 9
Matrix Representation
This corresponds to a pair of equations, one for each ouput:
y1 = f
- wT
1 x + b1
- y2 = f
- wT
2 x + b2
- which can be represented in matrix form by the system:
y = f ( W x + b )
where we assume the activation function has been vectorized.
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 10
Fully-Connected Neural Layers
The resulting layers, referred to as fully-connected or dense,
can be visualized as a collection of nodes connected by edges corresponding to weights (bias/activations are typically omitted)
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 11
Floating Point Operation Count
Matrix-Vector Multiplication w11 . . . w1N . . . ... . . . wM1 . . . wMN x1 . . . xN Mult: MN ∼ w11 · x1 . . . w1N · xN . . . ... . . . wM1 · x1 . . . wMN · xN Add: M(N−1) ∼ w11 · x1 + . . . + w1N · xN . . . . . . . . . wM1 · x1 + . . . + wMN · xN
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 12
Floating Point Operation Count
So we see that when bias terms are omitted, the FLOPs required for a neural connection between N inputs and M outputs is:
2 MN − M = MN multiplies + M(N − 1) adds
When bias terms are included, an additional M addition operations are required, resulting in a total of 2 MN FLOPs. Note: This omits the computation required for applying the activation function to M values resulting from the linear operations. Depending on the activation function selected, this may or may not have a significant impact on the overall computational complexity.
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 13
Activation Functions
Activation functions are a fundamental component of neural network architectures; these functions are responsible for:
Providing all of the network’s non-linear modeling capacity Controlling the gradient flows that guide the training process
While activation functions play a fundamental role in all neural networks, it is still desirable to limit their computational demands (e.g. avoid defining them in terms of a Krylov subspace method...). In practice, activations such as rectified linear units (ReLUs) with the most trivial function and derivative definitions often suffice.
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 14
Activation Functions
Rectified Linear Unit (ReLU) f(x) =
- x
x ≥ 0 x < 0 SoftPlus Activation f(x) = ln
- 1 + exp(−x)
- SIAM@Purdue 2018 - Nick Winovich
Understanding Neural Networks : Part I
SLIDE 15
Activation Functions
Sigmoidal Unit f(x) = 1 1 + exp(−x) Hyperbolic Tangent Unit f(x) = tanh(x)
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 16
Activation Functions (Parameterized)
Exponential Linear Unit (ELU) fα(x) =
- x
x ≥ 0 α · (ex − 1) x < 0 Leaky Rectified Linear Unit fα(x) =
- x
x ≥ 0 α · x x < 0
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 17
Activation Functions (Learnable Parameters)
Parameterized ReLU fβ(x) =
- β · x
x ≥ 0 x x < 0 Swish Units fβ(x) = x 1 + exp(−β · x)
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 18
Hidden Layers
Intermediate, or hidden, layers can be added between the input and
- uput nodes to allow for additional non-linear processing.
For example, we can first define a layer such as:
h = f1 ( W1 x + b1 )
and construct a subsequent layer to produce the final output:
y = f2 ( W2 h + b2 )
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 19
Hidden Layers
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 20
Multiple Hidden Layers
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 21
Multiple Hidden Layers
Multiple hidden layers can easily be defined in the same way:
h1 = f1 ( W1 x + b1 ) h2 = f2 ( W2 h1 + b2 ) y = f3 ( W3 h2 + b3 )
One of the challenges of working with additional layers is the need to determine the impact that earlier layers have on the final ouput. This will be necessary for tuning/optimizing network parameters (i.e. weights and biases) to produce accurate predictions.
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 22
Outline
1
Neural Networks
Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm
2
Network Optimization
Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 23
Universal Approximators: Cybenko (1989)
Cybenko, G., 1989. Approximation by superpositions of a sigmoidal
- function. Mathematics of control, signals and systems, 2(4), pp.303-314.
Basic Idea of Result: Let In denote the unit hypercube in Rn; the collection of functions which can be expressed in the form: N
i=1 αi · σ
- wT
i x + bi
- ∀x ∈ In
is dense in the space of continuous functions C(In) defined on In: i.e. ∀f ∈ C(In) , ε > 0 there exist constants N, αi, wi, bi such that
- f(x) −
N
i=1 αi · σ(wT i x + bi)
- < ε ∀x ∈ In
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 24
Universal Approximators: Hornik et al. / Funahashi
Hornik, K., Stinchcombe, M. and White, H., 1989. Multilayer feedforward networks are universal approximators. Neural networks, 2(5), pp.359-366. Funahashi, K.I., 1989. On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3), pp.183-192.
Summary of Results: For any compact set K ⊂ Rn, multi-layer feedforward neural networks are dense in the space of continuous funtions C(K) on K, with respect to the supremum norm, provided that the activation function used for the network layers is:
Continuous and increasing Non-constant and bounded
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 25
Universal Approximators: Leshno et al. (1992)
Leshno, M., Lin, V.Y., Pinkus, A. and Schocken, S., 1992. Multilayer feedforward networks with a non-polynomial activation function can approximate any function.
“A standard multilayer feedforward network with a locally bounded piecewise continuous activation function can ap- proximate any continuous function to any degree of accu- racy if and only if the network’s activation function is not a
- polynomial. ” (Leshno et al.)
Here the notion of “approximation” is also defined in terms of the
supremum norm, and the domains are assumed to be compact
The result does not hold without thresholds (i.e. bias terms)
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 26
Outline
1
Neural Networks
Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm
2
Network Optimization
Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 27
Overfitting
In some cases the network is capable learning “too much” from the specific training data used; this phenomenon is referred to as
- verfitting and occurs when the model performs well on the training
dataset, but does not generalize to accurate predictions on data which has not been seen during training. Consider, for example:
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 28
L1 and L2 Weight Regularization
One simple technique to help avoid overfitting is to add a penalty for network parameters with large L1 or L2 norms. This is similar to the underlying idea behind LASSO regression and can be loosely interpreted as a form of applying the principle of Ockham’s Razor: i.e. the simplest solution often turns out to be the correct solution.
L2 regularization is a fairly general regularization technique
which places an emphasis on reducing the largest weights
L1 regularization helps to encourage sparsity in the network and
improves performance when the problem has a sparse solution
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 29
Applying Dropout
Applying dropout to hidden network layers also helps to avoid
- verfitting. This technique consists of removing, or dropping,
units/nodes randomly at each step of the training process.
A fixed drop rate p ∈ (0, 1) is specified prior to training, and
nodes in the layer are dropped according to a collection of i.i.d. random Bernoulli samples drawn at each training step
Since all nodes will be used after training, the outputs of the
remaining nodes are rescaled by a factor of 1/(1 − p) to ensure that the expected values during training and testing coincide Loosely speaking, this can be thought of as a way to ensure that no individual node plays too large of a role in the final prediction.
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 30
Example: Dropout with Rate = 0.25 [ Training ]
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 31
Example: Dropout with Rate = 0.25 [ Training ]
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 32
Example: Dropout with Rate = 0.25 [ Training ]
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 33
Example: Dropout with Rate = 0.25 [ Testing ]
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 34
Motivation for Batch Normalization
Ioffe, S. and Szegedy, C., 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
Internal Covariate Shift: As network parameters change during training, the distributions of the input values to each layer change.
Training could be more efficient if the layers were receiving
inputs with a fixed distribution throughout the entire process
Achieving this using normalization requires a technique
which is compatabile with gradient-based optimization
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 35
Batch Normalization
The proposed batch normalization technique corresponds to first performing a normalization with respect to the batch statistics:
- x =
x − µB
- σ2
B + ε
with µB = 1 m
- x∈B
x , σ2
B = 1
m
- x∈B
(x − µB)2 where m is a fixed batch size, and ε > 0 for numerical stability. A linear map with learnable parameters γ and β is then applied: yi = γ · xi + β and the normalized values {yi} are passed to the activation function to apply the non-linear transformation for the layer.
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 36
Batch Normalization after Training
After training, we need a way to freeze the model in place for making predictions. This is accomplished by specifying a fixed normalization rule for each layer; rather than use sample statistics from a specific batch, it is natural to incorporate the entire dataset:
- x =
x − µ √ σ2 + ε where µ is the empirical mean ED[x] and σ2 is the variance VarD[x] taken with respect to the complete training dataset D. These values can be tracked using moving averages during training to avoid direct computation and provide accurate estimates when parameter changes are small near the end of the training process.
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 37
Outline
1
Neural Networks
Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm
2
Network Optimization
Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 38
Outline
1
Neural Networks
Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm
2
Network Optimization
Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 39
Evaluating a Network
Up until now, we have still not discussed how to quantify the performance of a neural network. The most common strategy for quantifying performance is to define loss functions for the network with “low loss” corresponding to “high performance”. In supervised training, where the true labels/solutions are known, the network loss function is typically composed of:
A primary loss term corresponding to a measure of how close
the network predictions are to the true solutions
Auxiliary loss components, such as weight regularization
penalties, designed to help guide the training process Once the network performance is quantified, we can specify an
- ptimization algorithm designed to minimize the network loss.
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 40
Loss Functions
Two of the most common applications of neural networks are regression (for predicting continuous properties/values) and classification (for predicting discrete properties or labels). For regression, a standard loss is given by the mean squared error: Loss = 1 |I|
- i ∈ I
( yi − yi)2 where I are the indices of the output data (e.g. pixels of an image). For classification, softmax cross entropy can be used when labels are mutually exclusive (e.g. classifying a digit as “0”, “1”, “2” etc.); sigmoid cross entropy can be used when labels are not mutually exclusive (e.g. determining which objects are in an image).
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 41
One Hot Encoding
It is also fundamentally important to consider how the data will be represented within the network. When classifying digits, for example, networks will typically perform extremely poorly if the labels are represented as a single number: “0.0”, “1.0”, “2.0”, etc. To better distinguish the differences between e.g. “0”, “1”, and “2”, it is useful to instead store the values using a one hot encoding: “0” =
- 1
- “1” =
- 1
- “2” =
- 1
- One hot encodings are also typically used for word prediction (by
specifying a dictionary of possibilities) and character level predictions (by specifying the admissable character set).
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 42
Data Preparation
In general, it also a good practice to first process/prepare the input values of a dataset before training. For example, if the input values are centered around 100 and all lie within the interval [99.9, 100.1], it is typically better to center and rescale these values beforehand:
- x =
x − µ √ σ2 − ε where µ = 1 |D|
- x′∈D
x′ and σ2 = 1 |D|
- x′∈D
(x′ − µ)2 The values of µ and σ2 can then be saved, and predictions can be made on arbitrary inputs during testing by applying the above normalization before passing the test inputs to the network.
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 43
Outline
1
Neural Networks
Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm
2
Network Optimization
Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 44
Gradient Descent
Gradient descent provides a simple, iterative algorithm for finding local minima of a real-valued function F numerically. The main idea behind gradient descent is relatively straightforward: compute the gradient of the function that we want to minimize and take a step in the direction of steepest descent ( i.e. −∇F(x) ). The iteration step of the algorithm is defined in terms of a step size parameter α ( or by a decreasing sequence {αi} ) by setting:
xi+1 = xi − α · ∇F(xi)
Note: Convergence is only guaranteed under certain assumptions
- n the function F
(e.g. convexity, Lipschitz continuity, etc.).
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 45
Loss Functions for Large Datasets
In the context of neural networks, gradient descent appears to provide a reasonable approach for tuning network parameters. The initial weights and biases can be interpreted as a single vector θ0, and the iteration steps from the previous slide could, in theory, be used to identify the optimal parameters θ∗ for the model. The issue with this approach is that the function we are actually trying to minimize is defined in terms of the entire dataset D:
F(θ) = 1 |D|
- x∈D
f(x | θ)
where f(x|θ) denotes the loss for a single example x when using the model parameters θ. So the standard algorithm would require computing the average loss at each step of the iterative scheme...
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 46
Stochastic Gradient Descent
Since computing the true gradient ∇F(θ) at every step is impractical for large datasets, we can instead try to approximate this gradient using a smaller, more managable mini-batch of data:
∇F(θ) ≈ ∇ Fi(θ) = 1 |Bi|
- x∈Bi
∇f(x | θ)
where the batches {Bi} partition the dataset into smaller subsets (typically of equal size). The iteration step is then taken to be:
θi+1 = θi − α · ∇ Fi(θi)
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 47
Potential Obstacles
Fixed learning rates typically lead to suboptimal performance Defining a learning rate schedule manually does not allow the
algorithm to adapt to the particular problem in consideration
Different parameters often require different learning rates Since the directions/magnitudes of previous updates are not
taken into consideration, defining optimization policies on small batches of data may lead to a noisy, inefficient training process
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 48
Importance of Selecting the Correct Learning Rate High Learning Rate Low Learning Rate
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 49
Nesterov Momentum
Nesterov, Y.E., 1983. A method for solving the convex programming problem with convergence rate O(1/kˆ2). In Dokl. Akad. Nauk SSSR (Vol. 269, pp. 543-547).
One method for learning from previous steps is to incorporate momentum into the update policy. This can be done by setting:
vi = γ · vi−1 + α · ∇F(θ) θi+1 = θi − vi
An accelerated form was introduced by Nesterov in 1983 which leverages the value of “looking ahead” before making updates:
vi = γ · vi−1 + α · ∇F(θ − γ · vi−1)
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 50
AdaGrad and RMSProp
Duchi, J., Hazan, E. and Singer, Y., 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), pp.2121-2159. Hinton, G., Srivastava, N. and Swersky, K., Neural Networks for Machine Learning Lecture 6a Overview of mini–batch gradient descent. AdaGrad defines parameter specific updates which are
normalized by the sum of the squares of previous gradients; this leads to a natural learning rate decay (often too much)
RMSProp keeps moving averages of the squared gradients
for each parameter which are used to rescale updates
AdaDelta provides another method for rescaling updates,
and many other variants (e.g. including momentum) exist ...
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 51
Exponential Moving Averages
One common method for estimating an average incrementally is to keep an exponential moving average of the values. This method applies an exponential decay to terms in the average which places an emphasis on the most recent values; this allows the average to move, or correct itself, as the distribution of the values changes. To track the gradient gt = ∇F(θt−1) of the loss with respect to the parameters θ, we can define an average recursively by setting:
- m0 = 0
mt = β · mt−1 + (1 − β) · gt ⇒ mt = (1−β)
t
- τ=1
β t−τ gτ where the parameter β is used to specify the exponential decay rate and is typically taken to be close to, but smaller than, 1.
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 52
The Adam Optimization Algorithm
Kingma, D.P . and Ba, J., 2014. Adam: A method for stochastic
- ptimization. arXiv preprint arXiv:1412.6980.
The Adam optimizer, derived from “adaptive moment estimation”, proposes keeping exponential moving averages of both the first moment gt and the (uncentered) second moment g2
t of the gradient.
In addition, a bias correction is introduced to address the issue of arbitrarily initializing the exponential moving averages with zero. The first moment average mt and second moment average vt are defined with decay rates β1 and β2, respectively, and the bias correction procedure is defined by the rescaling:
- mt =
mt 1 − βt
1
- vt =
vt 1 − βt
2 SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 53
The Adam Optimization Algorithm
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 54
Outline
1
Neural Networks
Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm
2
Network Optimization
Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 55
Backpropagation
Rumelhart, D.E., Hinton, G.E. and Williams, R.J., 1986. Learning representations by back-propagating errors. Nature, 323(6088), p.533.
While this theoretical framework for neural network optimization may seem complete, one fundamental question still remains: ”How are the gradients of network parameters actually computed?” One approach, referred to as backpropagation, was proposed in 1986 which dealt with sigmoidal activations "σ" and defined the loss "E" in terms of predictions "yj" and true values "dj" via:
E = 1 2
- (yj − dj)2
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 56
Backpropagation: Rumelhart, Hinton, and Williams
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 57
Backpropagation: Rumelhart, Hinton, and Williams
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 58
Backpropagation: Rumelhart, Hinton, and Williams
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 59
Backpropagation: Rumelhart, Hinton, and Williams
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 60
Backpropagation
Now that the error contribution associated with yi is known:
i.e. ∂E ∂yi =
- j
∂E ∂xj · wji
contributions from network parameters of the previous layer can be computed using the same methodology that was applied to yj:
e.g. ∂E ∂xi = ∂E ∂yi · dyi dxi = ∂E ∂yi · d dxi σ(xi)
In this way, gradient calculations for all network parameters can be computed by propagating back the error contributions from the parameters in subsequent layers which depend on their values.
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 61
Symbolic and Numeric Differentiation
Two commonly used methods for automating the process of computing derivatives are symbolic differentiation and numeric differentiation; however, both of these methods have severe practical limitations in the context of training neural networks.
Symbolic differentiation produces exact derivatives through
direct manipulation of the mathematical expressions used to define functions; the resulting expressions can be lengthy and contain unnecessary computations, however, and are inefficient unless additional “expression simplification” steps are included
Numeric differentiation techniques are widely applicable and
efficient; however, the resulting inexact gradient estimates can entirely undermine the training process for large networks
SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
SLIDE 62
Automatic Differentiation
Automatic differentiation (AD) in “reverse mode” provides a generalization to backpropagation and gives us a way to carry
- ut the required gradient calculations exactly and efficiently.