Understanding Neural Networks Part I: Artificial Neurons and Network - - PowerPoint PPT Presentation

understanding neural networks
SMART_READER_LITE
LIVE PREVIEW

Understanding Neural Networks Part I: Artificial Neurons and Network - - PowerPoint PPT Presentation

TensorFlow Workshop 2018 Understanding Neural Networks Part I: Artificial Neurons and Network Optimization Nick Winovich Department of Mathematics Purdue University July 2018 SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks :


slide-1
SLIDE 1

TensorFlow Workshop 2018

Understanding Neural Networks

Part I: Artificial Neurons and Network Optimization Nick Winovich

Department of Mathematics Purdue University

July 2018

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-2
SLIDE 2

Outline

1

Neural Networks

Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm

2

Network Optimization

Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-3
SLIDE 3

Outline

1

Neural Networks

Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm

2

Network Optimization

Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-4
SLIDE 4

Outline

1

Neural Networks

Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm

2

Network Optimization

Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-5
SLIDE 5

Artificial Neural Networks

Neural networks are a class of simple, yet effective, “computing systems” with a diverse range of applications. In these systems, small computational units, or nodes, are arranged to form networks in which connectivity is leveraged to carry out complex calculations.

Deep Learning by Goodfellow, Bengio, and Courville: http://www.deeplearningbook.org/ Convolutional Neural Networks for Visual Recognition at Stanford: http://cs231n.stanford.edu/

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-6
SLIDE 6

Artificial Neurons

Diagram modified from Stack Exchange post answered by Gonzalo Medina.

Weights are first used to scale inputs; the results are summed

with a bias term and passed through an activation function.

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-7
SLIDE 7

Formula and Vector Representation

The diagram from the previous slide can be interpreted as:

y = f ( x1 · w1 + x2 · w2 + x3 · w3 + b )

which can be conveniently represented in vector form via:

y = f

  • wT x + b
  • by interpreting the neuron inputs and weights as column vectors.

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-8
SLIDE 8

Artificial Neurons: Multiple Outputs

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-9
SLIDE 9

Matrix Representation

This corresponds to a pair of equations, one for each ouput:

y1 = f

  • wT

1 x + b1

  • y2 = f
  • wT

2 x + b2

  • which can be represented in matrix form by the system:

y = f ( W x + b )

where we assume the activation function has been vectorized.

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-10
SLIDE 10

Fully-Connected Neural Layers

The resulting layers, referred to as fully-connected or dense,

can be visualized as a collection of nodes connected by edges corresponding to weights (bias/activations are typically omitted)

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-11
SLIDE 11

Floating Point Operation Count

Matrix-Vector Multiplication    w11 . . . w1N . . . ... . . . wM1 . . . wMN       x1 . . . xN    Mult: MN ∼    w11 · x1 . . . w1N · xN . . . ... . . . wM1 · x1 . . . wMN · xN    Add: M(N−1) ∼    w11 · x1 + . . . + w1N · xN . . . . . . . . . wM1 · x1 + . . . + wMN · xN   

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-12
SLIDE 12

Floating Point Operation Count

So we see that when bias terms are omitted, the FLOPs required for a neural connection between N inputs and M outputs is:

2 MN − M = MN multiplies + M(N − 1) adds

When bias terms are included, an additional M addition operations are required, resulting in a total of 2 MN FLOPs. Note: This omits the computation required for applying the activation function to M values resulting from the linear operations. Depending on the activation function selected, this may or may not have a significant impact on the overall computational complexity.

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-13
SLIDE 13

Activation Functions

Activation functions are a fundamental component of neural network architectures; these functions are responsible for:

Providing all of the network’s non-linear modeling capacity Controlling the gradient flows that guide the training process

While activation functions play a fundamental role in all neural networks, it is still desirable to limit their computational demands (e.g. avoid defining them in terms of a Krylov subspace method...). In practice, activations such as rectified linear units (ReLUs) with the most trivial function and derivative definitions often suffice.

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-14
SLIDE 14

Activation Functions

Rectified Linear Unit (ReLU) f(x) =

  • x

x ≥ 0 x < 0 SoftPlus Activation f(x) = ln

  • 1 + exp(−x)
  • SIAM@Purdue 2018 - Nick Winovich

Understanding Neural Networks : Part I

slide-15
SLIDE 15

Activation Functions

Sigmoidal Unit f(x) = 1 1 + exp(−x) Hyperbolic Tangent Unit f(x) = tanh(x)

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-16
SLIDE 16

Activation Functions (Parameterized)

Exponential Linear Unit (ELU) fα(x) =

  • x

x ≥ 0 α · (ex − 1) x < 0 Leaky Rectified Linear Unit fα(x) =

  • x

x ≥ 0 α · x x < 0

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-17
SLIDE 17

Activation Functions (Learnable Parameters)

Parameterized ReLU fβ(x) =

  • β · x

x ≥ 0 x x < 0 Swish Units fβ(x) = x 1 + exp(−β · x)

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-18
SLIDE 18

Hidden Layers

Intermediate, or hidden, layers can be added between the input and

  • uput nodes to allow for additional non-linear processing.

For example, we can first define a layer such as:

h = f1 ( W1 x + b1 )

and construct a subsequent layer to produce the final output:

y = f2 ( W2 h + b2 )

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-19
SLIDE 19

Hidden Layers

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-20
SLIDE 20

Multiple Hidden Layers

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-21
SLIDE 21

Multiple Hidden Layers

Multiple hidden layers can easily be defined in the same way:

h1 = f1 ( W1 x + b1 ) h2 = f2 ( W2 h1 + b2 ) y = f3 ( W3 h2 + b3 )

One of the challenges of working with additional layers is the need to determine the impact that earlier layers have on the final ouput. This will be necessary for tuning/optimizing network parameters (i.e. weights and biases) to produce accurate predictions.

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-22
SLIDE 22

Outline

1

Neural Networks

Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm

2

Network Optimization

Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-23
SLIDE 23

Universal Approximators: Cybenko (1989)

Cybenko, G., 1989. Approximation by superpositions of a sigmoidal

  • function. Mathematics of control, signals and systems, 2(4), pp.303-314.

Basic Idea of Result: Let In denote the unit hypercube in Rn; the collection of functions which can be expressed in the form: N

i=1 αi · σ

  • wT

i x + bi

  • ∀x ∈ In

is dense in the space of continuous functions C(In) defined on In: i.e. ∀f ∈ C(In) , ε > 0 there exist constants N, αi, wi, bi such that

  • f(x) −

N

i=1 αi · σ(wT i x + bi)

  • < ε ∀x ∈ In

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-24
SLIDE 24

Universal Approximators: Hornik et al. / Funahashi

Hornik, K., Stinchcombe, M. and White, H., 1989. Multilayer feedforward networks are universal approximators. Neural networks, 2(5), pp.359-366. Funahashi, K.I., 1989. On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3), pp.183-192.

Summary of Results: For any compact set K ⊂ Rn, multi-layer feedforward neural networks are dense in the space of continuous funtions C(K) on K, with respect to the supremum norm, provided that the activation function used for the network layers is:

Continuous and increasing Non-constant and bounded

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-25
SLIDE 25

Universal Approximators: Leshno et al. (1992)

Leshno, M., Lin, V.Y., Pinkus, A. and Schocken, S., 1992. Multilayer feedforward networks with a non-polynomial activation function can approximate any function.

“A standard multilayer feedforward network with a locally bounded piecewise continuous activation function can ap- proximate any continuous function to any degree of accu- racy if and only if the network’s activation function is not a

  • polynomial. ” (Leshno et al.)

Here the notion of “approximation” is also defined in terms of the

supremum norm, and the domains are assumed to be compact

The result does not hold without thresholds (i.e. bias terms)

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-26
SLIDE 26

Outline

1

Neural Networks

Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm

2

Network Optimization

Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-27
SLIDE 27

Overfitting

In some cases the network is capable learning “too much” from the specific training data used; this phenomenon is referred to as

  • verfitting and occurs when the model performs well on the training

dataset, but does not generalize to accurate predictions on data which has not been seen during training. Consider, for example:

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-28
SLIDE 28

L1 and L2 Weight Regularization

One simple technique to help avoid overfitting is to add a penalty for network parameters with large L1 or L2 norms. This is similar to the underlying idea behind LASSO regression and can be loosely interpreted as a form of applying the principle of Ockham’s Razor: i.e. the simplest solution often turns out to be the correct solution.

L2 regularization is a fairly general regularization technique

which places an emphasis on reducing the largest weights

L1 regularization helps to encourage sparsity in the network and

improves performance when the problem has a sparse solution

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-29
SLIDE 29

Applying Dropout

Applying dropout to hidden network layers also helps to avoid

  • verfitting. This technique consists of removing, or dropping,

units/nodes randomly at each step of the training process.

A fixed drop rate p ∈ (0, 1) is specified prior to training, and

nodes in the layer are dropped according to a collection of i.i.d. random Bernoulli samples drawn at each training step

Since all nodes will be used after training, the outputs of the

remaining nodes are rescaled by a factor of 1/(1 − p) to ensure that the expected values during training and testing coincide Loosely speaking, this can be thought of as a way to ensure that no individual node plays too large of a role in the final prediction.

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-30
SLIDE 30

Example: Dropout with Rate = 0.25 [ Training ]

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-31
SLIDE 31

Example: Dropout with Rate = 0.25 [ Training ]

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-32
SLIDE 32

Example: Dropout with Rate = 0.25 [ Training ]

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-33
SLIDE 33

Example: Dropout with Rate = 0.25 [ Testing ]

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-34
SLIDE 34

Motivation for Batch Normalization

Ioffe, S. and Szegedy, C., 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.

Internal Covariate Shift: As network parameters change during training, the distributions of the input values to each layer change.

Training could be more efficient if the layers were receiving

inputs with a fixed distribution throughout the entire process

Achieving this using normalization requires a technique

which is compatabile with gradient-based optimization

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-35
SLIDE 35

Batch Normalization

The proposed batch normalization technique corresponds to first performing a normalization with respect to the batch statistics:

  • x =

x − µB

  • σ2

B + ε

with µB = 1 m

  • x∈B

x , σ2

B = 1

m

  • x∈B

(x − µB)2 where m is a fixed batch size, and ε > 0 for numerical stability. A linear map with learnable parameters γ and β is then applied: yi = γ · xi + β and the normalized values {yi} are passed to the activation function to apply the non-linear transformation for the layer.

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-36
SLIDE 36

Batch Normalization after Training

After training, we need a way to freeze the model in place for making predictions. This is accomplished by specifying a fixed normalization rule for each layer; rather than use sample statistics from a specific batch, it is natural to incorporate the entire dataset:

  • x =

x − µ √ σ2 + ε where µ is the empirical mean ED[x] and σ2 is the variance VarD[x] taken with respect to the complete training dataset D. These values can be tracked using moving averages during training to avoid direct computation and provide accurate estimates when parameter changes are small near the end of the training process.

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-37
SLIDE 37

Outline

1

Neural Networks

Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm

2

Network Optimization

Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-38
SLIDE 38

Outline

1

Neural Networks

Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm

2

Network Optimization

Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-39
SLIDE 39

Evaluating a Network

Up until now, we have still not discussed how to quantify the performance of a neural network. The most common strategy for quantifying performance is to define loss functions for the network with “low loss” corresponding to “high performance”. In supervised training, where the true labels/solutions are known, the network loss function is typically composed of:

A primary loss term corresponding to a measure of how close

the network predictions are to the true solutions

Auxiliary loss components, such as weight regularization

penalties, designed to help guide the training process Once the network performance is quantified, we can specify an

  • ptimization algorithm designed to minimize the network loss.

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-40
SLIDE 40

Loss Functions

Two of the most common applications of neural networks are regression (for predicting continuous properties/values) and classification (for predicting discrete properties or labels). For regression, a standard loss is given by the mean squared error: Loss = 1 |I|

  • i ∈ I

( yi − yi)2 where I are the indices of the output data (e.g. pixels of an image). For classification, softmax cross entropy can be used when labels are mutually exclusive (e.g. classifying a digit as “0”, “1”, “2” etc.); sigmoid cross entropy can be used when labels are not mutually exclusive (e.g. determining which objects are in an image).

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-41
SLIDE 41

One Hot Encoding

It is also fundamentally important to consider how the data will be represented within the network. When classifying digits, for example, networks will typically perform extremely poorly if the labels are represented as a single number: “0.0”, “1.0”, “2.0”, etc. To better distinguish the differences between e.g. “0”, “1”, and “2”, it is useful to instead store the values using a one hot encoding: “0” =

  • 1
  • “1” =
  • 1
  • “2” =
  • 1
  • One hot encodings are also typically used for word prediction (by

specifying a dictionary of possibilities) and character level predictions (by specifying the admissable character set).

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-42
SLIDE 42

Data Preparation

In general, it also a good practice to first process/prepare the input values of a dataset before training. For example, if the input values are centered around 100 and all lie within the interval [99.9, 100.1], it is typically better to center and rescale these values beforehand:

  • x =

x − µ √ σ2 − ε where µ = 1 |D|

  • x′∈D

x′ and σ2 = 1 |D|

  • x′∈D

(x′ − µ)2 The values of µ and σ2 can then be saved, and predictions can be made on arbitrary inputs during testing by applying the above normalization before passing the test inputs to the network.

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-43
SLIDE 43

Outline

1

Neural Networks

Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm

2

Network Optimization

Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-44
SLIDE 44

Gradient Descent

Gradient descent provides a simple, iterative algorithm for finding local minima of a real-valued function F numerically. The main idea behind gradient descent is relatively straightforward: compute the gradient of the function that we want to minimize and take a step in the direction of steepest descent ( i.e. −∇F(x) ). The iteration step of the algorithm is defined in terms of a step size parameter α ( or by a decreasing sequence {αi} ) by setting:

xi+1 = xi − α · ∇F(xi)

Note: Convergence is only guaranteed under certain assumptions

  • n the function F

(e.g. convexity, Lipschitz continuity, etc.).

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-45
SLIDE 45

Loss Functions for Large Datasets

In the context of neural networks, gradient descent appears to provide a reasonable approach for tuning network parameters. The initial weights and biases can be interpreted as a single vector θ0, and the iteration steps from the previous slide could, in theory, be used to identify the optimal parameters θ∗ for the model. The issue with this approach is that the function we are actually trying to minimize is defined in terms of the entire dataset D:

F(θ) = 1 |D|

  • x∈D

f(x | θ)

where f(x|θ) denotes the loss for a single example x when using the model parameters θ. So the standard algorithm would require computing the average loss at each step of the iterative scheme...

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-46
SLIDE 46

Stochastic Gradient Descent

Since computing the true gradient ∇F(θ) at every step is impractical for large datasets, we can instead try to approximate this gradient using a smaller, more managable mini-batch of data:

∇F(θ) ≈ ∇ Fi(θ) = 1 |Bi|

  • x∈Bi

∇f(x | θ)

where the batches {Bi} partition the dataset into smaller subsets (typically of equal size). The iteration step is then taken to be:

θi+1 = θi − α · ∇ Fi(θi)

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-47
SLIDE 47

Potential Obstacles

Fixed learning rates typically lead to suboptimal performance Defining a learning rate schedule manually does not allow the

algorithm to adapt to the particular problem in consideration

Different parameters often require different learning rates Since the directions/magnitudes of previous updates are not

taken into consideration, defining optimization policies on small batches of data may lead to a noisy, inefficient training process

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-48
SLIDE 48

Importance of Selecting the Correct Learning Rate High Learning Rate Low Learning Rate

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-49
SLIDE 49

Nesterov Momentum

Nesterov, Y.E., 1983. A method for solving the convex programming problem with convergence rate O(1/kˆ2). In Dokl. Akad. Nauk SSSR (Vol. 269, pp. 543-547).

One method for learning from previous steps is to incorporate momentum into the update policy. This can be done by setting:

vi = γ · vi−1 + α · ∇F(θ) θi+1 = θi − vi

An accelerated form was introduced by Nesterov in 1983 which leverages the value of “looking ahead” before making updates:

vi = γ · vi−1 + α · ∇F(θ − γ · vi−1)

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-50
SLIDE 50

AdaGrad and RMSProp

Duchi, J., Hazan, E. and Singer, Y., 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), pp.2121-2159. Hinton, G., Srivastava, N. and Swersky, K., Neural Networks for Machine Learning Lecture 6a Overview of mini–batch gradient descent. AdaGrad defines parameter specific updates which are

normalized by the sum of the squares of previous gradients; this leads to a natural learning rate decay (often too much)

RMSProp keeps moving averages of the squared gradients

for each parameter which are used to rescale updates

AdaDelta provides another method for rescaling updates,

and many other variants (e.g. including momentum) exist ...

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-51
SLIDE 51

Exponential Moving Averages

One common method for estimating an average incrementally is to keep an exponential moving average of the values. This method applies an exponential decay to terms in the average which places an emphasis on the most recent values; this allows the average to move, or correct itself, as the distribution of the values changes. To track the gradient gt = ∇F(θt−1) of the loss with respect to the parameters θ, we can define an average recursively by setting:

  • m0 = 0

mt = β · mt−1 + (1 − β) · gt ⇒ mt = (1−β)

t

  • τ=1

β t−τ gτ where the parameter β is used to specify the exponential decay rate and is typically taken to be close to, but smaller than, 1.

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-52
SLIDE 52

The Adam Optimization Algorithm

Kingma, D.P . and Ba, J., 2014. Adam: A method for stochastic

  • ptimization. arXiv preprint arXiv:1412.6980.

The Adam optimizer, derived from “adaptive moment estimation”, proposes keeping exponential moving averages of both the first moment gt and the (uncentered) second moment g2

t of the gradient.

In addition, a bias correction is introduced to address the issue of arbitrarily initializing the exponential moving averages with zero. The first moment average mt and second moment average vt are defined with decay rates β1 and β2, respectively, and the bias correction procedure is defined by the rescaling:

  • mt =

mt 1 − βt

1

  • vt =

vt 1 − βt

2 SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-53
SLIDE 53

The Adam Optimization Algorithm

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-54
SLIDE 54

Outline

1

Neural Networks

Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm

2

Network Optimization

Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-55
SLIDE 55

Backpropagation

Rumelhart, D.E., Hinton, G.E. and Williams, R.J., 1986. Learning representations by back-propagating errors. Nature, 323(6088), p.533.

While this theoretical framework for neural network optimization may seem complete, one fundamental question still remains: ”How are the gradients of network parameters actually computed?” One approach, referred to as backpropagation, was proposed in 1986 which dealt with sigmoidal activations "σ" and defined the loss "E" in terms of predictions "yj" and true values "dj" via:

E = 1 2

  • (yj − dj)2

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-56
SLIDE 56

Backpropagation: Rumelhart, Hinton, and Williams

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-57
SLIDE 57

Backpropagation: Rumelhart, Hinton, and Williams

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-58
SLIDE 58

Backpropagation: Rumelhart, Hinton, and Williams

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-59
SLIDE 59

Backpropagation: Rumelhart, Hinton, and Williams

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-60
SLIDE 60

Backpropagation

Now that the error contribution associated with yi is known:

i.e. ∂E ∂yi =

  • j

∂E ∂xj · wji

contributions from network parameters of the previous layer can be computed using the same methodology that was applied to yj:

e.g. ∂E ∂xi = ∂E ∂yi · dyi dxi = ∂E ∂yi · d dxi σ(xi)

In this way, gradient calculations for all network parameters can be computed by propagating back the error contributions from the parameters in subsequent layers which depend on their values.

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-61
SLIDE 61

Symbolic and Numeric Differentiation

Two commonly used methods for automating the process of computing derivatives are symbolic differentiation and numeric differentiation; however, both of these methods have severe practical limitations in the context of training neural networks.

Symbolic differentiation produces exact derivatives through

direct manipulation of the mathematical expressions used to define functions; the resulting expressions can be lengthy and contain unnecessary computations, however, and are inefficient unless additional “expression simplification” steps are included

Numeric differentiation techniques are widely applicable and

efficient; however, the resulting inexact gradient estimates can entirely undermine the training process for large networks

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I

slide-62
SLIDE 62

Automatic Differentiation

Automatic differentiation (AD) in “reverse mode” provides a generalization to backpropagation and gives us a way to carry

  • ut the required gradient calculations exactly and efficiently.

Computes derivatives using the underlying computational graph Very efficient with respect to evaluation time A trace of all elementary operations is stored on an evaluation

“tape”, or “Wengert list”; potentially large storage requirements

Baydin, A.G., Pearlmutter, B.A., Radul, A.A. and Siskind, J.M., 2015. Automatic differentiation in machine learning: a survey. University of Washington CSE599W: Spring 2018 – Slides from ”Lecture 4: Backpropagation and Automatic Differentiation” http://dlsys.cs.washington.edu/pdf/lecture4.pdf

SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I