Neural Networks Learning the network: Part 3 11-785, Fall 2020 - PowerPoint PPT Presentation

Calculus Refresher: Basic rules of calculus For any differentiable function with derivative �� the following must hold for sufficiently small For any differentiable function � � � with partial derivatives Both by the �� definition �� the following must hold for sufficiently small � � � � 25

Calculus Refresher: Chain rule For any nested function Check – we can confirm that : 26

Calculus Refresher: Distributed Chain rule Check: Let � � � � � � � � � 27 � � �

Calculus Refresher: Distributed Chain rule Check: � � � � � � � � � 28 � � �

Distributed Chain Rule: Influence Diagram � � � � � � • affects through each of 29

Distributed Chain Rule: Influence Diagram � � � � � � � � � � • Small perturbations in cause small perturbations in each of each of which individually additively perturbs 30

Returning to our problem • How to compute 31

A first closer look at the network • Showing a tiny 2-input network for illustration – Actual network would have many more neurons and inputs 32

A first closer look at the network + + 𝑔(. ) 𝑔(. ) + 𝑔(. ) + + 𝑔(. ) 𝑔(. ) • Showing a tiny 2-input network for illustration – Actual network would have many more neurons and inputs • Explicitly separating the weighted sum of inputs from the activation 33

A first closer look at the network (�) (�) �,� + �,� + (�) (�) (�) �,� �,� �,� + (�) (�) (�) �,� �,� �,� (�) (�) + + �,� �,� (�) (�) (�) (�) (�) �,� �,� �,� �,� �,� • Showing a tiny 2-input network for illustration – Actual network would have many more neurons and inputs • Expanded with all weights shown • Lets label the other variables too… 34

Computing the derivative for a single input (�) (�) �,� �,� (�) (�) (�) (�) � � + + � � 1 2 (�) (�) (�) �,� �,� �,� (�) + � Div 3 (�) (�) �,� (�) �,� �,� (�) (�) (�) (�) (�) �,� + + � � � �,� 1 2 (�) � (�) �,� (�) (�) (�) (�) �,� �,� �,� �,� 35

Computing the derivative for a single input 𝒆𝑬𝒋𝒘(𝒁,𝒆) What is: (�) (�) (�) 𝒆� �,� �,� �,� (�) (�) (�) (�) � � + + � � 1 2 (�) (�) (�) �,� �,� �,� (�) + � Div 3 (�) (�) �,� (�) �,� �,� (�) (�) (�) (�) (�) �,� + + � � � �,� 1 2 (�) � (�) �,� (�) (�) (�) (�) �,� �,� �,� �,� 36

Computing the gradient • Note: computation of the derivative requires (�) �,� intermediate and final output values of the network in response to the input 37

The “forward pass” y (0) z (1) y (1) z (2) y (2) z (3) y (3) z (N-1) y (N-1) �� z (N) y (N) f N � � � �� f N �� 1 1 1 1 We will refer to the process of computing the output from an input as the forward pass We will illustrate the forward pass in the following slides 38

The “forward pass” y (0) z (1) y (1) z (2) y (2) z (3) y (3) z (N-1) y (N-1) �� z (N) y (N) f N � � � �� f N �� 1 1 1 1 Setting � (�) � for notational convenience (�) and � Assuming (�) (�) -- assuming the bias is a weight and extending �� the output of every layer by a constant 1, to account for the biases 39

The “forward pass” y (0) z (1) y (1) z (2) y (2) z (3) y (3) z (N-1) y (N-1) �� z (N) y (N) f N � � � �� f N �� 1 1 1 1 40

The “forward pass” y (0) z (1) y (1) z (2) y (2) z (3) y (3) z (N-1) y (N-1) �� z (N) y (N) f N � � � �� f N �� 1 1 1 1 (�) (�) (�) � �� 41

y (0) z (1) y (1) z (2) y (2) z (3) y (3) z (N-1) y (N-1) �� z (N) y (N) f N � � � �� f N �� 1 1 1 1 (�) (�) (�) (�) (�) � � � � �� 42

y (0) z (1) y (1) z (2) y (2) z (3) y (3) z (N-1) y (N-1) �� z (N) y (N) f N � � � �� f N �� 1 1 1 1 (�) (�) (�) (�) (�) (�) (�) (�) � � � � �� 43

y (0) z (1) y (1) z (2) y (2) z (3) y (3) z (N-1) y (N-1) �� z (N) y (N) f N � � � �� f N �� 1 1 1 1 (�) (�) (�) (�) (�) (�) (�) (�) (�) (�) � � � � � � � �� 44

y (0) z (3) z (1) y (1) z (2) y (2) y (3) z (N-1) y (N-1) �� z (N) y (N) f N � � � �� f N �� 1 1 1 1 (�) (�) (�) (�) (�) (�) (�) (�) (�) (�) � � � � � � � �� (�) (�) (�) � �� 45 �

y (0) z (3) z (1) y (3) y (1) z (2) y (2) z (N-1) y (N-1) �� z (N) y (N) f N � � � �� f N �� 1 1 1 1 (�) (�) (�) (�) (�) (�) (�) (�) (�) (�) � � � � � � � �� (�) (�) (�) (�) (�) � �� 46 �

y (0) z (3) y (N-1) z (1) y (3) y (1) z (2) y (2) z (N-1) �� z (N) y (N) f N � � � �� f N �� 1 1 1 1 (�) (�) (��) (��) (��) (�) (�) � �� 47

Forward Computation y (0) z (3) z (1) y (1) z (2) y (3) y (2) z (N-1) y (N-1) �� z (N) y (N) f N � � � �� f N �� 1 1 1 1 ITERATE FOR k = 1:N for j = 1:layer-width 48

Forward “Pass” • Input: dimensional vector • Set: , is the width of the 0 th (input) layer – – ; • For layer – For D k is the size of the kth layer (�) � �� (�) (��) • � �� ,� � (�) (�) • � � � • Output: – 49

Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) �� z (N) y (N) f N � �� f N �� 1 1 1 We have computed all these intermediate values in the forward computation We must remember them – we will need them to compute the derivatives 50

Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) �� z (N) y (N) f N Div(Y,d) � �� f N �� 1 1 1 First, we compute the divergence between the output of the net y = y (N) and the desired output 51

Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) �� z (N) y (N) f N Div(Y,d) � �� f N �� 1 1 1 We then compute � (�) the derivative of the divergence w.r.t. the final output of the network y (N) 52

Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) �� z (N) y (N) f N Div(Y,d) � �� f N �� 1 1 1 We then compute � (�) the derivative of the divergence w.r.t. the final output of the network y (N) We then compute � (�) the derivative of the divergence w.r.t. the pre-activation affine combination z (N) using the chain rule 53

Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) �� z (N) y (N) f N Div(Y,d) � �� f N �� 1 1 1 Continuing on, we will compute � (�) the derivative of the divergence with respect to the weights of the connections to the output layer 54

Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) �� z (N) y (N) f N Div(Y,d) � �� f N �� 1 1 1 Continuing on, we will compute � (�) the derivative of the divergence with respect to the weights of the connections to the output layer Then continue with the chain rule to compute � (��) the derivative of the divergence w.r.t. the output of the N-1th layer 55

Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) �� z (N) y (N) f N Div(Y,d) � �� f N �� 1 1 1 We continue our way backwards in the order shown 56

y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) �� z (N) y (N) f N Div(Y,d) � �� f N �� 1 1 1 We continue our way backwards in the order shown 57

Backward Gradient Computation • Lets actually see the math.. 63

Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) �� z (N) y (N) f N Div(Y,d) � �� f N �� 1 1 1 64

Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) �� z (N) y (N) f N Div(Y,d) � �� f N �� 1 1 1 The derivative w.r.t the actual output of the final layer of the network is simply the derivative w.r.t to the output of the network 65

Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) �� z (N) y (N) f N Div(Y,d) � �� f N �� 1 1 1 Already computed 67

Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) �� z (N) y (N) f N Div(Y,d) � �� f N �� (�) � 1 1 1 � � Derivative of activation function 68

Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) �� z (N) y (N) f N Div(Y,d) � �� f N �� (�) � 1 1 1 � � Derivative of activation function Computed in forward pass 69

Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) �� z (N) y (N) f N Div(Y,d) � �� f N �� 1 1 1 (�) � (�) (�) (�) �� 72

Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) �� z (N) y (N) f N Div(Y,d) � �� f N �� 1 1 1 (�) � (�) (�) (�) Just computed �� 73

Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) �� z (N) y (N) f N Div(Y,d) � �� f N �� 1 1 1 Because (��) (�) (�) (�) (��) � � �� (�) (�) (�) �� 74

Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) �� z (N) y (N) f N Div(Y,d) � �� f N �� 1 1 1 Because (��) (�) (�) (�) (��) � � �� (�) (�) (�) �� Computed in forward pass 75

Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) �� z (N) y (N) f N Div(Y,d) � �� f N �� 1 1 1 (��) � (�) (�) �� 76

Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) �� z (N) y (N) f N Div(Y,d) � �� f N �� 1 1 1 (��) For the bias term � (��) � (�) (�) �� 77

Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) �� z (N) y (N) f N Div(Y,d) � �� f N �� 1 1 1 (�) � (��) (��) (�) � � � � 78

Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) �� z (N) y (N) f N Div(Y,d) � �� f N �� 1 1 1 (�) � Already computed (��) (��) (�) � � � � 79

Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) �� z (N) y (N) f N Div(Y,d) � �� f N �� 1 1 1 Because (�) (�) (�) (�) (��) � �� (��) (��) (�) � � � � 80

Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) �� z (N) y (N) f N Div(Y,d) � �� f N �� 1 1 1 (�) �� (��) (�) � � � 81

Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) �� z (N) y (N) f N Div(Y,d) � �� f N �� 1 1 1 (�) �� (��) (�) � � � 82

Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) �� z (N) y (N) f N Div(Y,d) � �� f N �� 1 1 1 We continue our way backwards in the order shown (��) � �� (��) (��) � � 83

y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) �� z (N) y (N) f N Div(Y,d) � �� f N �� 1 1 1 We continue our way backwards in the order shown (��) For the bias term � (��) � (��) (��) �� 84

y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) �� z (N) y (N) f N Div(Y,d) � �� f N �� 1 1 1 We continue our way backwards in the order shown (��) �� (��) (��) � � � 85

y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) �� z (N) y (N) f N Div(Y,d) � �� f N �� 1 1 1 We continue our way backwards in the order shown (��) � �� (��) (��) � � 86

y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) �� z (N) y (N) f N Div(Y,d) � �� f N �� 1 1 1 We continue our way backwards in the order shown (�) �� (�) (�) � � � 87

y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) �� z (N) y (N) f N Div(Y,d) � �� f N �� 1 1 1 We continue our way backwards in the order shown (�) � � � (�) (�) � � 88

y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) �� z (N) y (N) f N Div(Y,d) � �� f N �� 1 1 1 (�) � (�) (�) We continue our way backwards in the order shown �� 89

Gradients: Backward Computation z (k-1) y (k-1) z (k) y (k) z (N-1) y (N-1) z (N) y (N) f N Div(Y,d) Div(Y,d) f N Figure assumes, but does not show the “1” bias nodes Initialize: Gradient w.r.t network output (��) (�) � � �� (�) (��) (�) (�) (�) � � � � � � � (�) (�) � � (��) (��) � � (�) (�) 90 ��

Backward Pass • Output layer (N) : – For � �� (�) = ��(�,�) • �� (�) = �� 𝑨 � (�) • (�) 𝑔 � �� • For layer – For � �� (��) • (�) = ∑ 𝑥 �� (��) �� (�) = �� 𝑨 � (�) • (�) 𝑔 � �� (�) • (��) = 𝑧 � (��) for 𝑘 = 1 … 𝐸 � �� (�) �� – (�) for � (�) � �� 91

Backward Pass Called “Backpropagation” because • Output layer (N) : the derivative of the loss is – For � propagated “backwards” through the network �� (�) = ��(�,�) • �� (�) = �� 𝑨 � (�) • (�) 𝑔 � �� • Very analogous to the forward pass: For layer – For � Backward weighted combination of next layer �� (��) • (�) = ∑ 𝑥 �� (��) �� Backward equivalent of activation �� (�) = �� 𝑨 � (�) • (�) 𝑔 � �� (�) • (��) = 𝑧 � (��) for 𝑘 = 1 … 𝐸 � �� (�) �� – (�) for � (�) � �� 92

Backward Pass ��(�,�) Using notation etc (overdot represents derivative of w.r.t variable) �� Called “Backpropagation” because • Output layer (N) : the derivative of the loss is – For propagated “backwards” through � the network �� (�) • � �� (�) (�) (�) � • � � � � • For layer Very analogous to the forward pass: – For � Backward weighted combination (�) (��) (��) of next layer • � � �� Backward equivalent of activation (�) (�) (�) � • � � � � �� (�) (��) for • � (��) � � �� (�) (�) for – � (�) � � �� 93

For comparison: the forward pass again • Input: dimensional vector • Set: , is the width of the 0 th (input) layer – – ; • For layer – For (�) � � (�) (��) • � �� ,� � (�) (�) • � � � • Output: – 94

Special cases • Have assumed so far that 1. The computation of the output of one neuron does not directly affect computation of other neurons in the same (or previous) layers 2. Inputs to neurons only combine through weighted addition 3. Activations are actually differentiable – All of these conditions are frequently not applicable • Will not discuss all of these in class, but explained in slides – Will appear in quiz. Please read the slides 95

Special Case 1. Vector activations y (k-1) z (k) y (k) y (k-1) z (k) y (k) • Vector activations: all outputs are functions of all inputs 96

Special Case 1. Vector activations y (k-1) y (k-1) z (k) z (k) y (k) y (k) Scalar activation: Modifying a Vector activation: Modifying a only changes corresponding potentially changes all, 97

“Influence” diagram y (k-1) y (k-1) z (k) y (k) z (k) y (k) Scalar activation: Each Vector activation: Each influences one influences all, 98

The number of outputs y (k-1) y (k-1) z (k) y (k) z (k) y (k) • Note: The number of outputs (y (k) ) need not be the same as the number of inputs (z (k) ) • May be more or fewer 99

Scalar Activation: Derivative rule y (k-1) z (k) y (k) • In the case of scalar activation functions, the derivative of the error w.r.t to the input to the unit is a simple product of derivatives 100

Neural Networks Learning the network: Part 3 11-785, Fall 2020 - PowerPoint PPT Presentation

Neural Networks Learning the network: Part 3 11-785, Fall 2020 Lecture 5 1 Recap : Training the network Given a training set of input-output pairs Minimize the following function w.r.t This is problem of function minimization

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Networks Greg Mori - CMPT 419/726 Bishop PRML Ch. 5 Feed-forward Networks Network

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

CHAPTER VI VI CHAPTER Learning in Feedforward Feedforward Learning in Neural Networks Neural

Artificial Neural Networks Oliver Schulte - CMPT 726 Feed-forward Networks Network Training

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Pair Programming Review and Practice Lots of time to work on Homeworks 3 and 4 Session 4 CSSE

texdoc 2.0 An update on creating LaTeX documents from within Stata Ben Jann University of Bern,

Creating LaTeX and HTML documents from within Stata using textdoc and webdoc Ben Jann University

CS7015 (Deep Learning) : Lecture 11 Convolutional Neural Networks, LeNet, AlexNet, ZF-Net, VGGNet,

Recurrent Networks : 1 Spring 2020 Instructor: Bhiksha Raj 1 Which open source project? 2

Multiple Input and Output Channels Multiple Input and Output Channels Multiple Input Channels In

Introduction to Deep Learning Georgia Tech CS 4650/7650 Fall 2020 Outline Deep Learning

T HE Z ERO L OWER B OUND , THE D UAL M ANDATE , AND U NCONVENTIONAL D YNAMICS William T. Gavin