Backpropagation Matt Gormley Lecture 12 Oct 10, 2018 1 Q&A - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 12 Oct 10, 2018 1

BACKPROPAGATION 4

A Recipe for Background Machine Learning 1. Given training data: 3. Define goal: 2. Choose each of these: – Decision function 4. Train with SGD: (take small steps opposite the gradient) – Loss function 5

Approaches to Training Differentiation • Question 1: When can we compute the gradients of the parameters of an arbitrary neural network? • Question 2: When can we make the gradient computation efficient? 6

Approaches to Training Differentiation 1. Finite Difference Method – Pro: Great for testing implementations of backpropagation – Con: Slow for high dimensional inputs / outputs – Required: Ability to call the function f( x ) on any input x 2. Symbolic Differentiation – Note: The method you learned in high-school – Note: Used by Mathematica / Wolfram Alpha / Maple – Pro: Yields easily interpretable derivatives – Con: Leads to exponential computation time if not carefully implemented – Required: Mathematical expression that defines f( x ) 3. Automatic Differentiation - Reverse Mode – Note: Called Backpropagation when applied to Neural Nets – Pro: Computes partial derivatives of one output f (x) i with respect to all inputs x j in time proportional to computation of f( x ) – Con: Slow for high dimensional outputs (e.g. vector-valued functions) – Required: Algorithm for computing f( x ) 4. Automatic Differentiation - Forward Mode – Note: Easy to implement. Uses dual numbers. – Pro: Computes partial derivatives of all outputs f (x) i with respect to one input x j in time proportional to computation of f( x ) – Con: Slow for high dimensional inputs (e.g. vector-valued x ) – Required: Algorithm for computing f( x ) 7

Finite Difference Method Training Notes: • Suffers from issues of floating point precision, in practice • Typically only appropriate to use on small examples with an appropriately chosen epsilon 8

Symbolic Differentiation Training Differentiation Quiz #1: Suppose x = 2 and z = 3, what are dy/dx and dy/dz for the function below? 9

Symbolic Differentiation Training Differentiation Quiz #2: … … … 11

Chain Rule Training Whiteboard – Chain Rule of Calculus 12

Chain Rule Training } manner: y = g ( u ) and u = h ( x ) . Given: quantities. Chain Rule: J dy i dy i du j X = , 8 i, k dx k du j dx k j =1 … 13

Chain Rule Training } manner: y = g ( u ) and u = h ( x ) . Given: quantities. Chain Rule: J dy i dy i du j X = , 8 i, k dx k du j dx k j =1 Backpropagation … is just repeated application of the chain rule from Calculus 101. 14

Error Back-Propagation 15 Slide from (Stoyanov & Eisner, 2012)

Error Back-Propagation p(y| x (i) ) ϴ z y (i) 24 Slide from (Stoyanov & Eisner, 2012)

Backpropagation Training Whiteboard – Example: Backpropagation for Chain Rule #1 Differentiation Quiz #1: Suppose x = 2 and z = 3, what are dy/dx and dy/dz for the function below? 25

Backpropagation Training Automatic Differentiation – Reverse Mode (aka. Backpropagation) Forward Computation 1. Write an algorithm for evaluating the function y = f( x ). The algorithm defines a directed acyclic graph , where each variable is a node (i.e. the “ computation graph” ) 2. Visit each node in topological order . For variable u i with inputs v 1 ,…, v N a. Compute u i = g i (v 1 ,…, v N ) b. Store the result at the node Backward Computation 1. Initialize all partial derivatives dy/du j to 0 and dy/dy = 1. 2. Visit each node in reverse topological order . For variable u i = g i (v 1 ,…, v N ) a. We already know dy/du i b. Increment dy/dv j by (dy/du i )(du i /dv j ) (Choice of algorithm ensures computing (du i /dv j ) is easy) Return partial derivatives dy/du i for all variables 26

� �� Backpropagation Training Simple Example: The goal is to compute J = �� ( �� ( x 2 ) + 3 x 2 ) on the forward pass and the derivative dJ dx on the backward pass. Forward Backward J = cos ( u ) u = u 1 + u 2 u 1 = sin ( t ) u 2 = 3 t t = x 2 27

Backpropagation Training Simple Example: The goal is to compute J = �� ( �� ( x 2 ) + 3 x 2 ) on the forward pass and the derivative dJ dx on the backward pass. Forward Backward dJ J = cos ( u ) du � = − sin ( u ) dJ � = dJ du du dJ � = dJ du du u = u 1 + u 2 = 1 = 1 , , du 1 du du 1 du 1 du 2 du du 2 du 2 dJ dt � = dJ du 1 du 1 u 1 = sin ( t ) dt = �� ( t ) dt , du 1 dJ dt � = dJ du 2 du 2 u 2 = 3 t dt = 3 dt , du 2 dJ dx � = dJ dt dt t = x 2 dx = 2 x dx, dt 28

Backpropagation Training Output Case 1: Logistic θ 2 θ 3 θ M θ 1 Regression … Input Forward Backward y + (1 − y ∗ ) dJ dy = y ∗ J = y ∗ �� y + (1 − y ∗ ) �� (1 − y ) y − 1 1 �� ( − a ) dJ da = dJ dy da, dy y = da = 1 + �� ( − a ) ( �� ( − a ) + 1) 2 dy D dJ = dJ da , da � a = = x j θ j x j d θ j da d θ j d θ j j =0 dJ = dJ da , da = θ j dx j da dx j dx j 29

Backpropagation Training (F) Loss (E) Output (sigmoid) 1 y = 1+ �� ( − b ) Output (D) Output (linear) b = � D j =0 β j z j … Hidden Layer (C) Hidden (sigmoid) 1 z j = 1+ �� ( − a j ) , ∀ j … Input (B) Hidden (linear) a j = � M i =0 α ji x i , ∀ j (A) Input Given x i , ∀ i 30

Backpropagation Training (F) Loss J = 1 2 ( y − y ∗ ) 2 (E) Output (sigmoid) 1 y = 1+ �� ( − b ) Output (D) Output (linear) b = � D j =0 β j z j … Hidden Layer (C) Hidden (sigmoid) 1 z j = 1+ �� ( − a j ) , ∀ j … Input (B) Hidden (linear) a j = � M i =0 α ji x i , ∀ j (A) Input Given x i , ∀ i 31

Backpropagation Training Case 2: Neural Network … … 32

Backpropagation Training Case 2: Neural Loss Network Sigmoid … … Linear Sigmoid Linear 33

Derivative of a Sigmoid 34

Backpropagation Training Whiteboard – SGD for Neural Network – Example: Backpropagation for Neural Network 37

Backpropagation Training Backpropagation (Auto.Diff. - Reverse Mode) Forward Computation 1. Write an algorithm for evaluating the function y = f( x ). The algorithm defines a directed acyclic graph , where each variable is a node (i.e. the “ computation graph” ) 2. Visit each node in topological order . a. Compute the corresponding variable’s value b. Store the result at the node Backward Computation 1. Initialize all partial derivatives dy/du j to 0 and dy/dy = 1. 2. Visit each node in reverse topological order . For variable u i = g i (v 1 ,…, v N ) a. We already know dy/du i b. Increment dy/dv j by (dy/du i )(du i /dv j ) (Choice of algorithm ensures computing (du i /dv j ) is easy) Return partial derivatives dy/du i for all variables 38

A Recipe for Background Gradients Machine Learning 1. Given training data: 3. Define goal: Backpropagation can compute this gradient! And it’s a special case of a more general algorithm called reverse- 2. Choose each of these: mode automatic differentiation that – Decision function 4. Train with SGD: can compute the gradient of any differentiable function efficiently! (take small steps opposite the gradient) – Loss function 39

Summary 1. Neural Networks … – provide a way of learning features – are highly nonlinear prediction functions – (can be) a highly parallel network of logistic regression classifiers – discover useful hidden representations of the input 2. Backpropagation … – provides an efficient way to compute gradients – is a special case of reverse-mode automatic differentiation 40

Backpropagation Matt Gormley Lecture 12 Oct 10, 2018 1 Q&A - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 12 Oct 10, 2018 1 Q&A 3 BACKPROPAGATION 4 A Recipe for Background

Backpropagation Why backpropagation Neural networks are sequences of parametrized functions

MLPs with Backpropagation CS 472 Backpropagation 1 Multilayer Nets? Linear Systems F(cx) =

CSC321 Lecture 6: Backpropagation Roger Grosse Roger Grosse CSC321 Lecture 6: Backpropagation 1

Neural Networks for Machine Learning Lecture 13a The ups and downs of backpropagation Geoffrey

Learning From Data Lecture 21 Neural Networks: Backpropagation Forward propagation: algorithmic

Lecture 11: Multi-layer Perceptron Forward Pass Backpropagation Aykut Erdem November

GRAPH REPRESENTATIONS, BACKPROPAGATION AND BIOLOGICAL PLAUSIBILITY Marco Gori SAILAB,

Backpropagation and Gradient Descent Brian Carignan, Dec 5 2016 Overview Notation/background

Event-Driven Random Backpropagation: Enabling Neuromorphic Deep Learning Machines Emre Neftci

White Box : Website Frontend & Network visualization using Guided Backpropagation Neha Das

Unsupervised Domain Adaptation by Backpropagation Chih-Hui Ho, Xingyu Gu, Yuan Qi Outline

Bayesian Neural Networks - Presenters Group 1: A Practical Bayesian Framework for Backpropagation

Backpropagation Learning 15-486/782: Artificial Neural Networks David S. Touretzky Fall 2006 1

Lecture 11: Multi-layer Perceptron Forward Pass Backpropagation Aykut Erdem November

Multi-Layer Networks and Backpropagation Algorithm M. Soleymani Sharif University of Technology

Neural Networks and Backpropagation Neural Net Readings: Matt Gormley Murphy

Algorithms R OBERT S EDGEWICK | K EVIN W AYNE 4.2 T OPOLOGICAL S ORT Algorithms F O U R T H E

Graph Algorithms What is a graph? V - vertices E V x V - edges directed / undirected

Directed Acyclic Graphs & Topological Sort CS16: Introduction to Data Structures &

Topological Sort Carola Wenk 11/1/07 CS 3343 Analysis of Algorithms 1 Paths, Cycles,

Quantum Entanglement and Topological Order Ashvin Vishwanath UC Berkeley Ari Turner M.

Lecture 20: Topological Sort Algorithm Algorithm Slide 1 MP1 Decide on footnote

Topological states of matter: topological order vs SPT phases Victor Gurarie January 2018

G r a p h s G r a p h s A g r a p h i s a d a t a s t r u c t u r

Backpropagation Matt Gormley Lecture 12 Oct 10, 2018 1 Q&A - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 12 Oct 10, 2018 1 Q&A 3 BACKPROPAGATION 4 A Recipe for Background

Backpropagation Why backpropagation Neural networks are sequences of parametrized functions

MLPs with Backpropagation CS 472 Backpropagation 1 Multilayer Nets? Linear Systems F(cx) =

CSC321 Lecture 6: Backpropagation Roger Grosse Roger Grosse CSC321 Lecture 6: Backpropagation 1

Neural Networks for Machine Learning Lecture 13a The ups and downs of backpropagation Geoffrey

Learning From Data Lecture 21 Neural Networks: Backpropagation Forward propagation: algorithmic

Lecture 11: Multi-layer Perceptron Forward Pass Backpropagation Aykut Erdem November

GRAPH REPRESENTATIONS, BACKPROPAGATION AND BIOLOGICAL PLAUSIBILITY Marco Gori SAILAB,

Backpropagation and Gradient Descent Brian Carignan, Dec 5 2016 Overview Notation/background

Event-Driven Random Backpropagation: Enabling Neuromorphic Deep Learning Machines Emre Neftci

White Box : Website Frontend &amp; Network visualization using Guided Backpropagation Neha Das

Unsupervised Domain Adaptation by Backpropagation Chih-Hui Ho, Xingyu Gu, Yuan Qi Outline

Bayesian Neural Networks - Presenters Group 1: A Practical Bayesian Framework for Backpropagation

Backpropagation Learning 15-486/782: Artificial Neural Networks David S. Touretzky Fall 2006 1

Lecture 11: Multi-layer Perceptron Forward Pass Backpropagation Aykut Erdem November

Multi-Layer Networks and Backpropagation Algorithm M. Soleymani Sharif University of Technology

Neural Networks and Backpropagation Neural Net Readings: Matt Gormley Murphy

Algorithms R OBERT S EDGEWICK | K EVIN W AYNE 4.2 T OPOLOGICAL S ORT Algorithms F O U R T H E

Graph Algorithms What is a graph? V - vertices E V x V - edges directed / undirected

Directed Acyclic Graphs &amp; Topological Sort CS16: Introduction to Data Structures &amp;

Topological Sort Carola Wenk 11/1/07 CS 3343 Analysis of Algorithms 1 Paths, Cycles,

Quantum Entanglement and Topological Order Ashvin Vishwanath UC Berkeley Ari Turner M.

Lecture 20: Topological Sort Algorithm Algorithm Slide 1 MP1 Decide on footnote

Topological states of matter: topological order vs SPT phases Victor Gurarie January 2018

G r a p h s G r a p h s A g r a p h i s a d a t a s t r u c t u r

White Box : Website Frontend & Network visualization using Guided Backpropagation Neha Das

Directed Acyclic Graphs & Topological Sort CS16: Introduction to Data Structures &