backpropagation
play

Backpropagation Matt Gormley Lecture 12 Oct 10, 2018 1 Q&A - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 12 Oct 10, 2018 1 Q&A 3 BACKPROPAGATION 4 A Recipe for Background


  1. 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 12 Oct 10, 2018 1

  2. Q&A 3

  3. BACKPROPAGATION 4

  4. A Recipe for Background Machine Learning 1. Given training data: 3. Define goal: 2. Choose each of these: – Decision function 4. Train with SGD: (take small steps opposite the gradient) – Loss function 5

  5. Approaches to Training Differentiation • Question 1: When can we compute the gradients of the parameters of an arbitrary neural network? • Question 2: When can we make the gradient computation efficient? 6

  6. Approaches to Training Differentiation 1. Finite Difference Method – Pro: Great for testing implementations of backpropagation – Con: Slow for high dimensional inputs / outputs – Required: Ability to call the function f( x ) on any input x 2. Symbolic Differentiation – Note: The method you learned in high-school – Note: Used by Mathematica / Wolfram Alpha / Maple – Pro: Yields easily interpretable derivatives – Con: Leads to exponential computation time if not carefully implemented – Required: Mathematical expression that defines f( x ) 3. Automatic Differentiation - Reverse Mode – Note: Called Backpropagation when applied to Neural Nets – Pro: Computes partial derivatives of one output f (x) i with respect to all inputs x j in time proportional to computation of f( x ) – Con: Slow for high dimensional outputs (e.g. vector-valued functions) – Required: Algorithm for computing f( x ) 4. Automatic Differentiation - Forward Mode – Note: Easy to implement. Uses dual numbers. – Pro: Computes partial derivatives of all outputs f (x) i with respect to one input x j in time proportional to computation of f( x ) – Con: Slow for high dimensional inputs (e.g. vector-valued x ) – Required: Algorithm for computing f( x ) 7

  7. Finite Difference Method Training Notes: • Suffers from issues of floating point precision, in practice • Typically only appropriate to use on small examples with an appropriately chosen epsilon 8

  8. Symbolic Differentiation Training Differentiation Quiz #1: Suppose x = 2 and z = 3, what are dy/dx and dy/dz for the function below? 9

  9. Symbolic Differentiation Training Differentiation Quiz #2: … … … 11

  10. Chain Rule Training Whiteboard – Chain Rule of Calculus 12

  11. Chain Rule Training } manner: y = g ( u ) and u = h ( x ) . Given: quantities. Chain Rule: J dy i dy i du j X = , 8 i, k dx k du j dx k j =1 … 13

  12. Chain Rule Training } manner: y = g ( u ) and u = h ( x ) . Given: quantities. Chain Rule: J dy i dy i du j X = , 8 i, k dx k du j dx k j =1 Backpropagation … is just repeated application of the chain rule from Calculus 101. 14

  13. Error Back-Propagation 15 Slide from (Stoyanov & Eisner, 2012)

  14. Error Back-Propagation 16 Slide from (Stoyanov & Eisner, 2012)

  15. Error Back-Propagation 17 Slide from (Stoyanov & Eisner, 2012)

  16. Error Back-Propagation 18 Slide from (Stoyanov & Eisner, 2012)

  17. Error Back-Propagation 19 Slide from (Stoyanov & Eisner, 2012)

  18. Error Back-Propagation 20 Slide from (Stoyanov & Eisner, 2012)

  19. Error Back-Propagation 21 Slide from (Stoyanov & Eisner, 2012)

  20. Error Back-Propagation 22 Slide from (Stoyanov & Eisner, 2012)

  21. Error Back-Propagation 23 Slide from (Stoyanov & Eisner, 2012)

  22. Error Back-Propagation p(y| x (i) ) ϴ z y (i) 24 Slide from (Stoyanov & Eisner, 2012)

  23. Backpropagation Training Whiteboard – Example: Backpropagation for Chain Rule #1 Differentiation Quiz #1: Suppose x = 2 and z = 3, what are dy/dx and dy/dz for the function below? 25

  24. Backpropagation Training Automatic Differentiation – Reverse Mode (aka. Backpropagation) Forward Computation 1. Write an algorithm for evaluating the function y = f( x ). The algorithm defines a directed acyclic graph , where each variable is a node (i.e. the “ computation graph” ) 2. Visit each node in topological order . For variable u i with inputs v 1 ,…, v N a. Compute u i = g i (v 1 ,…, v N ) b. Store the result at the node Backward Computation 1. Initialize all partial derivatives dy/du j to 0 and dy/dy = 1. 2. Visit each node in reverse topological order . For variable u i = g i (v 1 ,…, v N ) a. We already know dy/du i b. Increment dy/dv j by (dy/du i )(du i /dv j ) (Choice of algorithm ensures computing (du i /dv j ) is easy) Return partial derivatives dy/du i for all variables 26

  25. � ��� � � � � � Backpropagation Training Simple Example: The goal is to compute J = ��� ( ��� ( x 2 ) + 3 x 2 ) on the forward pass and the derivative dJ dx on the backward pass. Forward Backward J = cos ( u ) u = u 1 + u 2 u 1 = sin ( t ) u 2 = 3 t t = x 2 27

  26. Backpropagation Training Simple Example: The goal is to compute J = ��� ( ��� ( x 2 ) + 3 x 2 ) on the forward pass and the derivative dJ dx on the backward pass. Forward Backward dJ J = cos ( u ) du � = − sin ( u ) dJ � = dJ du du dJ � = dJ du du u = u 1 + u 2 = 1 = 1 , , du 1 du du 1 du 1 du 2 du du 2 du 2 dJ dt � = dJ du 1 du 1 u 1 = sin ( t ) dt = ��� ( t ) dt , du 1 dJ dt � = dJ du 2 du 2 u 2 = 3 t dt = 3 dt , du 2 dJ dx � = dJ dt dt t = x 2 dx = 2 x dx, dt 28

  27. Backpropagation Training Output Case 1: Logistic θ 2 θ 3 θ M θ 1 Regression … Input Forward Backward y + (1 − y ∗ ) dJ dy = y ∗ J = y ∗ ��� y + (1 − y ∗ ) ��� (1 − y ) y − 1 1 ��� ( − a ) dJ da = dJ dy da, dy y = da = 1 + ��� ( − a ) ( ��� ( − a ) + 1) 2 dy D dJ = dJ da , da � a = = x j θ j x j d θ j da d θ j d θ j j =0 dJ = dJ da , da = θ j dx j da dx j dx j 29

  28. Backpropagation Training (F) Loss (E) Output (sigmoid) 1 y = 1+ ��� ( − b ) Output (D) Output (linear) b = � D j =0 β j z j … Hidden Layer (C) Hidden (sigmoid) 1 z j = 1+ ��� ( − a j ) , ∀ j … Input (B) Hidden (linear) a j = � M i =0 α ji x i , ∀ j (A) Input Given x i , ∀ i 30

  29. Backpropagation Training (F) Loss J = 1 2 ( y − y ∗ ) 2 (E) Output (sigmoid) 1 y = 1+ ��� ( − b ) Output (D) Output (linear) b = � D j =0 β j z j … Hidden Layer (C) Hidden (sigmoid) 1 z j = 1+ ��� ( − a j ) , ∀ j … Input (B) Hidden (linear) a j = � M i =0 α ji x i , ∀ j (A) Input Given x i , ∀ i 31

  30. Backpropagation Training Case 2: Neural Network … … 32

  31. Backpropagation Training Case 2: Neural Loss Network Sigmoid … … Linear Sigmoid Linear 33

  32. Derivative of a Sigmoid 34

  33. Backpropagation Training Case 2: Neural Loss Network Sigmoid … … Linear Sigmoid Linear 35

  34. Backpropagation Training Case 2: Neural Loss Network Sigmoid … … Linear Sigmoid Linear 36

  35. Backpropagation Training Whiteboard – SGD for Neural Network – Example: Backpropagation for Neural Network 37

  36. Backpropagation Training Backpropagation (Auto.Diff. - Reverse Mode) Forward Computation 1. Write an algorithm for evaluating the function y = f( x ). The algorithm defines a directed acyclic graph , where each variable is a node (i.e. the “ computation graph” ) 2. Visit each node in topological order . a. Compute the corresponding variable’s value b. Store the result at the node Backward Computation 1. Initialize all partial derivatives dy/du j to 0 and dy/dy = 1. 2. Visit each node in reverse topological order . For variable u i = g i (v 1 ,…, v N ) a. We already know dy/du i b. Increment dy/dv j by (dy/du i )(du i /dv j ) (Choice of algorithm ensures computing (du i /dv j ) is easy) Return partial derivatives dy/du i for all variables 38

  37. A Recipe for Background Gradients Machine Learning 1. Given training data: 3. Define goal: Backpropagation can compute this gradient! And it’s a special case of a more general algorithm called reverse- 2. Choose each of these: mode automatic differentiation that – Decision function 4. Train with SGD: can compute the gradient of any differentiable function efficiently! (take small steps opposite the gradient) – Loss function 39

  38. Summary 1. Neural Networks … – provide a way of learning features – are highly nonlinear prediction functions – (can be) a highly parallel network of logistic regression classifiers – discover useful hidden representations of the input 2. Backpropagation … – provides an efficient way to compute gradients – is a special case of reverse-mode automatic differentiation 40

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend