Artificial Neural Networks (Part 2) Gradient Descent Learning and - PDF document

Artificial Neural Networks (Part 2) Gradient Descent Learning and Backpropagation Christian Jacob CPSC 533 — Winter 2001 Learning by Gradient Descent Definition of the Learning Problem Let us start with the simple case of linear cells, which we have introduced as percep - tron units. The linear network should learn mappings (for m = 1, …, P ) between m L and Ë an input pattern x m = H x 1 m , …, x N Ë an associated target pattern T m .

2 05.2-Backprop-Printout.nb Figure 1. Perceptron m of cell i for the input pattern x m is calculated as The output O i m = ‚ H w ki ÿ x k m L O i (1) k m for input pat - The goal of the learning procedure is, that eventually the output O i m : tern x m corresponds to the desired output T i ! T i m = ‚ m = H w ki ÿ x k m L O i (2) k Explicit Solution (Linear Network) For a linear network, the weights that satisfy Equation (2) can be calculated explic - itly using the pseudo-inverse: w ik = 1 - 1 L ml x k m H Q k P ‚ l T i ÅÅÅÅ (3) ml

05.2-Backprop-Printout.nb 3 Q ml = 1 m x k P ‚ l ÅÅÅÅ x k (4) k ‡ Correlation Matrix Here Q ml is a component of the correlation matrix Q k of the input patterns: 1 x k 1 1 x k 2 1 x k P i y … x k x k x k j z j z j z j z Q k = . . . . j z (5) j z j z j z P x k 1 P x k P … … k { x k x k You can check that this is indeed a solution by verifying m = T i m . ‚ w ik x k (6) k ‡ Caveat Note that Q - 1 only exists for linearly independent input patterns. That means, if there are a i such that for all k = 1, …, N 1 + a 2 x k 2 + … + a P x k P = 0, a 1 x k (7) m cannot be selected independently from each other, and the then the outputs O i problem is NOT solvable. Learning by Gradient Descent (Linear Network) Let us now try to find a learning rule for a linear network with M output units. ÷÷ ” Starting from a random initial weight setting w 0 , the learning procedure should find a solution weight matrix for Equation (2). ‡ Error Function ÷÷ L : ” For this purpose, we define a cost or error function E H w

4 05.2-Backprop-Printout.nb M P ”L = 1 m L 2 m - O m E H w H T m 2 ‚ ‚ ÅÅÅÅ m = 1 m= 1 (8) M P 2 i y ”L = 1 j z j z E H w m - ‚ H w km ÿ x k m L 2 „ „ j j T m z ÅÅÅÅ z k { k m = 1 m= 1 ÷÷ L ¥ 0 will approach zero as w ” ÷÷ = 8 w km < satisfies Equation (2). ” E H w This cost function is a quadratic function in weight space. ‡ Paraboloid ÷÷ L is a paraboloid with a single global minimum. ” Therefore, E H w << RealTime3D` Plot3D @ x 2 + y 2 , 8 x, - 5, 5 < , 8 y, - 5, 5 <D ;

05.2-Backprop-Printout.nb 5 ContourPlot @ x 2 + y 2 , 8 x, - 5, 5 < , 8 y, - 5, 5 <D ; 4 2 0 -2 -4 -4 -2 0 2 4 If the pattern vectors are linearly independent—i.e., a solution for Equation (2) exists—the minimum is at E = 0. ‡ Finding the Minimum: Following the Gradient ÷÷ L in weight space by following the negative ” We can find the minimum of E H w gradient ÷ L ” ÷ L = -∑ E H w ” E H w ÷ ” (9) ” -∑ w ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ ÅÅÅ ∑ w We can implement this gradient strategy as follows: ‡ Changing a Weight ÷÷ is changed by D w ki proportionate to the E gradient at the ” Each weight w ki œ w current weight position (i.e., the current settings of all the weights):

6 05.2-Backprop-Printout.nb ”L D w ki = -h ∑ E H w (10) ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ Å ∑ w ki ‡ Steps Towards the Solution i M P 2 y j z i y j z j z j 1 z ∑ j z j z m - ‚ H w nm ÿ x n m L D w ki = -h j T m j z j 2 „ „ z ÅÅÅÅÅÅÅÅÅÅÅÅ Å ÅÅÅÅ z j z j z ∑ w ki j z k { n k { m = 1 m= 1 P M i 2 y j z i y j z D w ki = -h 1 j z j z ∑ j z m - ‚ j H w nm ÿ x n m L z j z 2 „ j„ j T m (11) j z ÅÅÅÅ ÅÅÅÅÅÅÅÅÅÅÅÅ Å z j z ∑ w ki z k { n k { m= 1 m = 1 P i y D w ki = -h 1 j z j z m - ‚ H w ni ÿ x n m L z H -x k m L j z 2 „ 2 j T i ÅÅÅÅ k { n m= 1 ‡ Weight Adaptation Rule P m - O i H T i m L x k m D w ki = h ‚ (12) m= 1 The parameter h is usually referred to as the learning rate . In this formula, the adaptation of the weights are accumulated over all patterns. ‡ Delta, LMS Learning If we change the weights after each presentation of an input pattern to the network, we get a simpler form for the weight update term: m - O i m L x k m D w ki = h H T i (13) or m x k m D w ki = h d i (14) with m = T i m - O i m . (15) d i

05.2-Backprop-Printout.nb 7 This learning rule has several names: Ë Delta rule Ë Adaline rule Ë Widrow-Hoff rule Ë LMS (least mean square) rule. Gradient Descent Learning with Nonlinear Cells We will now extend the gradient descent technique for the case of nonlinear cells, that is, where the activation/output function is a general nonlinear function g(x). † The input function is denoted by h H x L . † The output function g H h H x LL is assumed to be differentiable in x . ‡ Rewriting the Error Function The definition of the error function (Equation (8)) can be simply rewritten as follows: M P ”L = 1 m L 2 m - O m E H w H T m 2 ‚ ‚ ÅÅÅÅ m = 1 m= 1 (16) M P 2 i i y y ”L = 1 j j z z j j z z m - g m L E H w j‚ H w km ÿ x k 2 „ „ j T m j j z z ÅÅÅÅ z z k k { { k m = 1 m= 1 ‡ Weight Gradients Consequently, we can compute the w ki gradients: ”L P ∑ E H w m - g H h i = ‚ H T i m LL ÿ g £ H h i m L ÿ x k m (17) ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ ∑ w ki m= 1

8 05.2-Backprop-Printout.nb ‡ From Weight Gradients to the Learning Rule This eventually (after some more calculations) shows us that the adaptation term D w ki for w ki has the same form as in Equations (10), (13), and (14), namely: m x k m D w ki = h d i (18) where m = H T i m - O i m L ÿ g £ H h i m L (19) d i Suitable Activation Functions The calculation of the above d terms is easy for the following functions g , which are commonly used as activation functions: ‡ Hyperbolic Tangens: g H x L = tanh b x (20) g £ H x L = b H 1 - g 2 H x LL Hyperbolic Tangens Plot: Plot @ Tanh @ x D , 8 x, - 5, 5 <D ; 1 0.5 -4 -2 2 4 -0.5 -1

05.2-Backprop-Printout.nb 9 Plot of the first derivative: Plot @ Tanh' @ x D , 8 x, - 5, 5 <D ; 1 0.8 0.6 0.4 0.2 -4 -2 2 4 Check for equality with 1 - tanh 2 x Plot @ 1 - Tanh @ x D 2 , 8 x, - 5, 5 <D ; 1 0.8 0.6 0.4 0.2 -4 -2 2 4 Influence of the b parameter: p1 @ b _ D : = Plot @ Tanh @ b x D , 8 x, - 5, 5 < , PlotRange Ø All, DisplayFunction Ø Identity D p2 @ b _ D : = Plot @ Tanh' @ b x D , 8 x, - 5, 5 < , PlotRange Ø All, DisplayFunction Ø Identity D

10 05.2-Backprop-Printout.nb Table @ Show @ GraphicsArray @8 p1 @ b D , p2 @ b D<DD , 8 b , 1, 5 <D ; 1 1 0.8 0.5 0.6 -4 -2 2 4 0.4 -0.5 0.2 -4 -2 2 4 -1 1 1 0.8 0.5 0.6 -4 -2 2 4 0.4 -0.5 0.2 -4 -2 2 4 -1 1 1 0.8 0.5 0.6 -4 -2 2 4 0.4 -0.5 0.2 -4 -2 2 4 -1 1 1 0.8 0.5 0.6 -4 -2 2 4 0.4 -0.5 0.2 -4 -2 2 4 -1

05.2-Backprop-Printout.nb 11 1 1 0.8 0.5 0.6 -4 4 -2 2 0.4 -0.5 0.2 -4 -2 2 4 -1 Table @ Show @ GraphicsArray @8 p1 @ b D , p2 @ b D<DD , 8 b , 0.1, 1, 0.1 <D ; -4 -2 2 4 0.4 0.95 0.2 0.9 -4 -2 2 4 0.85 -0.2 -0.4 0.8 -4 -2 2 4 0.6 0.9 0.4 0.8 0.2 0.7 -4 -2 2 4 -0.2 0.6 -0.4 0.5 -0.6 -4 -2 2 4 0.75 0.5 0.8 0.25 0.6 -4 -2 2 4 -0.25 0.4 -0.5 -0.75 0.2 1 -4 -2 2 4 0.8 0.5 0.6 -4 -2 2 4 0.4 -0.5 0.2 -1

12 05.2-Backprop-Printout.nb 1 1 0.8 0.5 0.6 -4 -2 2 4 0.4 -0.5 0.2 -4 -2 2 4 -1 1 1 0.8 0.5 0.6 -4 4 -2 2 0.4 0.2 -0.5 -4 -2 2 4 -1 1 1 0.8 0.5 0.6 -4 4 -2 2 0.4 0.2 -0.5 -4 -2 2 4 -1 1 1 0.8 0.5 0.6 -4 -2 2 4 0.4 0.2 -0.5 -4 -2 2 4 -1 1 1 0.8 0.5 0.6 -4 -2 2 4 0.4 0.2 -0.5 -4 -2 2 4 -1

05.2-Backprop-Printout.nb 13 1 1 0.8 0.5 0.6 -4 -2 2 4 0.4 -0.5 0.2 -4 -2 2 4 -1 ‡ Sigmoid: 1 g H x L = ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ ÅÅÅÅ Å 1 + e - 2 b x (21) g £ H x L = 2 b g H x L H 1 - g H x LL Sigmoid Plot: 1 sigmoid @ x_, b _ D : = ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ ÅÅÅÅÅÅ 1 + E - 2 b x Plot @ sigmoid @ x, 1 D , 8 x, - 5, 5 <D ; 1 0.8 0.6 0.4 0.2 -4 -2 2 4 Plot of the first derivative:

14 05.2-Backprop-Printout.nb D @ sigmoid @ x, b D , x D 2 ‰ - 2 x b b ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ ÅÅÅÅÅÅÅÅ ÅÅÅÅÅÅ H 1 + ‰ - 2 x b L 2 Plot @ D @ sigmoid @ x, 1 D , x D êê Evaluate, 8 x, - 5, 5 <D ; 0.5 0.4 0.3 0.2 0.1 -4 -2 2 4 Check for equality with 2 ÿ g ÿ H 1 - g L Plot @ 2 sigmoid @ x, 1 D H 1 - sigmoid @ x, 1 DL , 8 x, - 5, 5 <D ; 0.5 0.4 0.3 0.2 0.1 -4 -2 2 4 Influence of the b parameter:

05.2-Backprop-Printout.nb 15 p1 @ b _ D : = Plot @ sigmoid @ x, b D , 8 x, - 5, 5 < , PlotRange Ø All, DisplayFunction Ø Identity D p2 @ b _ D : = Plot @ D @ sigmoid @ x, b D , x D êê Evaluate, 8 x, - 5, 5 < , PlotRange Ø All, DisplayFunction Ø Identity D Table @ Show @ GraphicsArray @8 p1 @ b D , p2 @ b D<DD , 8 b , 1, 5 <D ; 1 0.5 0.8 0.4 0.6 0.3 0.4 0.2 0.2 0.1 -4 -2 2 4 -4 -2 2 4 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 -4 4 -4 4 -2 2 -2 2 1 1.4 1.2 0.8 1 0.6 0.8 0.6 0.4 0.4 0.2 0.2 -4 -2 2 4 -4 -2 2 4 1 2 0.8 1.5 0.6 1 0.4 0.5 0.2 -4 -2 2 4 -4 -2 2 4

Artificial Neural Networks (Part 2) Gradient Descent Learning and - PDF document

Artificial Neural Networks (Part 2) Gradient Descent Learning and Backpropagation Christian Jacob CPSC 533 Winter 2001 Learning by Gradient Descent Definition of the Learning Problem Let us start with the simple case of linear cells, which

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Artificial Neural Networks Threshold units Gradient descent Multilayer networks

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Deep Neural Networks CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Deep learning slides credit:

Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks Yuan Cao

Lecture 26: Support Vector Classifjcation, Unsupervised Learning Instructor: Prof. Ganesh

Contents 1. Introduction 2. (Un)decidability on modal MTL logics Reducing to PCP The Global

Empirical Comparisons of Fast Methods Dustin Lang and Mike Klaas { dalang, klaas } @cs.ubc.ca

Collective Annotation: From Crowdsourcing to Social Choice Ulle Endriss Institute for Logic,

Feedforward Networks Gradient Descent Learning and Backpropagation Christian Jacob CPSC 565

Supervised Learning in Neural Networks Keith L. Downing The Norwegian University of Science and

Community structure in networks Argimiro Arratia & Marta Arias Universitat Polit` ecnica de

Scala & Spark PTT18/19 Prof. Dr. Ralf Lmmel Msc. Johannes Hrtel Msc. Marcel Heinz (C)

Artificial Neural Networks (Part 2) Gradient Descent Learning and - PDF document

Artificial Neural Networks (Part 2) Gradient Descent Learning and Backpropagation Christian Jacob CPSC 533 Winter 2001 Learning by Gradient Descent Definition of the Learning Problem Let us start with the simple case of linear cells, which

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Gradient Descent Michail Michailidis &amp; Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Artificial Neural Networks Threshold units Gradient descent Multilayer networks

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Deep Neural Networks CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Deep learning slides credit:

Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks Yuan Cao

Lecture 26: Support Vector Classifjcation, Unsupervised Learning Instructor: Prof. Ganesh

Contents 1. Introduction 2. (Un)decidability on modal MTL logics Reducing to PCP The Global

Empirical Comparisons of Fast Methods Dustin Lang and Mike Klaas { dalang, klaas } @cs.ubc.ca

Collective Annotation: From Crowdsourcing to Social Choice Ulle Endriss Institute for Logic,

Feedforward Networks Gradient Descent Learning and Backpropagation Christian Jacob CPSC 565

Supervised Learning in Neural Networks Keith L. Downing The Norwegian University of Science and

Community structure in networks Argimiro Arratia &amp; Marta Arias Universitat Polit` ecnica de

Scala &amp; Spark PTT18/19 Prof. Dr. Ralf Lmmel Msc. Johannes Hrtel Msc. Marcel Heinz (C)

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Community structure in networks Argimiro Arratia & Marta Arias Universitat Polit` ecnica de

Scala & Spark PTT18/19 Prof. Dr. Ralf Lmmel Msc. Johannes Hrtel Msc. Marcel Heinz (C)