Pattern Recognition Introduction to Gradient Descent Ad Feelders - PowerPoint PPT Presentation

Pattern Recognition Introduction to Gradient Descent Ad Feelders Universiteit Utrecht Ad Feelders (Universiteit Utrecht) Pattern Recognition 1 / 32

Optimization (single variable) Suppose we want to find the value of x for which the function y = f ( x ) is minimized (or maximized). From calculus we know that a necessary condition for a minimum is: df dx = 0 (1) This condition is not sufficient, since maxima and points of inflection also satisfy equation (1). Together with the second-order condition: d 2 f dx 2 > 0 (2) we have a sufficient condition for a local minimum. Ad Feelders (Universiteit Utrecht) Pattern Recognition 2 / 32

Optimization (single variable) The equation df dx = 0 may not have a closed form solution however. In such cases we have to resort to iterative numerical procedures such as gradient descent. Ad Feelders (Universiteit Utrecht) Pattern Recognition 3 / 32

Optimization (single variable) f ( x ) d f d x ( x = x ∗ ) x ∗ x The derivative at x = x ∗ is positive, so to increase the function value we should increase the value of x , i.e. make a step in the direction of the derivative. To decrease the function value, we should make a step in the opposite direction. Ad Feelders (Universiteit Utrecht) Pattern Recognition 4 / 32

Optimization (single variable) Also, the tangent line to the graph at x = x ∗ is a local linear approximation to f . ∆ f ≈ df dx ( x = x ∗ )∆ x The closer we are to x ∗ , the better the approximation. Ad Feelders (Universiteit Utrecht) Pattern Recognition 5 / 32

Gradient Descent Algorithm (single variable) The basic gradient-descent algorithm is: 1 Set i ← 0, and choose an initial value x (0) 2 determine the derivative df dx ( x = x ( i ) ) of f ( x ) at x ( i ) and update x ( i +1) ← x ( i ) − η df dx ( x = x ( i ) ) Set i ← i + 1. 3 Repeat the previous step until df dx = 0 and check if a (local) minimum has been reached. η > 0 is the step size (or learning rate ). Ad Feelders (Universiteit Utrecht) Pattern Recognition 6 / 32

Optimization (multiple variables) Suppose we want to find the values of x 1 , . . . , x m for which the function y = f ( x 1 , . . . , x m ) is minimized (or maximized). Analogous to the single-variable case a necessary condition for a minimum is: ∂ f = 0 j = 1 , . . . , m (3) ∂ x j Again this condition is not sufficient, since maxima and saddle points also satisfy (3). For the second order condition, define the Hessian matrix H , with ∂ 2 f H ij = ∂ x i ∂ x j Together with the second-order condition that H is positive definite, i.e. z ⊤ H z > 0 , for all z � = 0 (4) we have a sufficient condition for a local minimum. Ad Feelders (Universiteit Utrecht) Pattern Recognition 7 / 32

Linear Functions Consider a linear function m � b i x i = a + b ⊤ x f ( x ) = a + i =1 The contour lines of f are given by f ( x ) = a + b ⊤ x = c , for different values of the constant c . For linear functions the contours are parallel straight lines. Ad Feelders (Universiteit Utrecht) Pattern Recognition 8 / 32

The Gradient The gradient of f ( x 1 , x 2 , . . . , x m ) , is the vector of partial derivatives ∂ f   ∂ x 1 ∂ f     ∂ x 2 ∇ f =  .  .   .   ∂ f ∂ x m Ad Feelders (Universiteit Utrecht) Pattern Recognition 9 / 32

Gradient of a Linear Function The gradient of a linear function f ( x ) = a + b ⊤ x is given by ∇ f = b Furthermore, for linear functions we have: ∆ f = b ⊤ ∆ x = ∇ f ⊤ ∆ x In which direction should we move to maximize ∆ f ? Ad Feelders (Universiteit Utrecht) Pattern Recognition 10 / 32

The direction of steepest ascent (descent) is perpendicular to the contour line The direction of steepest ascent (descent) is an increasing (decreasing) direction perpendicular to the contour line. The direction of steepest ascent (descent) from the point x ∗ is where the contour line is tangent to a circle of radius one around x ∗ . Ad Feelders (Universiteit Utrecht) Pattern Recognition 11 / 32

The gradient is also perpendicular to the contour line Consider two points x A and x B both of which lie on the same contour line. Because f ( x A ) = f ( x B ) = c , we have f ( x A ) − f ( x B ) = 0 Therefore ( a + b ⊤ x A ) − ( a + b ⊤ x B ) = b ⊤ ( x A − x B ) = 0 and so the gradient is perpendicular to the contour line, because 1 The vector x A − x B runs parallel to the contour line. 2 Vectors are perpendicular if their dot product is zero. Ad Feelders (Universiteit Utrecht) Pattern Recognition 12 / 32

The gradient is also perpendicular to the contour line Ad Feelders (Universiteit Utrecht) Pattern Recognition 13 / 32

The gradient is perpendicular to the contour line For linear functions the direction of steepest increase is perpendicular to the contour line, as is the gradient. From ∆ f = b ⊤ ∆ x = ∇ f ⊤ ∆ x we conclude that the gradient points in an increasing direction, since filling in ∇ f for ∆ x gives ∆ f = ∇ f ⊤ ∇ f = �∇ f � 2 Therefore: 1 The gradient points in the direction of fastest increase of f . 2 Minus the gradient points in the direction of fastest decrease of f . Ad Feelders (Universiteit Utrecht) Pattern Recognition 14 / 32

Linear Approximation This reasoning works for arbitrary functions by considering a local linear approximation to the function at x ∗ by the tangent plane: ( y − y ∗ ) = ∂ f 1 ) + ∂ f ( x ∗ )( x 1 − x ∗ ( x ∗ )( x 2 − x ∗ 2 ) , ∂ x 1 ∂ x 2 and using the linear approximation ∆ f ≈ ∂ f ( x ∗ )∆ x 1 + ∂ f ( x ∗ )∆ x 2 = ∇ f ⊤ ( x ∗ )∆ x . ∂ x 1 ∂ x 2 Here ∂ f ( x ∗ ) and ∂ f ( x ∗ ) ∂ x 1 ∂ x 2 are the slopes of the tangent lines in the direction of x 1 resp. x 2 at the point x = x ∗ . Ad Feelders (Universiteit Utrecht) Pattern Recognition 15 / 32

Local Linear Approximation by Tangent Plane The white dot represents the point ( x ∗ , f ( x ∗ )). Ad Feelders (Universiteit Utrecht) Pattern Recognition 16 / 32

Gradient Descent Algorithm (multivariable) The basic gradient-descent algorithm is: 1 Set i ← 0, and choose an initial value x (0) 2 determine the gradient ∇ f ( x ( i ) ) of f ( x ) at x ( i ) and update x ( i +1) ← x ( i ) − η ∇ f ( x ( i ) ) Set i ← i + 1. 3 Repeat the previous step until ∇ f ( x ( i ) ) = 0 and check if a (local) minimum has been reached. η > 0 is the step size (or learning rate ). Ad Feelders (Universiteit Utrecht) Pattern Recognition 17 / 32

Example of gradient descent for linear regression Note: w 0 and w 1 are the variables here! n x t y = w 0 + w 1 x e = t − y 1 0 1 w 0 1 − w 0 2 1 3 w 0 + w 1 3 − w 0 − w 1 3 2 4 w 0 + 2 w 1 4 − w 0 − 2 w 1 4 3 3 w 0 + 3 w 1 3 − w 0 − 3 w 1 5 4 5 w 0 + 4 w 1 5 − w 0 − 4 w 1 (1 − w 0 ) 2 + (3 − w 0 − w 1 ) 2 SSE( w 0 , w 1 ) = +(4 − w 0 − 2 w 1 ) 2 + (3 − w 0 − 3 w 1 ) 2 +(5 − w 0 − 4 w 1 ) 2 Ad Feelders (Universiteit Utrecht) Pattern Recognition 18 / 32

Example of gradient descent ∂ SSE = [2(1 − w 0 )( − 1)] + [2(3 − w 0 − w 1 )( − 1)] ∂ w 0 + [2(4 − w 0 − 2 w 1 )( − 1)] + [2(3 − w 0 − 3 w 1 )( − 1)] + [2(5 − w 0 − 4 w 1 )( − 1)] = − 32 + 10 w 0 + 20 w 1 ∂ SSE = 0 + [2(3 − w 0 − w 1 )( − 1)] ∂ w 1 + [2(4 − w 0 − 2 w 1 )( − 2)] + [2(3 − w 0 − 3 w 1 )( − 3)] + [2(5 − w 0 − 4 w 1 )( − 4)] = − 80 + 20 w 0 + 60 w 1 Ad Feelders (Universiteit Utrecht) Pattern Recognition 19 / 32

Example of gradient descent So the gradient is: ∂ SSE � − 32 + 10 w 0 + 20 w 1   � ∂ w 0  = ∇ SSE =  − 80 + 20 w 0 + 60 w 1 ∂ SSE ∂ w 1 Let w (0) = (0 , 0). Then the gradient evaluated in the point w (0) is: � − 32 + 10 × 0 + 20 × 0 � − 32 � � ∇ SSE( w (0) ) = = − 80 + 20 × 0 + 60 × 0 − 80 Ad Feelders (Universiteit Utrecht) Pattern Recognition 20 / 32

Example of gradient descent Let η = 1 50 . Then we get the following update: − η∂ SSE = 0 − 1 w (1) = w (0) 50 × − 32 = 0 . 64 0 0 ∂ w 0 − η∂ SSE = 0 − 1 w (1) = w (0) 50 × − 80 = 1 . 6 1 1 ∂ w 1 Or both at once: � 0 � − 32 � 0 . 64 � − 1 � � w (1) = w (0) − η ∇ SSE( w (0) ) = = 0 − 80 1 . 6 50 Ad Feelders (Universiteit Utrecht) Pattern Recognition 21 / 32

Gradient Descent with step size η = 0 . 02 2.0 20 30 4 5 0 0 10 ● 1.5 5 4 4 ● ● ● ● ● ● ● ●●●●●●● ● 3 1.0 b1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.5 1 0 5 40 30 0.0 2 0 0 ● 0.0 0.5 1.0 1.5 2.0 b0 Ad Feelders (Universiteit Utrecht) Pattern Recognition 22 / 32

Pattern Recognition Introduction to Gradient Descent Ad Feelders - PowerPoint PPT Presentation

Pattern Recognition Introduction to Gradient Descent Ad Feelders Universiteit Utrecht Ad Feelders (Universiteit Utrecht) Pattern Recognition 1 / 32 Optimization (single variable) Suppose we want to find the value of x for which the function

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Part 5 pattern recognition pattern recognition track pattern recognition: associate hits

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Feature Selection Pattern Recognition: The Early Days Pattern Recognition: The Early Days Only

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION Pattern Recogniton Pattern: Any

CS 7616 Pattern Recognition Introduction Aaron Bobick School of Interactive Computing

Introduction to advanced parameter Gradient descent algorithm optimization w 1 d 1 g 1 1. Choose

Thermal Field Theory to All Orders in Gradient Expansion Peter Millington

Exploring the phases of Yang-Mills theory with adjoint matter through the gradient flow Camilo

Dioptics: a common generalization of Learners Motivation Simple Essence gradient-based

1 & 2 Samuel Series Lesson #148 September 25, 2018 Dean Bible Ministries

The Organization of Knowledge ! Concepts of Information i218 ! Geoff Nunberg ! Feb. 11, 2009 ! 1 !

Evolving Grammars: A Structured Point of View Nuno Loureno University of Coimbra, Portugal

Lecture 03: Duration Calculus I 2014-05-08 Dr. Bernd Westphal 03 2014-05-08 main

Pattern Recognition Introduction to Gradient Descent Ad Feelders - PowerPoint PPT Presentation

Pattern Recognition Introduction to Gradient Descent Ad Feelders Universiteit Utrecht Ad Feelders (Universiteit Utrecht) Pattern Recognition 1 / 32 Optimization (single variable) Suppose we want to find the value of x for which the function

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Gradient Descent Michail Michailidis &amp; Patrick Maiden Outline

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Part 5 pattern recognition pattern recognition track pattern recognition: associate hits

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Feature Selection Pattern Recognition: The Early Days Pattern Recognition: The Early Days Only

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION Pattern Recogniton Pattern: Any

CS 7616 Pattern Recognition Introduction Aaron Bobick School of Interactive Computing

Introduction to advanced parameter Gradient descent algorithm optimization w 1 d 1 g 1 1. Choose

Thermal Field Theory to All Orders in Gradient Expansion Peter Millington

Exploring the phases of Yang-Mills theory with adjoint matter through the gradient flow Camilo

Dioptics: a common generalization of Learners Motivation Simple Essence gradient-based

1 &amp; 2 Samuel Series Lesson #148 September 25, 2018 Dean Bible Ministries

The Organization of Knowledge ! Concepts of Information i218 ! Geoff Nunberg ! Feb. 11, 2009 ! 1 !

Evolving Grammars: A Structured Point of View Nuno Loureno University of Coimbra, Portugal

Lecture 03: Duration Calculus I 2014-05-08 Dr. Bernd Westphal 03 2014-05-08 main

Gradient Descent Michail Michailidis & Patrick Maiden Outline

1 & 2 Samuel Series Lesson #148 September 25, 2018 Dean Bible Ministries