Minimization Using Descent Information we will consider the - PDF document

Minimization Using Descent Information • we will consider the minimization of unconstrained functions of several variables where we now assume we have some derivative information such as the gradient vector or the Hessian matrix. • Recall that Powell’s method used the powerful concept of conjugate directions and performed a series of line searches. • We will see how these conjugate directions are related to the gradient directions and we will introduce a very powerful method called the conjugate-gradient technique. • Recall Taylor’s expansion uses such information: ∆ x T f x 1 - ∆ x T H x ∆ x ≅ ∇ - ( ) ∆ x ( + ) ( ) + ( ) + f x f x 2 • Methods using only first derivatives are called first-order methods • Methods using second order derivatives are called second-order methods .

The Gradient Vector Re-examined R n ∈ • Recall that the gradient vector of a function f x ( ), x : ∂ f ∂ x 1 ∂ f ∇ g x ( ) = f x ( ) = ∂ x 2 … ∂ f ∂ x n • Consider a differential length dx 1 dx 2 = = d x u s d … dx n where u = and u holds the directional information of the differential. 1 • Now a change in the function f x ( ) along d x is given by n ∂ f ∑ ( ∇ , ) ( ∇ , ) ( ∇ , ) d f = d x i = f x ( ) d x = f x ( ) u s d = d s f x ( ) u ∂ x i i = 1 the rate of change of f x ( ) along the arbitrary direction u is given by

d f ( ∇ , ) = f x ( ) u d s • On the other hand, along a direction u , the function f x ( ) is described as α u f x ( + ) and thus the rate of change along this direction is also written as d f d f x ( ∇ , ) α u = f x ( ) u = ( + ) (1) α d s d • We can examine (1) to see which direction gives the maximum rate of increase. • A well known inequality in functional analysis is called the Cauchy-Schwarz inequality which says ( , ) ≤ (2) a b a b Applying this to (1) we find ( ∇ , ) ≤ ∇ ∇ ( ) ( ) = ( ) (3) f x u f x u f x ∇ ( ) f x and letting u = - - - - - - - - - - - - - - - - - we have ∇ f x ( ) 2 ∇ ∇ f x ( ) f x ( ) ⎛ ⎞ ∇ , ∇ - - - - - - - - - - - - - - - - - f x ( ) = - - - - - - - - - - - - - - - - - - - - = f x ( ) (4) ⎝ ⎠ ∇ f x ( ) ∇ f x ( ) which shows that for this choice of direction, the directional derivative ( ∇ , ) f x ( ) u reaches its maximum value. ∇ • Therefore we say that f x ( ) is the direction of maximum increase and ∇ is the direction of maximum decrease • – ( ) f x (also: steepest ascent and steepest descent ).

Cauchy’s Method (Steepest Descent) • A logical minimization strategy is to use the direction of steepest descent and perform a line search in that direction. • Assume we are at a point x k and that we have calculated the gradient at this ∇ point, g k = g x k ( ) = f x k ( ) . • Then we can minimize along g k starting from x k : α k α g k = arg min α f x k ( + ) (5) and arrive at the new point α k g k x k = x k + . (6) + 1 Algorithm: Cauchy’s Method of Steepest Descent 1. input: f x ( ), g x ( ), x 0 , g tol , k max 2. set: x = , g = ( ) x 0 g x < > 3. while k and g k max g tol set: α α g 4. = arg min α f x ( + ) α g 5. set: x = x + 6. set: g = ( ) , k = + g x k 1 7. end • Note that we don’t bother to take the negative of g in the line search since this is automatically accomplished by allowing negative values of α . • The iterations end when either a maximum number of iterations have been reached or the norm of the gradient at the current point is less than a user defined value g tol which is a value close to zero.

• The successive directions of minimization in the method of steepest descent are orthogonal to each other. • This can be shown as follows: assume that we are at a point x k and we need to find α k α g k = arg min α f x k ( + ) (7) in order to arrive at the new point α k g k x k = x k + . (8) + 1 • We can find α k by setting the derivative of F α α g k ( ) = f x k ( + ) with respect to α equal to zero, which from (1) we can write: d F α ( ∇ , ) ( , ) ( ) = f x k ( ) g k = g k g k = 0 . (9) + 1 + 1 α d this immediately shows the orthogonality between the successive directions + and g k . g k 1

• Now, although it may seem like a good idea to minimize along the direction of steepest descent, it turns out that this method is not very efficient. • For relatively complicated functions, this method will tend to zig-zag towards the minimum. x 2 g 1 g 2 x 1 Zig-zaging effect of Cauchy’s method. • Note that since successive directions of descent are always orthogonal, in two dimensions the algorithm searches in only two directions. • This is what makes the algorithm slow to converge to the minimum and is generally not recommended.

The Conjugate-Gradient Method • It turns out that if we have access to the gradient of the objective function, then we can determine conjugate directions relatively efficiently. • Recall that Powell’s conjugate direction method requires n single variable minimizations per iteration in order to determine one new conjugate direction at the end of the iteration. This results in approximately n 2 line minimizations to find the minimum of a quadratic function. • In the conjugate-gradient algorithm, access to the gradient of f x ( ), that is ∇ g x ( ) = f x ( ) , allows us to set up a new conjugate direction after every line minimization. • Consider the quadratic function 1 b T x - x T C x - ( ) = + + (10) Q x a 2 R n ∈ where x . • We want to perform successive line minimizations along conjugate directions, say s x k ( ), where x k is the current search point and we minimize to the next point as λ k s x k = + ( ) . (11) x k x k + 1 • Now, how do we find these conjugate directions = ( ) s k s x k given previous information?

• Expand the search directions in terms of the gradient at the current point ∇ g k = f x k ( ) and a linear combination of the previous search directions: – k 1 ∑ γ i s i s k = – g k + (12) = i 0 where we start with the initial search direction as the steepest descent direction = – s 0 g 0 . • The coefficients of the expansion, γ i , i 1 … k = – 1 , are to be chosen so that the s k are C -conjugate . • Therefore, we have γ 0 s 0 γ 0 g 0 s 1 = – g 1 + = – g 1 – (13) and we require that ( , ) ( γ 0 g 0 , ) s 1 C s 0 = 0 = – g 1 – C s 0 (14) λ 0 s 0 but x 1 = x 0 + which can be solved for s 0 as ∆ x 0 x 1 – x 0 - - - - - - - - - - - - - - - - - = s 0 = - - - - - - - - - (15) λ 0 λ 0 where ∆ x 0 = x 1 – x 0 is the forward difference operator. Using this in (14) we have ∆ x 0 ⎛ ⎞ γ 0 g 0 , ⎜ - - - - - - - - - ⎟ – g 1 – C = 0 (16) λ 0 ⎝ ⎠ but g x ( ) = C x + b which means that we can set ∆ g 0 ( ) C ∆ x 0 = – = – = (17) g 1 g 0 C x 1 x 0

which substituting into (16), we have ∆ g 0 ⎛ ⎞ γ 0 g 0 , ⎜ - - - - - - - - - ⎟ – g 1 – = 0 λ 0 ⎝ ⎠ ( , ∆ g 0 ) γ 0 g 0 ∆ g 0 ( , ) – g 1 = and finally ( ∆ g 0 g 1 , ) γ 0 = – - - - - - - - - - - - - - - - - - - - - - - - . (18) ( ∆ g 0 g 0 , ) • We can further reduce the numerator as follows: 2 ( ∆ g 0 g 1 , ) ( , ) ( , ) ( , ) = – = – = (19) g 1 g 0 g 1 g 1 g 1 g 0 g 1 g 1 ( , ) since g 0 g 1 = 0 (successive directions of minimization are orthogonal) • The denominator can also be rewritten as 2 ( ∆ g 0 g 0 , ) ( , ) ( , ) ( , ) = g 1 – g 0 g 0 = g 1 g 0 – g 0 g 0 = – g 0 (20) and therefore the coefficient γ 0 can be calculated from 2 g 1 γ 0 = - - - - - - - - - - - - - . (21) 2 g 0 • The conjugate direction, s 1 , can be calculated as 2 g 1 - - - - - - - - - - - - s 0 - = – + . (22) s 1 g 1 2 g 0

• Now although we derived this for s 1 , we could have derived it for any s k to get 2 g k - - - - - - - - - - - - - - - - - - - - s k s k = – g k + (23) – 1 2 g k – 1 • This represents the Fletcher-Reeves scheme (1964) for determining conjugate directions for search from the gradient at the current point, g k , the previous search direction, s k – , and the magnitude of the gradient at the previous point, 1 . g k – 1 • Two alternative update equations are the Hestenes-Stiefel method (1952): ( ∆ g k , ) g k – 1 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - s k s k = – g k + (24) – 1 ( ∆ g k , ) s k – 1 and the Polak-Ribière method (1969): ( ∆ g k , ) g k – 1 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - s k s k = – g k + (25) – 1 2 g k – 1 • All three schemes are identical for quadratic functions but will be different for non-quadratic functions. • For quadratic functions, these methods will find the exact minimum in n iterations. • For non-quadratic functions, the search direction is reset to the steepest descent direction every n iterations.

Minimization Using Descent Information we will consider the - PDF document

Minimization Using Descent Information we will consider the minimization of unconstrained functions of several variables where we now assume we have some derivative information such as the gradient vector or the Hessian matrix. Recall

Continuous Descent Operation (CDO) Continuous Descent Operation (CDO) Doc 9331 Doc 9331 Erwin

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

10. Unconstrained minimization terminology and assumptions gradient descent method

Algorithms: Gradient Descent This classic greedy algorithm for minimization uses the negative of

Adaptive Low Complexity Algorithms for Unconstrained Minimization Carmine Di Fiore, Stefano

CSC2412: Private Gradient Descent & Empirical Risk Minimization Sasho Nikolov 1 Empirical

Minimization Satoru Iwata (University of Tokyo) Submodular Function Minimization ( )

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

DESCENT into the DARK AGES DESCENT into the DARK AGES A. Falcone Battle of the Romans

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

A HOL theory of Euclidean space John Harrison Intel Corporation TPHOLs 2005, Oxford Wed 24th

MATH 676 Finite element methods in scientifjc computing Wolfgang Bangerth, T exas A&M

Semidefinite method and Caccetta-H aggvist conjecture Jan Volec ETH Z urich joint work

Lecture 3: Convex Function CK Cheng Dept. of Computer Science and Engineering University of

d i E Inner Product a l l u d Dr. Abdulla Eid b A College of Science . r D MATHS

On Birmans Sequence of Hardy-Rellich Type Inequalities Isaac B. Michael (joint with F.

Online Learning II Presenter: Adams Wei Yu Carnegie Mellon University Mar 2015 Presenter: Adams

An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting Cem

Minimization Using Descent Information we will consider the - PDF document

Minimization Using Descent Information we will consider the minimization of unconstrained functions of several variables where we now assume we have some derivative information such as the gradient vector or the Hessian matrix. Recall

Continuous Descent Operation (CDO) Continuous Descent Operation (CDO) Doc 9331 Doc 9331 Erwin

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

10. Unconstrained minimization terminology and assumptions gradient descent method

Algorithms: Gradient Descent This classic greedy algorithm for minimization uses the negative of

Adaptive Low Complexity Algorithms for Unconstrained Minimization Carmine Di Fiore, Stefano

CSC2412: Private Gradient Descent &amp; Empirical Risk Minimization Sasho Nikolov 1 Empirical

Minimization Satoru Iwata (University of Tokyo) Submodular Function Minimization ( )

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Gradient Descent Michail Michailidis &amp; Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

DESCENT into the DARK AGES DESCENT into the DARK AGES A. Falcone Battle of the Romans

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

A HOL theory of Euclidean space John Harrison Intel Corporation TPHOLs 2005, Oxford Wed 24th

MATH 676 Finite element methods in scientifjc computing Wolfgang Bangerth, T exas A&amp;M

Semidefinite method and Caccetta-H aggvist conjecture Jan Volec ETH Z urich joint work

Lecture 3: Convex Function CK Cheng Dept. of Computer Science and Engineering University of

d i E Inner Product a l l u d Dr. Abdulla Eid b A College of Science . r D MATHS

On Birmans Sequence of Hardy-Rellich Type Inequalities Isaac B. Michael (joint with F.

Online Learning II Presenter: Adams Wei Yu Carnegie Mellon University Mar 2015 Presenter: Adams

An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting Cem

CSC2412: Private Gradient Descent & Empirical Risk Minimization Sasho Nikolov 1 Empirical

Gradient Descent Michail Michailidis & Patrick Maiden Outline

MATH 676 Finite element methods in scientifjc computing Wolfgang Bangerth, T exas A&M