conjugate direction minimization
play

Conjugate Direction minimization Lectures for PHD course on - PowerPoint PPT Presentation

Conjugate Direction minimization Lectures for PHD course on Numerical optimization Enrico Bertolazzi DIMS Universit a di Trento November 21 December 14, 2011 Conjugate Direction minimization 1 / 106 Outline Introduction 1


  1. Convergence rate of Steepest Descent iterative scheme The steepest descent convergence rate Proof. (3 / 5) . STEP 3: problem reduction. By using Lagrange multiplier maxima and minima are the stationary points of: �� n � k =1 α 2 g ( α 1 , . . . , α n , µ ) = h ( α 1 , . . . , α n ) + µ k − 1 setting A = � n k λ k and B = � n k λ − 1 k =1 α 2 k =1 α 2 we have k � ∂g ( α 1 , . . . , α n , µ ) λ k B + λ − 1 = 2 α k k A + µ ) = 0 ∂α k so that 1 Or α k = 0 ; 2 Or λ k is a root of the quadratic polynomial λ 2 B + λµ + A . in any case there are at most 2 coefficients α ’s not zero. a a the argument should be improved in the case of multiple eigenvalues Conjugate Direction minimization 20 / 106

  2. Convergence rate of Steepest Descent iterative scheme The steepest descent convergence rate Proof. (4 / 5) . STEP 4: problem reformulation. say α i and α j are the only non zero coefficients, then α 2 i + α 2 j = 1 and we can write � �� � α 2 i λ i + α 2 α 2 i λ − 1 + α 2 j λ − 1 h ( α 1 , . . . , α n ) = j λ j i j � λ i � + λ j = α 4 i + α 4 j + α 2 i α 2 j λ j λ i � λ i � + λ j = α 2 i (1 − α 2 j ) + α 2 j (1 − α 2 i ) + α 2 i α 2 j λ j λ i � λ i � + λ j = 1 + α 2 i α 2 − 2 j λ j λ i i )( λ i − λ j ) 2 = 1 + α 2 i (1 − α 2 λ i λ j Conjugate Direction minimization 21 / 106

  3. Convergence rate of Steepest Descent iterative scheme The steepest descent convergence rate Proof. (5 / 5) . STEP 5: bounding maxima and minima. notice that 0 ≤ β (1 − β ) ≤ 1 4 , ∀ β ∈ [0 , 1] i )( λ i − λ j ) 2 ≤ 1 + ( λ i − λ j ) 2 = ( λ i + λ j ) 2 1 ≤ 1 + α 2 i (1 − α 2 λ i λ j 4 λ i λ j 4 λ i λ j to bound ( λ i + λ j ) 2 / (4 λ i λ j ) consider the function f ( x ) = (1 + x ) 2 /x which is increasing for x ≥ 1 so that we have ( λ i + λ j ) 2 ≤ ( M + m ) 2 4 λ i λ j 4 M m and finally 1 ≤ h ( α 1 , . . . , α n ) ≤ ( M + m ) 2 4 M m Conjugate Direction minimization 22 / 106

  4. Convergence rate of Steepest Descent iterative scheme The steepest descent convergence rate Convergence rate of Steepest Descent The Kantorovich inequality permits to prove: Theorem (Convergence rate of Steepest Descent) Let A ∈ ❘ n × n an SPD matrix then the steepest descent method: x k +1 = x k + r T k r k r k r T k Ar k converge to the solution x ⋆ = A − 1 b with at least linear q -rate in the norm �·� A . Moreover we have the error estimate � x k +1 − x ⋆ � A ≤ κ − 1 κ + 1 � x k − x ⋆ � A κ = M/m is the condition number where m = λ 1 is the smallest eigenvalue of A and M = λ n is the biggest eigenvalue of A . Conjugate Direction minimization 23 / 106

  5. Convergence rate of Steepest Descent iterative scheme The steepest descent convergence rate Proof. Remember from slide N ◦ 16 � � ( r T k r k ) 2 � x ⋆ − x k +1 � 2 A = � x ⋆ − x k � 2 1 − A ( r T k A − 1 r k )( r T k Ar k ) from Kantorovich inequality ( r T k r k ) 2 ( M + m ) 2 = ( M − m ) 2 4 M m 1 − k Ar k ) ≤ 1 − ( r T k A − 1 r k )( r T ( M + m ) 2 so that � x ⋆ − x k +1 � A ≤ M − m M + m � x ⋆ − x k � A Conjugate Direction minimization 24 / 106

  6. Convergence rate of Steepest Descent iterative scheme The steepest descent convergence rate Remark (One step convergence) The steepest descent method can converge in one iteration if κ = 1 or when r 0 = u k where u k is an eigenvector of A . 1 In the first case ( κ = 1 ) we have A = β I for some β > 0 so it is not interesting. 2 In the second case we have ( u T k u k ) 2 ( u T k u k ) 2 k Au k ) = k u k ) = 1 ( u T k A − 1 u k )( u T λ − 1 k ( u T k u k ) λ k ( u T in both cases we have r 1 = 0 i.e. we have found the solution. Conjugate Direction minimization 25 / 106

  7. Conjugate direction method Outline Introduction 1 Convergence rate of Steepest Descent iterative scheme 2 Conjugate direction method 3 Conjugate Gradient method 4 Conjugate Gradient convergence rate 5 Preconditioning the Conjugate Gradient method 6 Nonlinear Conjugate Gradient extension 7 Conjugate Direction minimization 26 / 106

  8. Conjugate direction method Conjugate vectors Conjugate direction method Definition (Conjugate vector) Given two vectors p and q in ❘ n are conjugate respect to A if they are orthogonal respect the scalar product induced by A ; i.e., n � p T Aq = A ij p i q j = 0 . i,j =1 Clearly, n vectors p 1 , p 2 , . . . p n ∈ ❘ n that are pair wise conjugated respect to A form a base of ❘ n . Conjugate Direction minimization 27 / 106

  9. Conjugate direction method Conjugate vectors Problem (Linear system) 2 x T Ax − b T x + c is equivalent to Find the minimum of q ( x ) = 1 solve the first order necessary condition, i.e. Find x ⋆ ∈ ❘ n such that: Ax ⋆ = b . Observation Consider x 0 ∈ ❘ n and decompose the error e 0 = x ⋆ − x 0 by the conjugate vectors p 1 , p 2 , . . . , p n ∈ ❘ n : e 0 = x ⋆ − x 0 = σ 1 p 1 + σ 2 p 2 + · · · + σ n p n . Evaluating the coefficients σ 1 , σ 2 , . . . , σ n ∈ ❘ is equivalent to solve the problem Ax ⋆ = b , because knowing e 0 we have x ⋆ = x 0 + e 0 . Conjugate Direction minimization 28 / 106

  10. Conjugate direction method Conjugate vectors Observation Using conjugacy the coefficients σ 1 , σ 2 , . . . , σ n ∈ ❘ can be computed as σ i = p T i Ae 0 , for i = 1 , 2 , . . . , n. p T i Ap i In fact, for all 1 ≤ i ≤ n , we have p T i Ae 0 = p T i A ( σ 1 p 1 + σ 2 p 2 + . . . + σ n p n ) , = σ 1 p T i Ap 1 + σ 2 p T i Ap 2 + . . . + σ n p T i Ap n , = σ i p T i Ap i , because p T i Ap j = 0 for i � = j . Conjugate Direction minimization 29 / 106

  11. Conjugate direction method Conjugate vectors The conjugate direction method evaluate the coefficients σ 1 , σ 2 , . . . , σ n ∈ ❘ recursively in n steps, solving for k ≥ 0 the minimization problem: Conjugate direction method Given x 0 ; k ← 0 ; repeat k ← k + 1 ; Find x k ∈ x 0 + V k such that: x k = arg min � x ⋆ − x � A x ∈ x 0 + V k until k = n where V k is the subspace of ❘ n generated by the first k conjugate direction; i.e., � � V k = span p 1 , p 2 , . . . , p k . Conjugate Direction minimization 30 / 106

  12. Conjugate direction method First step Step: x 0 → x 1 At the first step we consider the subspace x 0 + span { p 1 } which consists in vectors of the form x ( α ) = x 0 + α p 1 α ∈ ❘ The minimization problem becomes: Minimization step x 0 → x 1 Find x 1 = x 0 + α 1 p 1 (i.e., find α 1 !) such that: � x ⋆ − x 1 � A = min α ∈ ❘ � x ⋆ − ( x 0 + α p 1 ) � A , Conjugate Direction minimization 31 / 106

  13. Conjugate direction method First step Solving first step method 1 The minimization problem is the minimum respect to α of the quadratic: Φ( α ) = � x ⋆ − ( x 0 + α p 1 ) � 2 A , = ( x ⋆ − ( x 0 + α p 1 )) T A ( x ⋆ − ( x 0 + α p 1 )) , = ( e 0 − α p 1 ) T A ( e 0 − α p 1 ) , = e T 0 Ae 0 − 2 α p T 1 Ae 0 + α 2 p T 1 Ap 1 . minimum is found by imposing: α 1 = p T dΦ( α ) 1 Ae 0 = − 2 p T 1 Ae 0 + 2 α p T 1 Ap 1 = 0 ⇒ p T d α 1 Ap 1 Conjugate Direction minimization 32 / 106

  14. Conjugate direction method First step Solving first step method 2 (1 / 2) Remember the error expansion: x ⋆ − x 0 = σ 1 p 1 + σ 2 p 2 + · · · + σ n p n . Let x ( α ) = x 0 + α p 1 , the difference x ⋆ − x ( α ) becomes: x ⋆ − x ( α ) = ( σ 1 − α ) p 1 + σ 2 p 2 + . . . + σ n p n due to conjugacy the error � x ⋆ − x ( α ) � A becomes � x ⋆ − x ( α ) � 2 A � � T � � n n � � = ( σ 1 − α ) p 1 + σ i p i ( σ 1 − α ) p 1 + σ j p i A i =2 j =2 n � = ( σ 1 − α ) 2 p T σ 2 j p T 1 Ap 1 + j Ap j j =2 Conjugate Direction minimization 33 / 106

  15. Conjugate direction method First step Solving first step method 2 (2 / 2) Because n � A = ( σ 1 − α ) 2 � p 1 � 2 � x ⋆ − x ( α ) � 2 σ 2 2 � p i � 2 A + A , i =2 we have that n � � x ⋆ − x ( α 1 ) � 2 σ 2 i � p i � 2 A ≤ � x ⋆ − x ( α ) � 2 A = for all α � = σ 1 A i =2 so that minimum is found by imposing α 1 = σ 1 : α 1 = p T 1 Ae 0 p T 1 Ap 1 This argument can be generalized for all k > 1 (see next slides). Conjugate Direction minimization 34 / 106

  16. Conjugate direction method k th Step Step, x k − 1 → x k For the step from k − 1 to k we consider the subspace of ❘ n � � V k = span p 1 , p 2 , . . . , p k which contains vectors of the form: x ( α (1) , α (2) , . . . , α ( k ) ) = x 0 + α (1) p 1 + α (2) p 2 + . . . + α ( k ) p k The minimization problem becomes: Minimization step x k − 1 → x k Find x k = x 0 + α 1 p 1 + α 2 p 2 + . . . + α k p k (i.e. α 1 , α 2 , . . . , α k ) such that: � � � � � x ⋆ − x ( α (1) , α (2) , . . . , α ( k ) ) � x ⋆ − x k � A = min � α (1) ,α (2) ,...,α ( k ) ∈ ❘ A Conjugate Direction minimization 35 / 106

  17. Conjugate direction method k th Step Solving k th Step: x k − 1 → x k (1 / 2) Remember the error expansion: x ⋆ − x 0 = σ 1 p 1 + σ 2 p 2 + · · · + σ n p n . Consider a vector of the form x ( α (1) , α (2) , . . . , α ( k ) ) = x 0 + α (1) p 1 + α (2) p 2 + . . . + α ( k ) p k the error x ⋆ − x ( α (1) , α (2) , . . . , α ( k ) ) can be written as k � x ⋆ − x ( α (1) , α (2) , . . . , α ( k ) ) = x ⋆ − x 0 − α ( i ) p i , i =1 k n � � � σ i − α ( i ) � = p i + σ i p i . i =1 i = k +1 Conjugate Direction minimization 36 / 106

  18. Conjugate direction method k th Step Solving k th Step: x k − 1 → x k (2 / 2) using conjugacy of p i we obtain the norm of the error: � � 2 � � � x ⋆ − x ( α (1) , α (2) , . . . , α ( k ) ) � A k n � � � σ i − α ( i ) � 2 � p i � 2 i � p i � 2 σ 2 = A + A . i =1 i = k +1 So that minimum is found by imposing α i = σ i : for i = 1 , 2 , . . . , k . α i = p T i Ae 0 i = 1 , 2 , . . . , k p T i Ap i Conjugate Direction minimization 37 / 106

  19. Conjugate direction method Successive one dimensional minimization Successive one dimensional minimization (1 / 3) notice that α i = σ i and that x k = x 0 + α 1 p 1 + · · · + α k p k = x k − 1 + α k p k so that x k − 1 contains k − 1 coefficients α i for the minimization. if we consider the one dimensional minimization on the subspace x k − 1 + span { p k } we find again x k ! Conjugate Direction minimization 38 / 106

  20. Conjugate direction method Successive one dimensional minimization Successive one dimensional minimization (2 / 3) Consider a vector of the form x ( α ) = x k − 1 + α p k remember that x k − 1 = x 0 + α 1 p 1 + · · · + α k − 1 p k − 1 so that the error x ⋆ − x ( α ) can be written as k − 1 � x ⋆ − x ( α ) = x ⋆ − x 0 − α i p i + α p k i =1 k − 1 n � � � � � � = σ i − α i p i + σ k − α p k + σ i p i . i =1 i = k +1 due to the equality σ i = α i the blue part of the expression is 0 . Conjugate Direction minimization 39 / 106

  21. Conjugate direction method Successive one dimensional minimization Successive one dimensional minimization (3 / 3) Using conjugacy of p i we obtain the norm of the error: n � � � 2 � p k � 2 � x ⋆ − x ( α ) � 2 i � p i � 2 σ 2 A = σ k − α A + A . i = k +1 So that minimum is found by imposing α = σ k : α k = p T k Ae 0 p T k Ap k Remark This observation permit to perform the minimization on the k -dimensional space x 0 + V k as successive one dimensional minimizations along the conjugate directions p k !. Conjugate Direction minimization 40 / 106

  22. Conjugate direction method Successive one dimensional minimization Problem (one dimensional successive minimization) Find x k = x k − 1 + α k p k such that: � x ⋆ − x k � A = min α ∈ ❘ � x ⋆ − ( x k − 1 + α p k ) � A , The solution is the minimum respect to α of the quadratic: Φ( α ) = ( x ⋆ − ( x k − 1 + α p k )) T A ( x ⋆ − ( x k − 1 + α p k )) , = ( e k − 1 − α p k ) T A ( e k − 1 − α p k ) , = e T k − 1 Ae k − 1 − 2 α p T k Ae k − 1 + α 2 p T k Ap k . minimum is found by imposing: α k = p T dΦ( α ) k Ae k − 1 = − 2 p T k Ae k − 1 + 2 α p T k Ap k = 0 ⇒ p T d α k Ap k Conjugate Direction minimization 41 / 106

  23. Conjugate direction method Successive one dimensional minimization In the case of minimization on the subspace x 0 + V k we have: α k = p T k Ae 0 / p T k Ap k In the case of one dimensional minimization on the subspace x k − 1 + span { p k } we have: α k = p T k Ae k − 1 / p T k Ap k Apparently they are different results, however by using the conjugacy of the vectors p i we have p T k Ae k − 1 = p T k A ( x ⋆ − x k − 1 ) � � = p T k A x ⋆ − ( x 0 + α 1 p 1 + · · · + α k − 1 p k − 1 ) = p T k Ae 0 − α 1 p T k Ap 1 − · · · − α k − 1 p T k Ap k − 1 = p T k Ae 0 Conjugate Direction minimization 42 / 106

  24. Conjugate direction method Successive one dimensional minimization The one step minimization in the space x 0 + V n and the successive minimization in the space x k − 1 + span { p k } , k = 1 , 2 , . . . , n are equivalent if p i s are conjugate. The successive minimization is useful when p i s are not known in advance but must be computed as the minimization process proceeds. The evaluation of α k is apparently not computable because e i is not known. However noticing Ae k = A ( x ⋆ − x k ) = b − Ax k = r k we can write α k = p T k Ae k − 1 / p T k Ap k = p T k r k − 1 / p T k Ap k = Finally for the residual is valid the recurrence r k = b − Ax k = b − A ( x k − 1 + α k p k ) = r k − 1 − α k Ap k . Conjugate Direction minimization 43 / 106

  25. Conjugate direction method Conjugate direction minimization Conjugate direction minimization Algorithm (Conjugate direction minimization) k ← 0 ; x 0 assigned; r 0 ← b − Ax 0 ; while not converged do k ← k + 1 ; r T k − 1 p T α k ← p k Ap k ; k x k ← x k − 1 + α k p k ; r k ← r k − 1 − α k Ap k ; end while Observation (Computazional cost) The conjugate direction minimization requires at each step one matrix–vector product for the evaluation of α k and two update AXPY for x k and r k . Conjugate Direction minimization 44 / 106

  26. Conjugate direction method Conjugate direction minimization Monotonic behavior of the error Remark (Monotonic behavior of the error) The energy norm of the error � e k � A is monotonically decreasing in k . In fact: e k = x ⋆ − x k = α k +1 p k +1 + . . . + α n p n , and by conjugacy � e k � 2 A = � x ⋆ − x k � 2 k +1 � p k +1 � 2 n � p n � 2 A = σ 2 A + . . . + σ 2 A . Finally from this relation we have e n = 0 . Conjugate Direction minimization 45 / 106

  27. Conjugate Gradient method Outline Introduction 1 Convergence rate of Steepest Descent iterative scheme 2 Conjugate direction method 3 Conjugate Gradient method 4 Conjugate Gradient convergence rate 5 Preconditioning the Conjugate Gradient method 6 Nonlinear Conjugate Gradient extension 7 Conjugate Direction minimization 46 / 106

  28. Conjugate Gradient method Conjugate Gradient method The Conjugate Gradient method combine the Conjugate Direction method with an orthogonalization process (like Gram-Schmidt) applied to the residual to construct the conjugate directions. In fact, because A define a scalar product in the next slide we prove: each residue is orthogonal to the previous conjugate directions, and consequently linearly independent from the previous conjugate directions. if the residual is not null is can be used to construct a new conjugate direction. Conjugate Direction minimization 47 / 106

  29. Conjugate Gradient method Orthogonality of the residue r k respect V k The residue r k is orthogonal to p 1 , p 2 , . . . , p k . In fact, from the error expansion e k = α k +1 p k +1 + α k +2 p k +2 + · · · + α n p n because r k = Ae k , for i = 1 , 2 , . . . , k we have p T i r k = p T i Ae k n n � � = p T α j p T α j p j = i A i Ap j j = k +1 j = k +1 = 0 Conjugate Direction minimization 48 / 106

  30. Conjugate Gradient method Building new conjugate direction (1 / 2) The conjugate direction method build one new direction at each step. If r k � = 0 it can be used to build the new direction p k +1 by a Gram-Schmidt orthogonalization process p k +1 = r k + β ( k +1) p 1 + β ( k +1) p 2 + . . . + β ( k +1) p k , 1 2 k where the k coefficients β ( k +1) , β ( k +1) , . . . , β ( k +1) must 1 2 k satisfy: p T i Ap k +1 = 0 , for i = 1 , 2 , . . . , k. Conjugate Direction minimization 49 / 106

  31. Conjugate Gradient method Building new conjugate direction (2 / 2) (repeating from previous slide) p k +1 = r k + β ( k +1) p 1 + β ( k +1) p 2 + · · · + β ( k +1) p k , 1 2 k expanding the expression: 0 = p T i Ap k +1 , � � r k + β ( k +1) p 1 + β ( k +1) p 2 + · · · + β ( k +1) = p T , i A p k 1 2 k i Ar k + β ( k +1) = p T p T i Ap i , i = − p T i Ar k β ( k +1) ⇒ i = 1 , 2 , . . . , k i p T i Ap i Conjugate Direction minimization 50 / 106

  32. Conjugate Gradient method The choice of the residual r k � = 0 for the construction of the new conjugate direction p k +1 has three important consequences: 1 simplification of the expression for α k ; 2 Orthogonality of the residual r k from the previous residue r 0 , r 1 , . . . , r k − 1 ; 3 three point formula and simplification of the coefficients β ( k +1) . i this facts will be examined in the next slides. Conjugate Direction minimization 51 / 106

  33. Conjugate Gradient method Simplification of the expression for α k Writing the expression for p k from the orthogonalization process p k = r k − 1 + β ( k +1) p 1 + β ( k +1) p 2 + . . . + β ( k +1) k − 1 p k − 1 , 1 2 using orthogonality of r k − 1 and the vectors p 1 , p 2 , . . . , p k − 1 , (see slide N.48) we have � � r k − 1 + β ( k +1) p 1 + β ( k +1) p 2 + . . . + β ( k +1) r T k − 1 p k = r T , k − 1 p k − 1 k − 1 1 3 = r T k − 1 r k − 1 . recalling the definition of α k it follows: α k = e T = r T r T k − 1 Ap k k − 1 p k k − 1 r k − 1 = p T p T p T k Ap k k Ap k k Ap k Conjugate Direction minimization 52 / 106

  34. Conjugate Gradient method Orthogonally of the residue r k from r 0 , r 1 , . . . , r k − 1 From the definition of p i +1 it follows: p i +1 = r i + β ( i +1) p 1 + β ( i +1) p 2 + . . . + β ( i +1) p i , 1 2 i � � ⇒ r i ∈ span { p 1 , p 2 , . . . , p i , p i +1 } = V i +1 obvious using orthogonality of r k and the vectors p 1 , p 2 , . . . , p k , (see slide N.48) for i < k we have � � i � β ( i +1) r T k r i = r T p i +1 − , p j k j j =1 i � β ( i +1) = r T r T k p i +1 − k p j = 0 . j j =1 Conjugate Direction minimization 53 / 106

  35. Conjugate Gradient method Three point formula and simplification of β ( k +1) i r T k r i = r T From the relation k ( r i − 1 − α i Ap i ) we deduce � − r T k Ap i = r T k r i − 1 − r T k r k /α k if i = k ; k r i r T = α i 0 if i < k ; remembering that α k = r T k − 1 r k − 1 / p T k Ap k we obtain  r T  k r k  = − r T i = k ; k Ap i β ( k +1) r T k − 1 r k − 1 = i p T  i Ap i  0 i < k ; i.e. there is only one non zero coefficient β ( k +1) , so we write k β k = β ( k +1) and obtain the three point formula: k p k +1 = r k + β k p k Conjugate Direction minimization 54 / 106

  36. Conjugate Gradient method Conjugate gradient algorithm initial step: k ← 0 ; x 0 assigned; r 0 ← b − Ax 0 ; p 1 ← r 0 ; while � r k � > ǫ do k ← k + 1 ; Conjugate direction method r T k − 1 r k − 1 α k ← k Ap k ; p T x k ← x k − 1 + α k p k ; r k ← r k − 1 − α k Ap k ; Residual orthogonalization r T k r k β k ← k − 1 r k − 1 ; r T p k +1 ← r k + β k p k ; end while Conjugate Direction minimization 55 / 106

  37. Conjugate Gradient convergence rate Outline Introduction 1 Convergence rate of Steepest Descent iterative scheme 2 Conjugate direction method 3 Conjugate Gradient method 4 Conjugate Gradient convergence rate 5 Preconditioning the Conjugate Gradient method 6 Nonlinear Conjugate Gradient extension 7 Conjugate Direction minimization 56 / 106

  38. Conjugate Gradient convergence rate Polynomial residual expansions Polynomial residual expansions (1 / 5) From the Conjugate Gradient iterative scheme on slide 55 we have Lemma There exists k -degree polynomial P k ( x ) and Q k ( x ) such that r k = P k ( A ) r 0 k = 0 , 1 , . . . , n p k = Q k − 1 ( A ) r 0 k = 1 , 2 , . . . , n Moreover P k (0) = 1 for all k . Proof. (1 / 2) . The proof is by induction. Base k = 0 p 1 = r 0 so that P 0 ( x ) = 1 and Q 0 ( x ) = 1 . Conjugate Direction minimization 57 / 106

  39. Conjugate Gradient convergence rate Polynomial residual expansions Polynomial residual expansions (2 / 5) Proof. (2 / 2) . let the expansion valid for k − 1 Consider the recursion for the residual: r k = r k − 1 − α k Ap k = P k − 1 ( A ) r 0 + α k A Q k − 1 ( A ) r 0 � � = P k − 1 ( A ) + α k A Q k − 1 ( A ) r 0 then P k ( x ) = P k − 1 ( x ) + α k xQ k − 1 ( x ) and P k (0) = P k − 1 (0) = 1 . Consider the recursion for the conjugate direction p k +1 = P k ( A ) r 0 + β k Q k − 1 ( A ) r 0 � � = P k ( A ) + β k Q k − 1 ( A ) r 0 then Q k ( x ) = P k ( x ) + β k Q k − 1 ( x ) . Conjugate Direction minimization 58 / 106

  40. Conjugate Gradient convergence rate Polynomial residual expansions Polynomial residual expansions (3 / 5) We have the following trivial equality � � V k = span p 1 , p 2 , . . . p k � � = span r 0 , r 1 , . . . r k − 1 � � q ( A ) r 0 | q ∈ P k − 1 , = � � p ( A ) e 0 | p ∈ P k , p (0) = 0 = In this way the optimality of CG step can be written as � x ⋆ − x k � A ≤ � x ⋆ − x � A , ∀ x ∈ x 0 + V k ∀ p ∈ P k , p (0) = 0 � x ⋆ − x k � A ≤ � x ⋆ − ( x 0 + p ( A ) e 0 ) � A , ∀ P ∈ P k , P (0) = 1 � x ⋆ − x k � A ≤ � P ( A ) e 0 � A , Conjugate Direction minimization 59 / 106

  41. Conjugate Gradient convergence rate Polynomial residual expansions Polynomial residual expansions (4 / 5) Recalling that A − 1 r k = A − 1 ( b − Ax k ) = x ⋆ − x k = e k we can write e k = x ⋆ − x k = A − 1 r k = A − 1 P k ( A ) r 0 = P k ( A ) A − 1 r 0 = P k ( A )( x ⋆ − x 0 ) = P k ( A ) e 0 . due to the optimality of the conjugate gradient we have: Conjugate Direction minimization 60 / 106

  42. Conjugate Gradient convergence rate Polynomial residual expansions Polynomial residual expansions (5 / 5) Using the results of slide 59 and 60 we can write e k = P k ( A ) e 0 , ∀ P ∈ P k , P (0) = 1 � e k � A = � P k ( A ) e 0 � A ≤ � P ( A ) e 0 � A and from this equation we have the estimate � e k � A ≤ P ∈ P k , P (0)=1 � P ( A ) e 0 � A inf So an estimate of the form P ∈ P k , P (0)=1 � P ( A ) e 0 � A ≤ C k � e 0 � A inf can be used to proof a convergence rate theorem, as for the steepest descent algorithm. Conjugate Direction minimization 61 / 106

  43. Conjugate Gradient convergence rate Convergence rate calculation Convergence rate calculation Lemma Let A ∈ ❘ n × n an SPD matrix, and p ∈ P k a polynomial, then � p ( A ) x � A ≤ � p ( A ) � 2 � x � A Proof. (1 / 2) . The matrix A is SPD so that we can write A = U T Λ U , Λ = diag { λ 1 , λ 2 , . . . , λ n } where U is an orthogonal matrix (i.e. U T U = I ) and Λ ≥ 0 is diagonal. We can define the SPD matrix A 1 / 2 as follows A 1 / 2 = U T Λ 1 / 2 U , Λ 1 / 2 = diag { λ 1 / 2 , λ 1 / 2 , . . . , λ 1 / 2 n } 1 2 and obviously A 1 / 2 A 1 / 2 = A . Conjugate Direction minimization 62 / 106

  44. Conjugate Gradient convergence rate Convergence rate calculation Proof. (2 / 2) . Notice that � � 2 � � A = x T Ax = x T A 1 / 2 A 1 / 2 x = � x � 2 � A 1 / 2 x � 2 so that � � � � � A 1 / 2 p ( A ) x � p ( A ) x � A = � 2 � � � � � p ( A ) A 1 / 2 x = � 2 � � � � � A 1 / 2 x ≤ � p ( A ) � 2 � 2 = � p ( A ) � 2 � x � A Conjugate Direction minimization 63 / 106

  45. Conjugate Gradient convergence rate Convergence rate calculation Lemma Let A ∈ ❘ n × n an SPD matrix, and p ∈ P k a polynomial, then � p ( A ) � 2 = max λ ∈ σ ( A ) | p ( λ ) | Proof. The matrix p ( A ) is symmetric, and for a generic symmetric matrix B we have � B � 2 = max λ ∈ σ ( B ) | λ | observing that if λ is an eigenvalue of A then p ( λ ) is an eigenvalue of p ( A ) the thesis easily follows. Conjugate Direction minimization 64 / 106

  46. Conjugate Gradient convergence rate Convergence rate calculation Starting the error estimate � e k � A ≤ P ∈ P k , P (0)=1 � P ( A ) e 0 � A inf Combining the last two lemma we easily obtain the estimate � � � e k � A ≤ inf λ ∈ σ ( A ) | P ( λ ) | max � e 0 � A P ∈ P k , P (0)=1 The convergence rate is estimated by bounding the constant � � inf λ ∈ σ ( A ) | P ( λ ) | max P ∈ P k , P (0)=1 Conjugate Direction minimization 65 / 106

  47. Conjugate Gradient convergence rate Finite termination of Conjugate Gradient Finite termination of Conjugate Gradient Theorem (Finite termination of Conjugate Gradient) Let A ∈ ❘ n × n an SPD matrix, the the Conjugate Gradient applied to the linear system Ax = b terminate finding the exact solution in at most n -step. Proof. From the estimate � � � e k � A ≤ inf λ ∈ σ ( A ) | P ( λ ) | max � e 0 � A P ∈ P k , P (0)=1 � � � choosing P ( x ) = ( x − λ ) (0 − λ ) λ ∈ σ ( A ) λ ∈ σ ( A ) we have max λ ∈ σ ( A ) | P ( λ ) | = 0 and � e n � A = 0 . Conjugate Direction minimization 66 / 106

  48. Conjugate Gradient convergence rate Convergence rate of Conjugate Gradient Convergence rate of Conjugate Gradient 1 The constant � � inf λ ∈ σ ( A ) | P ( λ ) | max P ∈ P k , P (0)=1 is not easy to evaluate, 2 The following bound, is useful λ ∈ σ ( A ) | P ( λ ) | ≤ max λ ∈ [ λ 1 ,λ n ] | P ( λ ) | max 3 in particular the final estimate will be obtained by � � � � � ¯ � inf λ ∈ σ ( A ) | P ( λ ) | max ≤ max P k ( λ ) P ∈ P k , P (0)=1 λ ∈ [ λ 1 ,λ n ] where ¯ P k ( x ) is an opportune k -degree polynomial for which � � ¯ � ¯ � . P k (0) = 1 and it is easy to evaluate max λ ∈ [ λ 1 ,λ n ] P k ( λ ) Conjugate Direction minimization 67 / 106

  49. Conjugate Gradient convergence rate Chebyshev Polynomials Chebyshev Polynomials (1 / 4) 1 The Chebyshev Polynomials of the First Kind are the right polynomial for this estimate. This polynomial have the following definition in the interval [ − 1 , 1] : T k ( x ) = cos( k arccos( x )) 2 Another equivalent definition valid in the interval ( −∞ , ∞ ) is the following �� � k � � k � � � T k ( x ) = 1 x 2 − 1 x 2 − 1 x + + x − 2 3 In spite of these definition, T k ( x ) is effectively a polynomial. Conjugate Direction minimization 68 / 106

  50. Conjugate Gradient convergence rate Chebyshev Polynomials Chebyshev Polynomials (2 / 4) Some example of Chebyshev Polynomials. 1.5 1.5 T1 T3 T2 T4 1 1 0.5 0.5 0 0 -0.5 -0.5 -1 -1 -1.5 -1.5 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 1.5 1.5 T12 T20 1 1 0.5 0.5 0 0 -0.5 -0.5 -1 -1 -1.5 -1.5 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 Conjugate Direction minimization 69 / 106

  51. Conjugate Gradient convergence rate Chebyshev Polynomials Chebyshev Polynomials (3 / 4) 1 It is easy to show that T k ( x ) is a polynomial by the use of cos( α + β ) = cos α cos β − sin α sin β cos( α + β ) + cos( α − β ) = 2 cos α cos β let θ = arccos( x ) : T 0 ( x ) = cos(0 θ ) = 1 ; 1 T 1 ( x ) = cos(1 θ ) = x ; 2 T 2 ( x ) = cos(2 θ ) = cos( θ ) 2 − sin( θ ) 2 = 2 cos( θ ) 2 − 1 = 2 x 2 − 1 ; 3 T k +1 ( x ) + T k − 1 ( x ) = cos(( k + 1) θ ) + cos(( k − 1) θ ) 4 = 2 cos( kθ ) cos( θ ) = 2 x T k ( x ) 2 In general we have the following recurrence: T 0 ( x ) = 1 ; 1 T 1 ( x ) = x ; 2 T k +1 ( x ) = 2 x T k ( x ) − T k − 1 ( x ) . 3 Conjugate Direction minimization 70 / 106

  52. Conjugate Gradient convergence rate Chebyshev Polynomials Chebyshev Polynomials (4 / 4) Solving the recurrence: T 0 ( x ) = 1 ; 1 T 1 ( x ) = x ; 2 T k +1 ( x ) = 2 x T k ( x ) − T k − 1 ( x ) . 3 We obtain the explicit form of the Chebyshev Polynomials �� � k � � k � � � T k ( x ) = 1 x 2 − 1 x 2 − 1 x + + x − 2 The translated and scaled polynomial is useful in the study of the conjugate gradient method: � a + b − 2 x � T k ( x ; a, b ) = T k b − a where we have | T k ( x ; a, b ) | ≤ 1 for all x ∈ [ a, b ] . Conjugate Direction minimization 71 / 106

  53. Conjugate Gradient convergence rate Convergence rate of Conjugate Gradient method Convergence rate of Conjugate Gradient method Theorem (Convergence rate of Conjugate Gradient method) Let A ∈ ❘ n × n an SPD matrix then the Conjugate Gradient method converge to the solution x ⋆ = A − 1 b with at least linear r -rate in the norm �·� A . Moreover we have the error estimate � √ κ − 1 � k √ κ + 1 � e k � A � 2 � e 0 � A κ = M/m is the condition number where m = λ 1 is the smallest eigenvalue of A and M = λ n is the biggest eigenvalue of A . The expression a k � b k means that for all ǫ > 0 there exists k 0 > 0 such that: a k ≤ (1 − ǫ ) b k , ∀ k > k 0 Conjugate Direction minimization 72 / 106

  54. Conjugate Gradient convergence rate Convergence rate of Conjugate Gradient method Proof. From the estimate P ∈ P k , P (0) = 1 � e k � A ≤ λ ∈ [ m,M ] | P ( λ ) | � e 0 � A , max choosing P ( x ) = T k ( x ; m, M ) /T k (0; m, M ) from the fact that | T k ( x ; m, M ) | ≤ 1 for x ∈ [ m, M ] we have � M + m � − 1 � e k � A ≤ T k (0; m, M ) − 1 � e 0 � A = T k � e 0 � A M − m observe that M + m M − m = κ +1 κ − 1 and � κ + 1 � − 1 � � √ κ + 1 � k � √ κ − 1 � k � − 1 T k = 2 √ κ − 1 + √ κ + 1 κ − 1 � √ κ − 1 � k finally notice that → 0 as k → ∞ . √ κ +1 Conjugate Direction minimization 73 / 106

  55. Preconditioning the Conjugate Gradient method Outline Introduction 1 Convergence rate of Steepest Descent iterative scheme 2 Conjugate direction method 3 Conjugate Gradient method 4 Conjugate Gradient convergence rate 5 Preconditioning the Conjugate Gradient method 6 Nonlinear Conjugate Gradient extension 7 Conjugate Direction minimization 74 / 106

  56. Preconditioning the Conjugate Gradient method Preconditioning Preconditioning Problem (Preconditioned linear system) Given A , P ∈ ❘ n × n , with A an SPD matrix and P non singular matrix and b ∈ ❘ n . Find x ⋆ ∈ ❘ n such that: P − T Ax ⋆ = P − T b . A good choice for P should be such that M = P T P ≈ A , where ≈ denotes that M is an approximation of A in some sense to precise later. Notice that: P non singular imply: P − T ( b − Ax ) = 0 ⇐ ⇒ b − Ax = 0; A = P − T AP − 1 is also SPD (obvious proof). A SPD imply � Conjugate Direction minimization 75 / 106

  57. Preconditioning the Conjugate Gradient method Preconditioning Now we reformulate the preconditioned system: Problem (Preconditioned linear system) Given A , P ∈ ❘ n × n , with A an SPD matrix and P non singular matrix and b ∈ ❘ n the preconditioned problem is the following: x ⋆ ∈ ❘ n such that: � x ⋆ = � Find � A � b where A = P − T AP − 1 b = P − T b � � notice that if x ⋆ is the solution of the linear system Ax = b then x ⋆ = P x ⋆ is the solution of the linear system � Ax = � � b . Conjugate Direction minimization 76 / 106

  58. Preconditioning the Conjugate Gradient method Preconditioning PCG: preliminary version initial step: k ← 0 ; x 0 assigned; r 0 ← � b − � x 0 ← P x 0 ; � � A � x 0 ; � p 1 ← � r 0 ; while � � r k � > ǫ do k ← k + 1 ; Conjugate direction method r T � k − 1 � r k − 1 � α k ← p k ; k � p T � A � x k ← � � x k − 1 + � α k � p k ; α k � � r k ← � r k − 1 − � A � p k ; Residual orthogonalization r T � k � � r k β k ← r k − 1 ; r T � k − 1 � r k + � p k +1 ← � � β k � p k ; end while final step P − 1 � x k ; Conjugate Direction minimization 77 / 106

  59. Preconditioning the Conjugate Gradient method CG reformulation Conjugate gradient algorithm applied to � x = � A � b require the evaluation of thing like: p k = P − T AP − 1 � � A � p k . this can be done without evaluate directly the matrix � A , by the following operations: k = P − 1 � 1 solve P s ′ p k for s ′ k = � p k ; 2 evaluate s ′′ k = As ′ k ; 3 solve P T s ′′′ k = P − T s ′′ . k = s ′′ k for s ′′′ Step 1 and 3 require the solution of two auxiliary linear system. This is not a big problem if P and P T are triangular matrices (see e.g. incomplete Cholesky). Conjugate Direction minimization 78 / 106

  60. Preconditioning the Conjugate Gradient method CG reformulation However. . . we can reformulate the algorithm using only the matrices A and P ! Definition For all k ≥ 1 , we introduce the vector q k = P − 1 � p . Observation p k for all 1 ≤ k ≤ n are � If the vectors � p 1 , � p 2 , . . . � A -conjugate, then the corresponding vectors q 1 , q 2 , . . . q k are A -conjugate. In fact: A P − 1 � q T p T j P − T p T � j Aq i = � p i = � A p i = 0 , � if i � = j, ���� j � �� � � �� � = P − T AP − 1 = q T = q T j j that is a consequence of � A -conjugation of vectors � p i . Conjugate Direction minimization 79 / 106

  61. Preconditioning the Conjugate Gradient method CG reformulation Definition For all k ≥ 1 , we introduce the vectors x k = x k − 1 + � α k q k . Observation If we assume, by construction, � x 0 = P x 0 , then we have � x k = P x k , for all k with 1 ≤ k ≤ n . In fact, if � x k − 1 = P x k − 1 (inductive hypothesis), then x k = � � x k − 1 + � α k � p k [preconditioned CG] = P x k − 1 + � α k P q k [inductive Hyp. defs of q k ] = P ( x k − 1 + � α k q k ) [obvious] = P x k [defs. of x k ] Conjugate Direction minimization 80 / 106

  62. Preconditioning the Conjugate Gradient method CG reformulation Observation Because � x k = P x k for all k ≥ 0 , we have the recurrence between r k = � b − � the corresponding residue � A � x and r k = b − Ax k : r k = P − T r k . � In fact, r k = � b − � � A � x k , [defs. of � r k ] = P − T b − P − T AP − 1 P x k , [defs. of � b , � A , � x k ] = P − T ( b − Ax k ) , [obvious] = P − T r k . [defs. of r k ] Conjugate Direction minimization 81 / 106

  63. Preconditioning the Conjugate Gradient method CG reformulation Definition For all k , with 1 ≤ k ≤ n , the vector z k is the solution of the linear system Mz k = r k . where M = P T P . Formally, z k = M − 1 r k = P − 1 P − T r k . Using the vectors { z k } , α k and � we can express � β k in terms of A , the residual r k , and conjugate direction q k ; we can build a recurrence relation for the A -conjugate directions q k . Conjugate Direction minimization 82 / 106

  64. Preconditioning the Conjugate Gradient method CG reformulation Observation r k − 1 P − 1 P − T r k − 1 r T α k = � k − 1 � = r k − 1 M − 1 r k − 1 r k − 1 � = , k P T P − T AP − 1 P q k q T k � p T q k Aq k � A � p k r k − 1 z k − 1 = . q k Aq k Observation k P − 1 P − T r k r T r T r T k M − 1 r k � k � r k � β k = = = , k − 1 P − 1 P − T r k − 1 r T r T r T k − 1 M − 1 r k − 1 � k − 1 � r k − 1 r T k z k = . r T k − 1 z k − 1 Conjugate Direction minimization 83 / 106

  65. Preconditioning the Conjugate Gradient method CG reformulation Observation Using the vector z k = M − 1 r k , the following recurrence is true q k +1 = z k + � β k q k In fact: r k + � p k +1 = � � β k � p k [preconditioned CG] P − 1 � β k P − 1 � p k +1 = P − 1 � r k + � [left mult P − 1 ] p k P − 1 � β k P − 1 � p k +1 = P − 1 P − T r k + � p k [ r k +1 = P − T r k +1 ] P − 1 � β k P − 1 � [ M − 1 = P − 1 P − T ] p k +1 = M − 1 r k + � p k [ q k = P − 1 � q k +1 = z k + � β k q k p k ] Conjugate Direction minimization 84 / 106

  66. Preconditioning the Conjugate Gradient method CG reformulation PCG: final version initial step: k ← 0 ; x 0 assigned; r 0 ← b − Ax 0 ; q 1 ← r 0 ; while � z k � > ǫ do k ← k + 1 ; Conjugate direction method r T k − 1 z k − 1 � α k ← Aq k ; k � q T x k ← x k − 1 + � α k q k ; r k ← r k − 1 − � α k Aq k ; Preconditioning z k = M − 1 r k ; Residual orthogonalization r T k z k � β k ← k − 1 z k − 1 ; r T q k +1 ← z k + � β k q k ; end while Conjugate Direction minimization 85 / 106

  67. Nonlinear Conjugate Gradient extension Outline Introduction 1 Convergence rate of Steepest Descent iterative scheme 2 Conjugate direction method 3 Conjugate Gradient method 4 Conjugate Gradient convergence rate 5 Preconditioning the Conjugate Gradient method 6 Nonlinear Conjugate Gradient extension 7 Conjugate Direction minimization 86 / 106

  68. Nonlinear Conjugate Gradient extension Nonlinear Conjugate Gradient extension 1 The conjugate gradient algorithm can be extended for nonlinear minimization. 2 Fletcher and Reeves extend CG for the minimization of a general non linear function f ( x ) as follows: Substitute the evaluation of α k by an line search 1 Substitute the residual r k with the gradient ∇ f ( x k ) 2 3 We also translate the index for the search direction p k to be more consistent with the gradients. The resulting algorithm is in the next slide Conjugate Direction minimization 87 / 106

  69. Nonlinear Conjugate Gradient extension Fletcher and Reeves Fletcher and Reeves Nonlinear Conjugate Gradient initial step: k ← 0 ; x 0 assigned; f 0 ← f ( x 0 ) ; g 0 ← ∇ f ( x 0 ) T ; p 0 ← − g 0 ; while � g k � > ǫ do k ← k + 1 ; Conjugate direction method Compute α k by line-search; x k ← x k − 1 + α k p k − 1 ; g k ← ∇ f ( x k ) T ; Residual orthogonalization g T k g k β FR ← k − 1 g k − 1 ; k g T p k ← − g k + β FR p k − 1 ; k end while Conjugate Direction minimization 88 / 106

  70. Nonlinear Conjugate Gradient extension Fletcher and Reeves 1 To ensure convergence and apply Zoutendijk global convergence theorem we need to ensure that p k is a descent direction. 2 p 0 is a descent direction by construction, for p k we have k p k = − � g k � 2 + β FR g T g T k p k − 1 k if the line-search is exact than g T k p k − 1 = 0 because p k − 1 is the direction of the line-search. So by induction p k is a descent direction. 3 Exact line-search is expensive, however if we use inexact line-search with strong Wolfe conditions sufficient decrease: f ( x k + α k p k ) ≤ f ( x k ) + c 1 α k ∇ f ( x k ) p k ; 1 curvature condition: |∇ f ( x k + α k p k ) p k | ≤ c 2 |∇ f ( x k ) p k | . 2 with 0 < c 1 < c 2 < 1 / 2 then we can prove that p k is a descent direction. Conjugate Direction minimization 89 / 106

  71. Nonlinear Conjugate Gradient extension convergence analysis The previous consideration permits to say that Fletcher and Reeves nonlinear conjugate gradient method with strong Wolfe line-search is globally convergent 1 To prove globally convergence we need the following lemma: Lemma (descent direction bound) Suppose we apply Fletcher and Reeves nonlinear conjugate gradient method to f ( x ) with strong Wolfe line-search with 0 < c 2 < 1 / 2 . The the method generates descent direction p k that satisfy the following inequality ≤ g T 1 � g k � 2 ≤ − 1 − 2 c 2 k p k − , k = 0 , 1 , 2 , . . . 1 − c 2 1 − c 2 1 globally here means that Zoutendijk like theorem apply Conjugate Direction minimization 90 / 106

  72. Nonlinear Conjugate Gradient extension convergence analysis Proof. (1 / 3) . The proof is by induction. First notice that the function t ( ξ ) = 2 ξ − 1 1 − ξ is monotonically increasing on the interval [0 , 1 / 2] and that t (0) = − 1 and t (1 / 2) = 0 . Hence, because of c 2 ∈ (0 , 1 / 2) we have: − 1 < 2 c 2 − 1 < 0 . ( ⋆ ) 1 − c 2 base of induction k = 0 : For k = 0 we have p 0 = − g 0 so that 0 p 0 / � g 0 � 2 = − 1 . From ( ⋆ ) the lemma inequality is trivially g T satisfied. Conjugate Direction minimization 91 / 106

  73. Nonlinear Conjugate Gradient extension convergence analysis Proof. (2 / 3) . Using update direction formula’s of the algorithm: g T k g k β FR p k = − g k + β FR = p k − 1 k k g T k − 1 g k − 1 we can write g T g T = − 1 + g T k p k k p k − 1 k p k − 1 � g k � 2 = − 1 + β FR k � g k � 2 � g k − 1 � 2 and by using second strong Wolfe condition: g T g T � g k − 1 � 2 ≤ g T k − 1 p k − 1 k − 1 p k − 1 k p k − 1 + c 2 � g k � 2 ≤ − 1 − c 2 � g k − 1 � 2 Conjugate Direction minimization 92 / 106

  74. Nonlinear Conjugate Gradient extension convergence analysis Proof. (3 / 3) . by induction we have ≥ − g T k − 1 p k − 1 1 � g k − 1 � 2 > 0 1 − c 2 so that g T g T k − 1 p k − 1 k p k 1 = 2 c 2 − 1 � g k � 2 ≤ − 1 − c 2 � g k − 1 � 2 ≤ − 1 + c 2 1 − c 2 1 − c 2 and g T g T k − 1 p k − 1 k p k 1 1 � g k � 2 ≥ − 1 + c 2 � g k − 1 � 2 ≥ − 1 − c 2 = − 1 − c 2 1 − c 2 Conjugate Direction minimization 93 / 106

  75. Nonlinear Conjugate Gradient extension convergence analysis 1 The inequality of the the previous lemma can be written as: g T 1 � g k � � g k � � p k � ≥ 1 − 2 c 2 � g k � k p k � p k � ≥ − � p k � > 0 1 − c 2 1 − c 2 2 Remembering the Zoutendijk theorem we have ∞ � g T k p k (cos θ k ) 2 � g k � 2 < ∞ , where cos θ k = − � g k � � p k � k =1 3 so that if � g k � / � p k � is bounded from below we have that cos θ k ≥ δ for all k and then from Zoutendijk theorem the scheme converge. 4 Unfortunately this bound cant be proved so that Zoutendijk theorem cant be applied directly. However it is possible to prove a weaker results, i.e. that lim inf k →∞ � g k � = 0 ! Conjugate Direction minimization 94 / 106

  76. Nonlinear Conjugate Gradient extension convergence analysis Convergence of Fletcher and Reeves method Assumption (Regularity assumption) We assume f ∈ C 1 ( ❘ n ) with Lipschitz continuous gradient, i.e. there exists γ > 0 such that � � ∇ f ( x ) T − ∇ f ( y ) T � � ≤ γ � x − y � , ∀ x , y ∈ ❘ n Conjugate Direction minimization 95 / 106

  77. Nonlinear Conjugate Gradient extension convergence analysis Theorem (Convergence of Fletcher and Reeves method) Suppose the method of Fletcher and Reeves is implemented with strong Wolfe line-search with 0 < c 1 < c 2 < 1 / 2 . If f ( x ) and x 0 satisfy the previous regularity assumptions, then lim inf k →∞ � g k � = 0 Proof. (1 / 4) . From previous Lemma we have 1 � g k � cos θ k ≥ k = 1 , 2 , . . . 1 − c 2 � p k � ∞ � � g k � 4 substituting in Zoutendijk condition we have � p k � 2 < ∞ . k =1 The proof is by contradiction. in fact if theorem is not true than the series diverge. Next we want to bound � p k � . Conjugate Direction minimization 96 / 106

  78. Nonlinear Conjugate Gradient extension convergence analysis Proof. (bounding � p k � ) (2 / 4) . Using second Wolfe condition and previous Lemma � � c 2 � g T � ≤ − c 2 g T � g k − 1 � 2 k p k − 1 k p k − 1 ≤ 1 − c 2 using p k ← − g k + β FR p k − 1 we have k � � � p k � 2 ≤ � g k � 2 + 2 β FR ) 2 � p k − 1 � 2 � g T � + ( β FR k p k − 1 k k 2 c 2 ≤ � g k � 2 + � g k − 1 � 2 + ( β FR ) 2 � p k − 1 � 2 β FR k k 1 − c 2 ← � g k � 2 / � g k − 1 � 2 then recall that β FR k � p k � 2 ≤ 1 + c 2 � g k � 2 + ( β FR ) 2 � p k − 1 � 2 k 1 − c 2 Conjugate Direction minimization 97 / 106

  79. Nonlinear Conjugate Gradient extension convergence analysis Proof. (bounding � p k � ) (3 / 4) . setting c 3 = 1+ c 2 1 − c 2 and using repeatedly the last inequality we obtain: ) 2 � k − 1 ) 2 � p k − 2 � 2 � � p k � 2 ≤ c 3 � g k � 2 + ( β FR c 3 � g k − 1 � 2 + ( β FR k = c 3 � g k � 4 � � g k � − 2 + � g k − 1 � − 2 � � g k � 4 � g k − 2 � 4 � p k − 2 � 2 + ≤ c 3 � g k � 4 � � g k � − 2 + � g k − 1 � − 2 + � g k − 2 � − 2 � + � g k � 4 � g k − 3 � 4 � p k − 3 � 2 k � ≤ c 3 � g k � 4 � g j � − 2 j =1 Conjugate Direction minimization 98 / 106

  80. Nonlinear Conjugate Gradient extension convergence analysis Proof. (4 / 4) . Suppose now by contradiction there exists δ > 0 such that � g k � ≥ δ a by using the regularity assumptions we have k � � p k � 2 ≤ c 3 � g k � 4 � g j � − 2 ≤ c 3 � g k � 4 δ − 2 k j =1 Substituting in Zoutendijk condition we have ∞ ∞ � � g k � 4 � � p k � 2 ≥ δ 2 1 ∞ > k = ∞ c 4 k =1 k =1 this contradict assumption. a the correct assumption is that there exists k 0 such that � g k � ≥ δ for k ≥ k 0 but this complicate a little bit the following inequality without introducing new idea. Conjugate Direction minimization 99 / 106

  81. Nonlinear Conjugate Gradient extension Weakness of Fletcher and Reeves method Suppose that p k is a bad search direction, i.e. cos θ k ≈ 0 . From the descent direction bound Lemma (see slide 90) we have 1 � p k � ≥ cos θ k ≥ 1 − 2 c 2 � g k � � g k � � p k � > 0 1 − c 2 1 − c 2 so that to have cos θ k ≈ 0 we needs � p k � ≫ � g k � . since p k is a bad direction near orthogonal to g k it is likely that the step is small and x k +1 ≈ x k . If so we have also g k +1 ≈ g k and β FR k +1 ≈ 1 . but remember that p k +1 ← − g k +1 + β FR k +1 p k , so that p k +1 ≈ p k . This means that a long sequence of unproductive iterates will follows. Conjugate Direction minimization 100 / 106

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend