Conjugate Direction minimization Lectures for PHD course on - PowerPoint PPT Presentation

Convergence rate of Steepest Descent iterative scheme The steepest descent convergence rate Proof. (3 / 5) . STEP 3: problem reduction. By using Lagrange multiplier maxima and minima are the stationary points of: �� n � k =1 α 2 g ( α 1 , . . . , α n , µ ) = h ( α 1 , . . . , α n ) + µ k − 1 setting A = � n k λ k and B = � n k λ − 1 k =1 α 2 k =1 α 2 we have k � ∂g ( α 1 , . . . , α n , µ ) λ k B + λ − 1 = 2 α k k A + µ ) = 0 ∂α k so that 1 Or α k = 0 ; 2 Or λ k is a root of the quadratic polynomial λ 2 B + λµ + A . in any case there are at most 2 coefficients α ’s not zero. a a the argument should be improved in the case of multiple eigenvalues Conjugate Direction minimization 20 / 106

Convergence rate of Steepest Descent iterative scheme The steepest descent convergence rate Proof. (4 / 5) . STEP 4: problem reformulation. say α i and α j are the only non zero coefficients, then α 2 i + α 2 j = 1 and we can write � �� α 2 i λ i + α 2 α 2 i λ − 1 + α 2 j λ − 1 h ( α 1 , . . . , α n ) = j λ j i j � λ i � + λ j = α 4 i + α 4 j + α 2 i α 2 j λ j λ i � λ i � + λ j = α 2 i (1 − α 2 j ) + α 2 j (1 − α 2 i ) + α 2 i α 2 j λ j λ i � λ i � + λ j = 1 + α 2 i α 2 − 2 j λ j λ i i )( λ i − λ j ) 2 = 1 + α 2 i (1 − α 2 λ i λ j Conjugate Direction minimization 21 / 106

Convergence rate of Steepest Descent iterative scheme The steepest descent convergence rate Proof. (5 / 5) . STEP 5: bounding maxima and minima. notice that 0 ≤ β (1 − β ) ≤ 1 4 , ∀ β ∈ [0 , 1] i )( λ i − λ j ) 2 ≤ 1 + ( λ i − λ j ) 2 = ( λ i + λ j ) 2 1 ≤ 1 + α 2 i (1 − α 2 λ i λ j 4 λ i λ j 4 λ i λ j to bound ( λ i + λ j ) 2 / (4 λ i λ j ) consider the function f ( x ) = (1 + x ) 2 /x which is increasing for x ≥ 1 so that we have ( λ i + λ j ) 2 ≤ ( M + m ) 2 4 λ i λ j 4 M m and finally 1 ≤ h ( α 1 , . . . , α n ) ≤ ( M + m ) 2 4 M m Conjugate Direction minimization 22 / 106

Convergence rate of Steepest Descent iterative scheme The steepest descent convergence rate Convergence rate of Steepest Descent The Kantorovich inequality permits to prove: Theorem (Convergence rate of Steepest Descent) Let A ∈ ❘ n × n an SPD matrix then the steepest descent method: x k +1 = x k + r T k r k r k r T k Ar k converge to the solution x ⋆ = A − 1 b with at least linear q -rate in the norm �·� A . Moreover we have the error estimate � x k +1 − x ⋆ � A ≤ κ − 1 κ + 1 � x k − x ⋆ � A κ = M/m is the condition number where m = λ 1 is the smallest eigenvalue of A and M = λ n is the biggest eigenvalue of A . Conjugate Direction minimization 23 / 106

Convergence rate of Steepest Descent iterative scheme The steepest descent convergence rate Proof. Remember from slide N ◦ 16 � � ( r T k r k ) 2 � x ⋆ − x k +1 � 2 A = � x ⋆ − x k � 2 1 − A ( r T k A − 1 r k )( r T k Ar k ) from Kantorovich inequality ( r T k r k ) 2 ( M + m ) 2 = ( M − m ) 2 4 M m 1 − k Ar k ) ≤ 1 − ( r T k A − 1 r k )( r T ( M + m ) 2 so that � x ⋆ − x k +1 � A ≤ M − m M + m � x ⋆ − x k � A Conjugate Direction minimization 24 / 106

Convergence rate of Steepest Descent iterative scheme The steepest descent convergence rate Remark (One step convergence) The steepest descent method can converge in one iteration if κ = 1 or when r 0 = u k where u k is an eigenvector of A . 1 In the first case ( κ = 1 ) we have A = β I for some β > 0 so it is not interesting. 2 In the second case we have ( u T k u k ) 2 ( u T k u k ) 2 k Au k ) = k u k ) = 1 ( u T k A − 1 u k )( u T λ − 1 k ( u T k u k ) λ k ( u T in both cases we have r 1 = 0 i.e. we have found the solution. Conjugate Direction minimization 25 / 106

Conjugate direction method Outline Introduction 1 Convergence rate of Steepest Descent iterative scheme 2 Conjugate direction method 3 Conjugate Gradient method 4 Conjugate Gradient convergence rate 5 Preconditioning the Conjugate Gradient method 6 Nonlinear Conjugate Gradient extension 7 Conjugate Direction minimization 26 / 106

Conjugate direction method Conjugate vectors Conjugate direction method Definition (Conjugate vector) Given two vectors p and q in ❘ n are conjugate respect to A if they are orthogonal respect the scalar product induced by A ; i.e., n � p T Aq = A ij p i q j = 0 . i,j =1 Clearly, n vectors p 1 , p 2 , . . . p n ∈ ❘ n that are pair wise conjugated respect to A form a base of ❘ n . Conjugate Direction minimization 27 / 106

Conjugate direction method Conjugate vectors Problem (Linear system) 2 x T Ax − b T x + c is equivalent to Find the minimum of q ( x ) = 1 solve the first order necessary condition, i.e. Find x ⋆ ∈ ❘ n such that: Ax ⋆ = b . Observation Consider x 0 ∈ ❘ n and decompose the error e 0 = x ⋆ − x 0 by the conjugate vectors p 1 , p 2 , . . . , p n ∈ ❘ n : e 0 = x ⋆ − x 0 = σ 1 p 1 + σ 2 p 2 + · · · + σ n p n . Evaluating the coefficients σ 1 , σ 2 , . . . , σ n ∈ ❘ is equivalent to solve the problem Ax ⋆ = b , because knowing e 0 we have x ⋆ = x 0 + e 0 . Conjugate Direction minimization 28 / 106

Conjugate direction method Conjugate vectors Observation Using conjugacy the coefficients σ 1 , σ 2 , . . . , σ n ∈ ❘ can be computed as σ i = p T i Ae 0 , for i = 1 , 2 , . . . , n. p T i Ap i In fact, for all 1 ≤ i ≤ n , we have p T i Ae 0 = p T i A ( σ 1 p 1 + σ 2 p 2 + . . . + σ n p n ) , = σ 1 p T i Ap 1 + σ 2 p T i Ap 2 + . . . + σ n p T i Ap n , = σ i p T i Ap i , because p T i Ap j = 0 for i � = j . Conjugate Direction minimization 29 / 106

Conjugate direction method Conjugate vectors The conjugate direction method evaluate the coefficients σ 1 , σ 2 , . . . , σ n ∈ ❘ recursively in n steps, solving for k ≥ 0 the minimization problem: Conjugate direction method Given x 0 ; k ← 0 ; repeat k ← k + 1 ; Find x k ∈ x 0 + V k such that: x k = arg min � x ⋆ − x � A x ∈ x 0 + V k until k = n where V k is the subspace of ❘ n generated by the first k conjugate direction; i.e., � � V k = span p 1 , p 2 , . . . , p k . Conjugate Direction minimization 30 / 106

Conjugate direction method First step Step: x 0 → x 1 At the first step we consider the subspace x 0 + span { p 1 } which consists in vectors of the form x ( α ) = x 0 + α p 1 α ∈ ❘ The minimization problem becomes: Minimization step x 0 → x 1 Find x 1 = x 0 + α 1 p 1 (i.e., find α 1 !) such that: � x ⋆ − x 1 � A = min α ∈ ❘ � x ⋆ − ( x 0 + α p 1 ) � A , Conjugate Direction minimization 31 / 106

Conjugate direction method First step Solving first step method 1 The minimization problem is the minimum respect to α of the quadratic: Φ( α ) = � x ⋆ − ( x 0 + α p 1 ) � 2 A , = ( x ⋆ − ( x 0 + α p 1 )) T A ( x ⋆ − ( x 0 + α p 1 )) , = ( e 0 − α p 1 ) T A ( e 0 − α p 1 ) , = e T 0 Ae 0 − 2 α p T 1 Ae 0 + α 2 p T 1 Ap 1 . minimum is found by imposing: α 1 = p T dΦ( α ) 1 Ae 0 = − 2 p T 1 Ae 0 + 2 α p T 1 Ap 1 = 0 ⇒ p T d α 1 Ap 1 Conjugate Direction minimization 32 / 106

Conjugate direction method First step Solving first step method 2 (1 / 2) Remember the error expansion: x ⋆ − x 0 = σ 1 p 1 + σ 2 p 2 + · · · + σ n p n . Let x ( α ) = x 0 + α p 1 , the difference x ⋆ − x ( α ) becomes: x ⋆ − x ( α ) = ( σ 1 − α ) p 1 + σ 2 p 2 + . . . + σ n p n due to conjugacy the error � x ⋆ − x ( α ) � A becomes � x ⋆ − x ( α ) � 2 A � � T � � n n � � = ( σ 1 − α ) p 1 + σ i p i ( σ 1 − α ) p 1 + σ j p i A i =2 j =2 n � = ( σ 1 − α ) 2 p T σ 2 j p T 1 Ap 1 + j Ap j j =2 Conjugate Direction minimization 33 / 106

Conjugate direction method First step Solving first step method 2 (2 / 2) Because n � A = ( σ 1 − α ) 2 � p 1 � 2 � x ⋆ − x ( α ) � 2 σ 2 2 � p i � 2 A + A , i =2 we have that n � � x ⋆ − x ( α 1 ) � 2 σ 2 i � p i � 2 A ≤ � x ⋆ − x ( α ) � 2 A = for all α � = σ 1 A i =2 so that minimum is found by imposing α 1 = σ 1 : α 1 = p T 1 Ae 0 p T 1 Ap 1 This argument can be generalized for all k > 1 (see next slides). Conjugate Direction minimization 34 / 106

Conjugate direction method k th Step Step, x k − 1 → x k For the step from k − 1 to k we consider the subspace of ❘ n � � V k = span p 1 , p 2 , . . . , p k which contains vectors of the form: x ( α (1) , α (2) , . . . , α ( k ) ) = x 0 + α (1) p 1 + α (2) p 2 + . . . + α ( k ) p k The minimization problem becomes: Minimization step x k − 1 → x k Find x k = x 0 + α 1 p 1 + α 2 p 2 + . . . + α k p k (i.e. α 1 , α 2 , . . . , α k ) such that: � � � � � x ⋆ − x ( α (1) , α (2) , . . . , α ( k ) ) � x ⋆ − x k � A = min � α (1) ,α (2) ,...,α ( k ) ∈ ❘ A Conjugate Direction minimization 35 / 106

Conjugate direction method k th Step Solving k th Step: x k − 1 → x k (1 / 2) Remember the error expansion: x ⋆ − x 0 = σ 1 p 1 + σ 2 p 2 + · · · + σ n p n . Consider a vector of the form x ( α (1) , α (2) , . . . , α ( k ) ) = x 0 + α (1) p 1 + α (2) p 2 + . . . + α ( k ) p k the error x ⋆ − x ( α (1) , α (2) , . . . , α ( k ) ) can be written as k � x ⋆ − x ( α (1) , α (2) , . . . , α ( k ) ) = x ⋆ − x 0 − α ( i ) p i , i =1 k n � � � σ i − α ( i ) � = p i + σ i p i . i =1 i = k +1 Conjugate Direction minimization 36 / 106

Conjugate direction method k th Step Solving k th Step: x k − 1 → x k (2 / 2) using conjugacy of p i we obtain the norm of the error: � � 2 � � � x ⋆ − x ( α (1) , α (2) , . . . , α ( k ) ) � A k n � � � σ i − α ( i ) � 2 � p i � 2 i � p i � 2 σ 2 = A + A . i =1 i = k +1 So that minimum is found by imposing α i = σ i : for i = 1 , 2 , . . . , k . α i = p T i Ae 0 i = 1 , 2 , . . . , k p T i Ap i Conjugate Direction minimization 37 / 106

Conjugate direction method Successive one dimensional minimization Successive one dimensional minimization (1 / 3) notice that α i = σ i and that x k = x 0 + α 1 p 1 + · · · + α k p k = x k − 1 + α k p k so that x k − 1 contains k − 1 coefficients α i for the minimization. if we consider the one dimensional minimization on the subspace x k − 1 + span { p k } we find again x k ! Conjugate Direction minimization 38 / 106

Conjugate direction method Successive one dimensional minimization Successive one dimensional minimization (2 / 3) Consider a vector of the form x ( α ) = x k − 1 + α p k remember that x k − 1 = x 0 + α 1 p 1 + · · · + α k − 1 p k − 1 so that the error x ⋆ − x ( α ) can be written as k − 1 � x ⋆ − x ( α ) = x ⋆ − x 0 − α i p i + α p k i =1 k − 1 n � � � � � � = σ i − α i p i + σ k − α p k + σ i p i . i =1 i = k +1 due to the equality σ i = α i the blue part of the expression is 0 . Conjugate Direction minimization 39 / 106

Conjugate direction method Successive one dimensional minimization Successive one dimensional minimization (3 / 3) Using conjugacy of p i we obtain the norm of the error: n � � � 2 � p k � 2 � x ⋆ − x ( α ) � 2 i � p i � 2 σ 2 A = σ k − α A + A . i = k +1 So that minimum is found by imposing α = σ k : α k = p T k Ae 0 p T k Ap k Remark This observation permit to perform the minimization on the k -dimensional space x 0 + V k as successive one dimensional minimizations along the conjugate directions p k !. Conjugate Direction minimization 40 / 106

Conjugate direction method Successive one dimensional minimization Problem (one dimensional successive minimization) Find x k = x k − 1 + α k p k such that: � x ⋆ − x k � A = min α ∈ ❘ � x ⋆ − ( x k − 1 + α p k ) � A , The solution is the minimum respect to α of the quadratic: Φ( α ) = ( x ⋆ − ( x k − 1 + α p k )) T A ( x ⋆ − ( x k − 1 + α p k )) , = ( e k − 1 − α p k ) T A ( e k − 1 − α p k ) , = e T k − 1 Ae k − 1 − 2 α p T k Ae k − 1 + α 2 p T k Ap k . minimum is found by imposing: α k = p T dΦ( α ) k Ae k − 1 = − 2 p T k Ae k − 1 + 2 α p T k Ap k = 0 ⇒ p T d α k Ap k Conjugate Direction minimization 41 / 106

Conjugate direction method Successive one dimensional minimization In the case of minimization on the subspace x 0 + V k we have: α k = p T k Ae 0 / p T k Ap k In the case of one dimensional minimization on the subspace x k − 1 + span { p k } we have: α k = p T k Ae k − 1 / p T k Ap k Apparently they are different results, however by using the conjugacy of the vectors p i we have p T k Ae k − 1 = p T k A ( x ⋆ − x k − 1 ) � � = p T k A x ⋆ − ( x 0 + α 1 p 1 + · · · + α k − 1 p k − 1 ) = p T k Ae 0 − α 1 p T k Ap 1 − · · · − α k − 1 p T k Ap k − 1 = p T k Ae 0 Conjugate Direction minimization 42 / 106

Conjugate direction method Successive one dimensional minimization The one step minimization in the space x 0 + V n and the successive minimization in the space x k − 1 + span { p k } , k = 1 , 2 , . . . , n are equivalent if p i s are conjugate. The successive minimization is useful when p i s are not known in advance but must be computed as the minimization process proceeds. The evaluation of α k is apparently not computable because e i is not known. However noticing Ae k = A ( x ⋆ − x k ) = b − Ax k = r k we can write α k = p T k Ae k − 1 / p T k Ap k = p T k r k − 1 / p T k Ap k = Finally for the residual is valid the recurrence r k = b − Ax k = b − A ( x k − 1 + α k p k ) = r k − 1 − α k Ap k . Conjugate Direction minimization 43 / 106

Conjugate direction method Conjugate direction minimization Conjugate direction minimization Algorithm (Conjugate direction minimization) k ← 0 ; x 0 assigned; r 0 ← b − Ax 0 ; while not converged do k ← k + 1 ; r T k − 1 p T α k ← p k Ap k ; k x k ← x k − 1 + α k p k ; r k ← r k − 1 − α k Ap k ; end while Observation (Computazional cost) The conjugate direction minimization requires at each step one matrix–vector product for the evaluation of α k and two update AXPY for x k and r k . Conjugate Direction minimization 44 / 106

Conjugate direction method Conjugate direction minimization Monotonic behavior of the error Remark (Monotonic behavior of the error) The energy norm of the error � e k � A is monotonically decreasing in k . In fact: e k = x ⋆ − x k = α k +1 p k +1 + . . . + α n p n , and by conjugacy � e k � 2 A = � x ⋆ − x k � 2 k +1 � p k +1 � 2 n � p n � 2 A = σ 2 A + . . . + σ 2 A . Finally from this relation we have e n = 0 . Conjugate Direction minimization 45 / 106

Conjugate Gradient method Outline Introduction 1 Convergence rate of Steepest Descent iterative scheme 2 Conjugate direction method 3 Conjugate Gradient method 4 Conjugate Gradient convergence rate 5 Preconditioning the Conjugate Gradient method 6 Nonlinear Conjugate Gradient extension 7 Conjugate Direction minimization 46 / 106

Conjugate Gradient method Conjugate Gradient method The Conjugate Gradient method combine the Conjugate Direction method with an orthogonalization process (like Gram-Schmidt) applied to the residual to construct the conjugate directions. In fact, because A define a scalar product in the next slide we prove: each residue is orthogonal to the previous conjugate directions, and consequently linearly independent from the previous conjugate directions. if the residual is not null is can be used to construct a new conjugate direction. Conjugate Direction minimization 47 / 106

Conjugate Gradient method Orthogonality of the residue r k respect V k The residue r k is orthogonal to p 1 , p 2 , . . . , p k . In fact, from the error expansion e k = α k +1 p k +1 + α k +2 p k +2 + · · · + α n p n because r k = Ae k , for i = 1 , 2 , . . . , k we have p T i r k = p T i Ae k n n � � = p T α j p T α j p j = i A i Ap j j = k +1 j = k +1 = 0 Conjugate Direction minimization 48 / 106

Conjugate Gradient method Building new conjugate direction (1 / 2) The conjugate direction method build one new direction at each step. If r k � = 0 it can be used to build the new direction p k +1 by a Gram-Schmidt orthogonalization process p k +1 = r k + β ( k +1) p 1 + β ( k +1) p 2 + . . . + β ( k +1) p k , 1 2 k where the k coefficients β ( k +1) , β ( k +1) , . . . , β ( k +1) must 1 2 k satisfy: p T i Ap k +1 = 0 , for i = 1 , 2 , . . . , k. Conjugate Direction minimization 49 / 106

Conjugate Gradient method Building new conjugate direction (2 / 2) (repeating from previous slide) p k +1 = r k + β ( k +1) p 1 + β ( k +1) p 2 + · · · + β ( k +1) p k , 1 2 k expanding the expression: 0 = p T i Ap k +1 , � � r k + β ( k +1) p 1 + β ( k +1) p 2 + · · · + β ( k +1) = p T , i A p k 1 2 k i Ar k + β ( k +1) = p T p T i Ap i , i = − p T i Ar k β ( k +1) ⇒ i = 1 , 2 , . . . , k i p T i Ap i Conjugate Direction minimization 50 / 106

Conjugate Gradient method The choice of the residual r k � = 0 for the construction of the new conjugate direction p k +1 has three important consequences: 1 simplification of the expression for α k ; 2 Orthogonality of the residual r k from the previous residue r 0 , r 1 , . . . , r k − 1 ; 3 three point formula and simplification of the coefficients β ( k +1) . i this facts will be examined in the next slides. Conjugate Direction minimization 51 / 106

Conjugate Gradient method Simplification of the expression for α k Writing the expression for p k from the orthogonalization process p k = r k − 1 + β ( k +1) p 1 + β ( k +1) p 2 + . . . + β ( k +1) k − 1 p k − 1 , 1 2 using orthogonality of r k − 1 and the vectors p 1 , p 2 , . . . , p k − 1 , (see slide N.48) we have � � r k − 1 + β ( k +1) p 1 + β ( k +1) p 2 + . . . + β ( k +1) r T k − 1 p k = r T , k − 1 p k − 1 k − 1 1 3 = r T k − 1 r k − 1 . recalling the definition of α k it follows: α k = e T = r T r T k − 1 Ap k k − 1 p k k − 1 r k − 1 = p T p T p T k Ap k k Ap k k Ap k Conjugate Direction minimization 52 / 106

Conjugate Gradient method Orthogonally of the residue r k from r 0 , r 1 , . . . , r k − 1 From the definition of p i +1 it follows: p i +1 = r i + β ( i +1) p 1 + β ( i +1) p 2 + . . . + β ( i +1) p i , 1 2 i � � ⇒ r i ∈ span { p 1 , p 2 , . . . , p i , p i +1 } = V i +1 obvious using orthogonality of r k and the vectors p 1 , p 2 , . . . , p k , (see slide N.48) for i < k we have � � i � β ( i +1) r T k r i = r T p i +1 − , p j k j j =1 i � β ( i +1) = r T r T k p i +1 − k p j = 0 . j j =1 Conjugate Direction minimization 53 / 106

Conjugate Gradient method Three point formula and simplification of β ( k +1) i r T k r i = r T From the relation k ( r i − 1 − α i Ap i ) we deduce � − r T k Ap i = r T k r i − 1 − r T k r k /α k if i = k ; k r i r T = α i 0 if i < k ; remembering that α k = r T k − 1 r k − 1 / p T k Ap k we obtain  r T  k r k  = − r T i = k ; k Ap i β ( k +1) r T k − 1 r k − 1 = i p T  i Ap i  0 i < k ; i.e. there is only one non zero coefficient β ( k +1) , so we write k β k = β ( k +1) and obtain the three point formula: k p k +1 = r k + β k p k Conjugate Direction minimization 54 / 106

Conjugate Gradient method Conjugate gradient algorithm initial step: k ← 0 ; x 0 assigned; r 0 ← b − Ax 0 ; p 1 ← r 0 ; while � r k � > ǫ do k ← k + 1 ; Conjugate direction method r T k − 1 r k − 1 α k ← k Ap k ; p T x k ← x k − 1 + α k p k ; r k ← r k − 1 − α k Ap k ; Residual orthogonalization r T k r k β k ← k − 1 r k − 1 ; r T p k +1 ← r k + β k p k ; end while Conjugate Direction minimization 55 / 106

Conjugate Gradient convergence rate Outline Introduction 1 Convergence rate of Steepest Descent iterative scheme 2 Conjugate direction method 3 Conjugate Gradient method 4 Conjugate Gradient convergence rate 5 Preconditioning the Conjugate Gradient method 6 Nonlinear Conjugate Gradient extension 7 Conjugate Direction minimization 56 / 106

Conjugate Gradient convergence rate Polynomial residual expansions Polynomial residual expansions (1 / 5) From the Conjugate Gradient iterative scheme on slide 55 we have Lemma There exists k -degree polynomial P k ( x ) and Q k ( x ) such that r k = P k ( A ) r 0 k = 0 , 1 , . . . , n p k = Q k − 1 ( A ) r 0 k = 1 , 2 , . . . , n Moreover P k (0) = 1 for all k . Proof. (1 / 2) . The proof is by induction. Base k = 0 p 1 = r 0 so that P 0 ( x ) = 1 and Q 0 ( x ) = 1 . Conjugate Direction minimization 57 / 106

Conjugate Gradient convergence rate Polynomial residual expansions Polynomial residual expansions (2 / 5) Proof. (2 / 2) . let the expansion valid for k − 1 Consider the recursion for the residual: r k = r k − 1 − α k Ap k = P k − 1 ( A ) r 0 + α k A Q k − 1 ( A ) r 0 � � = P k − 1 ( A ) + α k A Q k − 1 ( A ) r 0 then P k ( x ) = P k − 1 ( x ) + α k xQ k − 1 ( x ) and P k (0) = P k − 1 (0) = 1 . Consider the recursion for the conjugate direction p k +1 = P k ( A ) r 0 + β k Q k − 1 ( A ) r 0 � � = P k ( A ) + β k Q k − 1 ( A ) r 0 then Q k ( x ) = P k ( x ) + β k Q k − 1 ( x ) . Conjugate Direction minimization 58 / 106

Conjugate Gradient convergence rate Polynomial residual expansions Polynomial residual expansions (3 / 5) We have the following trivial equality � � V k = span p 1 , p 2 , . . . p k � � = span r 0 , r 1 , . . . r k − 1 � � q ( A ) r 0 | q ∈ P k − 1 , = � � p ( A ) e 0 | p ∈ P k , p (0) = 0 = In this way the optimality of CG step can be written as � x ⋆ − x k � A ≤ � x ⋆ − x � A , ∀ x ∈ x 0 + V k ∀ p ∈ P k , p (0) = 0 � x ⋆ − x k � A ≤ � x ⋆ − ( x 0 + p ( A ) e 0 ) � A , ∀ P ∈ P k , P (0) = 1 � x ⋆ − x k � A ≤ � P ( A ) e 0 � A , Conjugate Direction minimization 59 / 106

Conjugate Gradient convergence rate Polynomial residual expansions Polynomial residual expansions (4 / 5) Recalling that A − 1 r k = A − 1 ( b − Ax k ) = x ⋆ − x k = e k we can write e k = x ⋆ − x k = A − 1 r k = A − 1 P k ( A ) r 0 = P k ( A ) A − 1 r 0 = P k ( A )( x ⋆ − x 0 ) = P k ( A ) e 0 . due to the optimality of the conjugate gradient we have: Conjugate Direction minimization 60 / 106

Conjugate Gradient convergence rate Polynomial residual expansions Polynomial residual expansions (5 / 5) Using the results of slide 59 and 60 we can write e k = P k ( A ) e 0 , ∀ P ∈ P k , P (0) = 1 � e k � A = � P k ( A ) e 0 � A ≤ � P ( A ) e 0 � A and from this equation we have the estimate � e k � A ≤ P ∈ P k , P (0)=1 � P ( A ) e 0 � A inf So an estimate of the form P ∈ P k , P (0)=1 � P ( A ) e 0 � A ≤ C k � e 0 � A inf can be used to proof a convergence rate theorem, as for the steepest descent algorithm. Conjugate Direction minimization 61 / 106

Conjugate Gradient convergence rate Convergence rate calculation Convergence rate calculation Lemma Let A ∈ ❘ n × n an SPD matrix, and p ∈ P k a polynomial, then � p ( A ) x � A ≤ � p ( A ) � 2 � x � A Proof. (1 / 2) . The matrix A is SPD so that we can write A = U T Λ U , Λ = diag { λ 1 , λ 2 , . . . , λ n } where U is an orthogonal matrix (i.e. U T U = I ) and Λ ≥ 0 is diagonal. We can define the SPD matrix A 1 / 2 as follows A 1 / 2 = U T Λ 1 / 2 U , Λ 1 / 2 = diag { λ 1 / 2 , λ 1 / 2 , . . . , λ 1 / 2 n } 1 2 and obviously A 1 / 2 A 1 / 2 = A . Conjugate Direction minimization 62 / 106

Conjugate Gradient convergence rate Convergence rate calculation Proof. (2 / 2) . Notice that � � 2 � � A = x T Ax = x T A 1 / 2 A 1 / 2 x = � x � 2 � A 1 / 2 x � 2 so that � � � � � A 1 / 2 p ( A ) x � p ( A ) x � A = � 2 � � � � � p ( A ) A 1 / 2 x = � 2 � � � � � A 1 / 2 x ≤ � p ( A ) � 2 � 2 = � p ( A ) � 2 � x � A Conjugate Direction minimization 63 / 106

Conjugate Gradient convergence rate Convergence rate calculation Lemma Let A ∈ ❘ n × n an SPD matrix, and p ∈ P k a polynomial, then � p ( A ) � 2 = max λ ∈ σ ( A ) | p ( λ ) | Proof. The matrix p ( A ) is symmetric, and for a generic symmetric matrix B we have � B � 2 = max λ ∈ σ ( B ) | λ | observing that if λ is an eigenvalue of A then p ( λ ) is an eigenvalue of p ( A ) the thesis easily follows. Conjugate Direction minimization 64 / 106

Conjugate Gradient convergence rate Convergence rate calculation Starting the error estimate � e k � A ≤ P ∈ P k , P (0)=1 � P ( A ) e 0 � A inf Combining the last two lemma we easily obtain the estimate � � � e k � A ≤ inf λ ∈ σ ( A ) | P ( λ ) | max � e 0 � A P ∈ P k , P (0)=1 The convergence rate is estimated by bounding the constant � � inf λ ∈ σ ( A ) | P ( λ ) | max P ∈ P k , P (0)=1 Conjugate Direction minimization 65 / 106

Conjugate Gradient convergence rate Finite termination of Conjugate Gradient Finite termination of Conjugate Gradient Theorem (Finite termination of Conjugate Gradient) Let A ∈ ❘ n × n an SPD matrix, the the Conjugate Gradient applied to the linear system Ax = b terminate finding the exact solution in at most n -step. Proof. From the estimate � � � e k � A ≤ inf λ ∈ σ ( A ) | P ( λ ) | max � e 0 � A P ∈ P k , P (0)=1 � � � choosing P ( x ) = ( x − λ ) (0 − λ ) λ ∈ σ ( A ) λ ∈ σ ( A ) we have max λ ∈ σ ( A ) | P ( λ ) | = 0 and � e n � A = 0 . Conjugate Direction minimization 66 / 106

Conjugate Gradient convergence rate Convergence rate of Conjugate Gradient Convergence rate of Conjugate Gradient 1 The constant � � inf λ ∈ σ ( A ) | P ( λ ) | max P ∈ P k , P (0)=1 is not easy to evaluate, 2 The following bound, is useful λ ∈ σ ( A ) | P ( λ ) | ≤ max λ ∈ [ λ 1 ,λ n ] | P ( λ ) | max 3 in particular the final estimate will be obtained by � � � � � ¯ � inf λ ∈ σ ( A ) | P ( λ ) | max ≤ max P k ( λ ) P ∈ P k , P (0)=1 λ ∈ [ λ 1 ,λ n ] where ¯ P k ( x ) is an opportune k -degree polynomial for which � � ¯ � ¯ � . P k (0) = 1 and it is easy to evaluate max λ ∈ [ λ 1 ,λ n ] P k ( λ ) Conjugate Direction minimization 67 / 106

Conjugate Gradient convergence rate Chebyshev Polynomials Chebyshev Polynomials (1 / 4) 1 The Chebyshev Polynomials of the First Kind are the right polynomial for this estimate. This polynomial have the following definition in the interval [ − 1 , 1] : T k ( x ) = cos( k arccos( x )) 2 Another equivalent definition valid in the interval ( −∞ , ∞ ) is the following �� k � � k � � � T k ( x ) = 1 x 2 − 1 x 2 − 1 x + + x − 2 3 In spite of these definition, T k ( x ) is effectively a polynomial. Conjugate Direction minimization 68 / 106

Conjugate Gradient convergence rate Chebyshev Polynomials Chebyshev Polynomials (2 / 4) Some example of Chebyshev Polynomials. 1.5 1.5 T1 T3 T2 T4 1 1 0.5 0.5 0 0 -0.5 -0.5 -1 -1 -1.5 -1.5 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 1.5 1.5 T12 T20 1 1 0.5 0.5 0 0 -0.5 -0.5 -1 -1 -1.5 -1.5 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 Conjugate Direction minimization 69 / 106

Conjugate Gradient convergence rate Chebyshev Polynomials Chebyshev Polynomials (3 / 4) 1 It is easy to show that T k ( x ) is a polynomial by the use of cos( α + β ) = cos α cos β − sin α sin β cos( α + β ) + cos( α − β ) = 2 cos α cos β let θ = arccos( x ) : T 0 ( x ) = cos(0 θ ) = 1 ; 1 T 1 ( x ) = cos(1 θ ) = x ; 2 T 2 ( x ) = cos(2 θ ) = cos( θ ) 2 − sin( θ ) 2 = 2 cos( θ ) 2 − 1 = 2 x 2 − 1 ; 3 T k +1 ( x ) + T k − 1 ( x ) = cos(( k + 1) θ ) + cos(( k − 1) θ ) 4 = 2 cos( kθ ) cos( θ ) = 2 x T k ( x ) 2 In general we have the following recurrence: T 0 ( x ) = 1 ; 1 T 1 ( x ) = x ; 2 T k +1 ( x ) = 2 x T k ( x ) − T k − 1 ( x ) . 3 Conjugate Direction minimization 70 / 106

Conjugate Gradient convergence rate Chebyshev Polynomials Chebyshev Polynomials (4 / 4) Solving the recurrence: T 0 ( x ) = 1 ; 1 T 1 ( x ) = x ; 2 T k +1 ( x ) = 2 x T k ( x ) − T k − 1 ( x ) . 3 We obtain the explicit form of the Chebyshev Polynomials �� k � � k � � � T k ( x ) = 1 x 2 − 1 x 2 − 1 x + + x − 2 The translated and scaled polynomial is useful in the study of the conjugate gradient method: � a + b − 2 x � T k ( x ; a, b ) = T k b − a where we have | T k ( x ; a, b ) | ≤ 1 for all x ∈ [ a, b ] . Conjugate Direction minimization 71 / 106

Conjugate Gradient convergence rate Convergence rate of Conjugate Gradient method Convergence rate of Conjugate Gradient method Theorem (Convergence rate of Conjugate Gradient method) Let A ∈ ❘ n × n an SPD matrix then the Conjugate Gradient method converge to the solution x ⋆ = A − 1 b with at least linear r -rate in the norm �·� A . Moreover we have the error estimate � √ κ − 1 � k √ κ + 1 � e k � A � 2 � e 0 � A κ = M/m is the condition number where m = λ 1 is the smallest eigenvalue of A and M = λ n is the biggest eigenvalue of A . The expression a k � b k means that for all ǫ > 0 there exists k 0 > 0 such that: a k ≤ (1 − ǫ ) b k , ∀ k > k 0 Conjugate Direction minimization 72 / 106

Conjugate Gradient convergence rate Convergence rate of Conjugate Gradient method Proof. From the estimate P ∈ P k , P (0) = 1 � e k � A ≤ λ ∈ [ m,M ] | P ( λ ) | � e 0 � A , max choosing P ( x ) = T k ( x ; m, M ) /T k (0; m, M ) from the fact that | T k ( x ; m, M ) | ≤ 1 for x ∈ [ m, M ] we have � M + m � − 1 � e k � A ≤ T k (0; m, M ) − 1 � e 0 � A = T k � e 0 � A M − m observe that M + m M − m = κ +1 κ − 1 and � κ + 1 � − 1 � � √ κ + 1 � k � √ κ − 1 � k � − 1 T k = 2 √ κ − 1 + √ κ + 1 κ − 1 � √ κ − 1 � k finally notice that → 0 as k → ∞ . √ κ +1 Conjugate Direction minimization 73 / 106

Preconditioning the Conjugate Gradient method Outline Introduction 1 Convergence rate of Steepest Descent iterative scheme 2 Conjugate direction method 3 Conjugate Gradient method 4 Conjugate Gradient convergence rate 5 Preconditioning the Conjugate Gradient method 6 Nonlinear Conjugate Gradient extension 7 Conjugate Direction minimization 74 / 106

Preconditioning the Conjugate Gradient method Preconditioning Preconditioning Problem (Preconditioned linear system) Given A , P ∈ ❘ n × n , with A an SPD matrix and P non singular matrix and b ∈ ❘ n . Find x ⋆ ∈ ❘ n such that: P − T Ax ⋆ = P − T b . A good choice for P should be such that M = P T P ≈ A , where ≈ denotes that M is an approximation of A in some sense to precise later. Notice that: P non singular imply: P − T ( b − Ax ) = 0 ⇐ ⇒ b − Ax = 0; A = P − T AP − 1 is also SPD (obvious proof). A SPD imply � Conjugate Direction minimization 75 / 106

Preconditioning the Conjugate Gradient method Preconditioning Now we reformulate the preconditioned system: Problem (Preconditioned linear system) Given A , P ∈ ❘ n × n , with A an SPD matrix and P non singular matrix and b ∈ ❘ n the preconditioned problem is the following: x ⋆ ∈ ❘ n such that: � x ⋆ = � Find � A � b where A = P − T AP − 1 b = P − T b � � notice that if x ⋆ is the solution of the linear system Ax = b then x ⋆ = P x ⋆ is the solution of the linear system � Ax = � � b . Conjugate Direction minimization 76 / 106

Preconditioning the Conjugate Gradient method Preconditioning PCG: preliminary version initial step: k ← 0 ; x 0 assigned; r 0 ← � b − � x 0 ← P x 0 ; � � A � x 0 ; � p 1 ← � r 0 ; while � � r k � > ǫ do k ← k + 1 ; Conjugate direction method r T � k − 1 � r k − 1 � α k ← p k ; k � p T � A � x k ← � � x k − 1 + � α k � p k ; α k � � r k ← � r k − 1 − � A � p k ; Residual orthogonalization r T � k � � r k β k ← r k − 1 ; r T � k − 1 � r k + � p k +1 ← � � β k � p k ; end while final step P − 1 � x k ; Conjugate Direction minimization 77 / 106

Preconditioning the Conjugate Gradient method CG reformulation Conjugate gradient algorithm applied to � x = � A � b require the evaluation of thing like: p k = P − T AP − 1 � � A � p k . this can be done without evaluate directly the matrix � A , by the following operations: k = P − 1 � 1 solve P s ′ p k for s ′ k = � p k ; 2 evaluate s ′′ k = As ′ k ; 3 solve P T s ′′′ k = P − T s ′′ . k = s ′′ k for s ′′′ Step 1 and 3 require the solution of two auxiliary linear system. This is not a big problem if P and P T are triangular matrices (see e.g. incomplete Cholesky). Conjugate Direction minimization 78 / 106

Preconditioning the Conjugate Gradient method CG reformulation However. . . we can reformulate the algorithm using only the matrices A and P ! Definition For all k ≥ 1 , we introduce the vector q k = P − 1 � p . Observation p k for all 1 ≤ k ≤ n are � If the vectors � p 1 , � p 2 , . . . � A -conjugate, then the corresponding vectors q 1 , q 2 , . . . q k are A -conjugate. In fact: A P − 1 � q T p T j P − T p T � j Aq i = � p i = � A p i = 0 , � if i � = j, �� j � �� = P − T AP − 1 = q T = q T j j that is a consequence of � A -conjugation of vectors � p i . Conjugate Direction minimization 79 / 106

Preconditioning the Conjugate Gradient method CG reformulation Definition For all k ≥ 1 , we introduce the vectors x k = x k − 1 + � α k q k . Observation If we assume, by construction, � x 0 = P x 0 , then we have � x k = P x k , for all k with 1 ≤ k ≤ n . In fact, if � x k − 1 = P x k − 1 (inductive hypothesis), then x k = � � x k − 1 + � α k � p k [preconditioned CG] = P x k − 1 + � α k P q k [inductive Hyp. defs of q k ] = P ( x k − 1 + � α k q k ) [obvious] = P x k [defs. of x k ] Conjugate Direction minimization 80 / 106

Preconditioning the Conjugate Gradient method CG reformulation Observation Because � x k = P x k for all k ≥ 0 , we have the recurrence between r k = � b − � the corresponding residue � A � x and r k = b − Ax k : r k = P − T r k . � In fact, r k = � b − � � A � x k , [defs. of � r k ] = P − T b − P − T AP − 1 P x k , [defs. of � b , � A , � x k ] = P − T ( b − Ax k ) , [obvious] = P − T r k . [defs. of r k ] Conjugate Direction minimization 81 / 106

Preconditioning the Conjugate Gradient method CG reformulation Definition For all k , with 1 ≤ k ≤ n , the vector z k is the solution of the linear system Mz k = r k . where M = P T P . Formally, z k = M − 1 r k = P − 1 P − T r k . Using the vectors { z k } , α k and � we can express � β k in terms of A , the residual r k , and conjugate direction q k ; we can build a recurrence relation for the A -conjugate directions q k . Conjugate Direction minimization 82 / 106

Preconditioning the Conjugate Gradient method CG reformulation Observation r k − 1 P − 1 P − T r k − 1 r T α k = � k − 1 � = r k − 1 M − 1 r k − 1 r k − 1 � = , k P T P − T AP − 1 P q k q T k � p T q k Aq k � A � p k r k − 1 z k − 1 = . q k Aq k Observation k P − 1 P − T r k r T r T r T k M − 1 r k � k � r k � β k = = = , k − 1 P − 1 P − T r k − 1 r T r T r T k − 1 M − 1 r k − 1 � k − 1 � r k − 1 r T k z k = . r T k − 1 z k − 1 Conjugate Direction minimization 83 / 106

Preconditioning the Conjugate Gradient method CG reformulation Observation Using the vector z k = M − 1 r k , the following recurrence is true q k +1 = z k + � β k q k In fact: r k + � p k +1 = � � β k � p k [preconditioned CG] P − 1 � β k P − 1 � p k +1 = P − 1 � r k + � [left mult P − 1 ] p k P − 1 � β k P − 1 � p k +1 = P − 1 P − T r k + � p k [ r k +1 = P − T r k +1 ] P − 1 � β k P − 1 � [ M − 1 = P − 1 P − T ] p k +1 = M − 1 r k + � p k [ q k = P − 1 � q k +1 = z k + � β k q k p k ] Conjugate Direction minimization 84 / 106

Preconditioning the Conjugate Gradient method CG reformulation PCG: final version initial step: k ← 0 ; x 0 assigned; r 0 ← b − Ax 0 ; q 1 ← r 0 ; while � z k � > ǫ do k ← k + 1 ; Conjugate direction method r T k − 1 z k − 1 � α k ← Aq k ; k � q T x k ← x k − 1 + � α k q k ; r k ← r k − 1 − � α k Aq k ; Preconditioning z k = M − 1 r k ; Residual orthogonalization r T k z k � β k ← k − 1 z k − 1 ; r T q k +1 ← z k + � β k q k ; end while Conjugate Direction minimization 85 / 106

Nonlinear Conjugate Gradient extension Outline Introduction 1 Convergence rate of Steepest Descent iterative scheme 2 Conjugate direction method 3 Conjugate Gradient method 4 Conjugate Gradient convergence rate 5 Preconditioning the Conjugate Gradient method 6 Nonlinear Conjugate Gradient extension 7 Conjugate Direction minimization 86 / 106

Nonlinear Conjugate Gradient extension Nonlinear Conjugate Gradient extension 1 The conjugate gradient algorithm can be extended for nonlinear minimization. 2 Fletcher and Reeves extend CG for the minimization of a general non linear function f ( x ) as follows: Substitute the evaluation of α k by an line search 1 Substitute the residual r k with the gradient ∇ f ( x k ) 2 3 We also translate the index for the search direction p k to be more consistent with the gradients. The resulting algorithm is in the next slide Conjugate Direction minimization 87 / 106

Nonlinear Conjugate Gradient extension Fletcher and Reeves Fletcher and Reeves Nonlinear Conjugate Gradient initial step: k ← 0 ; x 0 assigned; f 0 ← f ( x 0 ) ; g 0 ← ∇ f ( x 0 ) T ; p 0 ← − g 0 ; while � g k � > ǫ do k ← k + 1 ; Conjugate direction method Compute α k by line-search; x k ← x k − 1 + α k p k − 1 ; g k ← ∇ f ( x k ) T ; Residual orthogonalization g T k g k β FR ← k − 1 g k − 1 ; k g T p k ← − g k + β FR p k − 1 ; k end while Conjugate Direction minimization 88 / 106

Nonlinear Conjugate Gradient extension Fletcher and Reeves 1 To ensure convergence and apply Zoutendijk global convergence theorem we need to ensure that p k is a descent direction. 2 p 0 is a descent direction by construction, for p k we have k p k = − � g k � 2 + β FR g T g T k p k − 1 k if the line-search is exact than g T k p k − 1 = 0 because p k − 1 is the direction of the line-search. So by induction p k is a descent direction. 3 Exact line-search is expensive, however if we use inexact line-search with strong Wolfe conditions sufficient decrease: f ( x k + α k p k ) ≤ f ( x k ) + c 1 α k ∇ f ( x k ) p k ; 1 curvature condition: |∇ f ( x k + α k p k ) p k | ≤ c 2 |∇ f ( x k ) p k | . 2 with 0 < c 1 < c 2 < 1 / 2 then we can prove that p k is a descent direction. Conjugate Direction minimization 89 / 106

Nonlinear Conjugate Gradient extension convergence analysis The previous consideration permits to say that Fletcher and Reeves nonlinear conjugate gradient method with strong Wolfe line-search is globally convergent 1 To prove globally convergence we need the following lemma: Lemma (descent direction bound) Suppose we apply Fletcher and Reeves nonlinear conjugate gradient method to f ( x ) with strong Wolfe line-search with 0 < c 2 < 1 / 2 . The the method generates descent direction p k that satisfy the following inequality ≤ g T 1 � g k � 2 ≤ − 1 − 2 c 2 k p k − , k = 0 , 1 , 2 , . . . 1 − c 2 1 − c 2 1 globally here means that Zoutendijk like theorem apply Conjugate Direction minimization 90 / 106

Nonlinear Conjugate Gradient extension convergence analysis Proof. (1 / 3) . The proof is by induction. First notice that the function t ( ξ ) = 2 ξ − 1 1 − ξ is monotonically increasing on the interval [0 , 1 / 2] and that t (0) = − 1 and t (1 / 2) = 0 . Hence, because of c 2 ∈ (0 , 1 / 2) we have: − 1 < 2 c 2 − 1 < 0 . ( ⋆ ) 1 − c 2 base of induction k = 0 : For k = 0 we have p 0 = − g 0 so that 0 p 0 / � g 0 � 2 = − 1 . From ( ⋆ ) the lemma inequality is trivially g T satisfied. Conjugate Direction minimization 91 / 106

Nonlinear Conjugate Gradient extension convergence analysis Proof. (2 / 3) . Using update direction formula’s of the algorithm: g T k g k β FR p k = − g k + β FR = p k − 1 k k g T k − 1 g k − 1 we can write g T g T = − 1 + g T k p k k p k − 1 k p k − 1 � g k � 2 = − 1 + β FR k � g k � 2 � g k − 1 � 2 and by using second strong Wolfe condition: g T g T � g k − 1 � 2 ≤ g T k − 1 p k − 1 k − 1 p k − 1 k p k − 1 + c 2 � g k � 2 ≤ − 1 − c 2 � g k − 1 � 2 Conjugate Direction minimization 92 / 106

Nonlinear Conjugate Gradient extension convergence analysis Proof. (3 / 3) . by induction we have ≥ − g T k − 1 p k − 1 1 � g k − 1 � 2 > 0 1 − c 2 so that g T g T k − 1 p k − 1 k p k 1 = 2 c 2 − 1 � g k � 2 ≤ − 1 − c 2 � g k − 1 � 2 ≤ − 1 + c 2 1 − c 2 1 − c 2 and g T g T k − 1 p k − 1 k p k 1 1 � g k � 2 ≥ − 1 + c 2 � g k − 1 � 2 ≥ − 1 − c 2 = − 1 − c 2 1 − c 2 Conjugate Direction minimization 93 / 106

Nonlinear Conjugate Gradient extension convergence analysis 1 The inequality of the the previous lemma can be written as: g T 1 � g k � � g k � � p k � ≥ 1 − 2 c 2 � g k � k p k � p k � ≥ − � p k � > 0 1 − c 2 1 − c 2 2 Remembering the Zoutendijk theorem we have ∞ � g T k p k (cos θ k ) 2 � g k � 2 < ∞ , where cos θ k = − � g k � � p k � k =1 3 so that if � g k � / � p k � is bounded from below we have that cos θ k ≥ δ for all k and then from Zoutendijk theorem the scheme converge. 4 Unfortunately this bound cant be proved so that Zoutendijk theorem cant be applied directly. However it is possible to prove a weaker results, i.e. that lim inf k →∞ � g k � = 0 ! Conjugate Direction minimization 94 / 106

Nonlinear Conjugate Gradient extension convergence analysis Convergence of Fletcher and Reeves method Assumption (Regularity assumption) We assume f ∈ C 1 ( ❘ n ) with Lipschitz continuous gradient, i.e. there exists γ > 0 such that � � ∇ f ( x ) T − ∇ f ( y ) T � � ≤ γ � x − y � , ∀ x , y ∈ ❘ n Conjugate Direction minimization 95 / 106

Nonlinear Conjugate Gradient extension convergence analysis Theorem (Convergence of Fletcher and Reeves method) Suppose the method of Fletcher and Reeves is implemented with strong Wolfe line-search with 0 < c 1 < c 2 < 1 / 2 . If f ( x ) and x 0 satisfy the previous regularity assumptions, then lim inf k →∞ � g k � = 0 Proof. (1 / 4) . From previous Lemma we have 1 � g k � cos θ k ≥ k = 1 , 2 , . . . 1 − c 2 � p k � ∞ � � g k � 4 substituting in Zoutendijk condition we have � p k � 2 < ∞ . k =1 The proof is by contradiction. in fact if theorem is not true than the series diverge. Next we want to bound � p k � . Conjugate Direction minimization 96 / 106

Nonlinear Conjugate Gradient extension convergence analysis Proof. (bounding � p k � ) (2 / 4) . Using second Wolfe condition and previous Lemma � � c 2 � g T � ≤ − c 2 g T � g k − 1 � 2 k p k − 1 k p k − 1 ≤ 1 − c 2 using p k ← − g k + β FR p k − 1 we have k � � � p k � 2 ≤ � g k � 2 + 2 β FR ) 2 � p k − 1 � 2 � g T � + ( β FR k p k − 1 k k 2 c 2 ≤ � g k � 2 + � g k − 1 � 2 + ( β FR ) 2 � p k − 1 � 2 β FR k k 1 − c 2 ← � g k � 2 / � g k − 1 � 2 then recall that β FR k � p k � 2 ≤ 1 + c 2 � g k � 2 + ( β FR ) 2 � p k − 1 � 2 k 1 − c 2 Conjugate Direction minimization 97 / 106

Nonlinear Conjugate Gradient extension convergence analysis Proof. (bounding � p k � ) (3 / 4) . setting c 3 = 1+ c 2 1 − c 2 and using repeatedly the last inequality we obtain: ) 2 � k − 1 ) 2 � p k − 2 � 2 � � p k � 2 ≤ c 3 � g k � 2 + ( β FR c 3 � g k − 1 � 2 + ( β FR k = c 3 � g k � 4 � � g k � − 2 + � g k − 1 � − 2 � � g k � 4 � g k − 2 � 4 � p k − 2 � 2 + ≤ c 3 � g k � 4 � � g k � − 2 + � g k − 1 � − 2 + � g k − 2 � − 2 � + � g k � 4 � g k − 3 � 4 � p k − 3 � 2 k � ≤ c 3 � g k � 4 � g j � − 2 j =1 Conjugate Direction minimization 98 / 106

Nonlinear Conjugate Gradient extension convergence analysis Proof. (4 / 4) . Suppose now by contradiction there exists δ > 0 such that � g k � ≥ δ a by using the regularity assumptions we have k � � p k � 2 ≤ c 3 � g k � 4 � g j � − 2 ≤ c 3 � g k � 4 δ − 2 k j =1 Substituting in Zoutendijk condition we have ∞ ∞ � � g k � 4 � � p k � 2 ≥ δ 2 1 ∞ > k = ∞ c 4 k =1 k =1 this contradict assumption. a the correct assumption is that there exists k 0 such that � g k � ≥ δ for k ≥ k 0 but this complicate a little bit the following inequality without introducing new idea. Conjugate Direction minimization 99 / 106

Nonlinear Conjugate Gradient extension Weakness of Fletcher and Reeves method Suppose that p k is a bad search direction, i.e. cos θ k ≈ 0 . From the descent direction bound Lemma (see slide 90) we have 1 � p k � ≥ cos θ k ≥ 1 − 2 c 2 � g k � � g k � � p k � > 0 1 − c 2 1 − c 2 so that to have cos θ k ≈ 0 we needs � p k � ≫ � g k � . since p k is a bad direction near orthogonal to g k it is likely that the step is small and x k +1 ≈ x k . If so we have also g k +1 ≈ g k and β FR k +1 ≈ 1 . but remember that p k +1 ← − g k +1 + β FR k +1 p k , so that p k +1 ≈ p k . This means that a long sequence of unproductive iterates will follows. Conjugate Direction minimization 100 / 106

Conjugate Direction minimization Lectures for PHD course on - PowerPoint PPT Presentation

Conjugate Direction minimization Lectures for PHD course on Numerical optimization Enrico Bertolazzi DIMS Universit a di Trento November 21 December 14, 2011 Conjugate Direction minimization 1 / 106 Outline Introduction 1

Choosing Priors Probability Intervals 18.05 Spring 2014 Conjugate priors A prior is conjugate

Minimization Satoru Iwata (University of Tokyo) Submodular Function Minimization ( )

WINE & DINE Direction Munich Rotkreuz Direction Frankfurt Direction Basle Vorarlberg

Minimization strategy for choice of the stopping index in conjugate gradient type methods for

Benefits of Radial Build Benefits of Radial Build Minimization and Requirements Minimization and

1 The Minimization Problem The Minimization Problem Input: A DFA (deterministic finite-state

Empirical Risk Minimization October 29, 2015 Outline Empirical risk minimization view

ARS Workshop Context Markov Random Fields minimization and minimal cuts in Exact total variation

Counterfactual Regret Minimization and Domination in Extensive-Form Games Richard Gibson

Learning as Loss Minimization Machine Learning 1 Learning as loss minimization The setup

One-Dimensional Minimization Lectures for PHD course on Numerical optimization Enrico Bertolazzi

Minimization Using Descent Information we will consider the minimization of unconstrained

Cluster Minimization in Geometric Graphs Jakob Geiger Motivation Motivation Cluster

11. Equality constrained minimization equality constrained minimization eliminating

A Minimization Algorithm Consider the minimization problem: * M min M M * subject

CENG 342 Digital Systems Tabular Minimization Larry Pyeatt SDSM&T Tabular Minimization

The Active Versus Passive Management Debate Challenge, Risk & Future Thierry Roncalli

Inverse Problems and Regularization An Introduction Stefan Kindermann Industrial Mathematics

Temporal Difference Learning Robert Platt Northeastern University If one had to identify one

Digital Matting Digital Matting Introduction to Digital Matting Introduction to Digital Matting

Power Calculations for a Difference of Means October 9, 2019 October 9, 2019 1 / 20 Case Study:

Two-Stage Residual Inclusion Estimation: A Practitioners Guide to Stata Implementation by Joseph

Confidence Intervals II 18.05 Spring 2018 R Quiz Open internet, open notes (no communication

Support Vector Machine Supervised Learning - Classification Ricco Rakotomalala Universit

Conjugate Direction minimization Lectures for PHD course on - PowerPoint PPT Presentation

Conjugate Direction minimization Lectures for PHD course on Numerical optimization Enrico Bertolazzi DIMS Universit a di Trento November 21 December 14, 2011 Conjugate Direction minimization 1 / 106 Outline Introduction 1

Choosing Priors Probability Intervals 18.05 Spring 2014 Conjugate priors A prior is conjugate

Minimization Satoru Iwata (University of Tokyo) Submodular Function Minimization ( )

WINE &amp; DINE Direction Munich Rotkreuz Direction Frankfurt Direction Basle Vorarlberg

Minimization strategy for choice of the stopping index in conjugate gradient type methods for

Benefits of Radial Build Benefits of Radial Build Minimization and Requirements Minimization and

1 The Minimization Problem The Minimization Problem Input: A DFA (deterministic finite-state

Empirical Risk Minimization October 29, 2015 Outline Empirical risk minimization view

ARS Workshop Context Markov Random Fields minimization and minimal cuts in Exact total variation

Counterfactual Regret Minimization and Domination in Extensive-Form Games Richard Gibson

Learning as Loss Minimization Machine Learning 1 Learning as loss minimization The setup

One-Dimensional Minimization Lectures for PHD course on Numerical optimization Enrico Bertolazzi

Minimization Using Descent Information we will consider the minimization of unconstrained

Cluster Minimization in Geometric Graphs Jakob Geiger Motivation Motivation Cluster

11. Equality constrained minimization equality constrained minimization eliminating

A Minimization Algorithm Consider the minimization problem: * M min M M * subject

CENG 342 Digital Systems Tabular Minimization Larry Pyeatt SDSM&amp;T Tabular Minimization

The Active Versus Passive Management Debate Challenge, Risk &amp; Future Thierry Roncalli

Inverse Problems and Regularization An Introduction Stefan Kindermann Industrial Mathematics

Temporal Difference Learning Robert Platt Northeastern University If one had to identify one

Digital Matting Digital Matting Introduction to Digital Matting Introduction to Digital Matting

Power Calculations for a Difference of Means October 9, 2019 October 9, 2019 1 / 20 Case Study:

Two-Stage Residual Inclusion Estimation: A Practitioners Guide to Stata Implementation by Joseph

Confidence Intervals II 18.05 Spring 2018 R Quiz Open internet, open notes (no communication

Support Vector Machine Supervised Learning - Classification Ricco Rakotomalala Universit

WINE & DINE Direction Munich Rotkreuz Direction Frankfurt Direction Basle Vorarlberg

CENG 342 Digital Systems Tabular Minimization Larry Pyeatt SDSM&T Tabular Minimization

The Active Versus Passive Management Debate Challenge, Risk & Future Thierry Roncalli