math 4211 6211 optimization quasi newton method
play

MATH 4211/6211 Optimization Quasi-Newton Method Xiaojing Ye - PowerPoint PPT Presentation

MATH 4211/6211 Optimization Quasi-Newton Method Xiaojing Ye Department of Mathematics & Statistics Georgia State University Xiaojing Ye, Math & Stat, Georgia State University 0 Quasi-Newton Method Motivation : Approximate the


  1. MATH 4211/6211 – Optimization Quasi-Newton Method Xiaojing Ye Department of Mathematics & Statistics Georgia State University Xiaojing Ye, Math & Stat, Georgia State University 0

  2. Quasi-Newton Method Motivation : Approximate the inverse Hessian ( ∇ 2 f ( x ( k ) )) − 1 in the New- ton’s method by some H k : x ( k +1) = x ( k ) − α k H k g ( k ) That is, the search direction is set to d ( k ) = − H k g ( k ) . Based on H k , x ( k ) , g ( k ) , quasi-Newton generates the next H k +1 , and so on. Xiaojing Ye, Math & Stat, Georgia State University 1

  3. Proposition . If f ∈ C 1 , g ( k ) � = 0 , and H k ≻ 0 , then d ( k ) = − H k g ( k ) is a descent direction. Proof . Let x ( k +1) = x ( k ) − α H k g ( k ) for some α , then by Taylor’s expansion f ( x ( k +1) ) = f ( x ( k ) ) − α g ( k ) ⊤ H k g ( k ) + o ( � H k g ( k ) � α ) < f ( x ( k ) ) for α sufficiently small. Xiaojing Ye, Math & Stat, Georgia State University 2

  4. Recall that for quadratic functions with Q ≻ 0 , the Hessian is H ( k ) = Q for all k , and g ( k +1) − g ( k ) = Q ( x ( k +1) − x ( k ) ) For notation simplicity, we denote ∆ x ( k ) = x ( k +1) − x ( k ) ∆ g ( k ) = g ( k +1) − g ( k ) and Then we can write the identity above as ∆ g ( k ) = Q ∆ x ( k ) or equivalently Q − 1 ∆ g ( k ) = ∆ x ( k ) Xiaojing Ye, Math & Stat, Georgia State University 3

  5. In quasi-Newton method, H k is in the place of Q − 1 : x ( k +1) = x ( k ) − α k Q − 1 g ( k ) Newton : x ( k +1) = x ( k ) − α k H k g ( k ) Quasi-Newton : Therefore we would like to have a sequence of H k with same property of Q − 1 : H k +1 ∆ g ( i ) = ∆ x ( i ) , 0 ≤ i ≤ k for all k = 0 , 1 , 2 , . . . . Xiaojing Ye, Math & Stat, Georgia State University 4

  6. If this is true, then at iteration n , there are H n ∆ g (0) = ∆ x (0) H n ∆ g (1) = ∆ x (1) . . . H n ∆ g ( n − 1) = ∆ x ( n − 1) or H n [∆ g (0) , . . . , ∆ g ( n − 1) ] = [∆ x (0) , . . . , ∆ x ( n − 1) ] . On the other hand, Q − 1 [∆ g (0) , . . . , ∆ g ( n − 1) ] = [∆ x (0) , . . . , ∆ x ( n − 1) ] . If [∆ g (0) , . . . , ∆ g ( n − 1) ] is invertible, then we have H n = Q − 1 . Then at the iteration n + 1 , there is x ( n +1) = x ( n ) − α n H n g ( n ) = x ∗ since this is the same as the Newton’s update. Hence for quadratic functions, quasi-Newton method would converge in at most n steps. Xiaojing Ye, Math & Stat, Georgia State University 5

  7. Quasi-Newton method d ( k ) = − H k g ( k ) f ( x ( k ) + α k d ( k ) ) α k = arg min α ≥ 0 x ( k +1) = x ( k ) + α k d ( k ) where H 0 , H 1 , . . . are symmetric. Moreover, for quadratic functions of form f ( x ) = 1 2 x ⊤ Qx − b ⊤ x , the matrices H 0 , H 1 , . . . are required to satisfy H k +1 ∆ g ( i ) = ∆ x ( i ) , 0 ≤ i ≤ k Xiaojing Ye, Math & Stat, Georgia State University 6

  8. Theorem . Consider a quasi-Newton algorithm applied to a quadratic function with symmetric Q ≻ 0 , such that for all k = 0 , 1 , . . . , n − 1 , there are H k +1 ∆ g ( i ) = ∆ x ( i ) , 0 ≤ i ≤ k and H k are all symmetric. If α i � = 0 for 0 ≤ i ≤ k , then d (0) , . . . , d ( n ) are Q -conjugate. Xiaojing Ye, Math & Stat, Georgia State University 7

  9. Proof . We prove by induction. It is trivial to show g (1) ⊤ d ( i ) . Assume the claim holds for some k < n − 1 . We have for i ≤ k that d ( k +1) ⊤ Qd ( i ) = − ( H k +1 g ( k +1) ) ⊤ Qd ( i ) Q ∆ x ( i ) = − g ( k +1) ⊤ H k +1 α i ∆ g ( i ) = − g ( k +1) ⊤ H k +1 α i = − g ( k +1) ⊤ ∆ x ( i ) α i = − g ( k +1) ⊤ d ( i ) Since d (0) , . . . , d ( k ) are Q -conjugate, we know g ( k +1) ⊤ d ( i ) = 0 for all i ≤ Hence d (0) , . . . , d ( k ) , d ( k +1) are Q -conjugate. k . By induction the claim holds. Xiaojing Ye, Math & Stat, Georgia State University 8

  10. The theorem above also shows that quasi-Newton method is a conjugate di- rection method, and hence converges in n steps for quadratic objective func- tions. In practice, there are various ways to generate H k such that H k +1 ∆ g ( i ) = ∆ x ( i ) , 0 ≤ i ≤ k Now we learn three algorithms that produce such H k . Xiaojing Ye, Math & Stat, Georgia State University 9

  11. Rank one correction formula Suppose we would like to update H k to H k +1 by adding a rank one matrix a k z ( k ) z ( k ) ⊤ for some a k ∈ R and z ( k ) ∈ R n : H k +1 = H k + a k z ( k ) z ( k ) ⊤ Now let us derive what this a k z ( k ) z ( k ) ⊤ should be. Since we need H k +1 ∆ g ( i ) = ∆ x ( i ) for i ≤ k , we at least need H k +1 ∆ g ( k ) = ∆ x ( k ) . That is ∆ x ( k ) = H k +1 ∆ g ( k ) = ( H k + a k z ( k ) z ( k ) ⊤ )∆ g ( k ) = H k ∆ g ( k ) + a k ( z ( k ) ⊤ ∆ g ( k ) ) z ( k ) Xiaojing Ye, Math & Stat, Georgia State University 10

  12. Therefore z ( k ) = ∆ x ( k ) − H k ∆ g ( k ) a k ( z ( k ) ⊤ ∆ g ( k ) ) and hence H k +1 = H k + (∆ x ( k ) − H k ∆ g ( k ) )(∆ x ( k ) − H k ∆ g ( k ) ) ⊤ a k ( z ( k ) ⊤ ∆ g ( k ) ) 2 On the other hand, multiplying ∆ g ( k ) ⊤ on both sides of ∆ x ( k ) − H k g ( k ) = a k ( z ( k ) ⊤ ∆ g ( k ) ) z ( k ) , we obtain ∆ g ( k ) ⊤ (∆ x ( k ) − H k ∆ g ( k ) ) = a k ( z ( k ) ⊤ ∆ g ( k ) ) 2 . Hence H k +1 = H k + (∆ x ( k ) − H k ∆ g ( k ) )(∆ x ( k ) − H k ∆ g ( k ) ) ⊤ ∆ g ( k ) ⊤ (∆ x ( k ) − H k ∆ g ( k ) ) This is the rank one correction formula. Xiaojing Ye, Math & Stat, Georgia State University 11

  13. We obtained the formula by requiring H k +1 ∆ g ( k ) = ∆ x ( k ) . However, we also need H k +1 ∆ g ( i ) = ∆ x ( i ) for i < k . This turns out to be true automat- ically: Theorem . For the rank one algorithm applied to quadratic functions with Hes- sian symmetric Q , there are H k +1 ∆ g ( i ) = ∆ x ( i ) , 0 ≤ i ≤ k for k = 0 , 1 , . . . , n − 1 . Xiaojing Ye, Math & Stat, Georgia State University 12

  14. We have showed H k +1 ∆ g ( k ) = ∆ x ( k ) for all k = 0 , 1 , 2 , · · · . Proof . Assume the identities hold up to k , we use induction to show it’s true for k +1 . We here only need to show H k +1 ∆ g ( i ) = ∆ x ( i ) for i < k : H k + (∆ x ( k ) − H k ∆ g ( k ) )(∆ x ( k ) − H k ∆ g ( k ) ) ⊤ � � H k +1 ∆ g ( i ) = ∆ g ( i ) ∆ g ( k ) ⊤ (∆ x ( k ) − H k ∆ g ( k ) ) = ∆ x ( i ) + (∆ x ( k ) − H k ∆ g ( k ) )(∆ x ( k ) − H k ∆ g ( k ) ) ⊤ ∆ g ( i ) ∆ g ( k ) ⊤ (∆ x ( k ) − H k ∆ g ( k ) ) Note that ( H k ∆ g ( k ) ) ⊤ ∆ g ( i ) = ∆ g ( k ) ⊤ H k ∆ g ( i ) = ∆ g ( k ) ⊤ ∆ x ( i ) = ∆ x ( k ) ⊤ Q ∆ x ( i ) = ∆ x ( k ) ⊤ ∆ g ( i ) Hence the second term on the right is zero, and we obtain H k ∆ g ( i ) = ∆ x ( i ) This completes the proof. Xiaojing Ye, Math & Stat, Georgia State University 13

  15. Issues with rank one correction formula: • H k +1 may not be positive definite even if H k is. Hence − H k g ( k ) may not be a descent direction; • the denominator in the rank one correction is ∆ g ( k ) ⊤ (∆ x ( k ) − H k ∆ g ( k ) ) , which can be close to 0 and makes computation unstable. Xiaojing Ye, Math & Stat, Georgia State University 14

  16. We now study the DFP algorithm which improves the rank one correction for- mula by ensuring positive definiteness of H k . DFP algoirthm [Davidson 1959, Fletcher and Powell 1963] H k +1 = H k + ∆ x ( k ) ∆ x ( k ) ⊤ ∆ x ( k ) ⊤ ∆ g ( k ) − ( H k ∆ g ( k ) )( H k ∆ g ( k ) ) ⊤ ∆ g ( k ) ⊤ H k ∆ g ( k ) Xiaojing Ye, Math & Stat, Georgia State University 15

  17. We first show that DFP is a quasi-Newton method. Theorem . The DFP algorithm applied to quadratic functions satisfies H k +1 ∆ g ( i ) = ∆ x ( i ) , 0 ≤ i ≤ k for all k . Xiaojing Ye, Math & Stat, Georgia State University 16

  18. Proof . We prove this by induction. It is trivial for k = 0 . Assume the claim is true for k , i.e., H k ∆ g ( i ) = ∆ x ( i ) for all i ≤ k − 1 . Now we first have H k +1 ∆ g ( i ) = ∆ x ( i ) for i = k by direct computation. For i < k , there is H k +1 ∆ g ( i ) = H k ∆ g ( i ) + ∆ x ( k ) (∆ x ( k ) ⊤ ∆ g ( i ) ) ∆ x ( k ) ⊤ ∆ g ( k ) − ( H k ∆ g ( k ) )( H k ∆ g ( k ) ) ⊤ ∆ g ( i ) ∆ g ( k ) ⊤ H k ∆ g ( k ) Note that due to assumption d (0) , . . . , d ( k ) are Q -conjugate, and hence ∆ x ( k ) ⊤ ∆ g ( i ) = ∆ x ( k ) ⊤ Q ∆ x ( i ) = α k α i d ( k ) ⊤ Qd ( i ) = 0 similarly ∆ g ( k ) ⊤ H k ∆ g ( i ) = ∆ g ( k ) ⊤ ∆ x ( i ) = 0 . This completes the proof. Xiaojing Ye, Math & Stat, Georgia State University 17

  19. Next we show that H k +1 inherits positive definiteness of H k in DFP algorithm. Theorem . Suppose g ( k ) � = 0 , then H k ≻ 0 implies H k +1 ≻ 0 in DFP . Proof . For any x ∈ R n , there is x ⊤ H k +1 x = x ⊤ H k x + ( x ⊤ ∆ x ( k ) ) 2 ∆ x ( k ) ⊤ ∆ g ( k ) − ( x ⊤ H k ∆ g ( k ) ) 2 ∆ g ( k ) ⊤ H k ∆ g ( k ) For notation simplicity, we denote a = H 1 / 2 b = H 1 / 2 ∆ g ( k ) x and k k where H k = H 1 / 2 H 1 / 2 (we know H 1 / 2 exists since H k is SPD). k k k Xiaojing Ye, Math & Stat, Georgia State University 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend