calculating hypergradient
play

Calculating Hypergradient Jingchang Liu November 13, 2019 HKUST 1 - PowerPoint PPT Presentation

Calculating Hypergradient Jingchang Liu November 13, 2019 HKUST 1 Table of Contents Background Bilevel optimization Forward and Reverse Gradient-Based Hyperparameter Optimization Conclusion Q & A 2 Background Hyperparameter


  1. Calculating Hypergradient Jingchang Liu November 13, 2019 HKUST 1

  2. Table of Contents Background Bilevel optimization Forward and Reverse Gradient-Based Hyperparameter Optimization Conclusion Q & A 2

  3. Background

  4. Hyperparameter Optimization Tradeoff parameter • The dataset is split in two: S train and S test . • Suppose we add ℓ 2 norm as the regulation term, then arg min loss( S test , X ( λ )) (1) λ ∈D loss( S train , x ) + e λ � x � 2 . s . t . X ( λ ) ∈ arg min x ∈ R p Stepsize For gradient descent with momentum: v t = µ v t − 1 + ∇ J t ( w t − 1 ) , w t = w t − 1 − η ( µ v t − 1 − ∇ J t ( w t − 1 )) . Hyperparameters are µ and η . 3

  5. Group Lasso Traditional Group Lasso To seduce the group sparse effect of parameter w , we do L 1 2 � y − Xw � 2 + λ � w ∈ arg min ˆ � w G l � 2 , (2) w ∈ R p l =1 where we partition features in L groups {G 1 , G 2 , . . . , G L } . • But we need to do the partition by ourself beforehand. • How to learn the partition? 4

  6. Group Lasso • Encapsulate the group structure by an hyperparameter θ = [ θ 1 , θ 2 , . . . , θ L ] ∈ { 0 , 1 } P × L , where L is max number of groups and P is the number of features. • θ p , l = 1 if the p -th feature belongs to the l -th group, and 0 otherwise. Formulations for learning θ : ˆ θ ∈ θ ∈∈{ 0 , 1 } P × L C ( ˆ arg min w ( θ )) , (3) where C ( ˆ w ( θ )) can be the validation function ′ − X 2 � � ′ w w ( θ )) = 1 C ( ˆ , and � y � � 2 � L 1 2 � y − Xw � 2 + λ � w ( θ ) = arg min ˆ � θ l ⊙ w � 2 (4) w ∈ R P × L l =1 5

  7. Bilevel optimization

  8. Bilevel Optimization We can conclude the following optimization problem: f U ( x , y ) min x y ∈ arg min y ′ f L ( x , y ′ ) , s . t . (5) • f U is the upper-level objective, over two variables x and y . • f L is the lower-level objective, which binds y as a function of x . • (5) can be simply viewed as a special case of constrained optimization. • If we can get the analytic solution y ∗ ( x ) of y , then we just need to solve the single-level problem min x f U ( x , y ∗ ( x )). 6

  9. Gradient Compute the gradient of the solution to the lower-level problem with respect to variables in the upper-level problem: � ∂ f U ∂ x + ∂ f U � ∂ y x = x − η | ( x , y ∗ ) . (6) ∂ y ∂ x How to calculate ∂ y ∂ x ? Theorem Let f : R × R → R be a continuous function with first and second derivatives. Let g ( x ) = arg min y f ( x , y ). Then the derivative of g with respect to x is dg ( x ) = − f XY ( x , g ( x )) f YY ( x , g ( x )) . (7) dx ∂ x ∂ y and f YY = ∂ 2 f ∂ 2 f where f XY = ∂ y 2 , 7

  10. Proof 1. Since g ( x ) = arg min y f ( x , y ), we get ∂ f ( x , y ) | y = g ( x ) = 0; ∂ y ∂ f ( x , g ( x )) d 2. Differentiating lhs and rhs, we get = 0; dx ∂ y 3. While by the chain rule, we have = ∂ 2 f ( x , g ( x )) + ∂ 2 f ( x , g ( x )) ∂ f ( x , g ( x )) dg ( x ) d ; (8) ∂ y 2 dx ∂ y ∂ x ∂ y dx Equating to zero and rearranging gives: � − 1 ∂ 2 f ( x , g ( x )) � ∂ 2 f ( x , g ( x )) dg ( x ) = (9) ∂ y 2 dx ∂ x ∂ y − f XY ( x , g ( x )) = (10) f YY ( x , g ( x )) . 8

  11. Lemma Lemma 1 Let f : R × R ⋉ → R be a continuous function with first and second derivatives. Let g ( x ) = arg min y ∈ R n f ( x , y ). Then the derivative of g with respect to x is ′ ( x ) = − f XY ( x , g ( x )) − 1 f YY ( x , g ( x )) . (11) g yy f ( x , y ) ∈ R n × n and f YY = where f XY = ∇ 2 ∂ ∂ x ∇ y f ( x , y ) ∈ R n , 9

  12. Application to hyperparameter optimization (icml 16) Hyperparameter optimization arg min loss( S test , X ( λ )) (12) λ ∈D loss( S train , x ) + e λ � x � 2 . s . t . X ( λ ) ∈ arg min x ∈ R p Gradient descent for bilevel problem � ∂ f U ∂ x + ∂ f U ∂ y � x = x − η | ( x , y ∗ ) (13) ∂ y ∂ x � − 1 ∂ 2 f ( x , g ( x )) � � ∂ f U ∂ x − ∂ f U � ∂ 2 f ( x , g ( x )) = x − η (14) ∂ y 2 ∂ y ∂ x ∂ y Gradient � − 1 ∇ 1 g � T � ∇ 2 ∇ 2 � ∇ f = ∇ 2 g − 1 , 2 h 1 h 10

  13. HOAG 11

  14. Analysis Conclusion • If the sequence { ǫ i } ∞ i =1 is summable, then this implies the convergence to a stationary point of f . Theorem If the sequence { ǫ i } ∞ i =1 is positive and verifies ∞ � ǫ i < ∞ , i =1 then the sequence λ k of iterates in the HOAG algorithm has limit λ ∗ ∈ D . In particular, if λ ∗ belongs to the interior of D , it is verified then ∇ f ( λ ∗ ) = 0 . 12

  15. Forward and Reverse Gradient-Based Hyperparameter Optimization

  16. Formulation I • Focus on training procedures of an objective function J ( w ) with respect to w . • The training procedures of SGD or its variants like momentum, RMSProp and ADAM can be regarded as a dynamical system with a state s t ∈ R d . s t = Φ t ( s t − 1 , λ ) , t = 1 , . . . , T • For gradient descent with momentum: = µ v t − 1 + ∇ J t ( w t − 1 ) , v t w t = w t − 1 − η ( µ v t − 1 − ∇ J t ( w t − 1 )) . 1. s t = ( w t , v t ) , s t ∈ R d . • 2. λ = ( µ, η ), λ ∈ R m . 3. Φ t : ( R d × R m ) → R d . 13

  17. Formulation II • The iterates s 1 , . . . , s T implicitly depend on the vector of hyperparameters λ . • Goal: optimize the hyperparameters according to a certain error function E evaluated at the last iterate s T . • We wish to solve the problem min λ ∈ Λ f ( λ ) , where the set Λ ⊂ R m incorporates constraints on the hyperparameters. • The response function f : R m → R , defined at λ ∈ R m f ( λ ) = E ( s T ( λ )) . 14

  18. Diagram Figure 1: The iterates s 1 , . . . , s T depend on the hyperparameters λ • Change the bilevel program to use the parameters at the last iterate s T rather than ˆ w , min λ ∈ Λ f ( λ ) , where f ( λ ) = E ( s T ( λ )) . • The hypergradient ∇ f ( λ ) = ∇ E ( s T ) d s T d λ . 15

  19. Forward-Mode to calculate hypergradient • Chain rule: ∇ f ( λ ) = ∇ E ( s T ) d s T d λ , where d s T d λ is the d × m matrix. • Sine s t = Φ t ( s t − 1 , λ ), Φ t depends on λ both directly and indirectly through the state s t − 1 : d s t d λ = ∂ Φ t ( s t − 1 , λ ) d s t − 1 + ∂ Φ t ( s t − 1 , λ ) . ∂ s t − 1 d λ ∂ λ • Defining Z t = d s t d λ , we rewrite it as Z t = A t Z t − 1 + B t , t ∈ { 1 , . . . , T } . 16

  20. Forward-mode Recurrence Figure 2: Recurrence 17

  21. Forward-HG algorithm Figure 3: Forward-HG algorithm 18

  22. Reverse-Mode to calculate hypergradient Reformulate original problem as the constrained opt problem min E ( s T ) , λ, s 1 ,..., s T s . t . s t = Φ t ( s t − 1 , λ ) , t ∈ { 1 , . . . , T } . Lagrangian T � L ( s , λ, α ) = E ( s T ) + α t (Φ t ( s t − 1 , λ ) − s t ) . t =1 19

  23. Partial derivation of the Lagrangian 20

  24. Derivations Notation A t = ∂ Φ t ( s t − 1 , λ ) , B t = ∂ Φ t ( s t − 1 , λ ) , ∂ s t − 1 ∂λ note that A t ∈ R d × d and B t ∈ R d × m . ∂ L ∂ L ∂ s t = 0 and ∂ s T = 0 � ∇ E ( s T ) if t = T , α t = (15) ∇ E ( s T ) A T · · · A t +1 if t ∈ { 1 , . . . , T − 1 } . T Since ∂ L ∂λ = � α t B t , t =1 T � T � ∂ L � � ∂λ = ∇ E ( s T ) B t . A s t =1 s = t +1 21

  25. Reverse-HG algorithm Figure 4: Reverse-HG algorithm TRUNCATED BACK-PROPAGATION t = T − 1 to T − k . 22

  26. Real-Time HO • For t ∈ { 1 , . . . , T } , define f t ( λ ) = E ( s t ( λ )) . • Partial hypergradients are avaliable in forward mode ∇ f t ( λ ) = d E ( s t ) = ∇ E ( s t ) Z t . d λ • Significant: we can update hyperparameters in a single epoch, without having to wait until time T . Figure 5: The iterates s 1 , . . . , s T depend on the hyperparameters λ 23

  27. Real-Time HO algorithm Figure 6: Real-Time HO algorithm 24

  28. Analysis • Forward and inverse mode have different time/space tradeoffs. • Reverse mode needs to store the whole history of parameters. • Forward mode need to calculate mat multipy mat in each step. 25

  29. Conclusion

  30. Conclusions • Calculating the hypergradients, the gradients with respect to hyperparameters, is very important in selecting a good hyperparameter. • We talk about two ways for calculating hypergradients: bilevel optimization and forward/inverse mode. • In bilevel optimization, we suppose an optimal solutions set of lower level function; while in forward/inverse mode, we cosider the whole process of the lower level iterations. • Calculating hypergradients in bilevel optimization invovles solving the lower level problem and two second-order derivatives, both are very heavy cost. • Forward/inverse mode uses chain rule, just like for deep nets training. 26

  31. Q & A

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend