anytime online to batch optimism and acceleration
play

Anytime Online-to-Batch, Optimism, and Acceleration Ashok Cutkosky - PowerPoint PPT Presentation

Anytime Online-to-Batch, Optimism, and Acceleration Ashok Cutkosky Google Research Stochastic Optimization First-Order Stochastic Optimization Find the minimum of some convex function F : W R using a stochastic gradient oracle: given w we


  1. Anytime Online-to-Batch, Optimism, and Acceleration Ashok Cutkosky Google Research

  2. Stochastic Optimization First-Order Stochastic Optimization Find the minimum of some convex function F : W → R using a stochastic gradient oracle: given w we can obtain a random variable g where E [ g ] = ∇ F ( w ) .

  3. Example: Stochastic Gradient Descent A popular algorithm is gradient descent: w 1 = 0 w t + 1 = w t − η t g t

  4. Example: Stochastic Gradient Descent A popular algorithm is gradient descent: w 1 = 0 w t + 1 = w t − η t g t How should we analyze its convergence?

  5. Online Optimization For t = 1 . . . T , repeat: 1. Learner chooses a point w t . 2. Environment presents learner with a gradient g t (think E [ g t ] = ∇ F ( w t ) ). 3. Learner suffers loss � g t , w t � . The objective is minimize regret : T � R T ( w ⋆ ) = � g t , w t � − � g t , w ⋆ � � �� � � �� � t = 1 loss suffered benchmark loss

  6. Back to Gradient Descent w t + 1 = w t − η t g t √ Simplest analysis chooses η t ∝ 1 / T , but can also do more 1 √ � T complicated things like η t ∝ t = 1 � g t � 2 .

  7. Back to Gradient Descent w t + 1 = w t − η t g t √ Simplest analysis chooses η t ∝ 1 / T , but can also do more 1 √ � T complicated things like η t ∝ t = 1 � g t � 2 .These yield √ R T ( w ⋆ ) ≤ � w ⋆ � T � � T � � � g t � 2 R T ( w ⋆ ) ≤ � w ⋆ � � t = 1

  8. Back to Gradient Descent w t + 1 = w t − η t g t √ Simplest analysis chooses η t ∝ 1 / T , but can also do more 1 √ � T complicated things like η t ∝ t = 1 � g t � 2 .These yield √ R T ( w ⋆ ) ≤ � w ⋆ � T � � T � � � g t � 2 R T ( w ⋆ ) ≤ � w ⋆ � � t = 1 We want to use regret bounds to solve stochastic optimization.

  9. What We Hope Happens

  10. What Could Happen Instead

  11. Online-to-Batch Conversion ◮ Run an online learner for T steps on gradients E [ g t ] = ∇ F ( w t ) . � T w = 1 ◮ Pick ˆ t = 1 w t . T w ) − F ( w ⋆ )] ≤ E [ R T ( w ⋆ )] ◮ E [ F (ˆ T

  12. Online-to-Batch Conversion ◮ Run an online learner for T steps on gradients E [ g t ] = ∇ F ( w t ) . � T w = 1 ◮ Pick ˆ t = 1 w t . T w ) − F ( w ⋆ )] ≤ E [ R T ( w ⋆ )] ◮ E [ F (ˆ T ◮ For example: � w ⋆ � √ � T √ t = 1 � g t � 2 = O ( 1 / T ) . T

  13. Averages Converge

  14. Something That Could Be Beter ◮ The conversion is not “anytime”: you must stop and average in order to get a convergence guarantee. ◮ The iterates w t are still not well-behaved. For example, �∇ F ( w T ) � may be much larger than �∇ F (ˆ w ) � .

  15. Simple Fix Just evaluate gradients at running averages! � t ◮ Let x t = 1 i = 1 w i t ◮ Let g t be stochastic gradient at x t . ◮ Send g t to online learner and get w t + 1 .

  16. Using Running Averages

  17. Notation Recap ◮ x t : where we evaluate gradients g t . ◮ w t : iterate of online learner (now exists only for analysis). ◮ R T ( w ⋆ ) = � T t = 1 � g t , w t − w ⋆ � . No longer clear what the relationship is between R T and the original loss function F since g t is no longer a gradient at w t .

  18. Online-To-Batch is unchanged Theorem Define T � R T ( x ⋆ ) = � α t g t , w t − x ⋆ � t = 1 � t i = 1 α i w i x t = � t i = 1 α i Then for all x ⋆ and all T, � � R T ( x ⋆ ) E [ F ( x T ) − F ( x ⋆ )] ≤ E � T t = 1 α t

  19. Proof Sketch Suppose α t = 1 for simplicity. � T � T � � � � F ( x t ) − F ( x ⋆ ) ≤ E � g t , x t − x ⋆ � E t = 1 t = 1   T �   ≤ E � g t , x t − w t � + � g t , w t − x ⋆ �   � �� � � �� � t = 1 ( t − 1 )( x t − 1 − x t ) R T ( x ⋆ ) � � T � ≤ E R T ( x ⋆ ) + ( t − 1 )( F ( x t − 1 ) − F ( x t )) t = 1 Subtract � T t = 1 F ( x t ) from both sides, and telescope.

  20. Stability It’s clear that F ( x t ) → F ( x ⋆ ) . But (in a bounded domain) we also have: x t − x t − 1 = α t ( x t − w t ) = O ( 1 / t ) � t − 1 i = 1 α i In contrast, the iterates of the base online learner are less stable: w t − w t − 1 = O ( 1 / √ t ) usually (because learning rate η t ∝ 1 / √ t ).

  21. An Algorithm That Likes Stability Optimistic online learning algorithms can obtain [RS13; HK10; MY16]: � � T � � � g t − g t − 1 � 2 R T ( w ⋆ ) ≤ � t = 1 ◮ This algorithm does beter if the gradients are stable.

  22. An Algorithm That Likes Stability Optimistic online learning algorithms can obtain [RS13; HK10; MY16]: � � T � � � g t − g t − 1 � 2 R T ( w ⋆ ) ≤ � t = 1 ◮ This algorithm does beter if the gradients are stable. ◮ When F is smooth, then gradient stability is implied by iterate stability!

  23. Using Optimism with Stability ◮ With previous conversion, we might hope that w t − w t − 1 = O ( 1 / √ t ) . This implies � 1 � T + σ √ E [ F (ˆ w T ) − F ( x ⋆ )] ≤ O T ◮ In the new conversion, g t − g t − 1 ≈ x t − x t − 1 = O ( 1 / t ) , so we can do much beter.

  24. Faster Rates with Optimism Theorem Suppose � � T � � α 2 t � g t − g t − 1 � 2 R T ( x ⋆ ) ≤ � t = 1 Set α t = t for all t. Suppose each g t has variance at most σ 2 , and F is L-smooth. Then � L � T 3 / 2 + σ E [ F ( x T ) − F ( x ⋆ )] ≤ O √ T

  25. Acceleration The optimal rate is E [ F ( x T ) − F ( x ⋆ )] ≤ L T 2 + σ √ T

  26. Acceleration The optimal rate is E [ F ( x T ) − F ( x ⋆ )] ≤ L T 2 + σ √ T ◮ A small change to the algorithm can get this rate too. ◮ The algorithm does not know L or σ . ◮ Unfortunately, the algebra no longer fits on a slide.

  27. Online-to-Batch Summary ◮ Evaluate gradients at running averages. ◮ Keeps the same convergence guarantee, but is anytime. ◮ Stabilizes the iterates − → faster rates on smooth problems.

  28. Online-to-Batch Summary ◮ Evaluate gradients at running averages. ◮ Keeps the same convergence guarantee, but is anytime. ◮ Stabilizes the iterates − → faster rates on smooth problems. Thank you!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend