Anytime Online-to-Batch, Optimism, and Acceleration Ashok Cutkosky - PowerPoint PPT Presentation

Anytime Online-to-Batch, Optimism, and Acceleration Ashok Cutkosky Google Research

Stochastic Optimization First-Order Stochastic Optimization Find the minimum of some convex function F : W → R using a stochastic gradient oracle: given w we can obtain a random variable g where E [ g ] = ∇ F ( w ) .

Example: Stochastic Gradient Descent A popular algorithm is gradient descent: w 1 = 0 w t + 1 = w t − η t g t

Example: Stochastic Gradient Descent A popular algorithm is gradient descent: w 1 = 0 w t + 1 = w t − η t g t How should we analyze its convergence?

Online Optimization For t = 1 . . . T , repeat: 1. Learner chooses a point w t . 2. Environment presents learner with a gradient g t (think E [ g t ] = ∇ F ( w t ) ). 3. Learner suffers loss � g t , w t � . The objective is minimize regret : T � R T ( w ⋆ ) = � g t , w t � − � g t , w ⋆ � � �� t = 1 loss suffered benchmark loss

Back to Gradient Descent w t + 1 = w t − η t g t √ Simplest analysis chooses η t ∝ 1 / T , but can also do more 1 √ � T complicated things like η t ∝ t = 1 � g t � 2 .

Back to Gradient Descent w t + 1 = w t − η t g t √ Simplest analysis chooses η t ∝ 1 / T , but can also do more 1 √ � T complicated things like η t ∝ t = 1 � g t � 2 .These yield √ R T ( w ⋆ ) ≤ � w ⋆ � T � � T � � � g t � 2 R T ( w ⋆ ) ≤ � w ⋆ � � t = 1

Back to Gradient Descent w t + 1 = w t − η t g t √ Simplest analysis chooses η t ∝ 1 / T , but can also do more 1 √ � T complicated things like η t ∝ t = 1 � g t � 2 .These yield √ R T ( w ⋆ ) ≤ � w ⋆ � T � � T � � � g t � 2 R T ( w ⋆ ) ≤ � w ⋆ � � t = 1 We want to use regret bounds to solve stochastic optimization.

What We Hope Happens

What Could Happen Instead

Online-to-Batch Conversion ◮ Run an online learner for T steps on gradients E [ g t ] = ∇ F ( w t ) . � T w = 1 ◮ Pick ˆ t = 1 w t . T w ) − F ( w ⋆ )] ≤ E [ R T ( w ⋆ )] ◮ E [ F (ˆ T

Online-to-Batch Conversion ◮ Run an online learner for T steps on gradients E [ g t ] = ∇ F ( w t ) . � T w = 1 ◮ Pick ˆ t = 1 w t . T w ) − F ( w ⋆ )] ≤ E [ R T ( w ⋆ )] ◮ E [ F (ˆ T ◮ For example: � w ⋆ � √ � T √ t = 1 � g t � 2 = O ( 1 / T ) . T

Averages Converge

Something That Could Be Beter ◮ The conversion is not “anytime”: you must stop and average in order to get a convergence guarantee. ◮ The iterates w t are still not well-behaved. For example, �∇ F ( w T ) � may be much larger than �∇ F (ˆ w ) � .

Simple Fix Just evaluate gradients at running averages! � t ◮ Let x t = 1 i = 1 w i t ◮ Let g t be stochastic gradient at x t . ◮ Send g t to online learner and get w t + 1 .

Using Running Averages

Notation Recap ◮ x t : where we evaluate gradients g t . ◮ w t : iterate of online learner (now exists only for analysis). ◮ R T ( w ⋆ ) = � T t = 1 � g t , w t − w ⋆ � . No longer clear what the relationship is between R T and the original loss function F since g t is no longer a gradient at w t .

Online-To-Batch is unchanged Theorem Define T � R T ( x ⋆ ) = � α t g t , w t − x ⋆ � t = 1 � t i = 1 α i w i x t = � t i = 1 α i Then for all x ⋆ and all T, � � R T ( x ⋆ ) E [ F ( x T ) − F ( x ⋆ )] ≤ E � T t = 1 α t

Proof Sketch Suppose α t = 1 for simplicity. � T � T � � � � F ( x t ) − F ( x ⋆ ) ≤ E � g t , x t − x ⋆ � E t = 1 t = 1   T �   ≤ E � g t , x t − w t � + � g t , w t − x ⋆ �   � �� t = 1 ( t − 1 )( x t − 1 − x t ) R T ( x ⋆ ) � � T � ≤ E R T ( x ⋆ ) + ( t − 1 )( F ( x t − 1 ) − F ( x t )) t = 1 Subtract � T t = 1 F ( x t ) from both sides, and telescope.

Stability It’s clear that F ( x t ) → F ( x ⋆ ) . But (in a bounded domain) we also have: x t − x t − 1 = α t ( x t − w t ) = O ( 1 / t ) � t − 1 i = 1 α i In contrast, the iterates of the base online learner are less stable: w t − w t − 1 = O ( 1 / √ t ) usually (because learning rate η t ∝ 1 / √ t ).

An Algorithm That Likes Stability Optimistic online learning algorithms can obtain [RS13; HK10; MY16]: � � T � � � g t − g t − 1 � 2 R T ( w ⋆ ) ≤ � t = 1 ◮ This algorithm does beter if the gradients are stable.

An Algorithm That Likes Stability Optimistic online learning algorithms can obtain [RS13; HK10; MY16]: � � T � � � g t − g t − 1 � 2 R T ( w ⋆ ) ≤ � t = 1 ◮ This algorithm does beter if the gradients are stable. ◮ When F is smooth, then gradient stability is implied by iterate stability!

Using Optimism with Stability ◮ With previous conversion, we might hope that w t − w t − 1 = O ( 1 / √ t ) . This implies � 1 � T + σ √ E [ F (ˆ w T ) − F ( x ⋆ )] ≤ O T ◮ In the new conversion, g t − g t − 1 ≈ x t − x t − 1 = O ( 1 / t ) , so we can do much beter.

Faster Rates with Optimism Theorem Suppose � � T � � α 2 t � g t − g t − 1 � 2 R T ( x ⋆ ) ≤ � t = 1 Set α t = t for all t. Suppose each g t has variance at most σ 2 , and F is L-smooth. Then � L � T 3 / 2 + σ E [ F ( x T ) − F ( x ⋆ )] ≤ O √ T

Acceleration The optimal rate is E [ F ( x T ) − F ( x ⋆ )] ≤ L T 2 + σ √ T

Acceleration The optimal rate is E [ F ( x T ) − F ( x ⋆ )] ≤ L T 2 + σ √ T ◮ A small change to the algorithm can get this rate too. ◮ The algorithm does not know L or σ . ◮ Unfortunately, the algebra no longer fits on a slide.

Online-to-Batch Summary ◮ Evaluate gradients at running averages. ◮ Keeps the same convergence guarantee, but is anytime. ◮ Stabilizes the iterates − → faster rates on smooth problems.

Online-to-Batch Summary ◮ Evaluate gradients at running averages. ◮ Keeps the same convergence guarantee, but is anytime. ◮ Stabilizes the iterates − → faster rates on smooth problems. Thank you!

Anytime Online-to-Batch, Optimism, and Acceleration Ashok Cutkosky - PowerPoint PPT Presentation

Anytime Online-to-Batch, Optimism, and Acceleration Ashok Cutkosky Google Research Stochastic Optimization First-Order Stochastic Optimization Find the minimum of some convex function F : W R using a stochastic gradient oracle: given w we

Batch Systems Running calculations on HPC resources Outline What is a batch system? How

The Anytime Automaton Joshua San Miguel Natalie Enright Jerger Summary We propose the Anytime

Small Business Optimism Index Small Business Optimism Index

Epistemic Optimism Julien Dutant Kings College London Les Principes de lpistmologie,

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

HEBT Magnet Vacuum Chambers for Batch 2 and Batch 3 PSP Code 2.3.7.1.2.3.2 Lukas Urban /

Batch Systems Running your jobs on an HPC machine Outline What are batch systems? Why are

Your Perfect Presentation: Speak in Front of Any Audience Anytime Your Perfect Presentation: Speak

Batch Metadata Editing in DSpace 1.6+ Maureen P. Walsh, The Ohio State University Libraries

Anytime Reliability of Systematic LDPC Motivation Convolutional Codes LDPC Convolutional Codes

Optimism bias in financial analysts earnings forecasts: do Commission sharing agreements

Particle Driven Acceleration Experiments Edda Gschwendtner CAS, Plasma Wake Acceleration 2014 2

Motion with Constant Acceleration 1 Particle Under Constant Acceleration In the case of motion

Acceleration at North Allegheny Mathematics Acceleration (Elementary) Students may qualify for

acceleration Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada NSS acceleration

Online Learning Lorenzo Rosasco MIT, 9.520 L. Rosasco Online Learning About this class Goal

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm Jacob Steinhardt Percy

Jon Schlesinger Director, Hiatt Career Center Hiatt helps Brandeisians know who they are,

FUTURE PLANNING: WE ARE NOT GOING BACK TO NORMAL HOSTED BY PETER AUSTEN AND DAVE HOLLINGS

3D VIDEO SYSTEMS 3D VIDEO SYSTEMS Fernando Pereira Instituto Superior Tcnico Comunicao de

OPTIMISM OF AGEING T Total 33% 64% India 1 73% 26% Turkey 2 67% 31% % who are looking

Common Sense Addition Computing Computing Explained by Hurwicz Let Us Apply Hurwicz . . .

What If We Only Know Our Solution Hurwiczs Home Page Optimism-Pessimism Title Page

Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic Optimality Kwang-Sung Jun