Adaptivity and Optimism: An Improved Exponentiated Gradient - PowerPoint PPT Presentation

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm Jacob Steinhardt Percy Liang Stanford University { jsteinhardt,pliang } @cs.stanford.edu Jun 11, 2013 J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 1 / 10

Setup Setting is learning from experts : n experts, T rounds For t = 1 ,..., T : Learner chooses distribution w t ∈ ∆ n over the experts Nature reveals losses z t ∈ [ − 1 , 1 ] n of the experts Learner suffers loss w ⊤ t z t Goal: minimize T T Regret def w ⊤ ∑ ∑ = t z t − z t , i ∗ , t = 1 t = 1 where i ∗ is the best fixed expert. Typical algorithm: multiplicative weights (aka exponentiated gradient): w t + 1 , i ∝ w t , i exp ( − η z t , i ) . J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 2 / 10

Outline Compare two variants of the multiplicative weights (exponentiated gradient) algorithm Understand the difference through lens of adaptive mirror descent (Orabona et al., 2013) Combine with machinery of optimistic updates (Rakhlin & Sridharan, 2012) to beat best existing bounds. J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 3 / 10

Two Types of Updates In literature, two similar but different updates (Kivinen & Warmuth, 1997; Cesa-Bianchi et al., 2007): w t + 1 , i ∝ w t , i exp ( − η z t , i ) (MW1) w t + 1 , i ∝ w t , i ( 1 − η z t , i ) (MW2) The regret is bounded as T Regret ≤ log ( n ) ∑ � z t � 2 + η (Regret:MW1) ∞ η t = 1 T Regret ≤ log ( n ) ∑ z 2 + η (Regret:MW2) t , i ∗ η t = 1 If best expert i ∗ has loss close to zero, then second bound better than first. √ Gap can be Θ ( T ) (in actual performance, not just upper bounds). J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 4 / 10

Two Types of Updates In literature, two similar but different updates (Kivinen & Warmuth, 1997; Cesa-Bianchi et al., 2007): w t + 1 , i ∝ w t , i exp ( − η z t , i ) (MW1) w t + 1 , i ∝ w t , i ( 1 − η z t , i ) (MW2) Mirror descent is the gold standard meta-algorithm for online learning. How do (MW1, MW2) relate to it? J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 4 / 10

Two Types of Updates In literature, two similar but different updates (Kivinen & Warmuth, 1997; Cesa-Bianchi et al., 2007): w t + 1 , i ∝ w t , i exp ( − η z t , i ) (MW1) w t + 1 , i ∝ w t , i ( 1 − η z t , i ) (MW2) Mirror descent is the gold standard meta-algorithm for online learning. How do (MW1, MW2) relate to it? (MW1) is mirror descent with regularizer 1 η ∑ n i = 1 w i log ( w i ) J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 4 / 10

Two Types of Updates In literature, two similar but different updates (Kivinen & Warmuth, 1997; Cesa-Bianchi et al., 2007): w t + 1 , i ∝ w t , i exp ( − η z t , i ) (MW1) w t + 1 , i ∝ w t , i ( 1 − η z t , i ) (MW2) Mirror descent is the gold standard meta-algorithm for online learning. How do (MW1, MW2) relate to it? (MW1) is mirror descent with regularizer 1 η ∑ n i = 1 w i log ( w i ) (MW2) is NOT mirror descent for any fixed regularizer J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 4 / 10

Two Types of Updates In literature, two similar but different updates (Kivinen & Warmuth, 1997; Cesa-Bianchi et al., 2007): w t + 1 , i ∝ w t , i exp ( − η z t , i ) (MW1) w t + 1 , i ∝ w t , i ( 1 − η z t , i ) (MW2) Mirror descent is the gold standard meta-algorithm for online learning. How do (MW1, MW2) relate to it? (MW1) is mirror descent with regularizer 1 η ∑ n i = 1 w i log ( w i ) (MW2) is NOT mirror descent for any fixed regularizer Unsettling: should we abandon mirror descent as a gold standard? J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 4 / 10

Two Types of Updates In literature, two similar but different updates (Kivinen & Warmuth, 1997; Cesa-Bianchi et al., 2007): w t + 1 , i ∝ w t , i exp ( − η z t , i ) (MW1) w t + 1 , i ∝ w t , i ( 1 − η z t , i ) (MW2) Mirror descent is the gold standard meta-algorithm for online learning. How do (MW1, MW2) relate to it? (MW1) is mirror descent with regularizer 1 η ∑ n i = 1 w i log ( w i ) (MW2) is NOT mirror descent for any fixed regularizer Unsettling: should we abandon mirror descent as a gold standard? No: can cast (MW2) as adaptive mirror descent (Orabona et al., 2013) J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 4 / 10

Adaptive Mirror Descent to the Rescue Recall that mirror descent is the (meta-)algorithm t − 1 w ⊤ z s . ∑ w t = argmin w ψ ( w )+ s = 1 For ψ ( w ) = 1 η ∑ n i = 1 w i log ( w i ) , we recover (MW1). J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 5 / 10

Adaptive Mirror Descent to the Rescue Recall that mirror descent is the (meta-)algorithm t − 1 w ⊤ z s . ∑ w t = argmin w ψ ( w )+ s = 1 For ψ ( w ) = 1 η ∑ n i = 1 w i log ( w i ) , we recover (MW1). Adaptive mirror descent (Orabona et al., 2013) is the meta-algorithm t − 1 w ⊤ z s . ∑ w t = argmin w ψ t ( w )+ s = 1 J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 5 / 10

Adaptive Mirror Descent to the Rescue Recall that mirror descent is the (meta-)algorithm t − 1 w ⊤ z s . ∑ w t = argmin w ψ ( w )+ s = 1 For ψ ( w ) = 1 η ∑ n i = 1 w i log ( w i ) , we recover (MW1). Adaptive mirror descent (Orabona et al., 2013) is the meta-algorithm t − 1 w ⊤ z s . ∑ w t = argmin w ψ t ( w )+ s = 1 i = 1 ∑ t − 1 For ψ t ( w ) = 1 η ∑ n i = 1 w i log ( w i )+ η ∑ n s = 1 w i z 2 s , i , we approximately recover (MW2). Update: w t + 1 , i ∝ w t , i exp ( − η z t , i − η 2 z 2 t , i ) ≈ w t , i ( 1 − η z t , i ) Enough to achieve better regret bound. Can recover (MW2) exactly with more complicated ψ t . J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 5 / 10

Advantages of Our Perspective So far, we have cast (MW2) as adaptive mirror descent, with regularizer � � η log ( w i )+ η ∑ t − 1 ψ t ( w ) = ∑ n 1 s = 1 z 2 i = 1 w i . s , i Explains the better regret bound while staying within the mirror descent framework, which is nice. Our new perspective also allows us to apply lots of modern machinery: optimistic updates (Rakhlin & Sridharan, 2012) matrix multiplicative weights (Tsuda et al., 2005; Arora & Kale, 2007) By “turning the crank”, we get results that beat state of the art! J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 6 / 10

Beating State of the Art Optimism Adaptivity In the above we let J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 7 / 10

Beating State of the Art Optimism S ∞ Kivinen & Warmuth, 1997 Adaptivity In the above we let def = ∑ T t = 1 � z t � 2 S ∞ ∞ J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 7 / 10

Beating State of the Art Optimism S ∞ Kivinen & Warmuth, 1997 Adaptivity max i S i S i ∗ Cesa-Bianchi et al., 2007 In the above we let def = ∑ T t = 1 � z t � 2 S ∞ ∞ def = ∑ T t = 1 z 2 S i t , i J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 7 / 10

Beating State of the Art Optimism V ∞ S ∞ Kivinen & Warmuth, 1997 Adaptivity max i V i max i S i Hazan & Kale, 2008 S i ∗ Cesa-Bianchi et al., 2007 In the above we let def def = ∑ T = ∑ T z � 2 t = 1 � z t � 2 t = 1 � z t − ¯ V ∞ S ∞ ∞ , ∞ def def = ∑ T = ∑ T t = 1 ( z t , i − ¯ z i ) 2 , t = 1 z 2 V i S i t , i J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 7 / 10

Beating State of the Art Optimism D ∞ V ∞ S ∞ Chiang et al., 2012 Kivinen & Warmuth, 1997 Adaptivity max i V i max i S i Hazan & Kale, 2008 S i ∗ Cesa-Bianchi et al., 2007 In the above we let def def def = ∑ T = ∑ T = ∑ T t = 1 � z t − z t − 1 � 2 z � 2 t = 1 � z t � 2 t = 1 � z t − ¯ D ∞ V ∞ S ∞ ∞ , ∞ , ∞ def def = ∑ T = ∑ T t = 1 ( z t , i − ¯ z i ) 2 , t = 1 z 2 V i S i t , i J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 7 / 10

Adaptivity and Optimism: An Improved Exponentiated Gradient - PowerPoint PPT Presentation

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm Jacob Steinhardt Percy Liang Stanford University { jsteinhardt,pliang } @cs.stanford.edu Jun 11, 2013 J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated

A Flexible Mechanism for Providing Adaptivity Based on Learning Providing Adaptivity Based on

fifty shades of adaptivity (in property testing) An Adaptivity Hierarchy Theorem for Property

Natural Analysts in Adaptive Data Analysis Tijana Zrnic joint with Moritz Hardt Adaptivity

Small Business Optimism Index Small Business Optimism Index

Epistemic Optimism Julien Dutant Kings College London Les Principes de lpistmologie,

Adaptivity and Personalization in Learning System s Sabine Graf School of Computing and

Q1) How important is the problem of adaptivity and its various guises as a cause of false

Adaptivity helps for testing juntas Rocco Servedio, Li-Yang Tan, John Wright Columbia TTIC CMU

Adaptive Sparse Recovery with Limited Adaptivity Akshay Kamath Eric Price UT Austin 2018-11-27

Improved pythonDEVS Simulator Improved pythonDEVS Simulator Improved pythonDEVS Simulator

Optimism bias in financial analysts earnings forecasts: do Commission sharing agreements

REAL-TIME ADAPTIVITY IN HEAD-AND-NECK AND LUNG CANCER RADIOTHERAPY IN A GPU ENVIRONMENT Anand P

Adaptivity and blowup detection for nonlinear evolution PDEs Emmanuil Georgoulis and Department

Robust a posteriori error control and adaptivity for multiscale, multinumerics, and mortar

A globally convergent numerical method and adaptivity for an inverse problem via Carleman

Performance Evaluation of Adaptivity in STM Mathias Payer and Thomas R. Gross Department of

Jon Schlesinger Director, Hiatt Career Center Hiatt helps Brandeisians know who they are,

FUTURE PLANNING: WE ARE NOT GOING BACK TO NORMAL HOSTED BY PETER AUSTEN AND DAVE HOLLINGS

3D VIDEO SYSTEMS 3D VIDEO SYSTEMS Fernando Pereira Instituto Superior Tcnico Comunicao de

Improving Health Care Processes by Smart Glasses Opportunities and Perils MIE 2014, Istanbul

Anytime Online-to-Batch, Optimism, and Acceleration Ashok Cutkosky Google Research Stochastic

OPTIMISM OF AGEING T Total 33% 64% India 1 73% 26% Turkey 2 67% 31% % who are looking

Common Sense Addition Computing Computing Explained by Hurwicz Let Us Apply Hurwicz . . .

What If We Only Know Our Solution Hurwiczs Home Page Optimism-Pessimism Title Page