online algorithms learning optimization with no regret
play

Online Algorithms: Learning & Optimization with No Regret. - PowerPoint PPT Presentation

Online Algorithms: Learning & Optimization with No Regret. CS/CNS/EE 253 Daniel Golovin 1 CS/CNS/EE 253 The Setup Optimization: Model the problem (objective, constraints) Pick best decision from a feasible set. Learning:


  1. Online Algorithms: Learning & Optimization with No Regret. CS/CNS/EE 253 Daniel Golovin 1 CS/CNS/EE 253

  2. The Setup Optimization: ● Model the problem (objective, constraints) ● Pick best decision from a feasible set. Learning: ● Model the problem (objective, hypothesis class) ● Pick best hypothesis from a feasible set. 2 CS/CNS/EE 253

  3. Online Learning/Optimization Choose an action Get f t ( x t ) and feedback x t 2 X f t : X ! [0 ; 1] ● Same feasible set X in each round t ● Different Reward Models: ● Stochastic, Arbitrary but Oblivious, Adaptive and Arbitrary 3 CS/CNS/EE 253

  4. Concrete Example: Commuting Pick a path x t from home to school. Pay cost f t ( x t ) := P e 2 x t c t ( e ) Then see all edge costs for that round. Dealing with Limited Feedback: later in the course. 4 CS/CNS/EE 253

  5. Other Applications ● Sequential decision problems ● Streaming algorithms for optimization/learning with large data sets ● Combining weak learners into strong ones (“boosting”) ● Fast approximate solvers for certain classes of convex programs ● Playing repeated games 5 CS/CNS/EE 253

  6. Binary prediction with a perfect expert ● n hypotheses (“experts”) h 1 ; h 2 ; : : : ; h n ● Guaranteed that some hypothesis is perfect. ● Each round, get a data point p t and classifications h i ( p t ) 2 f 0 ; 1 g ● Output binary prediction x t , observe correct label ● Minimize # mistakes Any Suggestions? 6 CS/CNS/EE 253

  7. A Weighted Majority Algorithm ● Each expert “votes” for it's classification. ● Only votes from experts who have never been wrong are counted. ● Go with the majority # mistakes M · log 2 ( n ) Weights w it = I ( h i correct on ¯rst t rounds ). W t = P i w it . W 0 = n , W T ¸ 1 Mistake on round t implies W t +1 · W t = 2 So 1 · W T · W 0 = 2 M = n= 2 M 7 CS/CNS/EE 253

  8. Weighted Majority [Littlestone & Warmuth '89] What if there's no perfect expert? ● Each expert i has a weight w(i) , “votes” for it's classification in {-1, 1}. Go with the weighted majority, predict sign( ∑ i w i x i ). Halve weights of wrong experts. Let m = # mistakes of best expert. How many mistakes M do we make? Weights w it = (1 = 2) ( # mistakes by i on ¯rst t rounds) . Let W t := P i w it . Note W 0 = n , W T ¸ (1 = 2) m Mistake on round t implies W t +1 · 3 4 W t So (1 = 2) m · W T · W 0 (3 = 4) M = n ¢ (3 = 4) M Thus (4 = 3) M · n ¢ 2 m and M · 2 : 41( m + log 2 ( n )). 8 CS/CNS/EE 253

  9. Can we do better? M · 2 : 41( m + log 2 ( n )) Experts\Time 1 2 3 4 e 1 ´ ¡ 1 0 1 0 1 e 2 ´ 1 1 0 1 0 ● No deterministic algorithm can get M < 2m. ● What if there are more than 2 choices? 9 CS/CNS/EE 253

  10. Regret “Maybe all one can do is hope to end up with the right regrets.” – Arthur Miller ● Notation: Define loss or cost functions c t and define the regret of x 1 , x 2 , ... , x T as X T X T c t ( x ¤ ) c t ( x t ) ¡ R T = t =1 t =1 P T where x ¤ = argmin x 2 X t =1 c t ( x ) A sequence has \no-regret" if R T = o ( T ). ● Questions: ● How can we improve Weighted Majority? ● What is the lowest regret we can hope for? 10 CS/CNS/EE 253

  11. The Hedge/WMR Algorithm* [Freund & Schapire '97] X Hedge( ² ) p t ( i ) := w it = w jt Initialize w i 0 = 1 for all i . j In each round t : Choose expert e t from categorical distribution p t Select x t = x ( e t ; t ), the advice/prediction of e t . For each i , set w i;t +1 = w it (1 ¡ ² ) c t ( x ( e i ;t )) ● How does this compare to WM? * Pedantic note: Hedge is often called “Randomized Weighted Majority”, and abbreviated “WMR”, though WMR was published in the context of binary classification, unlike Hedge. 11 CS/CNS/EE 253

  12. The Hedge/WMR Algorithm X Hedge( ² ) p t ( i ) := w it = w jt Initialize w i 0 = 1 for all i . j In each round t : Choose expert e t from categorical distribution p t Select x t = x ( e t ; t ), the advice/prediction of e t . For each i , set w i;t +1 = w it (1 ¡ ² ) c t ( x ( e i ;t )) Randomization Influence shrinks exponentially with cumulative loss. Intuitively: Either we do well on a round, or total weight drops, and total weight can't drop too much unless every expert is lousy. 12 CS/CNS/EE 253

  13. Hedge Performance Theorem: Let x 1 ; x 2 ; : : : be the choices of Hedge( ² ). Then " T # µ ¶ X 1 OPT T + ln( n ) E · c t ( x t ) 1 ¡ ² ² t =1 P T where OPT T := min i t =1 c t ( x ( e i ; t )). ³p ´ p ln( n ) = OPT OPT ln( n )) If ² = £ , the regret is £( 13 CS/CNS/EE 253

  14. Hedge Analysis Intuitively: Either we do well on a round, or total weight drops, and total weight can't drop too much unless every expert is lousy. Let W t := P i w it . Then W 0 = n and W T +1 ¸ (1 ¡ ² ) OPT . X w it (1 ¡ ² ) c t ( x it ) W t +1 = (1) i X W t p t ( i )(1 ¡ ² ) c t ( x it ) [def of p t ( i )] = (2) i X [Bernoulli's ineq] · W t p t ( i ) (1 ¡ ² ¢ c t ( x it )) (3) If x > ¡ 1 ; r 2 (0 ; 1) i then (1 + x ) r · 1 + rx = W t (1 ¡ ² ¢ E [ c t ( x t )]) (4) [1 ¡ x · e ¡ x ] · W t ¢ exp ( ¡ ² ¢ E [ c t ( x t )]) (5) 14 CS/CNS/EE 253

  15. Hedge Analysis à ! T X E [ c t ( x t )] W T +1 =W 0 · exp ¡ ² t =1 à ! X T E [ c t ( x t )] W 0 =W T +1 ¸ exp ² t =1 Recall W 0 = n and W T +1 ¸ (1 ¡ ² ) OPT . " T # µ W 0 ¶ X ¡ OPT ¢ ln(1 ¡ ² ) · 1 · ln( n ) E c t ( x t ) ² ln W T +1 ² ² t =1 + OPT · ln( n ) 1 ¡ ² ² 15 CS/CNS/EE 253

  16. Lower Bound ³p ´ p ln( n ) = OPT OPT ln( n )) If ² = £ , the regret is £( Can we do better? Let c t ( x ) » Bernoulli(1/2) for all x and t . Let Z i := P T t =1 c t ( x ( e i ; t )). Then Z i » Bin( T; 1 = 2) is roughly normally distributed, p with ¾ = 1 T . 2 P [ Z i · ¹ ¡ k¾ ] = exp ¡ ¡ £( k 2 ) ¢ We get about ¹ = T= 2, best choice is likely p p OPT ln( n )). to get ¹ ¡ £( T ln( n )) = ¹ ¡ £( 16 CS/CNS/EE 253

  17. What have we shown? ● Simple algorithm that learns to do nearly as ● Simple algorithm that learns to do nearly as well as best fixed choice. well as best fixed choice. ● Hedge can exploit any pattern that the best choice ● Hedge can exploit any pattern that the best choice does. does. ● Works for Adaptive Adversaries. ● Works for Adaptive Adversaries. ● Suitable for playing repeated games. Related ideas ● Suitable for playing repeated games. Related ideas appearing in Algorithmic Game Theory literature. appearing in Algorithmic Game Theory literature. 17 CS/CNS/EE 253

  18. Related Questions ● Optimize and get no-regret against richer classes of strategies/experts: – All distributions over experts – All sequences of experts that have K transitions [Auer et al '02] – Various classes of functions of input features [Blum & Mansour '05] ● E.g., consider time of day when choosing driving route. – Arbitrary convex set of experts, metric space of experts, etc, with linear, convex, or Lipschitz costs. [Zinkevich '03, Kleinberg et al '08] – All policies of a K-state initially unknown Markov Decision Process that models the world. [Auer et al '08] R n – Arbitrary sets of strategies in with linear costs that we can optimize offline. [Hannan'57, Kalai & Vempala '02] 18 CS/CNS/EE 253

  19. Related Questions ● Other notions of regret (see e.g., [Blum & Mansour '05]) ● Time selection functions: – get low regret on mondays, rainy days, etc. ● Sleeping experts: – if rule “if(P) then predict Q” is right 90% of the time it applies, be right 89% of the time P applies. ● Internal regret & swap regret: – If you played x 1 , ..., x T then have no regret against g ( x 1 ), ..., g ( x T ) for every g:X→X 19 CS/CNS/EE 253

  20. Sleeping Experts [Freund et al '97, Blum '97, Blum & Mansour '05] ● if rule “if(P) then predict Q” is right 90% of the time it applies, be right 89% of the time P applies. Get this for every rule simultaneously. ● Idea: Generate lots of hypotheses that “specialize” on certain inputs, some good, some lousy, and combine them into a great classifier. ● Many applications: ● Document classification, Spam filtering, Adaptive Uis, ... – if (“physics” in D) then classify D as “science”. ● Predicates can overlap. 20 CS/CNS/EE 253

  21. Sleeping Experts ● Predicates can overlap ● E.g., predict college major given the classes C you're enrolled in? – if(ML-101, CS-201 in C) then CS – if(ML-101, Stats-201 in C) then Stats ● What do we predict for students enrolled in ML-101, CS-201, and Stats-201? 21 CS/CNS/EE 253

  22. Sleeping Experts [Algorithm from Blum & Mansour '05] SleepingExperts( ¯ , E , F ) Input: ¯ 2 (0 ; 1), experts E , time selection functions F Initialize w 0 e;f = 1 for all e 2 E ; f 2 F . In each round t : e = P Let w t f f ( t ) w t e;f . Let W t = P e w t e . Let p t e = w t e =W t . Choose expert e t from categorical distribution p t Select x t = x ( e t ; t ), the advice/prediction of e t . For each e 2 E ; f 2 F e;f ¯ f ( t )( c t ( e ) ¡ ¯ E [ c t ( e t )]) w t +1 e;f = w t 22 CS/CNS/EE 253

  23. Sleeping Experts [Algorithm from Blum & Mansour '05] e;f ¯ f ( t )( c t ( e ) ¡ ¯ E [ c t ( e t )]) w t +1 e;f = w t X Ensures total sum of weights w t e;f · nm for all t can never increase. e;f Y ¯ f ( t )( c t ( e ) ¡ ¯ E [ c t ( e t )]) w T e;f = t ¸ 0 P t ¸ 0 [ f ( t )( c t ( e ) ¡ ¯ E [ c t ( e t )])] = ¯ · nm 23 CS/CNS/EE 253

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend