linear and contextual bandits rich decision sets and side
play

Linear (and contextual) Bandits: Rich decision sets (and side - PowerPoint PPT Presentation

Linear (and contextual) Bandits: Rich decision sets (and side information) Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 14 Announcements... Poster


  1. Linear (and contextual) Bandits: Rich decision sets (and side information) Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 14

  2. Announcements... Poster session: June 1, 9-11:30a Request: CSE grad students, could you please help others with poster printing? Aravind: Ask by 2p on Weds for help printing. Prepare, at most, a 2 minute verbal summary. Come earlier to setup. Submit your poster on Canvas. Due Dates: Please be on time. Today: review: Linear bandits today: contextual bandits, game trees? S. M. Kakade (UW) Optimization for Big data 2 / 14

  3. Review S. M. Kakade (UW) Optimization for Big data 2 / 14

  4. Bandits in practice: two major issues The decision space is very large. Drug cocktails Ad design We often have “side information” when making a decision history of a user S. M. Kakade (UW) Optimization for Big data 2 / 9

  5. More real motivations... S. M. Kakade (UW) Optimization for Big data 2 / 9

  6. Linear bandits An additive effects model. Suppose each round we take a decision x ∈ D ⊂ R d . x is paths on a graph. x is a feature vector of properties of an ad x is a which drugs are being taken Upon taking action x , we get reward r , with expectation: E [ r | x ] = µ > x only d unknown parameters (and “effectively” 2 d actions) W desire an algorithm A (mapping histories to decisions), which has low regret. T T µ > x ⇤ − X E [ µ > x t |A ] ≤ ?? t = 1 (where x ⇤ is the best decision) S. M. Kakade (UW) Optimization for Big data 4 / 14

  7. Example: Shortest paths... S. M. Kakade (UW) Optimization for Big data 3 / 9

  8. Algorithm Idea again, let’s think of optimism in the face of uncertainty we observed some r 1 , . . . r t � 1 , and have taken x 1 , . . . x t � 1 . Questions: what is an estimate of the reward of E [ r | x ] and what is our uncertainty? what is an estimate of µ and what is our uncertainty? S. M. Kakade (UW) Optimization for Big data 4 / 9

  9. Regression! Define: X x τ x > X A t := τ + λ I , b t := x τ r τ τ < t τ < t Our estimate of µ µ t = A � 1 ˆ b t t Confidence of our estimate: µ t k 2 k µ � ˆ A t  O ( d log t ) S. M. Kakade (UW) Optimization for Big data 5 / 9

  10. LinUCB Again, optimism in the face of uncertainty. Define: µ t k 2 B t := { ν | k ν � ˆ A t  O d log t } (Lin UCB) take action: ν > x x t = argmax x 2 D max ν 2 B t then update A t , B t , b t , and ˆ µ t . Equivalently, take action: q xA � 1 µ > x t = argmax x 2 D ˆ t x + ( d log t ) x t S. M. Kakade (UW) Optimization for Big data 7 / 14

  11. LinUCB: Geometry S. M. Kakade (UW) Optimization for Big data 8 / 14

  12. LinUCB: Confidence intervals S. M. Kakade (UW) Optimization for Big data 8 / 9

  13. Today S. M. Kakade (UW) Optimization for Big data 9 / 14

  14. LinUCB Regret bound of LinUCB T √ T µ > x ⇤ − X E [ µ > x t ] ≤ ⇤ ( d T ) t = 1 (this is the best possible, up to log factors). √ Compare to O ( KT ) Independent of number of actions. k -arm case is a special case. Thompson sampling: This is a good algorithm in practice. S. M. Kakade (UW) Optimization for Big data 10 / 14

  15. Proof Idea... Stats: need to show that B t is a valid confidence region. Geometric lemma: The regret is upper bounded by the: log volume of posterior cov volume of prior cov Then just bound the worst case log volume change. S. M. Kakade (UW) Optimization for Big data 10 / 14

  16. What about context? S. M. Kakade (UW) Optimization for Big data 10 / 14

  17. The Contextual Bandit Game Game: for t = 1 , 2 , . . . At each time t , we obtain context (e.g. side information, user information) c t Our feasible action set is A t . We choose arm a t ∈ A t and receive reward r t , a t . (what assumptions on the reward process?) Goal: Algorithm A to have low regret: X E [ ( r t , a ∗ t − r t ) |A ] ≤ ?? t where E [ r t , a ∗ t ] is the optimal expected reward at time t . S. M. Kakade (UW) Optimization for Big data 11 / 14

  18. How should we model outcomes? Example: ad (or movie, song, etc) prediction. What is prob. that a user u clicks on an ad a . How should we model the click probability of a for user u ? Featurizations: suppose we have φ ad ( a ) ∈ R d ad and φ user ( u ) ∈ R d user . We could make an “outer product” feature vector x as: x ( a , u ) = Vector ( φ ad ( a ) φ user ( u ) > ) ∈ R d ad d user We could model the probabilities as: E [ click = 1 | a , u ] = µ > x ( a , u ) (or log linear) How do we estimate µ ? S. M. Kakade (UW) Optimization for Big data 12 / 14

  19. Contextual Linear bandits Suppose each round t , we take a decision x ∈ D t ⊂ R d ( D t may be time varying). map each ad/user a to x ( a , u ) . D t = { x ( a , u t ) | a is a feasible ad at time t } Our decision is a feature vector in x ∈ D t . Upon taking action x t ∈ D t , we get reward r t , with expectation: E [ r t | x t ∈ D t ] = µ > x t (here µ is assumed constant over time). Our regret: X ( µ > x t , a ∗ t − µ > x t ) |A ] ≤ ?? E [ t (where x t , a ∗ t is the best decision at time t ) S. M. Kakade (UW) Optimization for Big data 13 / 14

  20. Algorithm let’s just run linUCB (or Thompson sampling) Nothing really changes: A t and b t are the same updating rules now our decision is: ν 2 B t ν > x x t = argmax x 2 D t max i.e. q µ > xA � 1 x t = argmax x 2 D t ˆ t x + ( d log t ) x t √ Regret bound is still O ( d T ) . S. M. Kakade (UW) Optimization for Big data 14 / 14

  21. Acknowledgements http://gdrro.lip6.fr/sites/default/files/ JourneeCOSdec2015-Kaufman.pdf https://sites.google.com/site/banditstutorial/ http://www.yisongyue.com/courses/cs159/lectures/ LinUCB.pdf S. M. Kakade (UW) Optimization for Big data 14 / 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend