Linear (and contextual) Bandits: Rich decision sets (and side - PowerPoint PPT Presentation

Linear (and contextual) Bandits: Rich decision sets (and side information) Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 14

Announcements... Poster session: June 1, 9-11:30a Request: CSE grad students, could you please help others with poster printing? Aravind: Ask by 2p on Weds for help printing. Prepare, at most, a 2 minute verbal summary. Come earlier to setup. Submit your poster on Canvas. Due Dates: Please be on time. Today: review: Linear bandits today: contextual bandits, game trees? S. M. Kakade (UW) Optimization for Big data 2 / 14

Review S. M. Kakade (UW) Optimization for Big data 2 / 14

Bandits in practice: two major issues The decision space is very large. Drug cocktails Ad design We often have “side information” when making a decision history of a user S. M. Kakade (UW) Optimization for Big data 2 / 9

More real motivations... S. M. Kakade (UW) Optimization for Big data 2 / 9

Linear bandits An additive effects model. Suppose each round we take a decision x ∈ D ⊂ R d . x is paths on a graph. x is a feature vector of properties of an ad x is a which drugs are being taken Upon taking action x , we get reward r , with expectation: E [ r | x ] = µ > x only d unknown parameters (and “effectively” 2 d actions) W desire an algorithm A (mapping histories to decisions), which has low regret. T T µ > x ⇤ − X E [ µ > x t |A ] ≤ ?? t = 1 (where x ⇤ is the best decision) S. M. Kakade (UW) Optimization for Big data 4 / 14

Example: Shortest paths... S. M. Kakade (UW) Optimization for Big data 3 / 9

Algorithm Idea again, let’s think of optimism in the face of uncertainty we observed some r 1 , . . . r t � 1 , and have taken x 1 , . . . x t � 1 . Questions: what is an estimate of the reward of E [ r | x ] and what is our uncertainty? what is an estimate of µ and what is our uncertainty? S. M. Kakade (UW) Optimization for Big data 4 / 9

Regression! Define: X x τ x > X A t := τ + λ I , b t := x τ r τ τ < t τ < t Our estimate of µ µ t = A � 1 ˆ b t t Confidence of our estimate: µ t k 2 k µ � ˆ A t  O ( d log t ) S. M. Kakade (UW) Optimization for Big data 5 / 9

LinUCB Again, optimism in the face of uncertainty. Define: µ t k 2 B t := { ν | k ν � ˆ A t  O d log t } (Lin UCB) take action: ν > x x t = argmax x 2 D max ν 2 B t then update A t , B t , b t , and ˆ µ t . Equivalently, take action: q xA � 1 µ > x t = argmax x 2 D ˆ t x + ( d log t ) x t S. M. Kakade (UW) Optimization for Big data 7 / 14

LinUCB: Geometry S. M. Kakade (UW) Optimization for Big data 8 / 14

LinUCB: Confidence intervals S. M. Kakade (UW) Optimization for Big data 8 / 9

Today S. M. Kakade (UW) Optimization for Big data 9 / 14

LinUCB Regret bound of LinUCB T √ T µ > x ⇤ − X E [ µ > x t ] ≤ ⇤ ( d T ) t = 1 (this is the best possible, up to log factors). √ Compare to O ( KT ) Independent of number of actions. k -arm case is a special case. Thompson sampling: This is a good algorithm in practice. S. M. Kakade (UW) Optimization for Big data 10 / 14

Proof Idea... Stats: need to show that B t is a valid confidence region. Geometric lemma: The regret is upper bounded by the: log volume of posterior cov volume of prior cov Then just bound the worst case log volume change. S. M. Kakade (UW) Optimization for Big data 10 / 14

What about context? S. M. Kakade (UW) Optimization for Big data 10 / 14

The Contextual Bandit Game Game: for t = 1 , 2 , . . . At each time t , we obtain context (e.g. side information, user information) c t Our feasible action set is A t . We choose arm a t ∈ A t and receive reward r t , a t . (what assumptions on the reward process?) Goal: Algorithm A to have low regret: X E [ ( r t , a ∗ t − r t ) |A ] ≤ ?? t where E [ r t , a ∗ t ] is the optimal expected reward at time t . S. M. Kakade (UW) Optimization for Big data 11 / 14

How should we model outcomes? Example: ad (or movie, song, etc) prediction. What is prob. that a user u clicks on an ad a . How should we model the click probability of a for user u ? Featurizations: suppose we have φ ad ( a ) ∈ R d ad and φ user ( u ) ∈ R d user . We could make an “outer product” feature vector x as: x ( a , u ) = Vector ( φ ad ( a ) φ user ( u ) > ) ∈ R d ad d user We could model the probabilities as: E [ click = 1 | a , u ] = µ > x ( a , u ) (or log linear) How do we estimate µ ? S. M. Kakade (UW) Optimization for Big data 12 / 14

Contextual Linear bandits Suppose each round t , we take a decision x ∈ D t ⊂ R d ( D t may be time varying). map each ad/user a to x ( a , u ) . D t = { x ( a , u t ) | a is a feasible ad at time t } Our decision is a feature vector in x ∈ D t . Upon taking action x t ∈ D t , we get reward r t , with expectation: E [ r t | x t ∈ D t ] = µ > x t (here µ is assumed constant over time). Our regret: X ( µ > x t , a ∗ t − µ > x t ) |A ] ≤ ?? E [ t (where x t , a ∗ t is the best decision at time t ) S. M. Kakade (UW) Optimization for Big data 13 / 14

Algorithm let’s just run linUCB (or Thompson sampling) Nothing really changes: A t and b t are the same updating rules now our decision is: ν 2 B t ν > x x t = argmax x 2 D t max i.e. q µ > xA � 1 x t = argmax x 2 D t ˆ t x + ( d log t ) x t √ Regret bound is still O ( d T ) . S. M. Kakade (UW) Optimization for Big data 14 / 14

Acknowledgements http://gdrro.lip6.fr/sites/default/files/ JourneeCOSdec2015-Kaufman.pdf https://sites.google.com/site/banditstutorial/ http://www.yisongyue.com/courses/cs159/lectures/ LinUCB.pdf S. M. Kakade (UW) Optimization for Big data 14 / 14

Linear (and contextual) Bandits: Rich decision sets (and side - PowerPoint PPT Presentation

Linear (and contextual) Bandits: Rich decision sets (and side information) Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 14 Announcements... Poster

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Contextual Inquiry Take Aways Overview of Contextual Design Contextual inquiry

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

Linear Bandits: Rich decision sets Sham M. Kakade Machine Learning for Big Data CSE547/STAT548

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits Authors: John Langford, Tom Zhang

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

Kernel Methods for Cooperative Contextual Bandits Introduction Motivation UCB Algorithms Basic

Simpler Optimal Algorithm for Contextual Bandits under Realizability Yunzong Xu MIT Joint work

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Contextual Analysis SWEN-444 Contextual analysis Systematic analysis of contextual user work

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning

Neural Contextual Bandits with UCB-based Exploration Dongruo Zhou 1 Lihong Li 2 Quanquan Gu 1 1

Toward Better Use of Data in Contextual and Linear Bandits Nima Hamidi and Mohsen Bayati

Experimental Design & Evaluation 4. Contextual Inquiry SunyoungKim,PhD Contextual

Contextual Advertising: Contextual Advertising: Semantic Approach Semantic Approach Ekaterina

Protective Effects of Optimism on Onset of Post-Deployment Pain in U.S. Army Personnel Afton L.

Constant Propagation on SSA form Advanced Compiler Techniques 2005 Erik Stenman Virtutech

(PS-1971) The Planning Fallacy and its Effect on Realistic Project Schedules Jeffrey A. Valdahl

Minimizing Latency in Fault-Tolerant Distributed Stream Processing Systems Andrey Brito 1 ,

Reward Shaping in Episodic Reinforcement Learning Marek Grze s Canterbury, UK AAMAS 2017

On the Benefits of Being Optimistic and Relaxed Petr Kuznetsov INFRES, Tlcom ParisTech

Secular Humanism The Road So Far The Agenda What is Secular Humanism? Why is it so

The Push/Pull Model of Transactions Matthew Parkinson Eric Koskinen IBM Research, New York

Linear (and contextual) Bandits: Rich decision sets (and side - PowerPoint PPT Presentation

Linear (and contextual) Bandits: Rich decision sets (and side information) Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 14 Announcements... Poster

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Contextual Inquiry Take Aways Overview of Contextual Design Contextual inquiry

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

Linear Bandits: Rich decision sets Sham M. Kakade Machine Learning for Big Data CSE547/STAT548

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits Authors: John Langford, Tom Zhang

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

Kernel Methods for Cooperative Contextual Bandits Introduction Motivation UCB Algorithms Basic

Simpler Optimal Algorithm for Contextual Bandits under Realizability Yunzong Xu MIT Joint work

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Contextual Analysis SWEN-444 Contextual analysis Systematic analysis of contextual user work

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning

Neural Contextual Bandits with UCB-based Exploration Dongruo Zhou 1 Lihong Li 2 Quanquan Gu 1 1

Toward Better Use of Data in Contextual and Linear Bandits Nima Hamidi and Mohsen Bayati

Experimental Design &amp; Evaluation 4. Contextual Inquiry SunyoungKim,PhD Contextual

Contextual Advertising: Contextual Advertising: Semantic Approach Semantic Approach Ekaterina

Protective Effects of Optimism on Onset of Post-Deployment Pain in U.S. Army Personnel Afton L.

Constant Propagation on SSA form Advanced Compiler Techniques 2005 Erik Stenman Virtutech

(PS-1971) The Planning Fallacy and its Effect on Realistic Project Schedules Jeffrey A. Valdahl

Minimizing Latency in Fault-Tolerant Distributed Stream Processing Systems Andrey Brito 1 ,

Reward Shaping in Episodic Reinforcement Learning Marek Grze s Canterbury, UK AAMAS 2017

On the Benefits of Being Optimistic and Relaxed Petr Kuznetsov INFRES, Tlcom ParisTech

Secular Humanism The Road So Far The Agenda What is Secular Humanism? Why is it so

The Push/Pull Model of Transactions Matthew Parkinson Eric Koskinen IBM Research, New York

Experimental Design & Evaluation 4. Contextual Inquiry SunyoungKim,PhD Contextual