Reducing contextual bandits to supervised learning Daniel Hsu - PowerPoint PPT Presentation

Using no-exploration Example : two contexts { X , Y } , two actions { A , B } . Suppose initial policy says ˆ π ( X ) = A and ˆ π ( Y ) = B . Observed rewards Reward estimates A B A B X 0 . 7 — X 0 . 7 0 . 5 — 0 . 1 0 . 5 0 . 1 Y Y π ′ ( X ) = ˆ π ′ ( Y ) = A . New policy: ˆ Observed rewards Reward estimates A B A B X 0 . 7 — X 0 . 7 0 . 5 Y 0 . 3 0 . 1 Y 0 . 3 0 . 1 Never try action B in context X . 9

Using no-exploration Example : two contexts { X , Y } , two actions { A , B } . Suppose initial policy says ˆ π ( X ) = A and ˆ π ( Y ) = B . Observed rewards Reward estimates A B A B X 0 . 7 — X 0 . 7 0 . 5 — 0 . 1 0 . 5 0 . 1 Y Y π ′ ( X ) = ˆ π ′ ( Y ) = A . New policy: ˆ Observed rewards Reward estimates True rewards A B A B A B X 0 . 7 — X 0 . 7 0 . 5 X 0 . 7 1 . 0 Y 0 . 3 0 . 1 Y 0 . 3 0 . 1 Y 0 . 3 0 . 1 Never try action B in context X . Ω( 1 ) regret . 9

Dealing with policies Feedback in round t : reward of chosen action r t ( a t ) . ◮ Tells us about policies π ∈ Π s.t. π ( x t ) = a t . ◮ Not informative about other policies! 10

Dealing with policies Feedback in round t : reward of chosen action r t ( a t ) . ◮ Tells us about policies π ∈ Π s.t. π ( x t ) = a t . ◮ Not informative about other policies! Possible approach : track average reward of each π ∈ Π . 10

Dealing with policies Feedback in round t : reward of chosen action r t ( a t ) . ◮ Tells us about policies π ∈ Π s.t. π ( x t ) = a t . ◮ Not informative about other policies! Possible approach : track average reward of each π ∈ Π . ◮ Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995) . 10

Dealing with policies Feedback in round t : reward of chosen action r t ( a t ) . ◮ Tells us about policies π ∈ Π s.t. π ( x t ) = a t . ◮ Not informative about other policies! Possible approach : track average reward of each π ∈ Π . ◮ Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995) . �� K log N ◮ Statistically optimal regret bound O T for K := |A| actions and N := | Π | policies after T rounds. 10

Dealing with policies Feedback in round t : reward of chosen action r t ( a t ) . ◮ Tells us about policies π ∈ Π s.t. π ( x t ) = a t . ◮ Not informative about other policies! Possible approach : track average reward of each π ∈ Π . ◮ Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995) . �� K log N ◮ Statistically optimal regret bound O T for K := |A| actions and N := | Π | policies after T rounds. ◮ Explicit bookkeeping is computationally intractable for large N . 10

Dealing with policies Feedback in round t : reward of chosen action r t ( a t ) . ◮ Tells us about policies π ∈ Π s.t. π ( x t ) = a t . ◮ Not informative about other policies! Possible approach : track average reward of each π ∈ Π . ◮ Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995) . �� K log N ◮ Statistically optimal regret bound O T for K := |A| actions and N := | Π | policies after T rounds. ◮ Explicit bookkeeping is computationally intractable for large N . But perhaps policy class Π has some structure . . . 10

Hypothetical “full-information” setting If we observed rewards for all actions . . . 11

Hypothetical “full-information” setting If we observed rewards for all actions . . . ◮ Like supervised learning , have labeled data after t rounds: ( x 1 , ρ 1 ) , . . . , ( x t , ρ t ) ∈ X × R A . 11

Hypothetical “full-information” setting If we observed rewards for all actions . . . ◮ Like supervised learning , have labeled data after t rounds: ( x 1 , ρ 1 ) , . . . , ( x t , ρ t ) ∈ X × R A . context − → features actions − → classes rewards − → − costs policy − → classifier 11

Hypothetical “full-information” setting If we observed rewards for all actions . . . ◮ Like supervised learning , have labeled data after t rounds: ( x 1 , ρ 1 ) , . . . , ( x t , ρ t ) ∈ X × R A . context − → features actions − → classes rewards − → − costs policy − → classifier ◮ Can often exploit structure of Π to get tractable algorithms. Abstraction : arg max oracle ( AMO ) �� t � t AMO ( x i , ρ i ) := arg max ρ i ( π ( x i )) . i = 1 π ∈ Π i = 1 11

Hypothetical “full-information” setting If we observed rewards for all actions . . . ◮ Like supervised learning , have labeled data after t rounds: ( x 1 , ρ 1 ) , . . . , ( x t , ρ t ) ∈ X × R A . context − → features actions − → classes rewards − → − costs policy − → classifier ◮ Can often exploit structure of Π to get tractable algorithms. Abstraction : arg max oracle ( AMO ) �� t � t AMO ( x i , ρ i ) := arg max ρ i ( π ( x i )) . i = 1 π ∈ Π i = 1 Can’t directly use this in bandit setting. 11

Using AMO with some exploration Explore-then-exploit 12

Using AMO with some exploration Explore-then-exploit 1. In first τ rounds, choose a t ∈ A u.a.r. to get unbiased estimates ˆ r t of r t for all t ≤ τ . 12

Using AMO with some exploration Explore-then-exploit 1. In first τ rounds, choose a t ∈ A u.a.r. to get unbiased estimates ˆ r t of r t for all t ≤ τ . r t ) } τ π := AMO ( { ( x t , ˆ 2. Get ˆ t = 1 ) . 12

Using AMO with some exploration Explore-then-exploit 1. In first τ rounds, choose a t ∈ A u.a.r. to get unbiased estimates ˆ r t of r t for all t ≤ τ . r t ) } τ π := AMO ( { ( x t , ˆ 2. Get ˆ t = 1 ) . 3. Henceforth use a t := ˆ π ( x t ) , for t = τ + 1 , τ + 2 , . . . , T . 12

Using AMO with some exploration Explore-then-exploit 1. In first τ rounds, choose a t ∈ A u.a.r. to get unbiased estimates ˆ r t of r t for all t ≤ τ . r t ) } τ π := AMO ( { ( x t , ˆ 2. Get ˆ t = 1 ) . 3. Henceforth use a t := ˆ π ( x t ) , for t = τ + 1 , τ + 2 , . . . , T . Regret bound with best τ : ∼ T − 1 / 3 ( sub-optimal ). (Dependencies on |A| and | Π | hidden.) 12

Previous contextual bandit algorithms 13

Previous contextual bandit algorithms Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995) . Optimal regret, but explicitly enumerates Π . 13

Previous contextual bandit algorithms Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995) . Optimal regret, but explicitly enumerates Π . Greedy (Langford & Zhang, NIPS 2007) Sub-optimal regret, but one call to AMO . 13

Previous contextual bandit algorithms Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995) . Optimal regret, but explicitly enumerates Π . Greedy (Langford & Zhang, NIPS 2007) Sub-optimal regret, but one call to AMO . Monster (Dudik, Hsu, Kale, Karampatziakis, Langford, Reyzin, & Zhang, UAI 2011) Near optimal regret, but O ( T 6 ) calls to AMO . 13

Our result Let K := |A| and N := | Π | . Our result : a new, fast and simple algorithm. �� ◮ Regret bound: ˜ K log N . O T Near optimal. �� ◮ # calls to AMO : ˜ TK O . log N Less than once per round! 14

Rest of the talk Components of the new algorithm : Importance-weighted LOw-Variance Epoch-Timed Oracleized CONtextual BANDITS 1. “Classical” tricks: randomization, inverse propensity weighting. 2. Efficient algorithm for balancing exploration/exploitation. 3. Additional tricks: warm-start and epoch structure. 15

1. Classical tricks 16

What would’ve happened if I had done X? For t = 1 , 2 , . . . , T : 0. Nature draws ( x t , r t ) from dist. D over X × [ 0 , 1 ] A . 1. Observe context x t ∈ X . [e.g., user profile, search query] 2. Choose action a t ∈ A . [e.g., ad to display] 3. Collect reward r t ( a t ) ∈ [ 0 , 1 ] . [e.g., 1 if click, 0 otherwise] 17

What would’ve happened if I had done X? For t = 1 , 2 , . . . , T : 0. Nature draws ( x t , r t ) from dist. D over X × [ 0 , 1 ] A . 1. Observe context x t ∈ X . [e.g., user profile, search query] 2. Choose action a t ∈ A . [e.g., ad to display] 3. Collect reward r t ( a t ) ∈ [ 0 , 1 ] . [e.g., 1 if click, 0 otherwise] Q : How do I learn about r t ( a ) for actions a I don’t actually take? 17

What would’ve happened if I had done X? For t = 1 , 2 , . . . , T : 0. Nature draws ( x t , r t ) from dist. D over X × [ 0 , 1 ] A . 1. Observe context x t ∈ X . [e.g., user profile, search query] 2. Choose action a t ∈ A . [e.g., ad to display] 3. Collect reward r t ( a t ) ∈ [ 0 , 1 ] . [e.g., 1 if click, 0 otherwise] Q : How do I learn about r t ( a ) for actions a I don’t actually take? A : Randomize . Draw a t ∼ p t for some pre-specified prob. dist. p t . 17

Inverse propensity weighting (Horvitz & Thompson, JASA 1952) Importance-weighted estimate of reward from round t : r t ( a ) := r t ( a t ) · 1 { a = a t } ∀ a ∈ A � ˆ p t ( a t ) 18

Inverse propensity weighting (Horvitz & Thompson, JASA 1952) Importance-weighted estimate of reward from round t :   r t ( a t )   if a = a t ,  r t ( a ) := r t ( a t ) · 1 { a = a t } p t ( a t ) ∀ a ∈ A � ˆ = p t ( a t )     0 otherwise . 18

Inverse propensity weighting (Horvitz & Thompson, JASA 1952) Importance-weighted estimate of reward from round t :   r t ( a t )   if a = a t ,  r t ( a ) := r t ( a t ) · 1 { a = a t } p t ( a t ) ∀ a ∈ A � ˆ = p t ( a t )     0 otherwise . Unbiasedness : � � � p t ( a ′ ) · r t ( a ′ ) · 1 { a = a ′ } r t ( a ) ˆ = = r t ( a ) . E a t ∼ p t p t ( a ′ ) a ′ ∈A 18

Inverse propensity weighting (Horvitz & Thompson, JASA 1952) Importance-weighted estimate of reward from round t :   r t ( a t )   if a = a t ,  r t ( a ) := r t ( a t ) · 1 { a = a t } p t ( a t ) ∀ a ∈ A � ˆ = p t ( a t )     0 otherwise . Unbiasedness : � � � p t ( a ′ ) · r t ( a ′ ) · 1 { a = a ′ } r t ( a ) ˆ = = r t ( a ) . E a t ∼ p t p t ( a ′ ) a ′ ∈A 1 Range and variance : upper-bounded by p t ( a ) . 18

Inverse propensity weighting (Horvitz & Thompson, JASA 1952) Importance-weighted estimate of reward from round t :   r t ( a t )   if a = a t ,  r t ( a ) := r t ( a t ) · 1 { a = a t } p t ( a t ) ∀ a ∈ A � ˆ = p t ( a t )     0 otherwise . Unbiasedness : � � � p t ( a ′ ) · r t ( a ′ ) · 1 { a = a ′ } r t ( a ) ˆ = = r t ( a ) . E a t ∼ p t p t ( a ′ ) a ′ ∈A 1 Range and variance : upper-bounded by p t ( a ) . � t Estimate avg. reward of policy : � Rew t ( π ) := 1 i = 1 ˆ r i ( π ( x i )) . t 18

Inverse propensity weighting (Horvitz & Thompson, JASA 1952) Importance-weighted estimate of reward from round t :   r t ( a t )   if a = a t ,  r t ( a ) := r t ( a t ) · 1 { a = a t } p t ( a t ) ∀ a ∈ A � ˆ = p t ( a t )     0 otherwise . Unbiasedness : � � � p t ( a ′ ) · r t ( a ′ ) · 1 { a = a ′ } r t ( a ) ˆ = = r t ( a ) . E a t ∼ p t p t ( a ′ ) a ′ ∈A 1 Range and variance : upper-bounded by p t ( a ) . � t Estimate avg. reward of policy : � Rew t ( π ) := 1 i = 1 ˆ r i ( π ( x i )) . t How should we choose the p t ? 18

Hedging over policies Get action distributions via policy distributions. ( Q , x ) �− → p � �� action distribution ( policy distribution , context ) 19

Hedging over policies Get action distributions via policy distributions. ( Q , x ) �− → p � �� action distribution ( policy distribution , context ) Policy distribution : Q = ( Q ( π ) : π ∈ Π) probability dist. over policies π in the policy class Π 19

Hedging over policies Get action distributions via policy distributions. ( Q , x ) �− → p � �� action distribution ( policy distribution , context ) 1: Pick initial distribution Q 1 over policies Π . 2: for round t = 1 , 2 , . . . do Nature draws ( x t , r t ) from dist. D over X × [ 0 , 1 ] A . 3: Observe context x t . 4: Compute distribution p t over A (using Q t and x t ). 5: Pick action a t ∼ p t . 6: Collect reward r t ( a t ) . 7: Compute new distribution Q t + 1 over policies Π . 8: 9: end for 19

2. Efficient construction of good policy distributions 20

Our approach Q : How do we choose Q t for good exploration/exploitation? 21

Our approach Q : How do we choose Q t for good exploration/exploitation? Caveat : Q t must be efficiently computable + representable! 21

Our approach Q : How do we choose Q t for good exploration/exploitation? Caveat : Q t must be efficiently computable + representable! Our approach : 1. Define convex feasibility problem (over distributions Q on Π ) such that solutions yield (near) optimal regret bounds. 21

Our approach Q : How do we choose Q t for good exploration/exploitation? Caveat : Q t must be efficiently computable + representable! Our approach : 1. Define convex feasibility problem (over distributions Q on Π ) such that solutions yield (near) optimal regret bounds. 2. Design algorithm that finds a sparse solution Q . 21

Our approach Q : How do we choose Q t for good exploration/exploitation? Caveat : Q t must be efficiently computable + representable! Our approach : 1. Define convex feasibility problem (over distributions Q on Π ) such that solutions yield (near) optimal regret bounds. 2. Design algorithm that finds a sparse solution Q . Algorithm only accesses Π via calls to AMO = ⇒ nnz ( Q ) = O ( # AMO calls ) 21

The “good policy distribution” problem Convex feasibility problem for policy distribution Q 22

The “good policy distribution” problem Convex feasibility problem for policy distribution Q � � K log N Q ( π ) · � Reg t ( π ) ≤ (Low regret) t π ∈ Π 22

The “good policy distribution” problem Convex feasibility problem for policy distribution Q � � K log N Q ( π ) · � Reg t ( π ) ≤ (Low regret) t π ∈ Π � � � var � Rew t ( π ) ≤ b ( π ) ∀ π ∈ Π (Low variance) 22

The “good policy distribution” problem Convex feasibility problem for policy distribution Q � � K log N Q ( π ) · � Reg t ( π ) ≤ (Low regret) t π ∈ Π � � � var � Rew t ( π ) ≤ b ( π ) ∀ π ∈ Π (Low variance) Using feasible Q t in round t gives near-optimal regret. 22

The “good policy distribution” problem Convex feasibility problem for policy distribution Q � � K log N Q ( π ) · � Reg t ( π ) ≤ (Low regret) t π ∈ Π � � � var � Rew t ( π ) ≤ b ( π ) ∀ π ∈ Π (Low variance) Using feasible Q t in round t gives near-optimal regret. But | Π | variables and > | Π | constraints, . . . 22

Solving the convex feasibility problem Solver for “good policy distribution” problem (Technical detail: Q can be a sub-distribution that sums to less than one.) 23

Solving the convex feasibility problem Solver for “good policy distribution” problem Start with some Q (e.g., Q := 0 ), then repeat: (Technical detail: Q can be a sub-distribution that sums to less than one.) 23

Solving the convex feasibility problem Solver for “good policy distribution” problem Start with some Q (e.g., Q := 0 ), then repeat: 1. If “low regret” constraint violated, then fix by rescaling: Q := c Q for some c < 1. (Technical detail: Q can be a sub-distribution that sums to less than one.) 23

Solving the convex feasibility problem Solver for “good policy distribution” problem Start with some Q (e.g., Q := 0 ), then repeat: 1. If “low regret” constraint violated, then fix by rescaling: Q := c Q for some c < 1. 2. Find most violated “low variance” constraint—say, corresponding to policy � π —and update Q ( � π ) := Q ( � π ) + α . ( c < 1 and α > 0 have closed-form formulae.) (Technical detail: Q can be a sub-distribution that sums to less than one.) 23

Solving the convex feasibility problem Solver for “good policy distribution” problem Start with some Q (e.g., Q := 0 ), then repeat: 1. If “low regret” constraint violated, then fix by rescaling: Q := c Q for some c < 1. 2. Find most violated “low variance” constraint—say, corresponding to policy � π —and update Q ( � π ) := Q ( � π ) + α . (If no such violated constraint, stop and return Q .) ( c < 1 and α > 0 have closed-form formulae.) (Technical detail: Q can be a sub-distribution that sums to less than one.) 23

Implementation via AMO Finding “low variance” constraint violation : 1. Create fictitious rewards for each i = 1 , 2 , . . . , t : µ � r i ( a ) := ˆ r i ( a ) + ∀ a ∈ A , Q ( a | x i ) � ( log N ) / ( Kt ) . where µ ≈ �� t ( x i , � 2. Obtain � π := AMO r i ) . i = 1 3. � Rew t ( � π ) > threshold iff � π ’s “low variance” constraint is violated. 24

Iteration bound Solver is coordinate descent for minimizing potential function � � � Φ( Q ) := c 1 · � Q ( π ) � RE ( uniform � Q ( ·| x )) + c 2 · Reg t ( π ) . E x π ∈ Π (Actually use ( 1 − ε ) · Q + ε · uniform inside RE expression.) 25

Iteration bound Solver is coordinate descent for minimizing potential function � � � Φ( Q ) := c 1 · � Q ( π ) � RE ( uniform � Q ( ·| x )) + c 2 · Reg t ( π ) . E x π ∈ Π (Partial derivative w.r.t. Q ( π ) is “low variance” constraint for π .) (Actually use ( 1 − ε ) · Q + ε · uniform inside RE expression.) 25

Iteration bound Solver is coordinate descent for minimizing potential function � � � Φ( Q ) := c 1 · � Q ( π ) � RE ( uniform � Q ( ·| x )) + c 2 · Reg t ( π ) . E x π ∈ Π (Partial derivative w.r.t. Q ( π ) is “low variance” constraint for π .) Returns a feasible solution after   � Kt ˜   steps . O log N (Actually use ( 1 − ε ) · Q + ε · uniform inside RE expression.) 25

Algorithm 1: Pick initial distribution Q 1 over policies Π . 2: for round t = 1 , 2 , . . . do Nature draws ( x t , r t ) from dist. D over X × [ 0 , 1 ] A . 3: Observe context x t . 4: Compute action distribution p t := Q t ( · | x t ) . 5: Pick action a t ∼ p t . 6: Collect reward r t ( a t ) . 7: Compute new policy distribution Q t + 1 using coordinate 8: descent + AMO . 9: end for 26

Recap 27

Recap Feasible solution to “good policy distribution problem” gives near optimal regret bound. 27

Recap Feasible solution to “good policy distribution problem” gives near optimal regret bound. New coordinate descent algorithm : repeatedly find a violated constraint and adjust Q to satisfy it. 27

Reducing contextual bandits to supervised learning Daniel Hsu - PowerPoint PPT Presentation

Reducing contextual bandits to supervised learning Daniel Hsu Columbia University Based on joint work with A. Agarwal, S. Kale, J. Langford, L. Li, and R. Schapire 1 Learning to interact: example #1 Practicing physician 2 Learning to

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

Contextual Inquiry Take Aways Overview of Contextual Design Contextual inquiry

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits Authors: John Langford, Tom Zhang

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

Kernel Methods for Cooperative Contextual Bandits Introduction Motivation UCB Algorithms Basic

Simpler Optimal Algorithm for Contextual Bandits under Realizability Yunzong Xu MIT Joint work

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Contextual Analysis SWEN-444 Contextual analysis Systematic analysis of contextual user work

Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard

Advanced Econometrics 2, Hilary term 2021 Multi-armed bandits Maximilian Kasy Department of

Neural Contextual Bandits with UCB-based Exploration Dongruo Zhou 1 Lihong Li 2 Quanquan Gu 1 1

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning

On adaptive regret bounds for non- stochastic bandits Gergely Neu INRIA Lille, SequeL team

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

ASHG Policy Luncheon: Genomics Research and the FDA NHGRI: Raising Awareness Regarding FDAs

Bleeding and cancer risk in patients with vascular disease COMPASS Steering Committee and

in Mississauga Stakeholder Engagement Session September 23, 2019 Mississauga Ontario Health Team

Building Phylogenetic Trees based on: Biological Sequence Analysis , Ch. 7 by R. Durbin et al.,

Systematic Annotation Mark Voorhies 4/5/2011 The Gene Ontology Three directed acyclic graphs

CSI5180. MachineLearningfor BioinformaticsApplications Course overview by Marcel Turcotte

Transfer Learning and Applications in Computational Biology 1 Christian Widmer, 1 , 2 Marius

582606 Introduction to bioinformatics Autumn 2007 Esa Pitknen Master's Degree Programme in