Bandits and Exploration: How do we (optimally) gather information? - PowerPoint PPT Presentation

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 18

Announcements... HW 4 posted soon (short) Poster session: June 1, 9-11:30a; ask TA/CSE students for help printing Projects: the term is approaching the end.... Today: Quick overview: Parallelization and Deep learning Bandits: Vanilla k-arm setting 1 Linear bandits and ad-placement 2 Game trees? 3 S. M. Kakade (UW) Optimization for Big data 2 / 18

The problem In unsupervised learning, we just have data... In supervised learning, we have inputs X and labels Y (often we spend resources to get these labels). In reinforcement learning (very general), we act in the world, there is “state” and we observe rewards. Bandit Settings: We have K decisions each round and we do only received feedback for the chosen decision... S. M. Kakade (UW) Optimization for Big data 3 / 18

Gambling in casino... S. M. Kakade (UW) Optimization for Big data 4 / 18

Multi-Armed Bandit Game K Independent Arms: a ∈ { 1 , . . . K } Each arm a returns a random reward R a if pulled. (simpler case) assume R a is not time varying. Game: You chose arm a t at time t . You then observe: X t = R a t where R a t is sampled from the underlying distribution of that arm. The distribution of R a is not known. S. M. Kakade (UW) Optimization for Big data 5 / 18

More real motivations... S. M. Kakade (UW) Optimization for Big data 5 / 18

Ad placement... S. M. Kakade (UW) Optimization for Big data 5 / 18

The Goal We would like to maximize our long term future reward. Our (possibly randomized) sequential strategy/algorithm A is: a t = A ( a 1 , X 1 , a 2 , X 2 , . . . a t − 1 , X t − 1 ) In T rounds, our reward is: T � E [ X t |A ] t = 1 where the expectation is with respect to the reward process and our algorithm. Objective: What is a strategy which maximizes our long term reward? S. M. Kakade (UW) Optimization for Big data 6 / 18

Our Regret Suppose: µ a = E [ R a ] Assume 0 ≤ µ a ≤ 1. Let µ ∗ = max a µ a In expectation, the best we can do is obtain µ ∗ T reward in T steps. In T rounds, our regret is: � T � � µ ∗ T − E X t |A ≤ ?? t = 1 Objective: What is a strategy which makes our regret small? S. M. Kakade (UW) Optimization for Big data 7 / 18

A Naive Strategy For the first τ rounds, sample each arm τ/ K times. For the remainder of the rounds, choose the arm with best observed empirical reward. How goes is this strategy? How do we set τ ? Let’s look at confidence intervals. S. M. Kakade (UW) Optimization for Big data 8 / 18

Hoeffding’s bound If we pull arm N a times, our empirical estimate for arm a is: µ a = 1 � ˆ X t N a t : a t = a By Hoeffding’s bound, with probability greater than 1 − δ , � log ( 1 /δ ) | ˆ µ a − µ a | ≤ O N a By the union bound, with probability greater than 1 − δ , � log ( K /δ ) ∀ a , | ˆ µ a − µ a | ≤ O N a S. M. Kakade (UW) Optimization for Big data 9 / 18

Our regret (Exploration rounds) What is our regret for the first τ rounds? (Exploitation rounds) What is our regret for the remainder τ rounds? Our total regret is: T � log ( K /δ ) � µ ∗ T − X t ≤ τ + O ( T − τ ) τ/ K t = 1 How do we choose τ ? S. M. Kakade (UW) Optimization for Big data 10 / 18

The Naive Strategy’s Regret Choose τ = K 1 / 3 T 2 / 3 and δ = 1 / T . Theorem: Our total (expected) regret is: T � X t |A ] ≤ O ( K 1 / 3 T 2 / 3 ( log ( KT )) 1 / 3 ) µ ∗ T − E [ t = 1 S. M. Kakade (UW) Optimization for Big data 11 / 18

Can we be more adaptive? Are we still pulling arms that we know are sub-optimal? How do we know this?? Let N a , t be the number of times we pulled arm a up to time t . Confidence interval at time t : with probability greater than 1 − δ , � log ( 1 /δ ) | ˆ µ a , t − µ a | ≤ O N a , t with δ → δ/ ( TK ) , the above bound will hold for all time arms a ∈ [ K ] and timesteps t ≤ T . S. M. Kakade (UW) Optimization for Big data 12 / 18

Example S. M. Kakade (UW) Optimization for Big data 12 / 18

Confidence Bounds... S. M. Kakade (UW) Optimization for Big data 13 / 18

UCB: a reasonable state of our uncertainty... S. M. Kakade (UW) Optimization for Big data 14 / 18

Upper Confidence Bound (UCB) Algorithm At each time t , Pull arm: � log ( KT /δ ) = argmax ˆ µ a , t + c a t N a , t := argmax ˆ µ a , t + ConfBound a , t (where c ≤ 10 is a constant). Observe reward X t . Update µ a , t , N a , t , and ConfBound a , t . How well does this do? S. M. Kakade (UW) Optimization for Big data 15 / 18

Instantaneous Regret With probability greater than 1 − δ all the confidence bounds will hold. Question: If argmax ˆ µ a , t + ConfBound a , t ≤ µ ∗ could UCB pull arm a at time t ? Question: If pull arm a at time t , how much regret do we pay? i.e. µ ∗ − µ a t ≤ ?? S. M. Kakade (UW) Optimization for Big data 16 / 18

Total Regret Theorem: The total (expected) regret of UCB is: T � � µ ∗ T − E [ X t |A ] ≤ KT log ( KT ) t = 1 This better than the Naive strategy. Up to log factors, it is optimal. Practical algorithm? S. M. Kakade (UW) Optimization for Big data 17 / 18

Proof Idea: for K = 2 Suppose arm a = 2 is not optimal. Claim 1: All confidence intervals will be valid (with Pr ≥ 1 − δ ). Claim 2: If we pull arm a = 1, then no regret. Claim 3: If we pull a = 2, then we pay 2 C a , t regret. To see this: Why? µ a , t + C a , t ≥ ˆ ˆ µ 1 , t + C 1 , t ≥ µ ∗ Why? µ a ≥ ˆ µ a , t − C a , t The total regret is: 1 � � C a , t ≤ � N a , t t t Note that N a , t ≤ t (and increasing). S. M. Kakade (UW) Optimization for Big data 18 / 18

Acknowledgements http://gdrro.lip6.fr/sites/default/files/ JourneeCOSdec2015-Kaufman.pdf https://sites.google.com/site/banditstutorial/ S. M. Kakade (UW) Optimization for Big data 18 / 18

Bandits and Exploration: How do we (optimally) gather information? - PowerPoint PPT Presentation

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 18 Announcements... HW 4 posted soon

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

Beating Stochastic and Adversarial Semi-bandits Optimally and Simultaneously Julian Zimmert

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Optimally Propagating SAT Encodings Martin Brain, Liana Hadarean , Ruben Martins and Daniel

Chicag cago o Bandits dits Affili liate te Program ram Junior r Affiliate and Tra vel

Data Poisoning Attack cks on Stoch chastic c Bandits Fang Liu and Ness Shroff Outline

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

MergeDTS for Large Scale Condorcet Dueling Bandits Chang Li , Ilya Markov, Maarten de Rijke and

Weighted bandits or: How bandits learn distorted values that are not expected Prashanth L.A.

On adaptive regret bounds for non- stochastic bandits Gergely Neu INRIA Lille, SequeL team

About this class An example Bandit problems in general Two-armed bandits Multi-armed bandits

Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Breaking the Softmax Bottleneck via Monotonic Functions Octavian Ganea, Sylvain Gelly, Gary

Vanilla Meta-interpreter prove ( G ) is true when base-level body G is a logical consequence of

Experimental study of the interaction of a strong shock with a spherical density inhomogeneity H.

Computational Complexity of Cosmology in String Theory Michael R. Douglas 1 Simons Center / Stony

Record Management vanilladb.org Outline Overview Design Considerations for Record Manager

Lecture 15: Backtracking Steven Skiena Department of Computer Science State University of New

Sequential Importance Resampling (SIR) Particle Filter 1. Algorithm particle_filter ( S t-1 , u t

An Upgrading Algorithm with Optimal Power Law Or Ordentlich 1 Ido Tal 2 1 Hebrew University 2