Exploration/Exploitation in Multi-armed Bandits Spring 2019, CMU - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Exploration/Exploitation in Multi-armed Bandits Spring 2019, CMU 10-403 Katerina Fragkiadaki

Used Materials • Disclaimer : Some of the material and slides for this lecture were borrowed from Russ Salakhutdinov who in turn borrowed from Rich Sutton’s class and David Silver’s class on Reinforcement Learning.

Supervised VS Reinforcement Learning • Supervised learning (instructive feedback): the expert directly suggests correct actions • Learning by interaction (evaluative feedback): the environment provides signal whether actions the agent selects are good or bad, not even how far away they are from the optimal actions! • Evaluative feedback depends on the current policy the agent has • Exploration: active search for good actions to execute

Exploration vs. Exploitation Dilemma Online decision-making involves a fundamental choice: ‣ - Exploitation: Make the best decision given current information - Exploration: Gather more information   The best long-term strategy may involve short-term sacrifices ‣ Gather enough information to make the best overall decisions ‣

Exploration vs. Exploitation Dilemma Restaurant Selection ‣ - Exploitation: Go to your favorite restaurant - Exploration: Try a new restaurant Oil Drilling ‣ - Exploitation: Drill at the best known location - Exploration: Drill at a new location Game Playing ‣ - Exploitation: Play the move you believe is best - Exploration: Play an experimental move

Reinforcement learning Agent state reward action S t R t A t R t+ 1 Environment S t+ 1 Agent and environment interact at discrete time steps: t = 0,1, 2, K = 0 , 1 , 2 , 3 , . . . . Agent observes state at step t : S t ∈ S produces action at step t : A t ∈ A ( ( S t ) gets resulting reward: R t + 1 ∈ ∈ R ⊂ R , R S + and resulting next state: S t + 1 ∈ R t + 1 R t + 2 R t + 3 . . . . . . S t S t + 1 S t + 2 S t + 3 A t A t + 1 A t + 2 A t + 3

This lecture A closer look to exploration-exploitation balancing in a simplified RL setup

Multi-Armed Bandits Agent state reward action S t R t A t R t+ 1 Environment S t+ 1 S t Agent and environment interact at discrete time steps: t = 0,1, 2, K = 0 , 1 , 2 , 3 , . . . . Agent observes state at step t : S t ∈ S produces action at step t : A t ∈ ( S t ) A t , R t +1 , A t +1 , R t +2 , A t +2 , A t +3 , R t +3 , . . . gets resulting reward: R t + 1 ∈ and resulting next state: S t + 1 ∈ The state does not change.

Multi-Armed Bandits One-armed bandit= Slot machine (English slang) source: infoslotmachine.com

Multi-Armed Bandits • Multi-Armed bandit = Multiple Slot Machine source: Microsoft Research

Multi-Armed Bandit Problem At each timestep t the agent chooses one of the K arms and plays it . The ith arm produces reward r i , t when played at timestep t . The rewards r i , t are drawn from a probability distribution 𝒬 i with mean μ i The agent does not know neither the arm reward distributions neither their means source: Pandey et al.’s slide Alternative notation for mean arm rewards: q ∗ ( a ) . = E [ R t | A t = a ] , ∀ a ∈ { 1 , . . . , k } Agent’s Objective: • Maximize cumulative rewards. • In other words: Find the arm with the highest mean reward

Example: Bernoulli Bandits Recall: The Bernoulli distribution is the discrete probability distribution of a random variable which takes the value 1 with probability p and the value 0 with probability q=1-p, that is, the probability distribution of any single experiment that asks a yes–no question • Each action (arm when played) results in success or failure. Rewards are binary! • Mean reward for each arm represents the probability of success • Action (arm) k ∈ {1, …, K} produces a success with probability θ _k ∈ [0, 1]. source: Pandey et al.’s slide win 0.6 win 0.4 win 0.45 of time of time of time

One Bandit Task from   Example: Gaussian Bandits The 10-armed Testbed R t ∼ N ( q ∗ ( a ) , 1) 4 3 q ∗ (3) q ∗ (5) 2 1 q ∗ (9) q ∗ (4) q ∗ (1) Reward 0 q ∗ (7) q ∗ (10) distribution q ∗ (2) q ∗ (8) -1 q ∗ (6) -2 -3 -4 1 2 3 4 5 6 7 8 9 10 Action

Real world motivation: A/B testing • Two arm bandits: each arm corresponds to an image variation shown to users (not necessarily the same user) • Mean rewards: the total percentage of users that would click on each invitation

Real world motivation: NETFLIX artwork For a particular movie, we want to decide what image to show (to all the NEFLIX users) • Actions: uploading one of the K images to a user’s home screen • Ground-truth mean rewards (unknown): the % of NETFLIX users that will click on the title and watch the movie • Estimated mean rewards: the average click rate observed (quality engagement, not clickbait) Netflix Artwork

The Exploration/Exploitation Dilemma • Suppose you form estimates action-value estimates Q t ( a ) ≈ q ∗ ( a ) , ∀ a

The Exploration/Exploitation Dilemma • Suppose you form estimates action-value estimates Q t ( a ) ≈ q ∗ ( a ) , ∀ a • Define the greedy action at time t as . A ∗ = arg max Q t ( a ) t a

The Exploration/Exploitation Dilemma • Suppose you form estimates action-value estimates Q t ( a ) ≈ q ∗ ( a ) , ∀ a • Define the greedy action at time t as . A ∗ = arg max Q t ( a ) t a • If then you are exploiting   A t = A ∗ t If then you are exploring A t 6 = A ∗ t

The Exploration/Exploitation Dilemma • Suppose you form estimates action-value estimates Q t ( a ) ≈ q ∗ ( a ) , ∀ a • Define the greedy action at time t as . A ∗ = arg max Q t ( a ) t a • If then you are exploiting   A t = A ∗ t If then you are exploring A t 6 = A ∗ t • You can’t do both, but you need to do both

The Exploration/Exploitation Dilemma • Suppose you form estimates action-value estimates Q t ( a ) ≈ q ∗ ( a ) , ∀ a • Define the greedy action at time t as . A ∗ = arg max Q t ( a ) t a • If then you are exploiting   A t = A ∗ t If then you are exploring A t 6 = A ∗ t • You can’t do both, but you need to do both • You can never stop exploring, but maybe you should explore less with time.

Regret The action-value is the mean reward for action a, ‣ q ∗ ( a ) . = E [ R t | A t = a ] , ∀ a ∈ { 1 , . . . , k } (expected return) The optimal value is ‣ v * = q ( a *) = max a ∈𝒝 q * ( a ) The regret is the opportunity loss for one step ‣ reward = − regret I t = 𝔽 [ v * − q * ( a t )] The total regret is the total opportunity loss ‣ L t = 𝔽 [ v * − q * ( a t ) ] T ∑ t =1 Maximize cumulative reward = minimize total regret ‣

Regret The count N t (a): the number of times that action a has been selected prior ‣ to time t The gap ∆ a is the difference in value between action a and optimal ‣ action a ∗ :   Δ a = v * − q * ( a ) Regret is a function of gaps and the counts ‣ L t = 𝔽 [ v * − q * ( a t ) ] T ∑ t =1 = ∑ 𝔽 [ N t ( a )]( v * − q * ( a )) a ∈𝒝 = ∑ 𝔽 [ N t ( a )] Δ a a ∈𝒝

Forming Action-Value Estimates • Estimate action values as sample averages : P t − 1 i =1 R i · 1 A i = a = sum of rewards when a taken prior to t Q t ( a ) . = P t − 1 number of times a taken prior to t i =1 1 A i = a • The sample-average estimates converge to the true values   If the action is taken an infinite number of times N t ( a ) →∞ Q t ( a ) = q ∗ ( a ) lim The number of times action a has been taken by time t

Forming Action-Value Estimates • To simplify notation, let us focus on one action • We consider only its rewards, and its estimate after n+ 1 rewards: = R 1 + R 2 + · · · + R n − 1 Q n . . n − 1

Forming Action-Value Estimates • To simplify notation, let us focus on one action • We consider only its rewards, and its estimate after n+ 1 rewards: = R 1 + R 2 + · · · + R n − 1 Q n . . n − 1 • How can we do this incrementally (without storing all the rewards)? • Could store a running sum and count (and divide), or equivalently: Q n +1 = Q n + 1 h i R n − Q n n

Forming Action-Value Estimates • To simplify notation, let us focus on one action • We consider only its rewards, and its estimate after n+ 1 rewards: = R 1 + R 2 + · · · + R n − 1 Q n . . n − 1 • How can we do this incrementally (without storing all the rewards)? • Could store a running sum and count (and divide), or equivalently: Q n +1 = Q n + 1 h i R n − Q n n • This is a standard form for learning/update rules: h i NewEstimate ← OldEstimate + StepSize Target − OldEstimate

Forming Action-Value Estimates • To simplify notation, let us focus on one action • We consider only its rewards, and its estimate after n+ 1 rewards: = R 1 + R 2 + · · · + R n − 1 Q n . . n − 1 • How can we do this incrementally (without storing all the rewards)? • Could store a running sum and count (and divide), or equivalently: Q n +1 = Q n + 1 h i R n − Q n n • This is a standard form for learning/update rules: error h i NewEstimate ← OldEstimate + StepSize Target − OldEstimate

Exploration/Exploitation in Multi-armed Bandits Spring 2019, CMU - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Exploration/Exploitation in Multi-armed Bandits Spring 2019, CMU 10-403 Katerina Fragkiadaki Used Materials Disclaimer : Some of the material and slides

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

About this class An example Bandit problems in general Two-armed bandits Multi-armed bandits

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12 k-armed Bandit Problem Playing k armed

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Ciara Pike-Burke

On conditional versus marginal bias in multi-armed bandits Jaehyeok Shin 1 , Aaditya Ramdas 1,2

Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard

Advanced Econometrics 2, Hilary term 2021 Multi-armed bandits Maximilian Kasy Department of

Social Learning in Multi Agent Multi Armed Bandits Abishek Sankararaman, UC Berkeley April 9,

Muti-armed Bandits,Online Learning and Sequential Prediction Jian Li Institute for

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson

Drupal as a Learning Platform Drupal can manage learning, too https://skvare.com Presenter: Mark

4 User Interface Design From Code to Product gidgreen.com/course Or how to prevent this

Inbound Leads Formula Module 1 The Business of Internet Advertising and How to Play it

subscribers with a giveaway Ellen Finkelstein www.ChangetheWorldMarketing.com Heres what

Writing Software That's Safe Enough To Drive A Car @shnewto Clickbait ! Functional safety is...

Importance of storytelling in open source projects Speaker : Justin W. Flory License : CC-BY-SA

A systematic evaluation of OpenBSD's mitigations 36c3 stein Agenda Why

PROGRAMME 6.15pm Welcome & Briefing by Associate Chair (Academic) Introduction to NTU

Exploration/Exploitation in Multi-armed Bandits Spring 2019, CMU - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Exploration/Exploitation in Multi-armed Bandits Spring 2019, CMU 10-403 Katerina Fragkiadaki Used Materials Disclaimer : Some of the material and slides

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

About this class An example Bandit problems in general Two-armed bandits Multi-armed bandits

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12 k-armed Bandit Problem Playing k armed

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Ciara Pike-Burke

On conditional versus marginal bias in multi-armed bandits Jaehyeok Shin 1 , Aaditya Ramdas 1,2

Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard

Advanced Econometrics 2, Hilary term 2021 Multi-armed bandits Maximilian Kasy Department of

Social Learning in Multi Agent Multi Armed Bandits Abishek Sankararaman, UC Berkeley April 9,

Muti-armed Bandits,Online Learning and Sequential Prediction Jian Li Institute for

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson

Drupal as a Learning Platform Drupal can manage learning, too https://skvare.com Presenter: Mark

4 User Interface Design From Code to Product gidgreen.com/course Or how to prevent this

Inbound Leads Formula Module 1 The Business of Internet Advertising and How to Play it

subscribers with a giveaway Ellen Finkelstein www.ChangetheWorldMarketing.com Heres what

Writing Software That's Safe Enough To Drive A Car @shnewto Clickbait ! Functional safety is...

Importance of storytelling in open source projects Speaker : Justin W. Flory License : CC-BY-SA

A systematic evaluation of OpenBSD's mitigations 36c3 stein Agenda Why

PROGRAMME 6.15pm Welcome &amp; Briefing by Associate Chair (Academic) Introduction to NTU

PROGRAMME 6.15pm Welcome & Briefing by Associate Chair (Academic) Introduction to NTU