Multi-armed bandits S Bubeck, N Cesa-Bianchi Foundations and Trends - PowerPoint PPT Presentation

Multi-armed bandits S Bubeck, N Cesa-Bianchi Foundations and Trends in Machine Learning 2012 * Real title: regret analysis of stochastic and nonstochastic multi-armed bandit problems

Overview • Stochastic, adversarial, extensions & connections. • Applications: • Clinical trails, Ad placement, Package routing, Video games • Flexible theoretical framework with rigorous guarantees • Feedback is partially observable • This is not supervised learning!

reward Multi-armed bandit setting • Casino, 𝐿 slot machines • 𝑈 rounds, timesteps 𝑢 = 1, … , 𝑈 • For 𝑢 = 1, … 𝑈 we choose a slot machine / arm to pull / ad to show 𝑙 • If we pull arm 𝑙 at timestep t, we receive (only) reward 𝑠 𝑢 𝑙 i.i.d. sampled from distribution of arm 𝑙 • Stochastic: reward 𝑠 𝑢 • For example normal: 𝑂 𝜈 𝑙 , 1 • Generally, we don’t know the distribution, or we may only know the class.

𝜈 ∗ reward Multi-armed bandit setting • We want to maximize our gains. • Say our algorithm chooses machine 𝐵 𝑢 in round 𝑢 𝐵 𝑢 𝑈 • Gain: 𝐻 𝐵 = σ 𝑢=1 𝑠 𝑢 Best action • Performance measure: regret. • Regret compares our performance to best fixed action. • Always play best arm  in expectation gain is 𝑈𝜈 ∗ • Regret: 𝑆 = 𝑈𝜈 ∗ − 𝐻 𝐵 . Low regret = High gain = Good.

reward Stochastic setting • Naïve greedy strategy: • Try all arms once (explore) • Afterward, continue pulling best arm (exploit) • Why can this fail? • We may get unlucky and observe the blue samples in the first round • Balance exploration & exploitation • Exploration: try enough arms to be certain you have a ‘good’ one • Exploitation: pull the good arm enough to maximize your reward • Algorithms: UCB, Thompson Sampling. Optimism in face of uncertainty.

reward ???? Adverserial setting • Problem with stochastic setting: • We assume the distribution of the arms are fixed. • For example for the advertisements, the reward distribution can change over time. 𝑢 ∈ [0,1] • Adversary chooses rewards: no assumptions made about rewards 𝑠 𝑙 𝑢 very low. • If we choose arm k at time t, adversary can set 𝑠 𝑙 • Solve by randomization! For each t, choose distribution 𝑞 𝑢 over all k arms. • By surprising adversary, we can still get low regret. • Algorithms: EXP3, EXP3.P, FPL. • Pessimistic / ‘play it safe’ / always spread your chances.

Extensions & Connections • Contextual bandit: use ‘side information’. • In each round we receive information from the user (in advertising example) • Naïve solution: run a bandit for each user category (cluster users first) • More related to supervised learning since we can use ‘features’ • Non-stationary bandit: • Distributions change slowly instead of rewards being determined by adversary. • Adversary assumption may be too pessimistic, but i.i.d. too optimistic. • Connection to reinforcement learning: MDP with only 1 state. • Full-information setting • We observe all rewards of all arms in all rounds • Online learning / Hedge algorithm / Exponential weights algorithm, EXP2.

Multi-armed bandits S Bubeck, N Cesa-Bianchi Foundations and Trends - PowerPoint PPT Presentation

Multi-armed bandits S Bubeck, N Cesa-Bianchi Foundations and Trends in Machine Learning 2012 * Real title: regret analysis of stochastic and nonstochastic multi-armed bandit problems Overview Stochastic, adversarial, extensions &

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

About this class An example Bandit problems in general Two-armed bandits Multi-armed bandits

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12 k-armed Bandit Problem Playing k armed

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Ciara Pike-Burke

On conditional versus marginal bias in multi-armed bandits Jaehyeok Shin 1 , Aaditya Ramdas 1,2

Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard

Advanced Econometrics 2, Hilary term 2021 Multi-armed bandits Maximilian Kasy Department of

Social Learning in Multi Agent Multi Armed Bandits Abishek Sankararaman, UC Berkeley April 9,

Muti-armed Bandits,Online Learning and Sequential Prediction Jian Li Institute for

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson

GALAXY SPIRAL ARMS, DISK DISTURBANCES AND STATISTICS Part I: NGC3081 to build background for

Wireless & Mobile Health to Address COVID-19 Fadel Adib Wireless & Mobile Health to

Refresh Your Knowledge Fast RL Part II The prior over arm 1 is Beta(1,2) (left) and arm 2 is a

Introduction to Multi-Armed Bandits and Reinforcement Learning Training School on Machine

Adpative MAMS Design Lingyun Liu 27 April 2019 Lingyun Liu Stat4Onc 27 April 2019 1 / 28

ARM Cortex-M4 Programming Model Memory Addressing Instructions References: Textbook Chapter 4,

Collaborative Learning with Limited Interaction: Tight Bounds for Distributed Exploration in

Under the Robotic Knife: A Verifiable Controller for use of Multiple Robotic Arms in Surgery

Multi-armed bandits S Bubeck, N Cesa-Bianchi Foundations and Trends - PowerPoint PPT Presentation

Multi-armed bandits S Bubeck, N Cesa-Bianchi Foundations and Trends in Machine Learning 2012 * Real title: regret analysis of stochastic and nonstochastic multi-armed bandit problems Overview Stochastic, adversarial, extensions &

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

About this class An example Bandit problems in general Two-armed bandits Multi-armed bandits

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12 k-armed Bandit Problem Playing k armed

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Ciara Pike-Burke

On conditional versus marginal bias in multi-armed bandits Jaehyeok Shin 1 , Aaditya Ramdas 1,2

Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard

Advanced Econometrics 2, Hilary term 2021 Multi-armed bandits Maximilian Kasy Department of

Social Learning in Multi Agent Multi Armed Bandits Abishek Sankararaman, UC Berkeley April 9,

Muti-armed Bandits,Online Learning and Sequential Prediction Jian Li Institute for

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson

GALAXY SPIRAL ARMS, DISK DISTURBANCES AND STATISTICS Part I: NGC3081 to build background for

Wireless &amp; Mobile Health to Address COVID-19 Fadel Adib Wireless &amp; Mobile Health to

Refresh Your Knowledge Fast RL Part II The prior over arm 1 is Beta(1,2) (left) and arm 2 is a

Introduction to Multi-Armed Bandits and Reinforcement Learning Training School on Machine

Adpative MAMS Design Lingyun Liu 27 April 2019 Lingyun Liu Stat4Onc 27 April 2019 1 / 28

ARM Cortex-M4 Programming Model Memory Addressing Instructions References: Textbook Chapter 4,

Collaborative Learning with Limited Interaction: Tight Bounds for Distributed Exploration in

Under the Robotic Knife: A Verifiable Controller for use of Multiple Robotic Arms in Surgery

Wireless & Mobile Health to Address COVID-19 Fadel Adib Wireless & Mobile Health to