Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12 k-armed Bandit - PowerPoint PPT Presentation

Feb 15, 2024 •123 likes •342 views

Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12 k-armed Bandit Problem Playing k armed bandit machines and find a way to win most money! Note: assume you have unlimited money and never go bankrupt!

Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12
k-armed Bandit Problem • Playing k armed bandit machines and find a way to win most money! • Note: assume you have unlimited money and never go bankrupt! https://towardsdatascience.com/reinforcement-learning-multi-arm-bandit-implementation-5399ef67b24b
10-armed Testbed • Each bandit machine has its own reward distribution
Action-value Function • 𝑅 𝑢 𝑏 : The estimated value (reward) of action a at time t • Let 𝑟 ∗ 𝑏 be the true (optimal) action-value function 𝑟 ∗ 𝑏 ← 𝐹[𝑆 𝑢 |𝐵 𝑢 = 𝑏]
ε -greedy • Greedy action − Always select the action with max value − 𝐵 𝑢 ← argmax 𝑅 𝑢 (𝑏) 𝑏 • ε -greedy − Select the greedy action (1- ε ) of the time, select random actions ε of the time
Performance of ε -greedy • Average rewards over 2000 runs with ε =0, 0.1, 0.01
Optimal Actions Selected by ε -greedy • Optimal actions selected over 2000 runs with ε =0, 0.1, 0.01
Update 𝑅 𝑢 𝑏 • let 𝑅 𝑜 denote the estimate of its action value after it has been selected n − 1 times
Deriving Update Rule • Require only memory of Qn and Rn
Tracking a Nonstationary Problem • Using constant step-size 𝛽 ∈ ( 0,1] • Constant step- size doesn’t converge
Exponential Recency-weighted Average 𝑜 𝑅 𝑜+1 = 1 − 𝛽 𝑜 𝑅 + ෍ 𝛽 1 − 𝛽 𝑜−𝑗 𝑆 𝑗 𝑗=1 𝑜 1 − 𝛽 𝑜 + ෍ 𝛽 1 − 𝛽 𝑜−𝑗 = 1 𝑗=1
Optimistic Initial Values • We should not care about initial value too much in practice
Upper-Confidence-Bound Action Selection • 𝑂 𝑢 𝑏 : Number of times that action a has been selected prior to time t • Not practical for large state spaces ln 𝑢 𝐵 ← arg max 𝑅 𝑢 𝑏 + 𝑑 𝑂 𝑢 𝑏 𝑏
Gradient Bandit Algorithms • Soft-max function • 𝜌 𝑢 (𝑏) is the probability of taking action a at time t
Selecting Actions based on 𝜌 𝑢 (𝑏)
Gradient Ascent
Calculating Gradient • Adding a baseline B
Convert Equation into Expectation • Multiplied by 𝜌 𝑢 (𝑦)/𝜌 𝑢 (𝑦) • Choose baseline 𝐶 𝑢 = 𝑆 𝑢
Calculating Gradient of Softmax
Final Result • Gradient bandit algorithm = gradient of expected reward!
Reference • Chapter 2, Richard S. Sutton and Andrew G. Barto , “Reinforcement Learning: An Introduction,” 2 nd edition, Nov. 2018

Recommend

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

Cooperative Bandits with Heavy Tails Dubey and Pentland ICML 2020 Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation Summary Abhimanyu Dubey and Alex Pentland Background K-Armed Bandits

350 views • 16 slides

About this class An example Bandit problems in general Two-armed bandits Multi-armed bandits

About this class An example Bandit problems in general Two-armed bandits Multi-armed bandits and Gittins indices 1 An Example [Most of this lecture from Berry & Fristedt] You want to maximize the sum of two observa- tions. The process

407 views • 13 slides

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem A New, Fast, and Simple Algorithm A New, Fast, and Simple Algorithm A New, Fast, and

1.56k views • 134 slides

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed bandit 0.9 0.5 0.1 0.9 0.5 0.1 0.0 0.0 0.0 0.0 estimate n-armed bandit n-armed bandit 0.9 0.5 0.1 0.9 0.5 0.1 0 0.0 0.0 0.0 0.0

677 views • 21 slides

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Ciara Pike-Burke

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Ciara Pike-Burke Supervisor: David Leslie 24th April 2015 1 / 14 Adaptations of the Thompson

586 views • 20 slides

On conditional versus marginal bias in multi-armed bandits Jaehyeok Shin 1 , Aaditya Ramdas 1,2

On conditional versus marginal bias in multi-armed bandits Jaehyeok Shin 1 , Aaditya Ramdas 1,2 and Alessandro Rinaldo 1 Dept. of Statistics and Data Science 1 , Machine Learning Dept. 2 , CMU Stochastic Multi-armed bandits (MABs) K 2 1

1k views • 56 slides

Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard

Bandits Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard University 1 / 25 Bandits Agenda Thus far: Supervised machine learning data are given. Next: Active learning

665 views • 25 slides

Advanced Econometrics 2, Hilary term 2021 Multi-armed bandits Maximilian Kasy Department of

Bandits Advanced Econometrics 2, Hilary term 2021 Multi-armed bandits Maximilian Kasy Department of Economics, Oxford University 1 / 25 Bandits Agenda Thus far: Supervised machine learning data are given. Next: Active

425 views • 25 slides

Social Learning in Multi Agent Multi Armed Bandits Abishek Sankararaman, UC Berkeley April 9,

Social Learning in Multi Agent Multi Armed Bandits Abishek Sankararaman, UC Berkeley April 9, 2020 Joint Work with - Sanjay Shakkottai, Ronshee Chawla, UT Austin - Ayalvadi Ganesh, University of Bristol Multi Armed Bandit Problem A set of

973 views • 54 slides

Muti-armed Bandits,Online Learning and Sequential Prediction Jian Li Institute for

2016 NDBC Muti-armed Bandits,Online Learning and Sequential Prediction Jian Li Institute for Interdisciplinary Information Sciences Tsinghua University Outline Online Learning Stochastic Multi-armed Bandits UCB Combinatorial

789 views • 47 slides

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X -armed bandits Planning Conclusion Introduction to Bandits R emi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA

1.1k views • 67 slides

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint work with milie Kaufmann PhD Student Team SCEE, IETR, CentraleSuplec, Rennes & Team SequeL, CRIStAL, Inria, Lille CMAP Seminar 31 st

1.47k views • 96 slides

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

951 views • 63 slides

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Christophe Moy milie Kaufmann Advised by PhD Student Team SCEE, IETR, CentraleSuplec, Rennes & Team SequeL, CRIStAL, Inria, Lille SequeL

860 views • 68 slides

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Advised by Christophe Moy milie Kaufmann PhD Student Team SCEE, IETR, CentraleSuplec, Rennes & Team SequeL, CRIStAL, Inria, Lille SequeL

920 views • 70 slides

Multi-armed bandits S Bubeck, N Cesa-Bianchi Foundations and Trends in Machine Learning 2012 *

Multi-armed bandits S Bubeck, N Cesa-Bianchi Foundations and Trends in Machine Learning 2012 * Real title: regret analysis of stochastic and nonstochastic multi-armed bandit problems Overview Stochastic, adversarial, extensions &

281 views • 7 slides

Markov decision process (MDP) Robert Platt Northeastern University The RL Setting Action Agent

Markov decision process (MDP) Robert Platt Northeastern University The RL Setting Action Agent World Observation Reward On a single time step, agent does the following: 1. observe some information 2. select an action to execute 3. take

858 views • 69 slides

A Correctness Result for Synthesizing Plans With Loops in Stochastic Domains Laszlo Treszkai

A Correctness Result for Synthesizing Plans With Loops in Stochastic Domains Laszlo Treszkai & Vaishak Belle , University of Edinburgh Finite State Controllers FSCs, such as plans with loops, are powerful and compact representations of

522 views • 12 slides

Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and

Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA Dynamic Programming Sparse Linear Algebra The Dwarfs

819 views • 40 slides

Argon : tradeoff-resilient password hashing scheme Alex Biryukov Dmitry Khovratovich University

Argon : tradeoff-resilient password hashing scheme Alex Biryukov Dmitry Khovratovich University of Luxembourg Concept of password hashing 1 Client generates password P and sends it to the server; 2 Server generates salt S and computes hash H (

664 views • 34 slides

From Qualitative to Quantitative Dominance Pruning for Optimal Planning Alvaro Torralba

Qualitative Quantitative Finding Dominance Action Selection Pruning Experiments Conclusions From Qualitative to Quantitative Dominance Pruning for Optimal Planning Alvaro Torralba Saarland University HSDIP Workshop June 20, 2017

677 views • 43 slides

Monte Carlo Tree Search guided by Symbolic Advice for MDPs Damien Busatto-Gaston, Debraj

Monte Carlo Tree Search guided by Symbolic Advice for MDPs Damien Busatto-Gaston, Debraj Chakraborty and Jean-Francois Raskin Universit Libre de Bruxelles September 16, 2020 HIGHLIGHTS 2020 1/13 Markov Decision Process 1 s 0 4 a 1 a 2 1

392 views • 26 slides

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning: an Introduction, 2nd Edition: Chapters 6 (6.1 6.5) Outline Reinforcement Learning Reinforcement Learning: the

589 views • 27 slides

Temporal-Di ff erence Learning What is MC estimation doing? Adaptive Dynamic Programming V ( s t

Coming Up With Better Policies We can interleave policy evaluation with policy improvement as before. * V * 21 STICK 20 E ! Q 0 I E ! I ! E ! Q 19 0 ! 1 21 Usable 18 + 1 17

299 views • 3 slides