The Multi-Arm Bandit Framework A. LAZARIC ( SequeL Team @INRIA-Lille - PowerPoint PPT Presentation

The Multi-Arm Bandit Framework A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL – INRIA Lille MVA-RL Course

In This Lecture A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 2/94

In This Lecture Question : which route should we take? Problem : each day we obtain a limited feedback : traveling time of the chosen route Results : if we do not repeatedly try different options we cannot learn. Solution : trade off between optimization and learning . A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 3/94

Mathematical Tools Outline Mathematical Tools The General Multi-arm Bandit Problem The Stochastic Multi-arm Bandit Problem The Non-Stochastic Multi-arm Bandit Problem Connections to Game Theory Other Stochastic Multi-arm Bandit Problems A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 4/94

Mathematical Tools Concentration Inequalities Proposition (Chernoff-Hoeffding Inequality) Let X i ∈ [ a i , b i ] be n independent r.v. with mean µ i = E X i . Then �� n � � 2 ǫ 2 � � X i − µ i � ≥ ǫ ≤ 2 exp − � n . P � i = 1 ( b i − a i ) 2 i = 1 A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 5/94

Mathematical Tools Concentration Inequalities Proof. � � � n P ( e s � n i = 1 X i − µ i ≥ e s ǫ ) X i − µ i ≥ ǫ = P i = 1 e − s ǫ E [ e s � n i = 1 X i − µ i ] , ≤ Markov inequality � n e − s ǫ E [ e s ( X i − µ i ) ] , = independent random variables i = 1 n � e s 2 ( b i − a i ) 2 / 8 , e − s ǫ ≤ Hoeffding inequality i = 1 e − s ǫ + s 2 � n i = 1 ( b i − a i ) 2 / 8 = If we choose s = 4 ǫ/ � n i = 1 ( b i − a i ) 2 , the result follows. � � n � Similar arguments hold for P i = 1 X i − µ i ≤ − ǫ . A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 6/94

Mathematical Tools Concentration Inequalities Finite sample guarantee : � � � � � � n � 2 n ǫ 2 � 1 � � X t − E [ X 1 ] > ǫ ≤ 2 exp − P � �� ( b − a ) 2 n t = 1 � �� accuracy � �� confidence deviation A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 7/94

Mathematical Tools Concentration Inequalities Finite sample guarantee : �� n � � 1 log 2 /δ � � X t − E [ X 1 ] � > ( b − a ) ≤ δ P n 2 n t = 1 A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 8/94

Mathematical Tools Concentration Inequalities Finite sample guarantee : �� n � � 1 � � P X t − E [ X 1 ] � > ǫ ≤ δ n t = 1 if n ≥ ( b − a ) 2 log 2 /δ . 2 ǫ 2 A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 9/94

The General Multi-arm Bandit Problem Outline Mathematical Tools The General Multi-arm Bandit Problem The Stochastic Multi-arm Bandit Problem The Non-Stochastic Multi-arm Bandit Problem Connections to Game Theory Other Stochastic Multi-arm Bandit Problems A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 10/94

The General Multi-arm Bandit Problem The Multi–armed Bandit Game The learner has i = 1 , . . . , N arms (options, experts, ...) At each round t = 1 , . . . , n ◮ At the same time ◮ The environment chooses a vector of rewards { X i , t } N i = 1 ◮ The learner chooses an arm I t ◮ The learner receives a reward X I t , t ◮ The environment does not reveal the rewards of the other arms A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 11/94

The General Multi-arm Bandit Problem The Multi–armed Bandit Game (cont’d) The regret � � � � n n � � R n ( A ) = max X i , t − E X I t , t i = 1 ,..., N E t = 1 t = 1 The expectation summarizes any possible source of randomness (either in X or in the algorithm) A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 12/94

The General Multi-arm Bandit Problem The Exploration–Exploitation Lemma Problem 1 : The environment does not reveal the rewards of the arms not pulled by the learner ⇒ the learner should gain information by repeatedly pulling all the arms ⇒ exploration Problem 2 : Whenever the learner pulls a bad arm , it suffers some regret ⇒ the learner should reduce the regret by repeatedly pulling the best arm ⇒ exploitation Challenge : The learner should solve two opposite problems! Challenge : The learner should solve the exploration-exploitation dilemma! A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 13/94

The General Multi-arm Bandit Problem The Multi–armed Bandit Game (cont’d) Examples ◮ Packet routing ◮ Clinical trials ◮ Web advertising ◮ Computer games ◮ Resource mining ◮ ... A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 14/94

The Stochastic Multi-arm Bandit Problem Outline Mathematical Tools The General Multi-arm Bandit Problem The Stochastic Multi-arm Bandit Problem The Non-Stochastic Multi-arm Bandit Problem Connections to Game Theory Other Stochastic Multi-arm Bandit Problems A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 15/94

The Stochastic Multi-arm Bandit Problem The Stochastic Multi–armed Bandit Problem Definition The environment is stochastic ◮ Each arm has a distribution ν i bounded in [ 0 , 1 ] and characterized by an expected value µ i ◮ The rewards are i.i.d. X i , t ∼ ν i A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 16/94

The Stochastic Multi-arm Bandit Problem The Stochastic Multi–armed Bandit Problem (cont’d) Notation ◮ Number of times arm i has been pulled after n rounds n � T i , n = I { I t = i } t = 1 ◮ Regret � � � � n n � � R n ( A ) = max X i , t − E X I t , t i = 1 ,..., N E t = 1 t = 1 � � n � R n ( A ) = i = 1 ,..., N ( n µ i ) − E max X I t , t t = 1 N � R n ( A ) = i = 1 ,..., N ( n µ i ) − max E [ T i , n ] µ i i = 1 N � R n ( A ) = n µ i ∗ − E [ T i , n ] µ i A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 17/94 i = 1 �

The Stochastic Multi-arm Bandit Problem The Stochastic Multi–armed Bandit Problem (cont’d) � R n ( A ) = E [ T i , n ]∆ i i � = i ∗ ⇒ we only need to study the expected number of pulls of the suboptimal arms A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 18/94

The Stochastic Multi-arm Bandit Problem The Stochastic Multi–armed Bandit Problem (cont’d) Optimism in Face of Uncertainty Learning (OFUL) Whenever we are uncertain about the outcome of an arm, we consider the best possible world and choose the best arm . Why it works : ◮ If the best possible world is correct ⇒ no regret ◮ If the best possible world is wrong ⇒ the reduction in the uncertainty is maximized A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 19/94

The Stochastic Multi-arm Bandit Problem The Stochastic Multi–armed Bandit Problem (cont’d) 25 40 35 20 30 25 15 20 10 15 10 5 5 0 0 −4 −2 0 2 4 6 −4 −2 0 2 4 6 Rewards Rewards pulls = 100 pulls = 200 14 3 12 2.5 10 2 8 1.5 6 1 4 0.5 2 0 0 −4 −2 0 2 4 6 −4 −2 0 2 4 6 Rewards Rewards pulls = 50 pulls = 20 A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 20/94

The Stochastic Multi-arm Bandit Problem The Stochastic Multi–armed Bandit Problem (cont’d) Optimism in face of uncertainty 40 35 25 30 20 25 20 15 15 10 10 5 5 0 0 −4 −2 0 2 4 6 −4 −2 0 2 4 6 Rewards Rewards 3 14 2.5 12 2 10 8 1.5 6 1 4 0.5 2 0 0 −4 −2 0 2 4 6 −4 −2 0 2 4 6 Rewards Rewards A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 21/94

The Stochastic Multi-arm Bandit Problem The Upper–Confidence Bound (UCB) Algorithm The idea 2 1.5 1 0.5 Reward 0 −0.5 −1 −1.5 1 (10) 2 (73) 3 (3) 4 (23) Arms A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 22/94

The Stochastic Multi-arm Bandit Problem The Upper–Confidence Bound (UCB) Algorithm Show time! A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 23/94

The Stochastic Multi-arm Bandit Problem The Upper–Confidence Bound (UCB) Algorithm (cont’d) At each round t = 1 , . . . , n ◮ Compute the score of each arm i B i = ( optimistic score of arm i ) ◮ Pull arm I t = arg max i = 1 ,..., N B i , s , t ◮ Update the number of pulls T I t , t = T I t , t − 1 + 1 A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 24/94

The Stochastic Multi-arm Bandit Problem The Upper–Confidence Bound (UCB) Algorithm (cont’d) The score (with parameters ρ and δ ) B i = ( optimistic score of arm i ) B i , s , t = ( optimistic score of arm i if pulled s times up to round t ) B i , s , t = knowledge + uncertainty �� optimism � log 1 /δ B i , s , t = ˆ µ i , s + ρ 2 s Optimism in face of uncertainty: Current knowledge : average rewards ˆ µ i , s Current uncertainty : number of pulls s A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 25/94

The Multi-Arm Bandit Framework A. LAZARIC ( SequeL Team @INRIA-Lille - PowerPoint PPT Presentation

The Multi-Arm Bandit Framework A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013 - 2/94 In This Lecture

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Reinforcement Learning Kevin Spiteri April 21, 2015 n-armed bandit n-armed bandit 0.9 0.5

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

Multi-Arm Bandit Sutton and Barto Sutton slides and Silver 1 Multi-Arm Bandits Sutton and

ARM Software Suite Powered by GDM Why use ARM Software? ARM is the software solution to plan,

ARM Advanced RISC Machines The ARM Instruction Set The ARM Instruction Set - ARM University

ARM Cortex-M4 Programming Model ARM = Advanced RISC Machines, Ltd. ARM licenses IP to other

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I S ebastien

One Armed Bandit source: http://dogbeforewicket.blogspot.ca EECS 1030 moodle.yorku.ca One Armed

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurlien Garivier

The Multi-Armed Bandit Problem Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano Nicol`

Equilibria in large one-arm bandit games A. Salomon Universit e Paris 13 HEC Paris November

ARM Microprocessor and ARM-Based Microcontrollers Nguatem William 24th May 2006 1 / 40 A

Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2 Sutton slides and Silver 295, class 2 1

Upper confidence bound strategy on stochastical bandits Multiarmed bandit: K arms, at each step we

Verifying the Motion of a Robot Arm Akul Penugonda 1 /6 Akul Penugonda - Robot Arm Motion 2

Reduced Variance Payoff Estimation in Adversarial Bandit Problems Levente Kocsis Csaba Szepesv

CSCI 246 Class 20 PROBABILITIES, PERMUTATIONS, COMBINATIONS PART 2 Quiz Questions

Permutations and Combinations MATH 107: Finite Mathematics University of Louisville March 3,

EECS 394 Software Project Management Chris Riesbeck Estimating Thursday, May 19, 2011

Counting Techniques Bernd Schr oder logo1 Bernd Schr oder Louisiana Tech University,

las vegas caesars palace g a m b l i n g p l a c e s

Speech synthesis Marc Schrder, DFKI schroed@dfki.de 20 January 2010 What is text-to-speech

CS 2334: Lab 7 Generics, Lists and Queues Andrew H. Fagg: CS2334: Lab 7 1 Generics We know