The Nonstochastic Multi Armed Bandit Problem Part 2 and counting... - PowerPoint PPT Presentation

Reminder from last week Goals Lower bounds on the weak regret The Nonstochastic Multi Armed Bandit Problem Part 2 and counting... Shahaf Nacson TAU Nov 15, 2017 Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

Reminder from last week Goals Lower bounds on the weak regret Reminder from last week Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

Reminder from last week Goals Lower bounds on the weak regret Background Problem setup: K arms Assume K is known to player in advance Rewards X i ( t ) are bounded to [0 , 1] Generalized trivially to [ a , b ] by ( b − a ) x + a Partial information Player learns only rewards of arms he chose Slot machines need not be of fixed distribution Can even be adversarial Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

Reminder from last week Goals Lower bounds on the weak regret Background Problem setup: Rewards assignment is determined in advance I.e. before the first arm is pulled Assignments can be picked after player strategy is already known We want to minimize the regret Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

Reminder from last week Goals Lower bounds on the weak regret Notations K - number of possible actions (i.e arms) denoted usually by i ∈ { 1 , ..., K } T - total time denoted usually by time t ∈ { 1 , ..., T } One action i per time t x i ( t ) - reward of arm i at time t x i ( t ) ∈ [0 , 1] A - player’s strategy Choose arm i t at time t (and receive reward x i t ( t )) Player only knows x i 1 (1) , ..., x i t ( t ) of previously chosen actions i 1 , ..., i t Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

Reminder from last week Goals Lower bounds on the weak regret Notations take 2 K - number of possible actions (i.e arms) denoted usually by i ∈ { 1 , ..., K } T - total time denoted usually by time t ∈ { 1 , ..., T } One action i per time t x i ( t ) - reward of arm i at time t x i ( t ) ∈ [0 , 1] A - player’s strategy Can be viewed as a sequence I 1 , I 2 , ... where each I t is a mapping from ( { 1 , ..., K } × [0 , 1]) t − 1 → { 1 , ..., K } , that is the set of action indices and previous rewards to the set of indices Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

Reminder from last week Goals Lower bounds on the weak regret Notations take 3 G A ( T ) - Total reward of strategy A at time horizon T T � G A ( T ) := x i t ( t ) t =1 denoted as G A when context of T is obvious Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

Reminder from last week Goals Lower bounds on the weak regret Notations take 4 Regret take 1: Given a ssequence of actions ( j 1 , ..., j T ), we denote T � G ( j 1 ,..., j T ) := x j t ( t ) t =1 as the return of the sequence (Worst-case) regret defined as G ( j 1 ,..., j T ) − G A ( T ) Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

Reminder from last week Goals Lower bounds on the weak regret Notations take 5 Regret take 2: G max ( T ) - Total reward of the best arm at time horizon T T � G max ( T ) := max x j ( t ) j t =1 denoted as G max as well Weak regret defined as G max − G A ( T ) We will consider the weak regret from now on and will refer to it simply as ”the regret” Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

Reminder from last week Goals Lower bounds on the weak regret Goals Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

Reminder from last week Goals Lower bounds on the weak regret Goals √ Lower bounds on the weak regret is Ω( KT ) Does not match the upper bound of previous week’s algorithm � of O ( KT ln( K )) Closing the gap is still an open problem (today??) Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

Reminder from last week Goals Lower bounds on the weak regret Goals √ Lower bounds on the weak regret is Ω( KT ) Upper bounds on the weak regret that hold with probability 1 If time permits... Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

Reminder from last week Goals Lower bounds on the weak regret Lower bounds on the weak regret Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

Reminder from last week Goals Lower bounds on the weak regret Theorem 5.1. For any number of actions K ≥ 2 and for any time horizon T , there exists a distribution over the assignment of rewards such that the expected weak regret of any algorithm is √ Ω( KT ) Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

Reminder from last week Goals Lower bounds on the weak regret Proof overview Construct the random distribution of rewards s.t all strategies would reach an expected regret of our lower bound Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

Reminder from last week Goals Lower bounds on the weak regret Proof overview Construct the random distribution of rewards s.t all strategies would reach an expected regret of our lower bound Find lower bound to the expected gain of the best arm G max Pretty straightforward Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

Reminder from last week Goals Lower bounds on the weak regret Proof overview Construct the random distribution of rewards s.t all strategies would reach an expected regret of our lower bound Find an lower bound to the expected gain of the best arm G max Pretty straightforward Find a upper bound to the expected gain of any given strategy G A Here is where all the magic happens Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

Reminder from last week Goals Lower bounds on the weak regret Proof overview Construct the random distribution of rewards s.t all strategies would reach an expected regret of our lower bound Find an lower bound to the expected gain of the best arm G max Pretty straightforward Find a upper bound to the expected gain of any given strategy G A Here is where all the magic happens Deduce lower bound on their diffrence Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

Reminder from last week Goals Lower bounds on the weak regret Proof overview Construct the random distribution of rewards s.t all strategies would reach an expected regret of our lower bound Find an lower bound to the expected gain of the best arm G max Find a upper bound to the expected gain of any given strategy G A Here is where all the magic happens Deduce lower bound on their diffrence Proof by notations :) Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

Reminder from last week Goals Lower bounds on the weak regret Constructing the distribution Before play begins, one action I is chosen uniformly at random to be the ”good” action. Define binary rewards: if j = I , meaning j is the ”good” action: Pr[ x j ( t ) = 1] = 1 2 + ǫ Pr[ x j ( t ) = 0] = 1 2 − ǫ if j � = I , meaning j is the ”good” action: Pr[ x j ( t ) = 1] = 1 2 Pr[ x j ( t ) = 0] = 1 2 for some small, fixed ǫ ∈ (0 , 1 2 ] to be chosen later down the road Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

Reminder from last week Goals Lower bounds on the weak regret Constructing the distribution Before play begins, one action I is chosen uniformly at random to be the ”good” action. Define binary rewards: if j = I , meaning j is the ”good” action: Pr[ x j ( t ) = 1] = 1 2 + ǫ Pr[ x j ( t ) = 0] = 1 2 − ǫ if j � = I , meaning j is the ”good” action: Pr[ x j ( t ) = 1] = 1 2 Pr[ x j ( t ) = 0] = 1 2 Then the expected reward of the best action is ( 1 2 + ǫ ) T Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

Reminder from last week Goals Lower bounds on the weak regret Constructing the distribution Translation of our problem: Our goal now is to show that for any given strategy A , we can find √ an ǫ s.t A ’s expected regret is of Ω( TK ). We will soon see that ǫ depends only on number of actions K and total time T . Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

Reminder from last week Goals Lower bounds on the weak regret Some more notations P ∗ {·} - probabilty w.r.t the afromentioned distribution P i {·} - probabilty conditioned on i being the good action P i {·} = P ∗ { · | i = I } P unif {·} - probabilty probabilty w.r.t the uniform distribution Same for expectations E ∗ [ · ] , E i [ · ] , E unif [ · ] Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

Reminder from last week Goals Lower bounds on the weak regret Some more notations P ∗ {·} - probabilty w.r.t the afromentioned distribution P i {·} - probabilty conditioned on i being the good action P i {·} = P ∗ { · | i = I } P unif {·} - probabilty probabilty w.r.t the uniform distribution Same for expectations E ∗ [ · ] , E i [ · ] , E unif [ · ] We want to show: √ E ∗ [ G max − G A ] ≥ Ω( KT ) Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

Reminder from last week Goals Lower bounds on the weak regret Some more notations (...) A - as before, player strategy r t = x i t ( t ) - random variable denoting reward received at time t r t = � r 1 , ..., r t � - sequence of rewards up to time t r = r T - the entire sequence N i - number of times action i is chosen by A Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

The Nonstochastic Multi Armed Bandit Problem Part 2 and counting... - PowerPoint PPT Presentation

Reminder from last week Goals Lower bounds on the weak regret The Nonstochastic Multi Armed Bandit Problem Part 2 and counting... Shahaf Nacson TAU Nov 15, 2017 Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem Reminder from last

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I S ebastien

Reinforcement Learning Kevin Spiteri April 21, 2015 n-armed bandit n-armed bandit 0.9 0.5

One Armed Bandit source: http://dogbeforewicket.blogspot.ca EECS 1030 moodle.yorku.ca One Armed

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part 2 S ebastien

Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12 k-armed Bandit Problem Playing k armed

The Multi-Armed Bandit Problem Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano Nicol`

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurlien Garivier

Contextual Multi-armed Bandit Algorithm for Semiparametric Reward Model Gi-Soo Kim, Myunghee Cho

Multi-armed bandits S Bubeck, N Cesa-Bianchi Foundations and Trends in Machine Learning 2012 *

About this class An example Bandit problems in general Two-armed bandits Multi-armed bandits

Social Learning in Multi Agent Multi Armed Bandits Abishek Sankararaman, UC Berkeley April 9,

Multi-armed bandit problem and its applications in reinforcement learning Pietro Lovato Ph.D.

The multi armed-bandit problem (with covariates if we have time) Vianney Perchet & Philippe

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits Authors: John Langford, Tom Zhang

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling Instructor: Sham Kakade 1 The

343H: Honors AI Lecture 7: Expectimax Search 2/6/2014 Kristen Grauman UT-Austin Slides

Rewards Structure in Games: Learning a Compact Representation for Action Space Margot Yann,

Where are we? Informatics 2D Reasoning and Agents Last time . . . Semester 2, 20192020

Incentives and Behavior Prof. Dr. Heiner Schumacher KU Leuven 2. Game Theory II Prof. Dr.

IHI Expedition Expedition: Preparing Care Teams for Bundled Payments Session 2: Building a Care

8.4 Renegotiation: The Repossession Game The players have signed a binding contract , but

the herdsmen of Abrams livestock and the herdsmen of Lots livestock. At that time the

EPHESIANS 1 : 15 - 23 & 3 : 14 - 21 Pa Pauls Pra Praye yer to r to th the E Ephesians

The Nonstochastic Multi Armed Bandit Problem Part 2 and counting... - PowerPoint PPT Presentation

Reminder from last week Goals Lower bounds on the weak regret The Nonstochastic Multi Armed Bandit Problem Part 2 and counting... Shahaf Nacson TAU Nov 15, 2017 Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem Reminder from last

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I S ebastien

Reinforcement Learning Kevin Spiteri April 21, 2015 n-armed bandit n-armed bandit 0.9 0.5

One Armed Bandit source: http://dogbeforewicket.blogspot.ca EECS 1030 moodle.yorku.ca One Armed

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part 2 S ebastien

Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12 k-armed Bandit Problem Playing k armed

The Multi-Armed Bandit Problem Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano Nicol`

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurlien Garivier

Contextual Multi-armed Bandit Algorithm for Semiparametric Reward Model Gi-Soo Kim, Myunghee Cho

Multi-armed bandits S Bubeck, N Cesa-Bianchi Foundations and Trends in Machine Learning 2012 *

About this class An example Bandit problems in general Two-armed bandits Multi-armed bandits

Social Learning in Multi Agent Multi Armed Bandits Abishek Sankararaman, UC Berkeley April 9,

Multi-armed bandit problem and its applications in reinforcement learning Pietro Lovato Ph.D.

The multi armed-bandit problem (with covariates if we have time) Vianney Perchet &amp; Philippe

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits Authors: John Langford, Tom Zhang

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling Instructor: Sham Kakade 1 The

343H: Honors AI Lecture 7: Expectimax Search 2/6/2014 Kristen Grauman UT-Austin Slides

Rewards Structure in Games: Learning a Compact Representation for Action Space Margot Yann,

Where are we? Informatics 2D Reasoning and Agents Last time . . . Semester 2, 20192020

Incentives and Behavior Prof. Dr. Heiner Schumacher KU Leuven 2. Game Theory II Prof. Dr.

IHI Expedition Expedition: Preparing Care Teams for Bundled Payments Session 2: Building a Care

8.4 Renegotiation: The Repossession Game The players have signed a binding contract , but

the herdsmen of Abrams livestock and the herdsmen of Lots livestock. At that time the

EPHESIANS 1 : 15 - 23 &amp; 3 : 14 - 21 Pa Pauls Pra Praye yer to r to th the E Ephesians

The multi armed-bandit problem (with covariates if we have time) Vianney Perchet & Philippe

EPHESIANS 1 : 15 - 23 & 3 : 14 - 21 Pa Pauls Pra Praye yer to r to th the E Ephesians