Empirical Likelihood Upper Confidence Bounds For Bandit Models - PowerPoint PPT Presentation

Empirical Likelihood Upper Confidence Bounds For Bandit Models Olivier Capp´ e, Aur´ elien Garivier, Odalric-Ambrym Maillard, R´ emi Munos, Gilles Stoltz Institut de Math´ ematique de Toulouse, Universit´ e Paul Sabatier June 10th, 2014

Bandit Problems Outline 1 Bandit Problems 2 Lower Bounds for the Regret 3 Optimistic Algorithms 4 The Kullback-Leibler UCB Algorithm 5 Non-parametric setting : Empirical Likelihood

Bandit Problems (Idealized) Motivation : Clinical Trials Imagine you are a doctor : patients visit you one after another for a given disease you prescribe one of the (say) 5 treatments available the treatments are not equally efficient you do not know which one is the best, you observe the effect of the prescribed treatment on each patient ⇒ What do you do ? You must choose each prescription using only the previous observations Your goal is not to estimate each treatment’s efficiency precisely, but to heal as many patients as possible

Bandit Problems The (stochastic) Multi-Armed Bandit Model Environment K arms ν = ( ν 1 , . . . , ν K ) such that for any possible choice of arm a t ∈ { 1 , . . . , K } at time t , the reward is X t = X a t ,n a ( t ) where n a ( t ) = � s ≤ t ✶ { a t = a } , and for any 1 ≤ a ≤ K, n ≥ 1 , X a,n ∼ ν a , and the ( X a,n ) a,n are independent. Reward distributions ν a ∈ F a = parametric family (canonical exponential family) or not (general bounded rewards) Example Bernoulli rewards : ν a = B ( θ a ) Strategy The agent’s actions follow a dynamical strategy π = ( π 1 , π 2 , . . . ) such that A t = π t ( X 1 , . . . , X t − 1 )

Bandit Problems Real challenges Randomized clinical trials original motivation since the 1930’s dynamic strategies can save resources Recommender systems : advertisement website optimization news, blog posts, . . . Computer experiments large systems can be simulated in order to optimize some criterion over a set of parameters but the simulation cost may be high, so that only few choices are possible for the parameters Games and planning (tree-structured options)

Bandit Problems Performance Evaluation, Regret Cumulated Reward S T = � T t =1 X t Our goal Choose π so as to maximize T K � � � � E [ S T ] = E [ X t ✶ { A t = a }| X 1 , . . . , X t − 1 ] E t =1 a =1 K � µ a E [ N π = a ( T )] a =1 a ( T ) = � where N π t ≤ T ✶ { A t = a } is the number of draws of arm a up to time T , and µ a = E ( ν a ) . Regret Minimization equivalent to minimizing � R T = Tµ ∗ − E [ S T ] = ( µ ∗ − µ a ) E [ N π a ( T )] a : µ a <µ ∗ where µ ∗ ∈ max { µ a : 1 ≤ a ≤ K }

Lower Bounds for the Regret Outline 1 Bandit Problems 2 Lower Bounds for the Regret 3 Optimistic Algorithms 4 The Kullback-Leibler UCB Algorithm 5 Non-parametric setting : Empirical Likelihood

Lower Bounds for the Regret Asymptotically Optimal Strategies A strategy π is said to be consistent if, for any ν ∈ F , 1 T E [ S T ] → µ ∗ The strategy is uniformly efficient if for all ν ∈ F and all α > 0 , R T = o ( T α ) There are uniformly efficient strategies and we consider the best achievable asymptotic performance among uniformly efficient strategies

Lower Bounds for the Regret The Lower Bound of Lai and Robbins One-parameter reward distribution ν a = ν θ a , θ a ∈ Θ ⊂ R . Theorem [Lai and Robbins, ’85] If π is a uniformly efficient strategy, then for any θ ∈ Θ K , µ ∗ − µ a � R T lim inf log( T ) ≥ KL( ν a , ν ∗ ) T →∞ a : µ a <µ ∗ where KL( ν, ν ′ ) denotes the Kullback-Leibler divergence For example, in the Bernoulli case : � � = d ber ( p, q ) = p log p q + (1 − p ) log 1 − p KL B ( p ) , B ( q ) 1 − q

Lower Bounds for the Regret Generalization by Burnetas and Katehakis More general reward distributions ν a ∈ F a Theorem [Burnetas and Katehakis, ’96] If π is an efficient strategy, then, for any ν ∈ F , µ ∗ − µ a � R T lim inf log( T ) ≥ K inf ( ν a , µ ∗ ) T →∞ a : µ a <µ ∗ δ 1 2 where � K inf ( ν a , µ ∗ ) = inf K ( ν a , ν ′ ) : νa ν ′ ∈ F a , E ( ν ′ ) ≥ µ ∗ � K inf ( νa, µ⋆ ) ν ∗ δ 1 µ ∗ δ 0

Lower Bounds for the Regret Intuition First assume that µ ∗ is known and that T is fixed How many draws n a of ν a are necessary to know that µ a < µ ∗ with probability at least 1 − 1 /T ? Test : H 0 : µ a = µ ∗ against H 1 : ν = ν a Stein’s Lemma : if the first type error α n a ≤ 1 /T , then � � − n a K inf ( ν a , µ ∗ ) β n a � exp = ⇒ it can be smaller than 1 /T if log( T ) n a ≥ K inf ( ν a , µ ∗ ) How to do as well without knowing µ ∗ and T in advance ? Not asymptotically ?

Optimistic Algorithms Outline 1 Bandit Problems 2 Lower Bounds for the Regret 3 Optimistic Algorithms 4 The Kullback-Leibler UCB Algorithm 5 Non-parametric setting : Empirical Likelihood

Optimistic Algorithms Optimism in the Face of Uncertainty Optimism in an heuristic principle popularized by [Lai&Robins ’85 ; Agrawal ’95] which consists in letting the agent play as if the environment was the most favorable among all environments that are sufficiently likely given the observations accumulated so far Surprisingly, this simple heuristic principle can be instantiated into algorithms that are robust, efficient and easy to implement in many scenarios pertaining to reinforcement learning

Optimistic Algorithms Upper Confidence Bound Strategies UCB [Lai&Robins ’85 ; Agrawal ’95 ; Auer&al ’02] Construct an upper confidence bound for the expected reward of each arm : � S a ( t ) log( t ) + N a ( t ) 2 N a ( t ) � �� estimated reward exploration bonus Choose the arm with the highest UCB It is an index strategy [Gittins ’79] Its behavior is easily interpretable and intuitively appealing

Optimistic Algorithms UCB in Action

Optimistic Algorithms Performance of UCB For rewards in [0 , 1] , the regret of UCB is upper-bounded as E [ R T ] = O (log( T )) (finite-time regret bound) and � E [ R T ] 1 lim sup log( T ) ≤ 2( µ ∗ − µ a ) T →∞ a : µ a <µ ∗ Yet, in the case of Bernoulli variables, the rhs. is greater than suggested by the bound by Lai & Robbins Many variants have been suggested to incorporate an estimate of the variance in the exploration bonus (e.g., [Audibert&al ’07])

The Kullback-Leibler UCB Algorithm Outline 1 Bandit Problems 2 Lower Bounds for the Regret 3 Optimistic Algorithms 4 The Kullback-Leibler UCB Algorithm 5 Non-parametric setting : Empirical Likelihood

The Kullback-Leibler UCB Algorithm The KL-UCB algorithm Parameters : An operator Π F : M 1 ( S ) → F ; a non-decreasing function f : N → R Initialization : Pull each arm of { 1 , . . . , K } once for t = K to T − 1 do • compute for each arm a the quantity � � � � � � ≤ f ( t ) U a ( t ) = sup E ( ν ) : ν ∈ F and KL Π F ν a ( t ) ˆ , ν N a ( t ) • pick an arm A t +1 ∈ arg max U a ( t ) a ∈{ 1 ,...,K } end for

The Kullback-Leibler UCB Algorithm Sketch of analysis • For every sub-optimal arm a , � � � � µ ⋆ ≥ U a ⋆ ( t ) µ ⋆ < U a ( t ) and A t +1 = a { A t +1 = a } ⊆ ∪ , � � • Choose f ( t ) such that for all a , P µ a < U a ( t ) ≤ 1 /t δ 1 2 � � � � µ ⋆ < U a ( t ) • = ν a,N a ( t ) ∈ C µ ⋆ , f ( t ) /N a ( t ) � κ a ( γ ) where for µ ∈ R and γ > 0 , ν γ ν a � � � � K inf ( ν a , µ ⋆ ) C µ,γ ⊆ ν ∈ M 1 ( S ) : K inf Π F ( ν ) , µ ≤ γ ν ∗ C µ ∗ ,γ δ 0 δ 1 µ ∗ • This event is typical iff N a ( t ) ≤ f ( T ) /K inf ( ν a , µ ⋆ ) : �� ν a,n ∈ C µ ⋆ , f ( t ) /n = o log( T ) P � f ( T ) n> K inf ( νa ,µ⋆ )

The Kullback-Leibler UCB Algorithm Parametric setting : Exponential Families Assume that F a = canonical one-dimensional exponential family , i.e. such that the pdf of the rewards is given by � � p θ a ( x ) = exp xθ a − b ( θ a ) + c ( x ) , 1 ≤ a ≤ K for a parameter θ ∈ R K , expectation µ a = ˙ b ( θ a ) The KL-UCB si simply : � � � � ≤ f ( t ) U a ( t ) = sup µ ∈ I : d µ a ( t ) , µ ˆ N a ( t ) For instance, for Bernoulli rewards : d ber ( p, q ) = p log p q + (1 − p ) log 1 − p 1 − q for exponential rewards p θ a ( x ) = θ a e − θ a x : d exp ( u, v ) = u − v + u log u v The analysis is generic and yields a non-asymptotic regret bound optimal in the sense of Lai and Robbins.

The Kullback-Leibler UCB Algorithm The kl-UCB algorithm Parameters : F parameterized by the expectation µ ∈ I ⊂ R with divergence d , a non-decreasing function f : N → R Initialization : Pull each arm of { 1 , . . . , K } once for t = K to T − 1 do • compute for each arm a the quantity � � � � ≤ f ( t ) U a ( t ) = sup µ ∈ I : d µ a ( t ) , µ ˆ N a ( t ) • pick an arm A t +1 ∈ arg max U a ( t ) a ∈{ 1 ,...,K } end for

Empirical Likelihood Upper Confidence Bounds For Bandit Models - PowerPoint PPT Presentation

Empirical Likelihood Upper Confidence Bounds For Bandit Models Olivier Capp e, Aur elien Garivier, Odalric-Ambrym Maillard, R emi Munos, Gilles Stoltz Institut de Math ematique de Toulouse, Universit e Paul Sabatier June 10th,

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Reinforcement Learning Kevin Spiteri April 21, 2015 n-armed bandit n-armed bandit 0.9 0.5

Upper confidence bound algorithms Christos Dimitrakakis EPFL November 6, 2013 Christos

Upper confidence bound strategy on stochastical bandits Multiarmed bandit: K arms, at each step we

One Armed Bandit source: http://dogbeforewicket.blogspot.ca EECS 1030 moodle.yorku.ca One Armed

Lecture 2. Upper and lower bounds for subgaussian matrices The -net method refined 1 Random

THE LISTING PRESENTATION A Natural Close! CONFIDENCE CONFIDENCE CONFIDENCE CONFIDENCE Hi

Circuit Lower-bounds Lecture 24 Weak circuits are indeed weak 1 Circuit Lower-bounds 2

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I S ebastien

The Multi-Armed Bandit Problem Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano Nicol`

On some topological upper bounds of the apex trees Sarfraz Ahmad Department of Mathematics,

CS70: Jean Walrand: Lecture 29. Confidence? Confidence? Confidence is essential is many

Max. likelihood & Bayesian techniques are both likelihood-based. Weaknesses of likelihood for

The Nonstochastic Multi Armed Bandit Problem Part 2 and counting... Shahaf Nacson TAU Nov 15,

Sequence Covering Arrays Lower Bounds Upper Bounds Existence Results Charles J. Colbourn 1

Linear Programming Chapter 6.14-7.3 Bjrn Morn 1 Simplex Method with Upper Bounds Optjmality

Hybrid Activity and Plan Recognition for Video Streams Roger Granada, Ramon Fraga Pereira, Juarez

A Forensic Review of TDSS Tim Slaybaugh US-CERT June 18, 2012 Background TDSS first

On Path Generation, Path Following On Path Generation, Path Following and Time Coordination for

Cryptographic Protocols bank executes (valid) transactions What is a Valid Transaction Spring

Segmentation and Representation for the Reuse of Skills Learned by Imitation 2012. 04. 18.

Banburismus Banburismus British codebreakers used cribs (guesses), brute force, and the and

PEEP recruitment maneuver Step 1 (lung) PEEP Recruits (opens) collapsed alveoli FiO

Power to peep-all: Inference Attacks by Malicious Batteries on Mobile Devices Pavel Lifshits,

Empirical Likelihood Upper Confidence Bounds For Bandit Models - PowerPoint PPT Presentation

Empirical Likelihood Upper Confidence Bounds For Bandit Models Olivier Capp e, Aur elien Garivier, Odalric-Ambrym Maillard, R emi Munos, Gilles Stoltz Institut de Math ematique de Toulouse, Universit e Paul Sabatier June 10th,

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Reinforcement Learning Kevin Spiteri April 21, 2015 n-armed bandit n-armed bandit 0.9 0.5

Upper confidence bound algorithms Christos Dimitrakakis EPFL November 6, 2013 Christos

Upper confidence bound strategy on stochastical bandits Multiarmed bandit: K arms, at each step we

One Armed Bandit source: http://dogbeforewicket.blogspot.ca EECS 1030 moodle.yorku.ca One Armed

Lecture 2. Upper and lower bounds for subgaussian matrices The -net method refined 1 Random

THE LISTING PRESENTATION A Natural Close! CONFIDENCE CONFIDENCE CONFIDENCE CONFIDENCE Hi

Circuit Lower-bounds Lecture 24 Weak circuits are indeed weak 1 Circuit Lower-bounds 2

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I S ebastien

The Multi-Armed Bandit Problem Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano Nicol`

On some topological upper bounds of the apex trees Sarfraz Ahmad Department of Mathematics,

CS70: Jean Walrand: Lecture 29. Confidence? Confidence? Confidence is essential is many

Max. likelihood &amp; Bayesian techniques are both likelihood-based. Weaknesses of likelihood for

The Nonstochastic Multi Armed Bandit Problem Part 2 and counting... Shahaf Nacson TAU Nov 15,

Sequence Covering Arrays Lower Bounds Upper Bounds Existence Results Charles J. Colbourn 1

Linear Programming Chapter 6.14-7.3 Bjrn Morn 1 Simplex Method with Upper Bounds Optjmality

Hybrid Activity and Plan Recognition for Video Streams Roger Granada, Ramon Fraga Pereira, Juarez

A Forensic Review of TDSS Tim Slaybaugh US-CERT June 18, 2012 Background TDSS first

On Path Generation, Path Following On Path Generation, Path Following and Time Coordination for

Cryptographic Protocols bank executes (valid) transactions What is a Valid Transaction Spring

Segmentation and Representation for the Reuse of Skills Learned by Imitation 2012. 04. 18.

Banburismus Banburismus British codebreakers used cribs (guesses), brute force, and the and

PEEP recruitment maneuver Step 1 (lung) PEEP Recruits (opens) collapsed alveoli FiO

Power to peep-all: Inference Attacks by Malicious Batteries on Mobile Devices Pavel Lifshits,

Max. likelihood & Bayesian techniques are both likelihood-based. Weaknesses of likelihood for