Weighted Linear Bandits for Non-Stationary Environments Yoan Russac - PowerPoint PPT Presentation

Weighted Linear Bandits for Non-Stationary Environments Yoan Russac 1 , Claire Vernade 2 and Olivier Capp´ e 1 1 CNRS, Inria, ENS, Universit´ 2 Deepmind e PSL

The Model Roadmap 1 The Model 2 Related work 3 Concentration Result 4 Application to Non-Stationary Linear Bandits 5 Empirical Performances

The Model The Non-Stationary Linear Model At time t , the learner has access to a time-dependent finite set of arbitrary actions A t = { A t, 1 , . . . , A t,K t } , where A t,k ∈ R d (with � A t,k � 2 ≤ L ) They can only be probed one at a time, i.e., the learner Chooses an action A t ∈ A t and observes only the noisy linear reward X t = A ⊤ t θ ⋆ t + η t where η t is a σ -subgaussian random noise Specificity of the model Non-Stationarity θ ⋆ t depends on t Unstructured action set

The Model Optimality Criteria Dynamic Regret Minimization � T � � T � T � � � a ∈A t � a, θ ⋆ max E X t ⇐ ⇒ min E max t � − X t t =1 s =1 t =1 � T � � a ∈A t � a − A t , θ ⋆ ⇐ ⇒ min E max t � t =1 � �� dynamic regret

The Model Difference to Specific Cases   1 . . . 0  . .  ... . . 1 When A t → I d =   . . 0 . . . 1 The model reduces to the (non-stationary) multiarmed bandit model If θ ⋆ t = θ ⋆ , there is a single best action a ⋆ It is only necessary to control the deviations of ˆ θ t in the principal directions   A t . . . 0  . .  ... . . 2 If A t → I d ⊗ A t =  , with ( A t ) t ≥ 1 i.i.d.  . . 0 . . . A t ǫ -greedy exploration (may be) efficient

The Model Non-Stationarity and Bandits Two different approaches are commonly used to deal with non-stationary bandits Detecting changes in the distribution of the arms Building methods that are (somewhat) robust to variations of the environment Their performance depends on the assumptions made on the sequence of environment parameters ( θ ⋆ t ) t ≥ 1 In abruptly changing environments, changepoint detection methods are more efficient But they may fail in slowly-changing environments We expect robust policies to perform well in both environments

The Model Our Approach We only focus on robust policies With that in mind, the non-stationarity in the θ ⋆ t parameter is measured with the variation budget T − 1 � � θ ⋆ s − θ ⋆ s +1 � 2 ≤ B T s =1 ֒ → A large variation budget can be either due to large scarce changes of θ ⋆ t or frequent but small deviations

Related work Roadmap 1 The Model 2 Related work 3 Concentration Result 4 Application to Non-Stationary Linear Bandits 5 Empirical Performances

Related work Some references Garivier et al. (2011), On upper-confidence bound policies for switching bandit problems , COLT Introduce sliding window and exponential discounting algorithms, analyzing them in the abrupt changes setting and providing a O ( T 1 / 2 ) lower bound Besbes et al. (2014), Stochastic multi-armed-bandit problem with non-stationary rewards , NeurIPS Consider the variation budget, prouve a O ( T 2 / 3 ) lower bound and analyze an epoch-based variant of Exp3 Wu et al. (2018), Learning contextual bandits in a non-stationary environment , ACM SIGIR Introduce an algorithm (called dLinUCB) based on change detection for the linear bandit Cheung et al. (2019), Learning to optimize under non-stationarity , AISTATS Adapt the sliding-window algorithm to the linear bandit

Related work Garivier et al. paper Sliding-Window UCB algorithm At time t the SW - UCB policy selects the action that maximizes � t � s = t − τ +1 X s ✶ ( I s = i ) ξ log(min( t, τ )) A t = arg max + � t � t s = t − τ +1 ✶ ( I s = i ) s = t − τ +1 ✶ ( I s = i )) i ∈{ 1 ,...K } Discounted UCB algorithm At time t the D - UCB policy selects the action that maximizes � t � s =1 γ t − s X s ✶ ( I s = i ) ξ log((1 − γ − t ) / (1 − γ )) A t = arg max + 2 � t � t s =1 γ t − s ✶ ( I s = i ) s =1 γ t − s ✶ ( I s = i ) i ∈{ 1 ,...K } with γ < 1

Concentration Result Roadmap 1 The Model 2 Related work 3 Concentration Result 4 Application to Non-Stationary Linear Bandits 5 Empirical Performances

Concentration Result Assumptions At each round t ≥ 1 the learner Receives a finite set of arbitrary feasible actions A t ⊂ R d Selects an F t = σ ( X 1 , A 1 , . . . , X t − 1 , A t − 1 ) –measurable action A t ∈ A t Other assumptions Sub-Gaussian Random Noise η t is, conditionally on the past, σ -subgaussian Bounded Actions ∀ t ≥ 1 , ∀ a ∈ A t , � a � 2 ≤ L Bounded Parameters ∀ t ≥ 1 , � θ ⋆ t � 2 ≤ S ∀ t ≥ 1 , ∀ a ∈ A t , |� a, θ ⋆ t �| ≤ 1

Concentration Result Weighted Least Squares Estimator Least Squares Estimator t � s θ ) 2 + λ ˆ ( X s − A ⊤ 2 � θ � 2 θ t = arg min 2 θ ∈ R d s =1 Weighted Least Squares Estimator t � s θ ) 2 + λ t ˆ w s ( X s − A ⊤ 2 � θ � 2 θ t = arg min 2 θ ∈ R d s =1

Concentration Result Scale-Invariance Property The weighted least squares estimator is given by � t � − 1 t � � ˆ w s A s A ⊤ θ t = s + λ t I d w s A s X s s =1 s =1 → ˆ ֒ θ t is unchanged if all the weights w s and the regularization parameter λ t are multiplied by a same constant α

Concentration Result The Case of Exponential weights Exponential Discount (Time-Dependent Weights) t � s θ ) 2 + λ ˆ γ t − s ( X s − A ⊤ 2 � θ � 2 θ t = arg min 2 �� θ ∈ R d s =1 w t,s Time-Independent Weights � 1 � s t � s θ ) 2 + λ ˆ ( X s − A ⊤ 2 γ t � θ � 2 θ t = arg min 2 γ θ ∈ R d s =1 ֒ → are equivalent, due to scale-invariance

Concentration Result Concentration Result Theorem 1 Assuming that θ ⋆ t = θ ⋆ , for any F t -predictable sequences of actions ( A t ) t ≥ 1 and positive weights ( w t ) t ≥ 1 and for all δ > 0 , with probability higher than 1 − δ , �  � � � 1 + L 2 � t � s =1 w 2 λ t � 2 log(1 /δ ) + d log  ∀ t, � ˆ θ t − θ ⋆ � V t � s  P V t ≤ S + σ √ µ t V − 1 dµ t t where t � w s A s A ⊤ V t = s + λ t I d , s =1 t � � w 2 s A s A ⊤ V t = s + µ t I d s =1

Concentration Result On the Control of Deviations in the V t � V − 1 V t Norm t For the unweighted least squares estimator, the [Abbasi-Yadkori et al. , 2001] deviation bound features the � ˆ θ t − θ ⋆ � V t norm Here, the V t � V − 1 V t norm comes form the observation that t s which are featured in � The variance terms are related to w 2 V t The weighted least squares estimator (and the matrix V t ) is defined with w s Remark: When w t = 1 , taking λ t = µ t yields V t � V − 1 V t = V t and t the usual concentration inequality

Concentration Result On the Role of µ t The sequence of parameters ( µ t ) t ≥ 1 is instrumental (results from the use of the Method of Mixtures) and could theoretically be chosen completely independently from λ t and w t But taking µ t proportional to λ 2 t , ensures that V t � V − 1 V t becomes scale-invariant t λ t / √ µ t becomes scale-invariant � t s =1 w 2 s /µ t becomes scale-invariant ֒ → Scale-invariant concentration inequality !

Concentration Result On the Use of Time-Dependent Regularization Parameters Using time-dependent regularization parameter λ t , is required to avoid vanishing regularization � � 1 + L 2 � t s =1 w 2 s In the sense that d log should not dominate dµ t the radius of the confidence region as t increases In the setting with exponentially increasing weights ( w s = γ − s ) µ t ∝ λ 2 λ t ∝ w t t

Application to Non-Stationary Linear Bandits Roadmap 1 The Model 2 Related work 3 Concentration Result 4 Application to Non-Stationary Linear Bandits 5 Empirical Performances

Application to Non-Stationary Linear Bandits D - LinUCB Algorithm (1) Algorithm 1: D - LinUCB Input: Probability δ , subgaussianity constant σ , dimension d , regularization λ , upper bound for actions L , upper bound for parameters S , discount factor γ . Initialization: b = 0 R d , V = λI d , � V = λI d , ˆ θ = 0 R d for t ≥ 1 do Receive A t , compute � � � √ � 1 � 1 + L 2 (1 − γ 2( t − 1) ) β t − 1 = λS + σ 2 log + d log λd (1 − γ 2 ) δ for a ∈ A t do � Compute UCB ( a ) = a ⊤ ˆ a ⊤ V − 1 � V V − 1 a θ + β t − 1 A t = arg max a ( UCB ( a )) Play action A t and receive reward X t Updating phase : V = γV + A t A ⊤ t + (1 − γ ) λI d , V = γ 2 � � V + A t A ⊤ t + (1 − γ 2 ) λI d b = γb + X t A t , ˆ θ = V − 1 b

Application to Non-Stationary Linear Bandits D - LinUCB Algorithm (2) Thanks to the scale-invariance property, for numerical stability of the implementation, we consider time-dependent weights w t,s = γ t − s for 1 ≤ s ≤ t The weighted least squares estimator is solution of t � γ t − s ( X s − � A s , θ � ) 2 + λ/ 2 � θ � 2 ˆ θ t = arg min 2 θ ∈ R d s =1 ֒ → this form is numerically stable and can be implemented recursively (but we revert to the standard form for the analysis)

Weighted Linear Bandits for Non-Stationary Environments Yoan Russac - PowerPoint PPT Presentation

Weighted Linear Bandits for Non-Stationary Environments Yoan Russac 1 , Claire Vernade 2 and Olivier Capp e 1 1 CNRS, Inria, ENS, Universit 2 Deepmind e PSL The Model Roadmap 1 The Model 2 Related work 3 Concentration Result 4 Application to

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

Weighted graphs Weighted graphs Weighted graphs Weighted graphs Graphs with numbers, called

Weighted bandits or: How bandits learn distorted values that are not expected Prashanth L.A.

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

Outline Outline Stationary Solution to Fokker Stationary Solution to Fokker- - Planck

Semi-stationary reflection, stationary reflection and combinatorics Hiroshi Sakai (joint work

On adaptive regret bounds for non- stochastic bandits Gergely Neu INRIA Lille, SequeL team

Stationary Rational Bubbles in Non-Linear Business Cycle Models Robert Kollmann Universit

The Bernoulli Generalized Likelihood Ratio test (BGLR) for Non-Stationary Multi-Armed Bandits

Weighted graphs 2 Weighted graphs So far we have only considered weighted graphs with

Weighted graphs 3 Weighted graph Edges in weighted graph are assigned a weight: w(v 1 , v 2 ),

Chicag cago o Bandits dits Affili liate te Program ram Junior r Affiliate and Tra vel

Data Poisoning Attack cks on Stoch chastic c Bandits Fang Liu and Ness Shroff Outline

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning

PPMLHDFE: Fast Poisson Estimation with High Dimensional Fixed Effects Paulo Guimares 2020

Family Achievements?: How Wealth Trumps Education Among White and Black College Graduates

Why Threshold Models: Need to Go Beyond . . . The Above Idea Works . . . A Theoretical

WEYERHAEUSER Earnings Release 2nd Quarter 2011 07/29/2011 1 FORWARD-LOOKING STATEMENT

Least Weighted Absolute Value Estimator with an Application to Investment Data Petra Vidnerov

HMM-based acoustic model adaptation and discriminative training Steven Wegmann ICSI 11 April

Time series Decomposing a series into meaningful components R.W. Oldford Time series data -

PAUP* Lab Note: Parts of this computer lab exercise wer written by Paul O. Lewis. Paul has