Restless bandits with controlled restarts: Indexability and - PowerPoint PPT Presentation

Restless bandits with controlled restarts: Indexability and computation of Whittle index Nima Akbarzadeh, Aditya Mahajan McGill University, Electrical and Computer Engineering Department Dec. 13, 2019 1/23

Whack a Mole 2/23

Applications Applications : queueing, channel scheduling, machine maintenance and clinical care. 1 A repairman is responsible for maintaining several machines. Each machine stochastically deteriorates . There is a state-dependent cost associated with running and repairing the machine. He can repair one machine at a time. 2 Scheduling multiple data queues over a shared communication channels, there is a cost associated with holding packets or transmitting it. A fixed number of data queues can be selected at a time. The machine/queue restarts upon being repaired/selected. Goal : Find a optimal/near-optimal policy to optimize scheduling! 3/23

Model n available arms (controlled Markov processes), N = { 1 , . . . , n } . m arms have to be selected. ( m < n ) State space of each arm X i , i ∈ N Action space for each arm { 0 , 1 } Passive action: a i t = 0 → Markov chain matrix P i xy Active action: a i t = 1 → Reset PMF Q i y Cost: c i ( x i t , a i t ) 4/23

Objective Problem Given the discount factor β , the total number n of arms, the number m of active arms, the state space {X i } i ∈N , the transition matrices { P i } i ∈N , the reset pmfs { Q i } i ∈N , and the cost functions { c i ( · , · ) } i ∈N , choose a time-homogeneous Markov policy ❣ , � A i ❆ t = ❣ ( ❳ t ) such that t = m i ∈N that minimizes � ∞ � � β t � c i ( X i t , A i J ( ❣ ) := (1 − β ) E t ) . t =0 i ∈N 5/23

Challenge & Solution Challenge: The dynamic program suffers from curse of dimensionality! The size of the state space is |X| n . Example: 100 machines with 3 states each results in a system with 3 100 ≈ 5 . 15 × 10 47 states! Solution: Index-based heuristic policy (Whittle index [1988]) Drawback: Suboptimal! Advantage: Problem decomposition ⇒ 100 problems with 3 states. 6/23

Whittle Index policy Whittle index heuristic provides a dynamic index for each arm and select the arm with the smallest index at each time. Whittle index exists if indexability condition is satisfied for all arms. Whittle index policy performs close-to-optimal for many applications in the state-of-arts works. There is no general framework to check indexability and correspondingly, obtain the Whittle indices. Objectives: Prove our problem is indexable . Provide a closed-form solution for the Whittle index . 7/23

Problem Decomposition Define c λ ( x i t , a i t ) := c i ( x i , a i t ) + λ a i t , a i t ∈ { 0 , 1 } for arm i . Problem Given an arm i ∈ N , discount factor β , the state space X i , the transition probability matrix P i , the reset probability mass function Q i , the cost function c i ( · , · ) and the penalty λ ∈ R , choose a policy g i : X i → { 0 , 1 } to minimize � ∞ � � J i ( g i ) := (1 − β ) E β t c i λ ( X i t , A i t ) . t =0 8/23

Dynamic Programming Theorem λ : X i → R be the unique fixed point of the following: Let V i V i H i λ ( x , 0) , H i , ∀ x ∈ X i . � � λ ( x ) = min λ ( x , 1) where H i λ ( x , 0) = (1 − β ) c i ( x , 0) + β � P i xy V i λ ( y ) , y ∈X i � H i � c i ( x , 1) + λ � Q i y V i λ ( x , 1) = (1 − β ) + β λ ( y ) . y ∈X i Let g i λ ( x ) denote the minimizer of the right hand side. Then, g i λ is optimal for arm i. 9/23

Indexability Let passive set for arm i be x i ∈ X i : g i Π i � � λ := λ ( x ) = 0 . Definition (Indexability) For any λ 1 , λ 2 ∈ R arm i is indexable if ⇒ Π i λ 1 ⊆ Π i λ 1 < λ 2 = λ 2 . Definition (Whittle index) The Whittle index of state x of arm i is defined as w i ( x ) = inf λ ∈ R : x ∈ Π i � � . λ 10/23

Indexability Proof Sketch Theorem Each arm is indexable. Lemma � L ( x , τ ) − c ( x , 1) � x ∈ X : (1 − β ) inf Π λ = < W λ . 1 − β τ τ Lemma W λ = λ + β � y ∈X Q y V λ ( y ) is increasing in λ . 11/23

Whittle index By definition, � L ( x , τ ) − c ( x , 1) w i ( x ) = inf λ ∈ R : (1 − β ) inf < 1 − β τ τ � � Q i y V i λ + β λ ( y ) . y ∈X i Challenge: Obtaining a closed form solution for Whittle index is inefficient. Solution: To provide a closed-form solution we consider threshold-based policies. 12/23

Threshold Policies The optimal policy for each subproblem is a threshold-based policy, i.e., � 0 , if x < k g ( k ) ( x ) := 1 , otherwise . � ∞ � � = D ( k ) + λ N ( k ) . C ( k ) � β t c λ ( X t , g ( k ) ( X t )) := (1 − β ) E � X 0 ∼ Q � λ t =0 where � ∞ � D ( k ) := (1 − β ) E � � β t c ( X t , g ( k ) ( X t )) � X 0 ∼ Q , � t =0 � ∞ � N ( k ) := (1 − β ) E � � β t g ( k ) ( X t ) � X 0 ∼ Q . � t =0 13/23

Computation of D ( k ) and N ( k ) Let � τ k − 1 � � L ( k ) := E � β t c ( X t , 0) + β τ k c ( X τ k , 1) � X 0 ∼ Q � t =0 � τ k � M ( k ) := E β t � � � X 0 ∼ Q . � t =0 Theorem For all threshold k, D ( k ) = L ( k ) β M ( k ) − 1 − β 1 N ( k ) = and . M ( k ) β 14/23

Property Lemma k λ := arg min k ∈X C ( k ) is increasing in λ . λ k λ k + 1 k Λ ( k ) w ( k − 1) w ( k ) λ Figure: k λ as a function of λ . 15/23

Whittle Index Theorem The Whittle index for threshold-policies at state k ∈ X is w ( k ) = D ( k +1) − D ( k ) N ( k ) − N ( k +1) . Proof. Key Ideas: C ( k ) λ C ( k ) C ( k +1) is continuous in λ . λ λ D ( k +1) C ( k ) w ( k ) = C ( k +1) w ( k ) , i.e., D ( k ) D ( k ) + w ( k ) N ( k ) = D ( k +1) + w ( k ) N ( k +1) . w ( k ) λ 16/23

Whittle Index policy Compute Whittle indices offline. At each time instance, observe the state of each arm and select the arm with the lowest Whittle index. 17/23

Experiment Setup Deterministic restart : Q = [1 , 0 , . . . , 0] c ( x , 0) = ( x − 1) 2 and c ( x , 1) = 0 . 5( |X| − 1) 2 , β = 0 . 9 We consider structured and randomly generated stochastic monotone matrices for P . Monte-Carlo simulations : 5000 iterations with 250 time steps in each one. 18/23

Experiments (1) & (2) Comparison with Optimal Policy for small-scale models: α opt = J ( opt ) J ( wip ) × 100 For |X| = 5, n = 5, m ∈ { 1 , 2 } → α opt ∈ [95 . 5% − 100%]. 100 80 60 40 20 0 95 96 97 98 99 100 α OPT Figure: 100 randomly generated stochastic monotone matrices with m = 1. 19/23

Experiments (3) & (4) Comparison with Myopic Policy for large-scale models: � J ( myp ) − J ( wip ) � × 100 . ε myp = J ( myp ) For |X| = 25, n ∈ { 25 , 50 , 75 } , m ∈ { 1 , 2 , 5 } → ε myp ∈ [0% − 12%]. 60 40 20 0 0 2 4 6 8 10 12 ε MYP Figure: 100 randomly generated stochastic monotone matrices with n = 75, m = 2. 20/23

Conclusion A model for restless bandit with controlled restarts. An indexable model. A closed form expression to compute the Whittle indices when the optimal policy is threshold-based. Numerical experiments shows the Whittle index policy performs very close to the optimal policy and better than a myopic policy. 21/23

Q&A Thank you! 22/23

Q&A J ( k ) λ J ( k +1) λ J ( k +2) λ J ( k +3) λ D ( k +3) D ( k +2) D ( k +1) D ( k ) λ ◦ λ ◦ λ ◦ λ g k , k +1 g k +1 , k +2 g k +2 , k +3 λ ◦ λ ◦ g k , k +2 g k +1 , k +3 23/23

Restless bandits with controlled restarts: Indexability and - PowerPoint PPT Presentation

Restless bandits with controlled restarts: Indexability and computation of Whittle index Nima Akbarzadeh, Aditya Mahajan McGill University, Electrical and Computer Engineering Department Dec. 13, 2019 1/23 Whack a Mole 2/23 Applications

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

Aging Beyond Restarts Thomas Jansen University College Cork joint work with Christine Zarges

Dynamic spectrum access under partial observations: A restless bandit approach Nima Akbarzadeh,

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Outline Restless Bandits 1 Overview Problem Description Decomposition Applications 2

GLUCOSE 2.1 Aggressive but Reactive Clause Database Management, Dynamic Restarts Gilles

Recursive Restarts for HA We have crash-only components now what? Reduce recovery time

Rest for the Restless (or Finding Rest in a Family Tree) Matthew 1:1-17 3 Promises of Rest for

Chicag cago o Bandits dits Affili liate te Program ram Junior r Affiliate and Tra vel

Data Poisoning Attack cks on Stoch chastic c Bandits Fang Liu and Ness Shroff Outline

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning

Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

Weighted bandits or: How bandits learn distorted values that are not expected Prashanth L.A.

Phylogenetics: Parsimony COMP 571 Luay Nakhleh, Rice University The Problem Input: Multiple

CS681: Advanced Topics in Computational Biology Week 4, Lectures 1-2-3 Can Alkan EA224

Bayesian parameter estimation using Multilevel and multi-index Monte Carlo Kody Law joint with

DAETS: a Differential-Algebraic Equation Code in C++ for High Index and High Accuracy Ned

Stitch: The Sound Type-Indexed Type Checker Richard A. Eisenberg Bryn Mawr College

Queueing Networks service to a collection of customers that represent the users Customers compete

California Complete Count Census 2020 Convenings 2 & Implementation Plan Workshop

Reynolds Parametricity Patricia Johann Appalachian State University cs.appstate.edu/

Restless bandits with controlled restarts: Indexability and - PowerPoint PPT Presentation

Restless bandits with controlled restarts: Indexability and computation of Whittle index Nima Akbarzadeh, Aditya Mahajan McGill University, Electrical and Computer Engineering Department Dec. 13, 2019 1/23 Whack a Mole 2/23 Applications

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

Aging Beyond Restarts Thomas Jansen University College Cork joint work with Christine Zarges

Dynamic spectrum access under partial observations: A restless bandit approach Nima Akbarzadeh,

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Outline Restless Bandits 1 Overview Problem Description Decomposition Applications 2

GLUCOSE 2.1 Aggressive but Reactive Clause Database Management, Dynamic Restarts Gilles

Recursive Restarts for HA We have crash-only components now what? Reduce recovery time

Rest for the Restless (or Finding Rest in a Family Tree) Matthew 1:1-17 3 Promises of Rest for

Chicag cago o Bandits dits Affili liate te Program ram Junior r Affiliate and Tra vel

Data Poisoning Attack cks on Stoch chastic c Bandits Fang Liu and Ness Shroff Outline

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning

Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

Weighted bandits or: How bandits learn distorted values that are not expected Prashanth L.A.

Phylogenetics: Parsimony COMP 571 Luay Nakhleh, Rice University The Problem Input: Multiple

CS681: Advanced Topics in Computational Biology Week 4, Lectures 1-2-3 Can Alkan EA224

Bayesian parameter estimation using Multilevel and multi-index Monte Carlo Kody Law joint with

DAETS: a Differential-Algebraic Equation Code in C++ for High Index and High Accuracy Ned

Stitch: The Sound Type-Indexed Type Checker Richard A. Eisenberg Bryn Mawr College

Queueing Networks service to a collection of customers that represent the users Customers compete

California Complete Count Census 2020 Convenings 2 &amp; Implementation Plan Workshop

Reynolds Parametricity Patricia Johann Appalachian State University cs.appstate.edu/

California Complete Count Census 2020 Convenings 2 & Implementation Plan Workshop