Prashanth. L.A. Advisor: Prof. Shalabh Bhatnagar Department of - PowerPoint PPT Presentation

Resource Allocation for Sequential Decision Making under Uncertainty: Studies in Vehicular Traffic Control, Service Systems, Sensor Networks and Mechanism Design Prashanth. L.A. Advisor: Prof. Shalabh Bhatnagar Department of Computer Science and Automation Indian Institute of Science Bangalore March, 2013 1 / 68

Outline 1 Introduction 2 Part I - Vehicular Traffic Control Traffic control MDP Qlearning based TLC algorithms Threshold tuning using SPSA Feature adaptation 3 Part II - Service Systems Background Labor cost optimization problem Simulation Optimization Methods 4 Part III - Sensor Networks Sleep–wake control POMDP Sleep–wake scheduling algorithms – discounted setting Sleep–wake scheduling algorithms – average setting 5 Part IV - Mechanism Design Static Mechanism with Capacity Constraints Dynamic Mechanism with Capacity Constraints 2 / 68

Introduction The problem Question:“how to allocate resources amongst competing entities so as to maximize the rewards accumulated in the long run?” Resources: may be abstract (e.g. time) or concrete (e.g. manpower) The sequential decision making setting: involves one or more agents interacting with an environment to procure rewards at every time instant, and the goal is to find an optimal policy for choosing actions Uncertainties in the system the stochastic noise and partial observability in a single-agent setting or private information of the agents in a multi-agent setting Real-world problems: high-dimensional state and action spaces and hence, the choice of knowledge representation is crucial 3 / 68

Introduction The studies conducted Vehicular Traffic Control Here we optimize the ‘green time’ resource of the lanes in a road network so that traffic flow is maximized in the long term Service Systems Here we optimize the ‘workforce’, while complying to queue stability as well as aggregate service level agreement (SLA) constraints Wireless Sensor Networks Here we allocate the ‘sleep time’ (resource) of the individual sensors in an object tracking application such that the energy consumption from the sensors is reduced, while keeping the tracking error to a minimum Mechanism Design In a setting of multiple self-interested agents with limited capacities, we attempt to find an incentive compatible transfer scheme following a socially efficient allocation 4 / 68

Part I - Vehicular Traffic Control Traffic control MDP Outline 1 Introduction 2 Part I - Vehicular Traffic Control Traffic control MDP Qlearning based TLC algorithms Threshold tuning using SPSA Feature adaptation 3 Part II - Service Systems Background Labor cost optimization problem Simulation Optimization Methods 4 Part III - Sensor Networks Sleep–wake control POMDP Sleep–wake scheduling algorithms – discounted setting Sleep–wake scheduling algorithms – average setting 5 Part IV - Mechanism Design Static Mechanism with Capacity Constraints Dynamic Mechanism with Capacity Constraints 5 / 68

Part I - Vehicular Traffic Control Traffic control MDP The problem 6 / 68

Part I - Vehicular Traffic Control Traffic control MDP Traffic Signal Control 1 The problem we are looking at Maximizing traffic flow: adaptive control of traffic lights at intersections Control decisions based on: coarse estimates of the queue lengths at intersecting roads time elapsed since last light switch over to red how do we solve it? Apply reinforcement learning (RL) Works with real data i.e., system model not assumed Simple, efficient and convergent! Use Green Light District (GLD) simulator for performance comparisons 1 work as a project associate with DIT-ASTec 7 / 68

Part I - Vehicular Traffic Control Traffic control MDP Reinforcement Learning (RL) Combines Dynamic programming - optimization and control Supervised learning - training a parametrized function approximator Operation: Environment: evolves probabilistically over states Policy: determines which action to be taken in each state Reinforcement: the reward received after performing an action in a given state Goal: maximize the expected cumulative reward Using trial-and-error process the RL agent learns the policy that achieves the goal 8 / 68

Part I - Vehicular Traffic Control Qlearning based TLC algorithms Outline 1 Introduction 2 Part I - Vehicular Traffic Control Traffic control MDP Qlearning based TLC algorithms Threshold tuning using SPSA Feature adaptation 3 Part II - Service Systems Background Labor cost optimization problem Simulation Optimization Methods 4 Part III - Sensor Networks Sleep–wake control POMDP Sleep–wake scheduling algorithms – discounted setting Sleep–wake scheduling algorithms – average setting 5 Part IV - Mechanism Design Static Mechanism with Capacity Constraints Dynamic Mechanism with Capacity Constraints 9 / 68

Part I - Vehicular Traffic Control Qlearning based TLC algorithms Traffic Signal Control Problem The MDP specifics State: vector of queue lengths and elapsed times s n = ( q 1 , ··· , q N , t 1 , ··· , t N ) Actions: a n = {feasible sign configurations in state s n } Cost: r 1 ∗ ( � i ∈ I p r 2 ∗ q i ( n )+ � k ( s n , a n ) = ∈ I p s 2 ∗ q i ( n )) i / s 1 ∗ ( � i ∈ I p r 2 ∗ t i ( n )+ � (1) + ∈ I p s 2 ∗ t i ( n )) , i / where r i , s i ≥ 0 and r i + s i = 1 , i = 1 , 2. more weightage to main road traffic 10 / 68

Part I - Vehicular Traffic Control Qlearning based TLC algorithms Qlearning based TLC algorithm Q-learning An off-policy temporal difference based control algorithm � � Q ( s n + 1 , a n + 1 ) = Q ( s n , a n )+ α ( n ) k ( s n , a n )+ γ min a Q ( s n + 1 , a ) − Q ( s n , a n ) (2) . Why function approximation? need look-up table to store Q-value for every ( s , a ) in (2) Computationally expensive (Why?) two-junction corridor: 10 signalled lanes, 20 vehicles on each lane | S × A ( S ) | ∼ 10 14 Situation aggravated when we consider larger road networks 11 / 68

Part I - Vehicular Traffic Control Qlearning based TLC algorithms Q-learning with Function Approximation [1] Approximate Q ( s , a ) ≈ θ T σ s , a , where σ s , a : d -dimensional feature vector, with d << | S × A ( S ) | θ is a tunable d -dimensional parameter Feature-based analog of Q-learning: v ∈ A ( s n + 1 ) θ T n σ s n + 1 , v − θ T θ n + 1 = θ n + α ( n ) σ s n , a n ( k ( s n , a n )+ γ min n σ s n , a n ) σ s n , a n : is graded and assigns a value for each lane based on its congestion level (low, medium or high) 12 / 68

Part I - Vehicular Traffic Control Qlearning based TLC algorithms Q-learning with Function Approximation [2] Feature Selection State ( s n ) Action ( a n ) Feature ( σ s n , a n ) RED 0 q i ( n ) < L 1 and t i ( n ) < T 1 GREEN 1 RED 0.2 q i ( n ) < L 1 and t i ( n ) ≥ T 1 GREEN 0.8 RED 0.4 L 1 ≤ q i ( n ) < L 2 and t i ( n ) < T 1 GREEN 0.6 RED 0.6 L 1 ≤ q i ( n ) < L 2 and t i ( n ) ≥ T 1 GREEN 0.4 RED 0.8 q i ( n ) ≥ L 2 and t i ( n ) < T 1 GREEN 0.2 RED 1 q i ( n ) ≥ L 2 and t i ( n ) ≥ T 1 GREEN 0 13 / 68

Part I - Vehicular Traffic Control Qlearning based TLC algorithms Results on a 3x3-Grid Network 70 16000 QTLC-FA QTLC-FA Fixed10 Fixed10 Fixed20 Fixed20 Fixed30 14000 Fixed30 60 SOTL SOTL 12000 50 Number of Road Users 10000 40 Delay 8000 30 6000 20 4000 10 2000 0 0 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 Cycles Cycles (a) Average junction waiting time (b) Total Arrived Road Users Full state RL algorithms (cf. [B. Abdulhai et al. 2003] a ) are not feasible as | S × A ( S ) | ∼ 10 101 , whereas dim( σ s n , a n ) ∼ 200 Self Organizing TLC (SOTL) b switches a lane to green if elapsed time crosses a threshold, provided the # of vehicles crosses another threshold a B. Abdulhai et al, “Reinforcement learning for true adaptive traffic signal control,” Journal of Transportation Engineering , 2003. b S. Cools et al, “Self-organizing traffic lights: A realistic simulation,” Advances in Applied Self-organizing Systems ,2008 14 / 68

Part I - Vehicular Traffic Control Threshold tuning using SPSA Outline 1 Introduction 2 Part I - Vehicular Traffic Control Traffic control MDP Qlearning based TLC algorithms Threshold tuning using SPSA Feature adaptation 3 Part II - Service Systems Background Labor cost optimization problem Simulation Optimization Methods 4 Part III - Sensor Networks Sleep–wake control POMDP Sleep–wake scheduling algorithms – discounted setting Sleep–wake scheduling algorithms – average setting 5 Part IV - Mechanism Design Static Mechanism with Capacity Constraints Dynamic Mechanism with Capacity Constraints 15 / 68

Part I - Vehicular Traffic Control Threshold tuning using SPSA Threshold tuning using stochastic optimization Thresholds are L 1 and L 2 on the waiting queue lengths TLC algorithm uses broad congestion estimates instead of exact queue lengths congestion is low, medium or high if the queue length falls below L 1 or between L 1 and L 2 or above L 2 How to tune Li ’s? Use stochastic optimization Combine the tuning algorithm with A full state Q-learning algorithm with state aggregation A function approximation Q-learning TLC with a novel feature selection scheme A priority based scheduling scheme 16 / 68

Prashanth. L.A. Advisor: Prof. Shalabh Bhatnagar Department of - PowerPoint PPT Presentation

Resource Allocation for Sequential Decision Making under Uncertainty: Studies in Vehicular Traffic Control, Service Systems, Sensor Networks and Mechanism Design Prashanth. L.A. Advisor: Prof. Shalabh Bhatnagar Department of Computer Science

Two-Timescale Algorithms for Learning Nash Equilibria in General-Sum Stochastic Games H.L. Prasad

RUBY ON RAILS CSCI 5448 Fall 2012 Presentation Prashanth Mannar 11/16/2012 Ruby on Rails -

Policy Gradients for CVaR-Constrained MDPs Prashanth L.A. INRIA Lille Team SequeL Prashanth

EU Strategy in Partitioning & Transm utation and its I m plem entation w ithin the EURATOM

002 - Motivating Examples EPIB 607 - FALL 2020 Sahir Rai Bhatnagar Department of Epidemiology,

005 - Data Graphics EPIB 607 - FALL 2020 Sahir Rai Bhatnagar Department of Epidemiology,

004 - Exploring Data - Part II EPIB 607 - FALL 2020 Sahir Rai Bhatnagar Department of

Algorithms for Product Pricing and Energy Allocation in Energy Harvesting Sensor Networks

Insights into non-projectivity in Hindi Prashanth Mannem, Himani Chaudhry and Akshar Bharati

New Temporal-Difference Methods Based on Gradient Descent Rich Sutton Hamid Maei Doina Precup

Environment Remote Sensing Instructor: Prof. Prashanth Reddy Marpu SAMPLE METEOSAT SEVIRI

Vignette Session C: Patient Safety Moderator: Deepa Bhatnagar, MD Unknown Vignette Discussant:

CSIR NEWS Total visitors: 6,229 since 15-8-03 Volume 53 no 14, 30 July 2003 ISSN 04097467

Finding that dress at scale @Rent The Runway Saurabh Bhatnagar Bio 17 years in ML/data Prev:

Rohit Chawla Anshika Bhatnagar Priyanka Nigam Diliraj Dhabhole Krishna Rao Raja Bhantia Pema

ALMA Future Sc. Program Development Workshop Aug. 2016, Charlottesville Wide-fjeld Wide-band

How was it for Wessex Identifying Issues Chris Brown WAHSN PINCER Support The PINCER

URBAN FOREST MANAGEMENT PLAN SCOPE OF WORK DISCUSSION CITY COUNCIL STUDY SESSION JUNE 5, 2017

FLORIDA DEEP DIVE ANALYSIS 1 Focus Groups Online Survey 2 Global Strategy Group and Garin Hart

480 million records 2.9 billion holdings WorldCat gets one new record every second As of

Multiagent Systems: Spring 2006 Ulle Endriss Institute for Logic, Language and Computation

fbRads Facebook marketing R felhasznlknak Darczi Gergely CARD.com 2015-09-30 When to

Programming Languages First Class Func3ons Func3ons An

Bertha: Tunneling Through the Network API Akshay Narayan Aurojit Panda Mohammad Alizadeh Hari

Prashanth. L.A. Advisor: Prof. Shalabh Bhatnagar Department of - PowerPoint PPT Presentation

Resource Allocation for Sequential Decision Making under Uncertainty: Studies in Vehicular Traffic Control, Service Systems, Sensor Networks and Mechanism Design Prashanth. L.A. Advisor: Prof. Shalabh Bhatnagar Department of Computer Science

Two-Timescale Algorithms for Learning Nash Equilibria in General-Sum Stochastic Games H.L. Prasad

RUBY ON RAILS CSCI 5448 Fall 2012 Presentation Prashanth Mannar 11/16/2012 Ruby on Rails -

Policy Gradients for CVaR-Constrained MDPs Prashanth L.A. INRIA Lille Team SequeL Prashanth

EU Strategy in Partitioning &amp; Transm utation and its I m plem entation w ithin the EURATOM

002 - Motivating Examples EPIB 607 - FALL 2020 Sahir Rai Bhatnagar Department of Epidemiology,

005 - Data Graphics EPIB 607 - FALL 2020 Sahir Rai Bhatnagar Department of Epidemiology,

004 - Exploring Data - Part II EPIB 607 - FALL 2020 Sahir Rai Bhatnagar Department of

Algorithms for Product Pricing and Energy Allocation in Energy Harvesting Sensor Networks

Insights into non-projectivity in Hindi Prashanth Mannem, Himani Chaudhry and Akshar Bharati

New Temporal-Difference Methods Based on Gradient Descent Rich Sutton Hamid Maei Doina Precup

Environment Remote Sensing Instructor: Prof. Prashanth Reddy Marpu SAMPLE METEOSAT SEVIRI

Vignette Session C: Patient Safety Moderator: Deepa Bhatnagar, MD Unknown Vignette Discussant:

CSIR NEWS Total visitors: 6,229 since 15-8-03 Volume 53 no 14, 30 July 2003 ISSN 04097467

Finding that dress at scale @Rent The Runway Saurabh Bhatnagar Bio 17 years in ML/data Prev:

Rohit Chawla Anshika Bhatnagar Priyanka Nigam Diliraj Dhabhole Krishna Rao Raja Bhantia Pema

ALMA Future Sc. Program Development Workshop Aug. 2016, Charlottesville Wide-fjeld Wide-band

How was it for Wessex Identifying Issues Chris Brown WAHSN PINCER Support The PINCER

URBAN FOREST MANAGEMENT PLAN SCOPE OF WORK DISCUSSION CITY COUNCIL STUDY SESSION JUNE 5, 2017

FLORIDA DEEP DIVE ANALYSIS 1 Focus Groups Online Survey 2 Global Strategy Group and Garin Hart

480 million records 2.9 billion holdings WorldCat gets one new record every second As of

Multiagent Systems: Spring 2006 Ulle Endriss Institute for Logic, Language and Computation

fbRads Facebook marketing R felhasznlknak Darczi Gergely CARD.com 2015-09-30 When to

Programming Languages First Class Func3ons Func3ons An

Bertha: Tunneling Through the Network API Akshay Narayan Aurojit Panda Mohammad Alizadeh Hari

EU Strategy in Partitioning & Transm utation and its I m plem entation w ithin the EURATOM