Reinforcement Learning in Configurable Continuous Environments - PowerPoint PPT Presentation

Reinforcement Learning in Configurable Continuous Environments Alberto Maria Metelli, Emanuele Ghelfi and Marcello Restelli 36th International Conference on Machine Learning 13th June 2019

1 Non-Configurable Environments 1 Configurable Markov Decision Process Configurable (MDP, Puterman, 2014) M = ( S , A , r, γ, µ, p ) S 0 ∼ µ, A t ∼ π θ ( ·| S t ) , S t +1 ∼ p ( ·| S t , A t ) Agent Learn the policy parameters θ under the action A t (policy) fixed environment p � + ∞ � � θ ∗ = arg max γ t R t +1 J ( θ ) = E reward R t +1 θ ∈ Θ t =0 Environment state S t +1 A. M. Metelli Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863

2 Configurable Environments 2 Configurable Markov Decision Process (Conf-MDP, Metelli et al., 2018) CM = ( S , A , r, γ, µ, P , Π) S 0 ∼ µ, A t ∼ π θ ( ·| S t ) , S t +1 ∼ p ω ( ·| S t , A t ) Agent Learn the policy parameters θ together action A t (policy) with the environment configuration ω � + ∞ � � θ ∗ , ω ∗ = arg max γ t R t +1 J ( θ , ω ) = E reward R t +1 Environment θ ∈ Θ , ω ∈ Ω t =0 (configuration) state S t +1 A. M. Metelli Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863

3 3 State of the Art Safe Policy Model Iteration (SPMI, Metelli et al., 2018) Optimize a lower bound of the performance improvement Limitations Finite state-actions spaces Full knowledge of the environment dynamics Similar approaches Keren et al. (2017) and Silva et al. (2018) A. M. Metelli Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863

4 Relative Entropy Model Policy Search (REMPS) 4 Projection Optimization Find a policy π θ ′ and configuration p ω ′ Find a new stationary distribution d ′ in a inducing a stationary distribution close to d ′ trust region centered in d π θ ,p ω J d ′ = S,A,S ′ ∼ d ′ [ r ( S, A, S ′ )] � � max E d ′ � d π θ ′ ,p ω ′ θ ′ ∈ Θ , ω ′ ∈ Ω D KL min d ′ s.t. D KL ( d ′ � d π θ ,p ω ) ≤ κ, d π θ ,p ω ↑ ( π θ , p ω ) Θ × Ω A. M. Metelli Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863

4 Relative Entropy Model Policy Search (REMPS) 4 Projection Optimization Find a policy π θ ′ and configuration p ω ′ Find a new stationary distribution d ′ in a inducing a stationary distribution close to d ′ trust region centered in d π θ ,p ω J d ′ = S,A,S ′ ∼ d ′ [ r ( S, A, S ′ )] � � max E d ′ � d π θ ′ ,p ω ′ θ ′ ∈ Θ , ω ′ ∈ Ω D KL min d ′ s.t. D KL ( d ′ � d π θ ,p ω ) ≤ κ, d ′ D KL ≤ κ optimization d π θ ,p ω ↑ ( π θ , p ω ) Θ × Ω A. M. Metelli Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863

4 Relative Entropy Model Policy Search (REMPS) 4 Projection Optimization Find a policy π θ ′ and configuration p ω ′ Find a new stationary distribution d ′ in a inducing a stationary distribution close to d ′ trust region centered in d π θ ,p ω J d ′ = S,A,S ′ ∼ d ′ [ r ( S, A, S ′ )] � � max E d ′ � d π θ ′ ,p ω ′ θ ′ ∈ Θ , ω ′ ∈ Ω D KL min d ′ s.t. D KL ( d ′ � d π θ ,p ω ) ≤ κ, d ′ D KL ≤ κ optimization projection d π θ ,p ω d π θ ′ ,p ω ′ ↑ ↓ ( π θ , p ω ) ( π θ ′ , p ω ′ ) Θ × Ω A. M. Metelli Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863

4 Relative Entropy Model Policy Search (REMPS) 4 Projection Optimization Find a policy π θ ′ and configuration p ω ′ Find a new stationary distribution d ′ in a inducing a stationary distribution close to d ′ trust region centered in d π θ ,p ω J d ′ = S,A,S ′ ∼ d ′ [ r ( S, A, S ′ )] � � max E d ′ � d π θ ′ ,p ω ′ θ ′ ∈ Θ , ω ′ ∈ Ω D KL min d ′ s.t. D KL ( d ′ � d π θ ,p ω ) ≤ κ, Can also be an d ′ D KL ≤ κ optimization approximated model � p projection d π θ ,p ω d π θ ′ ,p ω ′ ↑ ↓ ( π θ , p ω ) ( π θ ′ , p ω ′ ) Θ × Ω A. M. Metelli Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863

5 Experiments 5 TORCS Cartpole Chain Domain Configure the front-rear wing Configure the cart force orientation and brake repartition 2000 100 10 average reward average return average reward 1500 80 8 60 1000 6 40 500 4 20 2 0 0 20 40 0 500 1000 1500 2000 0 5 10 15 iteration iteration iteration REMPS REPS REMPS (0 . 01) REMPS (0 . 1) REMPS G(PO)MDP Bot REMPS (10) G(PO)MDP A. M. Metelli Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863

Thank You for Your Attention! Poster Pacific Ballroom #37 Code: github.com/albertometelli/remps Web page: albertometelli.github.io/ICML2019-REMPS Contact: albertomaria.metelli@polimi.it

7 7 References Keren, S., Pineda, L., Gal, A., Karpas, E., and Zilberstein, S. (2017). Equi-reward utility maximizing design in stochastic environments. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17 , pages 4353–4360. Metelli, A. M., Mutti, M., and Restelli, M. (2018). Configurable markov decision processes. In Dy, J. G. and Krause, A., editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm¨ assan, Stockholm, Sweden, July 10-15, 2018 , volume 80 of Proceedings of Machine Learning Research , pages 3488–3497. PMLR. Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming . John Wiley & Sons. Silva, R., Melo, F. S., and Veloso, M. (2018). What if the world were different? gradient-based exploration for new optimal policies. EPiC Series in Computing , 55:229–242. A. M. Metelli Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863

Reinforcement Learning in Configurable Continuous Environments - PowerPoint PPT Presentation

Reinforcement Learning in Configurable Continuous Environments Alberto Maria Metelli, Emanuele Ghelfi and Marcello Restelli 36th International Conference on Machine Learning 13th June 2019 1 Non-Configurable Environments 1 Configurable

Fibre Optic Multiplexer Configurable The What is the Badger Fully configurable Audio/Data

Sampling Effect on Performance Prediction of Configurable Systems : A Case Study Juliana Alves

Overview of Overview of configurable architectures configurable architectures Prof. Kurt

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Dual-Mode Configurable RISC-V Processor IP Nuclei System Technology Dual-Mode

Designing a Web of Highly-Configurable Designing a Web of Highly-Configurable Intrusion Detection

Configurable software- -based based Configurable software edge router architecture edge router

An Architecture for An Architecture for Configurable Dependability of Configurable Dependability

Maca a configurable tool to Maca a configurable tool to integrate Polish morphological

A Configurable Hardware Scheduler A Configurable Hardware Scheduler (CHS) for Real- -Time

Building Inclusive Classroom Communities Dr. Ellen Moore, Communication UW Tacoma Dr. Jim

Nanotechnology Plasma & Nanotechnology Graphene nanoflakes CNT Si nanofibers Au nanodots

Confined Ultrathin Silicon Nanowires P.D. Tran, T.J. Macdonald, B. Wolfrum, R. Stockmann, A.

God Promises a New Covenant Jeremiah 17 & 31 Here is some test text Here is some test text

Protocol on SEA Chapter A3: Determining whether plans & programmes require SEA under the

Theoretical Analysis of Domain Adaptation Current state of the art Shai Ben-David September 14,

1 Digital UI Prototype Pipeline - Creation Commercial / third party tools: 2D tools:

Community Survey Data By Cluster (October - November 2014) Carver Community Survey Signature

Reinforcement Learning in Configurable Continuous Environments - PowerPoint PPT Presentation

Reinforcement Learning in Configurable Continuous Environments Alberto Maria Metelli, Emanuele Ghelfi and Marcello Restelli 36th International Conference on Machine Learning 13th June 2019 1 Non-Configurable Environments 1 Configurable

Fibre Optic Multiplexer Configurable The What is the Badger Fully configurable Audio/Data

Sampling Effect on Performance Prediction of Configurable Systems : A Case Study Juliana Alves

Overview of Overview of *configurable* architectures *configurable* architectures Prof. Kurt

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Dual-Mode Configurable RISC-V Processor IP Nuclei System Technology Dual-Mode

Designing a Web of Highly-Configurable Designing a Web of Highly-Configurable Intrusion Detection

Configurable software- -based based Configurable software edge router architecture edge router

An Architecture for An Architecture for Configurable Dependability of Configurable Dependability

Maca a configurable tool to Maca a configurable tool to integrate Polish morphological

A Configurable Hardware Scheduler A Configurable Hardware Scheduler (CHS) for Real- -Time

Building Inclusive Classroom Communities Dr. Ellen Moore, Communication UW Tacoma Dr. Jim

Nanotechnology Plasma &amp; Nanotechnology Graphene nanoflakes CNT Si nanofibers Au nanodots

Confined Ultrathin Silicon Nanowires P.D. Tran, T.J. Macdonald, B. Wolfrum, R. Stockmann, A.

God Promises a New Covenant Jeremiah 17 &amp; 31 Here is some test text Here is some test text

Protocol on SEA Chapter A3: Determining whether plans &amp; programmes require SEA under the

Theoretical Analysis of Domain Adaptation Current state of the art Shai Ben-David September 14,

1 Digital UI Prototype Pipeline - Creation Commercial / third party tools: 2D tools:

Community Survey Data By Cluster (October - November 2014) Carver Community Survey Signature

Overview of Overview of configurable architectures configurable architectures Prof. Kurt

Nanotechnology Plasma & Nanotechnology Graphene nanoflakes CNT Si nanofibers Au nanodots

God Promises a New Covenant Jeremiah 17 & 31 Here is some test text Here is some test text

Protocol on SEA Chapter A3: Determining whether plans & programmes require SEA under the