Reinforcement Learning in Configurable Continuous Environments - - PowerPoint PPT Presentation

reinforcement learning in configurable continuous
SMART_READER_LITE
LIVE PREVIEW

Reinforcement Learning in Configurable Continuous Environments - - PowerPoint PPT Presentation

Reinforcement Learning in Configurable Continuous Environments Alberto Maria Metelli, Emanuele Ghelfi and Marcello Restelli 36th International Conference on Machine Learning 13th June 2019 1 Non-Configurable Environments 1 Configurable


slide-1
SLIDE 1

Reinforcement Learning in Configurable Continuous Environments

Alberto Maria Metelli, Emanuele Ghelfi and Marcello Restelli

36th International Conference on Machine Learning 13th June 2019

slide-2
SLIDE 2

1

Non-Configurable Environments

1

Configurable Markov Decision Process Configurable (MDP, Puterman, 2014)

Agent

(policy)

Environment

action At reward Rt+1 state St+1

M = (S, A, r, γ, µ, p) S0 ∼ µ, At ∼ πθ(·|St), St+1 ∼ p(·|St, At) Learn the policy parameters θ under the fixed environment p θ∗ = arg max

θ∈Θ

J(θ) = E +∞

  • t=0

γtRt+1

  • A. M. Metelli

Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863

slide-3
SLIDE 3
slide-4
SLIDE 4

2

Configurable Environments

2

Configurable Markov Decision Process (Conf-MDP, Metelli et al., 2018)

Agent

(policy)

Environment

(configuration)

action At reward Rt+1 state St+1

CM = (S, A, r, γ, µ, P, Π) S0 ∼ µ, At ∼ πθ(·|St), St+1 ∼ pω(·|St, At) Learn the policy parameters θ together with the environment configuration ω θ∗, ω∗ = arg max

θ∈Θ, ω∈Ω

J(θ, ω) = E +∞

  • t=0

γtRt+1

  • A. M. Metelli

Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863

slide-5
SLIDE 5

3

State of the Art

3

Safe Policy Model Iteration (SPMI, Metelli et al., 2018) Optimize a lower bound of the performance improvement Limitations Finite state-actions spaces Full knowledge of the environment dynamics Similar approaches Keren et al. (2017) and Silva et al. (2018)

  • A. M. Metelli

Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863

slide-6
SLIDE 6

4

Relative Entropy Model Policy Search (REMPS)

4

Optimization

Find a new stationary distribution d′ in a trust region centered in dπθ,pω max

d′

Jd′ = E

S,A,S′∼d′ [r(S, A, S′)]

s.t. DKL(d′dπθ,pω) ≤ κ,

Projection

Find a policy πθ′ and configuration pω′ inducing a stationary distribution close to d′ min

θ′∈Θ,ω′∈Ω DKL

  • d′dπθ′,pω′
  • Θ × Ω

dπθ,pω

↑ (πθ, pω)

  • A. M. Metelli

Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863

slide-7
SLIDE 7

4

Relative Entropy Model Policy Search (REMPS)

4

Optimization

Find a new stationary distribution d′ in a trust region centered in dπθ,pω max

d′

Jd′ = E

S,A,S′∼d′ [r(S, A, S′)]

s.t. DKL(d′dπθ,pω) ≤ κ,

Projection

Find a policy πθ′ and configuration pω′ inducing a stationary distribution close to d′ min

θ′∈Θ,ω′∈Ω DKL

  • d′dπθ′,pω′
  • Θ × Ω

dπθ,pω

↑ (πθ, pω)

DKL ≤ κ

  • ptimization

d′

  • A. M. Metelli

Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863

slide-8
SLIDE 8

4

Relative Entropy Model Policy Search (REMPS)

4

Optimization

Find a new stationary distribution d′ in a trust region centered in dπθ,pω max

d′

Jd′ = E

S,A,S′∼d′ [r(S, A, S′)]

s.t. DKL(d′dπθ,pω) ≤ κ,

Projection

Find a policy πθ′ and configuration pω′ inducing a stationary distribution close to d′ min

θ′∈Θ,ω′∈Ω DKL

  • d′dπθ′,pω′
  • Θ × Ω

dπθ,pω

↑ (πθ, pω)

DKL ≤ κ

  • ptimization

d′ dπθ′,pω′

↓ (πθ′, pω′)

projection

  • A. M. Metelli

Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863

slide-9
SLIDE 9

4

Relative Entropy Model Policy Search (REMPS)

4

Optimization

Find a new stationary distribution d′ in a trust region centered in dπθ,pω max

d′

Jd′ = E

S,A,S′∼d′ [r(S, A, S′)]

s.t. DKL(d′dπθ,pω) ≤ κ,

Projection

Find a policy πθ′ and configuration pω′ inducing a stationary distribution close to d′

Can also be an

approximated model p min

θ′∈Θ,ω′∈Ω DKL

  • d′dπθ′,pω′
  • Θ × Ω

dπθ,pω

↑ (πθ, pω)

DKL ≤ κ

  • ptimization

d′ dπθ′,pω′

↓ (πθ′, pω′)

projection

  • A. M. Metelli

Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863

slide-10
SLIDE 10

5

Experiments

5

Chain Domain

5 10 15 2 4 6 8 10 iteration average reward REMPS (0.01) REMPS (0.1) REMPS (10) G(PO)MDP

Cartpole

Configure the cart force

500 1000 1500 2000 500 1000 1500 2000 iteration average return REMPS G(PO)MDP

TORCS

Configure the front-rear wing

  • rientation and brake repartition

20 40 20 40 60 80 100 iteration average reward REMPS REPS Bot

  • A. M. Metelli

Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863

slide-11
SLIDE 11

Thank You for Your Attention!

Poster Pacific Ballroom #37 Code: github.com/albertometelli/remps Web page: albertometelli.github.io/ICML2019-REMPS Contact: albertomaria.metelli@polimi.it

slide-12
SLIDE 12

7

References

7

Keren, S., Pineda, L., Gal, A., Karpas, E., and Zilberstein, S. (2017). Equi-reward utility maximizing design in stochastic environments. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 4353–4360. Metelli, A. M., Mutti, M., and Restelli, M. (2018). Configurable markov decision processes. In Dy, J. G. and Krause, A., editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm¨ assan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 3488–3497. PMLR. Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons. Silva, R., Melo, F. S., and Veloso, M. (2018). What if the world were different? gradient-based exploration for new optimal policies. EPiC Series in Computing, 55:229–242.

  • A. M. Metelli

Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863