Reinforcement Learning in Configurable Continuous Environments
Alberto Maria Metelli, Emanuele Ghelfi and Marcello Restelli
36th International Conference on Machine Learning 13th June 2019
Reinforcement Learning in Configurable Continuous Environments - - PowerPoint PPT Presentation
Reinforcement Learning in Configurable Continuous Environments Alberto Maria Metelli, Emanuele Ghelfi and Marcello Restelli 36th International Conference on Machine Learning 13th June 2019 1 Non-Configurable Environments 1 Configurable
36th International Conference on Machine Learning 13th June 2019
1
1
Agent
(policy)
Environment
action At reward Rt+1 state St+1
θ∈Θ
Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863
2
2
Agent
(policy)
Environment
(configuration)
action At reward Rt+1 state St+1
θ∈Θ, ω∈Ω
Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863
3
3
Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863
4
4
Find a new stationary distribution d′ in a trust region centered in dπθ,pω max
d′
Jd′ = E
S,A,S′∼d′ [r(S, A, S′)]
s.t. DKL(d′dπθ,pω) ≤ κ,
Find a policy πθ′ and configuration pω′ inducing a stationary distribution close to d′ min
θ′∈Θ,ω′∈Ω DKL
dπθ,pω
↑ (πθ, pω)
Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863
4
4
Find a new stationary distribution d′ in a trust region centered in dπθ,pω max
d′
Jd′ = E
S,A,S′∼d′ [r(S, A, S′)]
s.t. DKL(d′dπθ,pω) ≤ κ,
Find a policy πθ′ and configuration pω′ inducing a stationary distribution close to d′ min
θ′∈Θ,ω′∈Ω DKL
dπθ,pω
↑ (πθ, pω)
DKL ≤ κ
d′
Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863
4
4
Find a new stationary distribution d′ in a trust region centered in dπθ,pω max
d′
Jd′ = E
S,A,S′∼d′ [r(S, A, S′)]
s.t. DKL(d′dπθ,pω) ≤ κ,
Find a policy πθ′ and configuration pω′ inducing a stationary distribution close to d′ min
θ′∈Θ,ω′∈Ω DKL
dπθ,pω
↑ (πθ, pω)
DKL ≤ κ
d′ dπθ′,pω′
↓ (πθ′, pω′)
projection
Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863
4
4
Find a new stationary distribution d′ in a trust region centered in dπθ,pω max
d′
Jd′ = E
S,A,S′∼d′ [r(S, A, S′)]
s.t. DKL(d′dπθ,pω) ≤ κ,
Find a policy πθ′ and configuration pω′ inducing a stationary distribution close to d′
Can also be an
approximated model p min
θ′∈Θ,ω′∈Ω DKL
dπθ,pω
↑ (πθ, pω)
DKL ≤ κ
d′ dπθ′,pω′
↓ (πθ′, pω′)
projection
Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863
5
5
5 10 15 2 4 6 8 10 iteration average reward REMPS (0.01) REMPS (0.1) REMPS (10) G(PO)MDP
Configure the cart force
500 1000 1500 2000 500 1000 1500 2000 iteration average return REMPS G(PO)MDP
Configure the front-rear wing
20 40 20 40 60 80 100 iteration average reward REMPS REPS Bot
Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863
7
7
Keren, S., Pineda, L., Gal, A., Karpas, E., and Zilberstein, S. (2017). Equi-reward utility maximizing design in stochastic environments. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 4353–4360. Metelli, A. M., Mutti, M., and Restelli, M. (2018). Configurable markov decision processes. In Dy, J. G. and Krause, A., editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm¨ assan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 3488–3497. PMLR. Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons. Silva, R., Melo, F. S., and Veloso, M. (2018). What if the world were different? gradient-based exploration for new optimal policies. EPiC Series in Computing, 55:229–242.
Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863