control frequency adaptation via action persistence in
play

Control Frequency Adaptation via Action Persistence in Batch - PowerPoint PPT Presentation

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning Alberto Maria Metelli Flavio Mazzolini Lorenzo Bisi Luca Sabbioni Marcello Restelli July 2020 Thirty-seventh International Conference on Machine Learning 1


  1. Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning Alberto Maria Metelli Flavio Mazzolini Lorenzo Bisi Luca Sabbioni Marcello Restelli July 2020 Thirty-seventh International Conference on Machine Learning

  2. 1 1 Motivations Problem : How to select the control frequency for a system? Lower Frequencies Higher Frequencies Research Question : Can we exploit this trade-off to find an optimal control frequency? A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

  3. 1 1 Motivations Problem : How to select the control frequency for a system? Lower Frequencies Higher Frequencies Control Opportunities Research Question : Can we exploit this trade-off to find an optimal control frequency? A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

  4. 1 1 Motivations Problem : How to select the control frequency for a system? Lower Frequencies Higher Frequencies Control Opportunities Sample Complexity Research Question : Can we exploit this trade-off to find an optimal control frequency? A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

  5. 1 1 Motivations Problem : How to select the control frequency for a system? Lower Frequencies Higher Frequencies Trade-Off Control Opportunities Sample Complexity Research Question : Can we exploit this trade-off to find an optimal control frequency? A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

  6. 2 Control Frequency and Action Persistence 2 Idea : persisting each action for k steps continuous- discrete-time k -persistent time MDP MDP MDP M 0 M ∆ t M k ∆ t time action control 0 ∆ t k ∆ t discretization persistence k time-step control 1 f 8 f “ ∆ t frequency k Action persistence as form of environment configurability (Metelli et al., 2018) A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

  7. 2 Control Frequency and Action Persistence 2 Idea : persisting each action for k steps continuous- discrete-time k -persistent time MDP MDP MDP M 0 M ∆ t M k ∆ t time action control 0 ∆ t k ∆ t discretization persistence k time-step control 1 f 8 f “ ∆ t frequency k Action persistence as form of environment configurability (Metelli et al., 2018) A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

  8. 2 Control Frequency and Action Persistence 2 Idea : persisting each action for k steps continuous- discrete-time k -persistent time MDP MDP MDP M 0 M ∆ t M k ∆ t time action control 0 ∆ t k ∆ t discretization persistence k time-step control 1 f 8 f “ ∆ t frequency k Action persistence as form of environment configurability (Metelli et al., 2018) A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

  9. 2 Control Frequency and Action Persistence 2 Idea : persisting each action for k steps continuous- discrete-time k -persistent time MDP MDP MDP M 0 M ∆ t M k ∆ t time action control 0 ∆ t k ∆ t discretization persistence k time-step control 1 f 8 f “ ∆ t frequency k Action persistence as form of environment configurability (Metelli et al., 2018) A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

  10. 3 3 Outline 1 Action persistence formalization A 0 A 0 A 0 2 Performance loss due to persistence S 0 S 1 S 2 S 3 3 Persistent Fitted Q-Iteration A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

  11. 3 3 Outline 1 1 Action persistence formalization 0.8 0.6 1 − γ k − 1 1 − γ k 0.4 2 Performance loss due to persistence 0.2 3 Persistent Fitted Q-Iteration 5 10 15 20 k A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

  12. 3 3 Outline 1 Action persistence formalization � T δ � T ∗ � T δ Π F Π F 2 Performance loss due to persistence Π F Q ( j +1) Q ( j ) F Q ( j + k ) 3 Persistent Fitted Q-Iteration A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

  13. 4 4 No Action Persistence M “ p S , A , P, R, γ q and π π : S Ñ P p A q is Markovian and Stationary (Puterman, 2014; Sutton and Barto, 2018) A 0 ∼ π ( ·| S 0 ) A 1 ∼ π ( ·| S 1 ) A 2 ∼ π ( ·| S 2 ) A 3 ∼ π ( ·| S 3 ) A 4 ∼ π ( ·| S 4 ) A 5 ∼ π ( ·| S 5 ) S 0 S 1 S 2 S 3 S 4 S 5 S 6 t = 0 t = 1 t = 2 t = 3 t = 4 t = 5 t = 6 A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

  14. 5 Action Persistence 5 Policy View Change the policy Ñ k -persistent policy # π p a | s t q if t mod k “ 0 M “ p S , A , P, R, γ q and π k π t,k p a | h t q “ δ a t ´ 1 p a q otherwise History h t “ p s 0 , a 0 , . . . , s t ´ 1 , a t ´ 1 , s t q π k is Non-Markovian and Non-Stationary A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

  15. 5 Action Persistence 5 Policy View Change the policy Ñ k -persistent policy # π p a | s t q if t mod k “ 0 M “ p S , A , P, R, γ q and π k π t,k p a | h t q “ δ a t ´ 1 p a q otherwise History h t “ p s 0 , a 0 , . . . , s t ´ 1 , a t ´ 1 , s t q π k is Non-Markovian and Non-Stationary A 0 ∼ π ( ·| S 0 ) A 3 ∼ π ( ·| S 3 ) A 0 A 0 A 3 A 3 S 0 S 1 S 2 S 3 S 4 S 5 S 6 t = 0 t = 1 t = 2 t = 3 t = 4 t = 5 t = 6 A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

  16. 5 Action Persistence 5 Policy View Change the policy Ñ k -persistent policy # π p a | s t q if t mod k “ 0 M “ p S , A , P, R, γ q and π k π t,k p a | h t q “ δ a t ´ 1 p a q otherwise History h t “ p s 0 , a 0 , . . . , s t ´ 1 , a t ´ 1 , s t q π k is Non-Markovian and Non-Stationary A 0 ∼ π ( ·| S 0 ) A 3 ∼ π ( ·| S 3 ) A 0 A 0 A 3 A 3 S 0 S 1 S 2 S 3 S 4 S 5 S 6 t = 0 t = 1 t = 2 t = 3 t = 4 t = 5 t = 6 A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

  17. 6 Action Persistence 6 Environment View Change the MDP Ñ k -persistent MDP ` ˘ P k p s 1 | s, a q “ p P δ q k ´ 1 P p s 1 | s, a q ` S , A , P k , R k , γ k ˘ i “ 0 γ i ` ˘ M k “ and π R k p s 1 | s, a q “ ř k ´ 1 p s 1 | s, a q p P δ q i R Persistent state-action kernel P δ p s 1 , a 1 | s, a q “ δ a 1 p a q P p s 1 | s, a q M k has smaller discount factor γ k A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

  18. 6 Action Persistence 6 Environment View Change the MDP Ñ k -persistent MDP ` ˘ P k p s 1 | s, a q “ p P δ q k ´ 1 P p s 1 | s, a q ` S , A , P k , R k , γ k ˘ i “ 0 γ i ` ˘ M k “ and π R k p s 1 | s, a q “ ř k ´ 1 p s 1 | s, a q p P δ q i R Persistent state-action kernel P δ p s 1 , a 1 | s, a q “ δ a 1 p a q P p s 1 | s, a q M k has smaller discount factor γ k A 0 ∼ π ( ·| S 0 ) A 3 ∼ π ( ·| S 3 ) S 0 S 1 S 2 S 3 S 4 S 5 S 6 t = 0 t = 1 t = 2 t = 3 t = 4 t = 5 t = 6 A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

  19. 6 Action Persistence 6 Environment View Change the MDP Ñ k -persistent MDP ` ˘ P k p s 1 | s, a q “ p P δ q k ´ 1 P p s 1 | s, a q ` S , A , P k , R k , γ k ˘ i “ 0 γ i ` ˘ M k “ and π R k p s 1 | s, a q “ ř k ´ 1 p s 1 | s, a q p P δ q i R Persistent state-action kernel P δ p s 1 , a 1 | s, a q “ δ a 1 p a q P p s 1 | s, a q M k has smaller discount factor γ k A 0 ∼ π ( ·| S 0 ) A 3 ∼ π ( ·| S 3 ) S 0 S 1 S 2 S 3 S 4 S 5 S 6 t = 0 t = 1 t = 2 t = 3 t = 4 t = 5 t = 6 A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

  20. 7 Persistent Bellman Operators 7 k -persistent MDP M k MDP M Persistence Operator ż Bellman Operator (Bertsekas, 2005) p T δ f qp s, a q “ r p s, a q ` γ P p d s 1 | s, a q f p s 1 , a q S ż k -persistent Bellman Operator p T ˚ f qp s, a q “ r p s, a q ` γ P p d s 1 | s, a q max a 1 P A f p s 1 , a 1 q S T ˚ k “ p T δ q k ´ 1 T ˚ T ˚ is a γ -contraction in L 8 -norm T ˚ k is a γ k -contraction in L 8 -norm Q ˚ is the unique fixed point of T ˚ Q ˚ k is the unique fixed point of T ˚ T ˚ Q ˚ “ Q ˚ k T ˚ k Q ˚ k “ Q ˚ k A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

  21. 7 Persistent Bellman Operators 7 k -persistent MDP M k MDP M Persistence Operator ż Bellman Operator (Bertsekas, 2005) p T δ f qp s, a q “ r p s, a q ` γ P p d s 1 | s, a q f p s 1 , a q S ż k -persistent Bellman Operator p T ˚ f qp s, a q “ r p s, a q ` γ P p d s 1 | s, a q max a 1 P A f p s 1 , a 1 q S T ˚ k “ p T δ q k ´ 1 T ˚ T ˚ is a γ -contraction in L 8 -norm T ˚ k is a γ k -contraction in L 8 -norm Q ˚ is the unique fixed point of T ˚ Q ˚ k is the unique fixed point of T ˚ T ˚ Q ˚ “ Q ˚ k T ˚ k Q ˚ k “ Q ˚ k A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend