Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning
Alberto Maria Metelli Flavio Mazzolini Lorenzo Bisi Luca Sabbioni Marcello Restelli
July 2020 Thirty-seventh International Conference on Machine Learning
Control Frequency Adaptation via Action Persistence in Batch - - PowerPoint PPT Presentation
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning Alberto Maria Metelli Flavio Mazzolini Lorenzo Bisi Luca Sabbioni Marcello Restelli July 2020 Thirty-seventh International Conference on Machine Learning 1
July 2020 Thirty-seventh International Conference on Machine Learning
1
1
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
1
1
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
1
1
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
1
1
Trade-Off
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
2
2
continuous- time MDP discrete-time MDP k-persistent MDP
time discretization
action persistence k
control time-step
control frequency
1 ∆t f k
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
2
2
continuous- time MDP discrete-time MDP k-persistent MDP
time discretization
action persistence k
control time-step
control frequency
1 ∆t f k
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
2
2
continuous- time MDP discrete-time MDP k-persistent MDP
time discretization
action persistence k
control time-step
control frequency
1 ∆t f k
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
2
2
continuous- time MDP discrete-time MDP k-persistent MDP
time discretization
action persistence k
control time-step
control frequency
1 ∆t f k
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
3
3
1 Action persistence formalization 2 Performance loss due to persistence 3 Persistent Fitted Q-Iteration
S0 S1 S2 S3 A0 A0 A0
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
3
3
1 Action persistence formalization 2 Performance loss due to persistence 3 Persistent Fitted Q-Iteration 5 10 15 20 0.2 0.4 0.6 0.8 1 k
1−γk−1 1−γk
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
3
3
1 Action persistence formalization 2 Performance loss due to persistence 3 Persistent Fitted Q-Iteration
F
Q(j) Q(j+1)
ΠF
ΠF Q(j+k)
ΠF
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
4
4
π : S Ñ PpAq is Markovian and Stationary (Puterman, 2014; Sutton and Barto, 2018)
S0 t = 0 S1 t = 1 S2 t = 2 S3 t = 3 S4 t = 4 S5 t = 5 S6 t = 6 A0 ∼ π(·|S0) A1 ∼ π(·|S1) A2 ∼ π(·|S2) A3 ∼ π(·|S3) A4 ∼ π(·|S4) A5 ∼ π(·|S5)
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
5
Policy View
5
History ht “ ps0, a0, . . . , st´1, at´1, stq πk is Non-Markovian and Non-Stationary
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
5
Policy View
5
History ht “ ps0, a0, . . . , st´1, at´1, stq πk is Non-Markovian and Non-Stationary
S0 t = 0 S1 t = 1 S2 t = 2 S3 t = 3 S4 t = 4 S5 t = 5 S6 t = 6 A0 ∼ π(·|S0) A0 A0 A3 ∼ π(·|S3) A3 A3
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
5
Policy View
5
History ht “ ps0, a0, . . . , st´1, at´1, stq πk is Non-Markovian and Non-Stationary
S0 t = 0 S1 t = 1 S2 t = 2 S3 t = 3 S4 t = 4 S5 t = 5 S6 t = 6 A0 ∼ π(·|S0) A0 A0 A3 ∼ π(·|S3) A3 A3
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
6
Environment View
6
i“0 γi`
Persistent state-action kernel P δps1, a1|s, aq “ δa1paqPps1|s, aq Mk has smaller discount factor γk
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
6
Environment View
6
i“0 γi`
Persistent state-action kernel P δps1, a1|s, aq “ δa1paqPps1|s, aq Mk has smaller discount factor γk
S0 t = 0 S1 t = 1 S2 t = 2 S3 t = 3 S4 t = 4 S5 t = 5 S6 t = 6 A0 ∼ π(·|S0) A3 ∼ π(·|S3)
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
6
Environment View
6
i“0 γi`
Persistent state-action kernel P δps1, a1|s, aq “ δa1paqPps1|s, aq Mk has smaller discount factor γk
S0 t = 0 S1 t = 1 S2 t = 2 S3 t = 3 S4 t = 4 S5 t = 5 S6 t = 6 A0 ∼ π(·|S0) A3 ∼ π(·|S3)
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
7
7
Bellman Operator (Bertsekas, 2005)
pT ˚fqps, aq “ rps, aq ` γ ż
S
Ppds1|s, aq max
a1PA fps1, a1q
T ˚ is a γ-contraction in L8-norm Q˚ is the unique fixed point of T ˚ T ˚Q˚ “ Q˚
Persistence Operator
pT δfqps, aq “ rps, aq ` γ ż
S
Ppds1|s, aqfps1, aq
k-persistent Bellman Operator
k “ pT δqk´1T ˚
T ˚
k is a γk-contraction in L8-norm
Q˚
k is the unique fixed point of T ˚ k
T ˚
k Q˚ k “ Q˚ k
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
7
7
Bellman Operator (Bertsekas, 2005)
pT ˚fqps, aq “ rps, aq ` γ ż
S
Ppds1|s, aq max
a1PA fps1, a1q
T ˚ is a γ-contraction in L8-norm Q˚ is the unique fixed point of T ˚ T ˚Q˚ “ Q˚
Persistence Operator
pT δfqps, aq “ rps, aq ` γ ż
S
Ppds1|s, aqfps1, aq
k-persistent Bellman Operator
k “ pT δqk´1T ˚
T ˚
k is a γk-contraction in L8-norm
Q˚
k is the unique fixed point of T ˚ k
T ˚
k Q˚ k “ Q˚ k
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
7
7
Bellman Operator (Bertsekas, 2005)
pT ˚fqps, aq “ rps, aq ` γ ż
S
Ppds1|s, aq max
a1PA fps1, a1q
T ˚ is a γ-contraction in L8-norm Q˚ is the unique fixed point of T ˚ T ˚Q˚ “ Q˚
Persistence Operator
pT δfqps, aq “ rps, aq ` γ ż
S
Ppds1|s, aqfps1, aq
k-persistent Bellman Operator
k “ pT δqk´1T ˚
T ˚
k is a γk-contraction in L8-norm
Q˚
k is the unique fixed point of T ˚ k
T ˚
k Q˚ k “ Q˚ k
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
7
7
Bellman Operator (Bertsekas, 2005)
pT ˚fqps, aq “ rps, aq ` γ ż
S
Ppds1|s, aq max
a1PA fps1, a1q
T ˚ is a γ-contraction in L8-norm Q˚ is the unique fixed point of T ˚ T ˚Q˚ “ Q˚
Persistence Operator
pT δfqps, aq “ rps, aq ` γ ż
S
Ppds1|s, aqfps1, aq
k-persistent Bellman Operator
k “ pT δqk´1T ˚
T ˚
k is a γk-contraction in L8-norm
Q˚
k is the unique fixed point of T ˚ k
T ˚
k Q˚ k “ Q˚ k
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
7
7
Bellman Operator (Bertsekas, 2005)
pT ˚fqps, aq “ rps, aq ` γ ż
S
Ppds1|s, aq max
a1PA fps1, a1q
T ˚ is a γ-contraction in L8-norm Q˚ is the unique fixed point of T ˚ T ˚Q˚ “ Q˚
Persistence Operator
pT δfqps, aq “ rps, aq ` γ ż
S
Ppds1|s, aqfps1, aq
k-persistent Bellman Operator
k “ pT δqk´1T ˚
T ˚
k is a γk-contraction in L8-norm
Q˚
k is the unique fixed point of T ˚ k
T ˚
k Q˚ k “ Q˚ k
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
7
7
Bellman Operator (Bertsekas, 2005)
pT ˚fqps, aq “ rps, aq ` γ ż
S
Ppds1|s, aq max
a1PA fps1, a1q
T ˚ is a γ-contraction in L8-norm Q˚ is the unique fixed point of T ˚ T ˚Q˚ “ Q˚
Persistence Operator
pT δfqps, aq “ rps, aq ` γ ż
S
Ppds1|s, aqfps1, aq
k-persistent Bellman Operator
k “ pT δqk´1T ˚
T ˚
k is a γk-contraction in L8-norm
Q˚
k is the unique fixed point of T ˚ k
T ˚
k Q˚ k “ Q˚ k
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
7
7
Bellman Operator (Bertsekas, 2005)
pT ˚fqps, aq “ rps, aq ` γ ż
S
Ppds1|s, aq max
a1PA fps1, a1q
T ˚ is a γ-contraction in L8-norm Q˚ is the unique fixed point of T ˚ T ˚Q˚ “ Q˚
Persistence Operator
pT δfqps, aq “ rps, aq ` γ ż
S
Ppds1|s, aqfps1, aq
k-persistent Bellman Operator
k “ pT δqk´1T ˚
T ˚
k is a γk-contraction in L8-norm
Q˚
k is the unique fixed point of T ˚ k
T ˚
k Q˚ k “ Q˚ k
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
8
8
k ď Q˚ for all k ě 1
k}p,µ ď
p,µ
‚ Can be bounded under Lipschitz conditions (Rachelson and Lagoudakis, 2010)
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
8
8
k ď Q˚ for all k ě 1
k}p,µ ď
p,µ
‚ Can be bounded under Lipschitz conditions (Rachelson and Lagoudakis, 2010)
5 10 15 20 0.2 0.4 0.6 0.8 1 k
1−γk−1 1−γk
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
8
8
k ď Q˚ for all k ě 1
k}p,µ ď
p,µ
‚ Can be bounded under Lipschitz conditions (Rachelson and Lagoudakis, 2010)
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
8
8
k ď Q˚ for all k ě 1
k}p,µ ď
p,µ
‚ Can be bounded under Lipschitz conditions (Rachelson and Lagoudakis, 2010)
p,µ ď L rpLπ ` 1qLT ` σps
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
9
9
i“1 „ ν
k?
aPA fpSi`1, aq
Qpjq Qpj`1q p T ˚ ΠF p T ˚ ΠF Qpj`kq p T ˚ ΠF
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
9
9
i“1 „ ν
k
aPA fpSi`1, aq
k “ pT δqk´1T ˚ » pΠF p
Qpjq Qpj`1q p T ˚ ΠF p T δ ΠF Qpj`kq p T δ ΠF
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
9
9
i“1 „ ν
k
aPA fpSi`1, aq
k “ pT δqk´1T ˚ » pΠF p
Qpjq Qpj`1q p T ˚ ΠF p T δ ΠF Qpj`kq p T δ ΠF
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
9
9
i“1 „ ν
k
aPA fpSi`1, aq
k “ pT δqk´1T ˚ » pΠF p
Qpjq Qpj`1q p T ˚ ΠF p T δ ΠF Qpj`kq p T δ ΠF
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
9
9
i“1 „ ν
k
aPA fpSi`1, aq
k “ pT δqk´1T ˚ » pΠF p
Qpjq Qpj`1q p T ˚ ΠF p T δ ΠF Qpj`kq p T δ ΠF
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
10
10
k ´ QπpJq k
p,µ ď
‚ Decreasing with k ‚ Approximation errors ǫpjq and concentrability coefficients (Farahmand, 2011)
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
10
10
k ´ QπpJq k
p,µ ď
‚ Decreasing with k ‚ Approximation errors ǫpjq and concentrability coefficients (Farahmand, 2011)
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
10
10
k ´ QπpJq k
p,µ ď
‚ Decreasing with k ‚ Approximation errors ǫpjq and concentrability coefficients (Farahmand, 2011)
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
10
10
k ´ QπpJq k
p,µ ď
‚ Decreasing with k ‚ Approximation errors ǫpjq and concentrability coefficients (Farahmand, 2011)
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
11
11
k
p,µ ď
k}p,µ
k ´ QπpJq k
p,µ
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
11
11
k
p,µ ď
k}p,µ
k ´ QπpJq k
p,µ
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
11
11
k
p,µ ď
k}p,µ
k ´ QπpJq k
p,µ
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
11
11
k
p,µ ď
k}p,µ
k ´ QπpJq k
p,µ
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
12
12
kPK
D
k Qk) (Farahmand and Szepesv´
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
12
12
kPK
D
k Qk) (Farahmand and Szepesv´
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
12
12
kPK
D
k Qk) (Farahmand and Szepesv´
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
13
Best Persistences
13
Environments Best Persistence Cartpole 4 Mountain Car 8, 16, 32 LunarLander 4, 8 Pendulum 1, 2, 4 Acrobot 2, 4 Swimmer 2, 4, 8 Hopper 64 Walker2D 8, 16, 32, 64
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
13
Best Persistences
13
Environments Best Persistence Cartpole 4 Mountain Car 8, 16, 32 LunarLander 4, 8 Pendulum 1, 2, 4 Acrobot 2, 4 Swimmer 2, 4, 8 Hopper 64 Walker2D 8, 16, 32, 64
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
14
Cartpole
14 200 400 100 200
Iteration Expected return Jk
200 400 100 200 300
Iteration Estimated return Jk
200 400 −400 −200
Iteration Index Bk k = 1 k = 2 k = 4 k = 8 k = 16
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
14
Cartpole
14 200 400 100 200
Iteration Expected return Jk
200 400 100 200 300
Iteration Estimated return Jk
200 400 −400 −200
Iteration Index Bk k = 1 k = 2 k = 4 k = 8 k = 16
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
15
15
1 Can persistence improve exploration? 2 Persistence in on–line RL 3 Dynamic persistent selection
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
15
15
1 Can persistence improve exploration? 2 Persistence in on–line RL 3 Dynamic persistent selection
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
15
15
1 Can persistence improve exploration? 2 Persistence in on–line RL 3 Dynamic persistent selection
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
15
15
1 Can persistence improve exploration? 2 Persistence in on–line RL 3 Dynamic persistent selection
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
15
15
1 Can persistence improve exploration? 2 Persistence in on–line RL 3 Dynamic persistent selection
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
17
17
Dimitri P. Bertsekas. Dynamic programming and optimal control, 3rd Edition. Athena Scientific, 2005. ISBN 1886529264. Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. J. Mach.
Amir Massoud Farahmand. Regularization in Reinforcement Learning. PhD thesis, University of Alberta, 2011. Amir Massoud Farahmand and Csaba Szepesv´
85(3):299–332, 2011. doi: 10.1007/s10994-011-5254-7. Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees. Machine Learning, 63(1):3–42,
Alberto Maria Metelli, Mirco Mutti, and Marcello Restelli. Configurable markov decision processes. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm¨ assan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings
Martin L Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014. Emmanuel Rachelson and Michail G. Lagoudakis. On the locality of action domination in sequential decision
Lauderdale, Florida, USA, January 6-8, 2010, 2010. Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020