Control Frequency Adaptation via Action Persistence in Batch - - PowerPoint PPT Presentation

control frequency adaptation via action persistence in
SMART_READER_LITE
LIVE PREVIEW

Control Frequency Adaptation via Action Persistence in Batch - - PowerPoint PPT Presentation

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning Alberto Maria Metelli Flavio Mazzolini Lorenzo Bisi Luca Sabbioni Marcello Restelli July 2020 Thirty-seventh International Conference on Machine Learning 1


slide-1
SLIDE 1

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning

Alberto Maria Metelli Flavio Mazzolini Lorenzo Bisi Luca Sabbioni Marcello Restelli

July 2020 Thirty-seventh International Conference on Machine Learning

slide-2
SLIDE 2

1

Motivations

1

Problem: How to select the control frequency for a system? Lower Frequencies Higher Frequencies Research Question: Can we exploit this trade-off to find an optimal control frequency?

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-3
SLIDE 3

1

Motivations

1

Problem: How to select the control frequency for a system? Lower Frequencies Higher Frequencies Control Opportunities Research Question: Can we exploit this trade-off to find an optimal control frequency?

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-4
SLIDE 4

1

Motivations

1

Problem: How to select the control frequency for a system? Lower Frequencies Higher Frequencies Control Opportunities Sample Complexity Research Question: Can we exploit this trade-off to find an optimal control frequency?

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-5
SLIDE 5

1

Motivations

1

Problem: How to select the control frequency for a system? Lower Frequencies

Trade-Off

Higher Frequencies Control Opportunities Sample Complexity Research Question: Can we exploit this trade-off to find an optimal control frequency?

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-6
SLIDE 6

2

Control Frequency and Action Persistence

2

Idea: persisting each action for k steps

continuous- time MDP discrete-time MDP k-persistent MDP

M0

time discretization

M∆t

action persistence k

Mk∆t

control time-step

∆t k∆t

control frequency

8 f “

1 ∆t f k

Action persistence as form of environment configurability (Metelli et al., 2018)

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-7
SLIDE 7

2

Control Frequency and Action Persistence

2

Idea: persisting each action for k steps

continuous- time MDP discrete-time MDP k-persistent MDP

M0

time discretization

M∆t

action persistence k

Mk∆t

control time-step

∆t k∆t

control frequency

8 f “

1 ∆t f k

Action persistence as form of environment configurability (Metelli et al., 2018)

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-8
SLIDE 8

2

Control Frequency and Action Persistence

2

Idea: persisting each action for k steps

continuous- time MDP discrete-time MDP k-persistent MDP

M0

time discretization

M∆t

action persistence k

Mk∆t

control time-step

∆t k∆t

control frequency

8 f “

1 ∆t f k

Action persistence as form of environment configurability (Metelli et al., 2018)

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-9
SLIDE 9

2

Control Frequency and Action Persistence

2

Idea: persisting each action for k steps

continuous- time MDP discrete-time MDP k-persistent MDP

M0

time discretization

M∆t

action persistence k

Mk∆t

control time-step

∆t k∆t

control frequency

8 f “

1 ∆t f k

Action persistence as form of environment configurability (Metelli et al., 2018)

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-10
SLIDE 10

3

Outline

3

1 Action persistence formalization 2 Performance loss due to persistence 3 Persistent Fitted Q-Iteration

S0 S1 S2 S3 A0 A0 A0

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-11
SLIDE 11

3

Outline

3

1 Action persistence formalization 2 Performance loss due to persistence 3 Persistent Fitted Q-Iteration 5 10 15 20 0.2 0.4 0.6 0.8 1 k

1−γk−1 1−γk

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-12
SLIDE 12

3

Outline

3

1 Action persistence formalization 2 Performance loss due to persistence 3 Persistent Fitted Q-Iteration

F

Q(j) Q(j+1)

  • T ∗

ΠF

  • T δ

ΠF Q(j+k)

  • T δ

ΠF

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-13
SLIDE 13

4

No Action Persistence

4

M “ pS, A, P, R, γq

and

π

π : S Ñ PpAq is Markovian and Stationary (Puterman, 2014; Sutton and Barto, 2018)

S0 t = 0 S1 t = 1 S2 t = 2 S3 t = 3 S4 t = 4 S5 t = 5 S6 t = 6 A0 ∼ π(·|S0) A1 ∼ π(·|S1) A2 ∼ π(·|S2) A3 ∼ π(·|S3) A4 ∼ π(·|S4) A5 ∼ π(·|S5)

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-14
SLIDE 14

5

Action Persistence

Policy View

5

Change the policy Ñ k-persistent policy

M “ pS, A, P, R, γq

and

πk

πt,kpa|htq “ # πpa|stq if t mod k “ 0 δat´1paq

  • therwise

History ht “ ps0, a0, . . . , st´1, at´1, stq πk is Non-Markovian and Non-Stationary

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-15
SLIDE 15

5

Action Persistence

Policy View

5

Change the policy Ñ k-persistent policy

M “ pS, A, P, R, γq

and

πk

πt,kpa|htq “ # πpa|stq if t mod k “ 0 δat´1paq

  • therwise

History ht “ ps0, a0, . . . , st´1, at´1, stq πk is Non-Markovian and Non-Stationary

S0 t = 0 S1 t = 1 S2 t = 2 S3 t = 3 S4 t = 4 S5 t = 5 S6 t = 6 A0 ∼ π(·|S0) A0 A0 A3 ∼ π(·|S3) A3 A3

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-16
SLIDE 16

5

Action Persistence

Policy View

5

Change the policy Ñ k-persistent policy

M “ pS, A, P, R, γq

and

πk

πt,kpa|htq “ # πpa|stq if t mod k “ 0 δat´1paq

  • therwise

History ht “ ps0, a0, . . . , st´1, at´1, stq πk is Non-Markovian and Non-Stationary

S0 t = 0 S1 t = 1 S2 t = 2 S3 t = 3 S4 t = 4 S5 t = 5 S6 t = 6 A0 ∼ π(·|S0) A0 A0 A3 ∼ π(·|S3) A3 A3

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-17
SLIDE 17

6

Action Persistence

Environment View

6

Change the MDP Ñ k-persistent MDP

Mk “ ` S, A, Pk, Rk, γk˘

and

π

Pkps1|s, aq “ ` pP δqk´1P ˘ ps1|s, aq Rkps1|s, aq “ řk´1

i“0 γi`

pP δqiR ˘ ps1|s, aq

Persistent state-action kernel P δps1, a1|s, aq “ δa1paqPps1|s, aq Mk has smaller discount factor γk

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-18
SLIDE 18

6

Action Persistence

Environment View

6

Change the MDP Ñ k-persistent MDP

Mk “ ` S, A, Pk, Rk, γk˘

and

π

Pkps1|s, aq “ ` pP δqk´1P ˘ ps1|s, aq Rkps1|s, aq “ řk´1

i“0 γi`

pP δqiR ˘ ps1|s, aq

Persistent state-action kernel P δps1, a1|s, aq “ δa1paqPps1|s, aq Mk has smaller discount factor γk

S0 t = 0 S1 t = 1 S2 t = 2 S3 t = 3 S4 t = 4 S5 t = 5 S6 t = 6 A0 ∼ π(·|S0) A3 ∼ π(·|S3)

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-19
SLIDE 19

6

Action Persistence

Environment View

6

Change the MDP Ñ k-persistent MDP

Mk “ ` S, A, Pk, Rk, γk˘

and

π

Pkps1|s, aq “ ` pP δqk´1P ˘ ps1|s, aq Rkps1|s, aq “ řk´1

i“0 γi`

pP δqiR ˘ ps1|s, aq

Persistent state-action kernel P δps1, a1|s, aq “ δa1paqPps1|s, aq Mk has smaller discount factor γk

S0 t = 0 S1 t = 1 S2 t = 2 S3 t = 3 S4 t = 4 S5 t = 5 S6 t = 6 A0 ∼ π(·|S0) A3 ∼ π(·|S3)

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-20
SLIDE 20

7

Persistent Bellman Operators

7

MDP M

Bellman Operator (Bertsekas, 2005)

pT ˚fqps, aq “ rps, aq ` γ ż

S

Ppds1|s, aq max

a1PA fps1, a1q

T ˚ is a γ-contraction in L8-norm Q˚ is the unique fixed point of T ˚ T ˚Q˚ “ Q˚

k-persistent MDP Mk

Persistence Operator

pT δfqps, aq “ rps, aq ` γ ż

S

Ppds1|s, aqfps1, aq

k-persistent Bellman Operator

T ˚

k “ pT δqk´1T ˚

T ˚

k is a γk-contraction in L8-norm

k is the unique fixed point of T ˚ k

T ˚

k Q˚ k “ Q˚ k

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-21
SLIDE 21

7

Persistent Bellman Operators

7

MDP M

Bellman Operator (Bertsekas, 2005)

pT ˚fqps, aq “ rps, aq ` γ ż

S

Ppds1|s, aq max

a1PA fps1, a1q

T ˚ is a γ-contraction in L8-norm Q˚ is the unique fixed point of T ˚ T ˚Q˚ “ Q˚

k-persistent MDP Mk

Persistence Operator

pT δfqps, aq “ rps, aq ` γ ż

S

Ppds1|s, aqfps1, aq

k-persistent Bellman Operator

T ˚

k “ pT δqk´1T ˚

T ˚

k is a γk-contraction in L8-norm

k is the unique fixed point of T ˚ k

T ˚

k Q˚ k “ Q˚ k

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-22
SLIDE 22

7

Persistent Bellman Operators

7

MDP M

Bellman Operator (Bertsekas, 2005)

pT ˚fqps, aq “ rps, aq ` γ ż

S

Ppds1|s, aq max

a1PA fps1, a1q

T ˚ is a γ-contraction in L8-norm Q˚ is the unique fixed point of T ˚ T ˚Q˚ “ Q˚

k-persistent MDP Mk

Persistence Operator

pT δfqps, aq “ rps, aq ` γ ż

S

Ppds1|s, aqfps1, aq

k-persistent Bellman Operator

T ˚

k “ pT δqk´1T ˚

T ˚

k is a γk-contraction in L8-norm

k is the unique fixed point of T ˚ k

T ˚

k Q˚ k “ Q˚ k

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-23
SLIDE 23

7

Persistent Bellman Operators

7

MDP M

Bellman Operator (Bertsekas, 2005)

pT ˚fqps, aq “ rps, aq ` γ ż

S

Ppds1|s, aq max

a1PA fps1, a1q

T ˚ is a γ-contraction in L8-norm Q˚ is the unique fixed point of T ˚ T ˚Q˚ “ Q˚

k-persistent MDP Mk

Persistence Operator

pT δfqps, aq “ rps, aq ` γ ż

S

Ppds1|s, aqfps1, aq

k-persistent Bellman Operator

T ˚

k “ pT δqk´1T ˚

T ˚

k is a γk-contraction in L8-norm

k is the unique fixed point of T ˚ k

T ˚

k Q˚ k “ Q˚ k

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-24
SLIDE 24

7

Persistent Bellman Operators

7

MDP M

Bellman Operator (Bertsekas, 2005)

pT ˚fqps, aq “ rps, aq ` γ ż

S

Ppds1|s, aq max

a1PA fps1, a1q

T ˚ is a γ-contraction in L8-norm Q˚ is the unique fixed point of T ˚ T ˚Q˚ “ Q˚

k-persistent MDP Mk

Persistence Operator

pT δfqps, aq “ rps, aq ` γ ż

S

Ppds1|s, aqfps1, aq

k-persistent Bellman Operator

T ˚

k “ pT δqk´1T ˚

T ˚

k is a γk-contraction in L8-norm

k is the unique fixed point of T ˚ k

T ˚

k Q˚ k “ Q˚ k

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-25
SLIDE 25

7

Persistent Bellman Operators

7

MDP M

Bellman Operator (Bertsekas, 2005)

pT ˚fqps, aq “ rps, aq ` γ ż

S

Ppds1|s, aq max

a1PA fps1, a1q

T ˚ is a γ-contraction in L8-norm Q˚ is the unique fixed point of T ˚ T ˚Q˚ “ Q˚

k-persistent MDP Mk

Persistence Operator

pT δfqps, aq “ rps, aq ` γ ż

S

Ppds1|s, aqfps1, aq

k-persistent Bellman Operator

T ˚

k “ pT δqk´1T ˚

T ˚

k is a γk-contraction in L8-norm

k is the unique fixed point of T ˚ k

T ˚

k Q˚ k “ Q˚ k

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-26
SLIDE 26

7

Persistent Bellman Operators

7

MDP M

Bellman Operator (Bertsekas, 2005)

pT ˚fqps, aq “ rps, aq ` γ ż

S

Ppds1|s, aq max

a1PA fps1, a1q

T ˚ is a γ-contraction in L8-norm Q˚ is the unique fixed point of T ˚ T ˚Q˚ “ Q˚

k-persistent MDP Mk

Persistence Operator

pT δfqps, aq “ rps, aq ` γ ż

S

Ppds1|s, aqfps1, aq

k-persistent Bellman Operator

T ˚

k “ pT δqk´1T ˚

T ˚

k is a γk-contraction in L8-norm

k is the unique fixed point of T ˚ k

T ˚

k Q˚ k “ Q˚ k

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-27
SLIDE 27

8

Bounding the Performance Loss

8

k ď Q˚ for all k ě 1

How much do we lose by persisting k times the actions of policy π? }Qπ ´ Qπ

k}p,µ ď

γ 1 ´ γ 1 ´ γk´1 1 ´ γk › › ›dpP π, P δq › › ›

p,µ

Increasing with k dpP π, P δq: discrepancy between transition kernels

‚ Can be bounded under Lipschitz conditions (Rachelson and Lagoudakis, 2010)

P πps1, a1|s, aq “ πpa1|s1q Pps1|s, aq P δps1, a1|s, aq “ δa1paq Pps1|s, aq

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-28
SLIDE 28

8

Bounding the Performance Loss

8

k ď Q˚ for all k ě 1

How much do we lose by persisting k times the actions of policy π? }Qπ ´ Qπ

k}p,µ ď

γ 1 ´ γ 1 ´ γk´1 1 ´ γk › › ›dpP π, P δq › › ›

p,µ

Increasing with k dpP π, P δq: discrepancy between transition kernels

‚ Can be bounded under Lipschitz conditions (Rachelson and Lagoudakis, 2010)

P πps1, a1|s, aq “ πpa1|s1q Pps1|s, aq P δps1, a1|s, aq “ δa1paq Pps1|s, aq

5 10 15 20 0.2 0.4 0.6 0.8 1 k

1−γk−1 1−γk

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-29
SLIDE 29

8

Bounding the Performance Loss

8

k ď Q˚ for all k ě 1

How much do we lose by persisting k times the actions of policy π? }Qπ ´ Qπ

k}p,µ ď

γ 1 ´ γ 1 ´ γk´1 1 ´ γk › › ›dpP π, P δq › › ›

p,µ

Increasing with k dpP π, P δq: discrepancy between transition kernels

‚ Can be bounded under Lipschitz conditions (Rachelson and Lagoudakis, 2010)

P πps1, a1|s, aq “ πpa1|s1q Pps1|s, aq P δps1, a1|s, aq “ δa1paq Pps1|s, aq

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-30
SLIDE 30

8

Bounding the Performance Loss

8

k ď Q˚ for all k ě 1

How much do we lose by persisting k times the actions of policy π? }Qπ ´ Qπ

k}p,µ ď

γ 1 ´ γ 1 ´ γk´1 1 ´ γk › › ›dpP π, P δq › › ›

p,µ

Increasing with k dpP π, P δq: discrepancy between transition kernels

‚ Can be bounded under Lipschitz conditions (Rachelson and Lagoudakis, 2010)

P πps1, a1|s, aq “ πpa1|s1q Pps1|s, aq P δps1, a1|s, aq “ δa1paq Pps1|s, aq › › ›dpP π, P δq › › ›

p,µ ď L rpLπ ` 1qLT ` σps

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-31
SLIDE 31

9

Persistent Fitted Q-Iteration (PFQI)

9

Fitted Q-Iteration (Ernst et al., 2005) Approximation space F Initial estimate Qp0q Dataset D “ tpSi, Ai, Si`1, Riqun

i“1 „ ν

Qpj`1q “ ΠF p T ˚Qpjq

Qpjq ù Q˚ What about Q˚

k?

Empirical Bellman Operators p p T ˚fqpSi, Aiq “ Ri ` γ max

aPA fpSi`1, aq

p p T δfqpSi, Aiq “ Ri ` γfpSi`1, Aiq

T ˚ » ΠF p T ˚ F

Qpjq Qpj`1q p T ˚ ΠF p T ˚ ΠF Qpj`kq p T ˚ ΠF

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-32
SLIDE 32

9

Persistent Fitted Q-Iteration (PFQI)

9

Persistent Fitted Q-Iteration Approximation space F Initial estimate Qp0q Dataset D “ tpSi, Ai, Si`1, Riqun

i“1 „ ν

Qpj`1q “ # ΠF p T ˚Qpjq if j mod k “ 0 ΠF p T δQpjq

  • therwise

Qpjq ù Q˚

k

Empirical Bellman Operators p p T ˚fqpSi, Aiq “ Ri ` γ max

aPA fpSi`1, aq

p p T δfqpSi, Aiq “ Ri ` γfpSi`1, Aiq

T ˚

k “ pT δqk´1T ˚ » pΠF p

T δqk´1ΠF p T ˚ F

Qpjq Qpj`1q p T ˚ ΠF p T δ ΠF Qpj`kq p T δ ΠF

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-33
SLIDE 33

9

Persistent Fitted Q-Iteration (PFQI)

9

Persistent Fitted Q-Iteration Approximation space F Initial estimate Qp0q Dataset D “ tpSi, Ai, Si`1, Riqun

i“1 „ ν

Qpj`1q “ # ΠF p T ˚Qpjq if j mod k “ 0 ΠF p T δQpjq

  • therwise

Qpjq ù Q˚

k

Empirical Bellman Operators p p T ˚fqpSi, Aiq “ Ri ` γ max

aPA fpSi`1, aq

p p T δfqpSi, Aiq “ Ri ` γfpSi`1, Aiq

T ˚

k “ pT δqk´1T ˚ » pΠF p

T δqk´1ΠF p T ˚ F

Qpjq Qpj`1q p T ˚ ΠF p T δ ΠF Qpj`kq p T δ ΠF

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-34
SLIDE 34

9

Persistent Fitted Q-Iteration (PFQI)

9

Persistent Fitted Q-Iteration Approximation space F Initial estimate Qp0q Dataset D “ tpSi, Ai, Si`1, Riqun

i“1 „ ν

Qpj`1q “ # ΠF p T ˚Qpjq if j mod k “ 0 ΠF p T δQpjq

  • therwise

Qpjq ù Q˚

k

Empirical Bellman Operators p p T ˚fqpSi, Aiq “ Ri ` γ max

aPA fpSi`1, aq

p p T δfqpSi, Aiq “ Ri ` γfpSi`1, Aiq

T ˚

k “ pT δqk´1T ˚ » pΠF p

T δqk´1ΠF p T ˚ F

Qpjq Qpj`1q p T ˚ ΠF p T δ ΠF Qpj`kq p T δ ΠF

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-35
SLIDE 35

9

Persistent Fitted Q-Iteration (PFQI)

9

Persistent Fitted Q-Iteration Approximation space F Initial estimate Qp0q Dataset D “ tpSi, Ai, Si`1, Riqun

i“1 „ ν

Qpj`1q “ # ΠF p T ˚Qpjq if j mod k “ 0 ΠF p T δQpjq

  • therwise

Qpjq ù Q˚

k

Empirical Bellman Operators p p T ˚fqpSi, Aiq “ Ri ` γ max

aPA fpSi`1, aq

p p T δfqpSi, Aiq “ Ri ` γfpSi`1, Aiq

T ˚

k “ pT δqk´1T ˚ » pΠF p

T δqk´1ΠF p T ˚ F

Qpjq Qpj`1q p T ˚ ΠF p T δ ΠF Qpj`kq p T δ ΠF

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-36
SLIDE 36

10

PFQI Analysis

10

Computational Complexity: monotonically decreasing with k O ˆ Jn ˆ 1 ` |A| ´ 1 k ˙˙ for J iterations Error propagation › › ›Q˚

k ´ QπpJq k

› › ›

p,µ ď

2 1 ´ γ γk 1 ´ γk EpJ, µ, ν, pq

‚ Decreasing with k ‚ Approximation errors ǫpjq and concentrability coefficients (Farahmand, 2011)

ǫpjq “ # T ˚Qpjq ´ Qpj`1q if j mod k “ 0 T δQpjq ´ Qpj`1q

  • therwise
  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-37
SLIDE 37

10

PFQI Analysis

10

Computational Complexity: monotonically decreasing with k O ˆ Jn ˆ 1 ` |A| ´ 1 k ˙˙ for J iterations Error propagation › › ›Q˚

k ´ QπpJq k

› › ›

p,µ ď

2 1 ´ γ γk 1 ´ γk EpJ, µ, ν, pq

‚ Decreasing with k ‚ Approximation errors ǫpjq and concentrability coefficients (Farahmand, 2011)

ǫpjq “ # T ˚Qpjq ´ Qpj`1q if j mod k “ 0 T δQpjq ´ Qpj`1q

  • therwise
  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-38
SLIDE 38

10

PFQI Analysis

10

Computational Complexity: monotonically decreasing with k O ˆ Jn ˆ 1 ` |A| ´ 1 k ˙˙ for J iterations Error propagation › › ›Q˚

k ´ QπpJq k

› › ›

p,µ ď

2 1 ´ γ γk 1 ´ γk EpJ, µ, ν, pq

‚ Decreasing with k ‚ Approximation errors ǫpjq and concentrability coefficients (Farahmand, 2011)

ǫpjq “ # T ˚Qpjq ´ Qpj`1q if j mod k “ 0 T δQpjq ´ Qpj`1q

  • therwise
  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-39
SLIDE 39

10

PFQI Analysis

10

Computational Complexity: monotonically decreasing with k O ˆ Jn ˆ 1 ` |A| ´ 1 k ˙˙ for J iterations Error propagation › › ›Q˚

k ´ QπpJq k

› › ›

p,µ ď

2 1 ´ γ γk 1 ´ γk EpJ, µ, ν, pq

‚ Decreasing with k ‚ Approximation errors ǫpjq and concentrability coefficients (Farahmand, 2011)

ǫpjq “ # T ˚Qpjq ´ Qpj`1q if j mod k “ 0 T δQpjq ´ Qpj`1q

  • therwise
  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-40
SLIDE 40

11

Control Frequency Trade-Off

11

› › ›Q˚ ´ QπpJq

k

› › ›

p,µ ď

}Q˚ ´ Q˚

k}p,µ

` › › ›Q˚

k ´ QπpJq k

› › ›

p,µ

Control Opportunities Algorithm-independent Increasing with k Sample Complexity Algorithm-dependent Decreasing with k ‚ How to identify the optimal persistence?

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-41
SLIDE 41

11

Control Frequency Trade-Off

11

› › ›Q˚ ´ QπpJq

k

› › ›

p,µ ď

}Q˚ ´ Q˚

k}p,µ

` › › ›Q˚

k ´ QπpJq k

› › ›

p,µ

Control Opportunities Algorithm-independent Increasing with k Sample Complexity Algorithm-dependent Decreasing with k ‚ How to identify the optimal persistence?

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-42
SLIDE 42

11

Control Frequency Trade-Off

11

› › ›Q˚ ´ QπpJq

k

› › ›

p,µ ď

}Q˚ ´ Q˚

k}p,µ

` › › ›Q˚

k ´ QπpJq k

› › ›

p,µ

Control Opportunities Algorithm-independent Increasing with k Sample Complexity Algorithm-dependent Decreasing with k ‚ How to identify the optimal persistence?

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-43
SLIDE 43

11

Control Frequency Trade-Off

11

› › ›Q˚ ´ QπpJq

k

› › ›

p,µ ď

}Q˚ ´ Q˚

k}p,µ

` › › ›Q˚

k ´ QπpJq k

› › ›

p,µ

Control Opportunities Algorithm-independent Increasing with k Sample Complexity Algorithm-dependent Decreasing with k ‚ How to identify the optimal persistence?

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-44
SLIDE 44

12

Persistence Selection

12

How to identify the optimal persistence in a batch setting? Given estimated Q-function tQk : k P Ku r k P arg max

kPK

Bk “ p Jk ´ 1 1 ´ γk › › › r Qk ´ Qk › › ›

D

estimated performance derived from Qk » Bellman residual ( r Qk » T ˚

k Qk) (Farahmand and Szepesv´

ari, 2011)

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-45
SLIDE 45

12

Persistence Selection

12

How to identify the optimal persistence in a batch setting? Given estimated Q-function tQk : k P Ku r k P arg max

kPK

Bk “ p Jk ´ 1 1 ´ γk › › › r Qk ´ Qk › › ›

D

estimated performance derived from Qk » Bellman residual ( r Qk » T ˚

k Qk) (Farahmand and Szepesv´

ari, 2011)

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-46
SLIDE 46

12

Persistence Selection

12

How to identify the optimal persistence in a batch setting? Given estimated Q-function tQk : k P Ku r k P arg max

kPK

Bk “ p Jk ´ 1 1 ´ γk › › › r Qk ´ Qk › › ›

D

estimated performance derived from Qk » Bellman residual ( r Qk » T ˚

k Qk) (Farahmand and Szepesv´

ari, 2011)

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-47
SLIDE 47

13

Experimental Evaluation

Best Persistences

13

PFQI & ExtraTrees (Geurts et al., 2006)

Environments Best Persistence Cartpole 4 Mountain Car 8, 16, 32 LunarLander 4, 8 Pendulum 1, 2, 4 Acrobot 2, 4 Swimmer 2, 4, 8 Hopper 64 Walker2D 8, 16, 32, 64

The best persistence is usually not 1 Excessive increase of the persistence prevents control at all

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-48
SLIDE 48

13

Experimental Evaluation

Best Persistences

13

PFQI & ExtraTrees (Geurts et al., 2006)

Environments Best Persistence Cartpole 4 Mountain Car 8, 16, 32 LunarLander 4, 8 Pendulum 1, 2, 4 Acrobot 2, 4 Swimmer 2, 4, 8 Hopper 64 Walker2D 8, 16, 32, 64

The best persistence is usually not 1 Excessive increase of the persistence prevents control at all

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-49
SLIDE 49

14

Experimental Evaluation

Cartpole

14 200 400 100 200

Iteration Expected return Jk

200 400 100 200 300

Iteration Estimated return Jk

200 400 −400 −200

Iteration Index Bk k = 1 k = 2 k = 4 k = 8 k = 16

Overestimated lower persistence Q-functions The persistence selection heuristic correctly selects k “ 4

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-50
SLIDE 50

14

Experimental Evaluation

Cartpole

14 200 400 100 200

Iteration Expected return Jk

200 400 100 200 300

Iteration Estimated return Jk

200 400 −400 −200

Iteration Index Bk k = 1 k = 2 k = 4 k = 8 k = 16

Overestimated lower persistence Q-functions The persistence selection heuristic correctly selects k “ 4

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-51
SLIDE 51

15

Conclusions

15

Research Question: Can we exploit this trade-off to find an optimal control frequency? Open Questions

1 Can persistence improve exploration? 2 Persistence in on–line RL 3 Dynamic persistent selection

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-52
SLIDE 52

15

Conclusions

15

Research Question: Can we exploit this trade-off to find an optimal control frequency? Yes! Open Questions

1 Can persistence improve exploration? 2 Persistence in on–line RL 3 Dynamic persistent selection

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-53
SLIDE 53

15

Conclusions

15

Research Question: Can we exploit this trade-off to find an optimal control frequency? Yes! Open Questions

1 Can persistence improve exploration? 2 Persistence in on–line RL 3 Dynamic persistent selection

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-54
SLIDE 54

15

Conclusions

15

Research Question: Can we exploit this trade-off to find an optimal control frequency? Yes! Open Questions

1 Can persistence improve exploration? 2 Persistence in on–line RL 3 Dynamic persistent selection

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-55
SLIDE 55

15

Conclusions

15

Research Question: Can we exploit this trade-off to find an optimal control frequency? Yes! Open Questions

1 Can persistence improve exploration? 2 Persistence in on–line RL 3 Dynamic persistent selection

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020

slide-56
SLIDE 56

Thank You for Your Attention!

slide-57
SLIDE 57

17

References

17

Dimitri P. Bertsekas. Dynamic programming and optimal control, 3rd Edition. Athena Scientific, 2005. ISBN 1886529264. Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. J. Mach.

  • Learn. Res., 6:503–556, 2005.

Amir Massoud Farahmand. Regularization in Reinforcement Learning. PhD thesis, University of Alberta, 2011. Amir Massoud Farahmand and Csaba Szepesv´

  • ari. Model selection in reinforcement learning. Machine Learning,

85(3):299–332, 2011. doi: 10.1007/s10994-011-5254-7. Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees. Machine Learning, 63(1):3–42,

  • 2006. doi: 10.1007/s10994-006-6226-1.

Alberto Maria Metelli, Mirco Mutti, and Marcello Restelli. Configurable markov decision processes. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm¨ assan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings

  • f Machine Learning Research, pages 3488–3497. PMLR, 2018.

Martin L Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014. Emmanuel Rachelson and Michail G. Lagoudakis. On the locality of action domination in sequential decision

  • making. In International Symposium on Artificial Intelligence and Mathematics, ISAIM 2010, Fort

Lauderdale, Florida, USA, January 6-8, 2010, 2010. Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.

  • A. M. Metelli

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020