Demystifying the efficiency of reinforcement learning: A few recent - - PowerPoint PPT Presentation

demystifying the efficiency of reinforcement learning a
SMART_READER_LITE
LIVE PREVIEW

Demystifying the efficiency of reinforcement learning: A few recent - - PowerPoint PPT Presentation

Demystifying the efficiency of reinforcement learning: A few recent stories Yuxin Chen EE, Princeton University Acknowledgement 2/ 74 Reinforcement learning (RL) 3/ 74 RL challenges In RL, an agent learns by interacting with an environment


slide-1
SLIDE 1

Demystifying the efficiency of reinforcement learning: A few recent stories

Yuxin Chen EE, Princeton University

slide-2
SLIDE 2

Acknowledgement

2/ 74

slide-3
SLIDE 3

Reinforcement learning (RL)

3/ 74

slide-4
SLIDE 4

RL challenges

In RL, an agent learns by interacting with an environment

  • unknown or changing environments
  • delayed rewards or feedback
  • enormous state and action space
  • nonconvexity

4/ 74

slide-5
SLIDE 5

Sample efficiency

Collecting data samples might be expensive or time-consuming clinical trials

  • nline ads

5/ 74

slide-6
SLIDE 6

Sample efficiency

Collecting data samples might be expensive or time-consuming clinical trials

  • nline ads

Calls for design of sample-efficient RL algorithms!

5/ 74

slide-7
SLIDE 7

Computational efficiency

Running RL algorithms might take a long time . . .

  • enormous state-action space
  • nonconvexity

6/ 74

slide-8
SLIDE 8

Computational efficiency

Running RL algorithms might take a long time . . .

  • enormous state-action space
  • nonconvexity

Calls for computationally efficient RL algorithms!

6/ 74

slide-9
SLIDE 9

This talk: three recent stories

(large-scale) optimization

(high-dimensional) statistics

Demystify sample- and computational efficiency of RL algorithms

7/ 74

slide-10
SLIDE 10

This talk: three recent stories

(large-scale) optimization

(high-dimensional) statistics

Demystify sample- and computational efficiency of RL algorithms

  • 1. model-based RL
  • 2. value-based RL
  • 3. policy-based RL

7/ 74

slide-11
SLIDE 11

This talk: three recent stories

(large-scale) optimization

(high-dimensional) statistics

Demystify sample- and computational efficiency of RL algorithms

  • 1. model-based RL: breaking a sample size barrier
  • 2. value-based RL: Q-learning over Markovian samples
  • 3. policy-based RL: natural policy gradient (NPG) methods

7/ 74

slide-12
SLIDE 12

Background: Markov decision processes

slide-13
SLIDE 13

Markov decision process (MDP)

  • S: state space
  • A: action space

9/ 74

slide-14
SLIDE 14

Markov decision process (MDP)

  • S: state space
  • A: action space
  • r(s, a) ∈ [0, 1]: immediate reward

9/ 74

slide-15
SLIDE 15

Markov decision process (MDP)

  • state space S: positions in the maze
  • action space A: up, down, left, right
  • immediate reward r(s, a): cheese, electricity shocks, cats

10/ 74

slide-16
SLIDE 16

Markov decision process (MDP)

  • S: state space
  • A: action space
  • r(s, a) ∈ [0, 1]: immediate reward
  • π(·|s): policy (or action selection rule)

11/ 74

slide-17
SLIDE 17

Markov decision process (MDP)

  • S: state space
  • A: action space
  • r(s, a) ∈ [0, 1]: immediate reward
  • π(·|s): policy (or action selection rule)
  • P(·|s, a): unknown transition probabilities

11/ 74

slide-18
SLIDE 18

Value function

Value of policy π: long-term discounted reward ∀s ∈ S : V π(s) := E

  • t=0

γtr(st, at)

  • s0 = s
  • 12/ 74
slide-19
SLIDE 19

Value function

Value of policy π: long-term discounted reward ∀s ∈ S : V π(s) := E

  • t=0

γtr(st, at)

  • s0 = s
  • γ ∈ [0, 1): discount factor
  • (a0, s1, a1, s2, a2, · · · ): generated under policy π

12/ 74

slide-20
SLIDE 20

Q-function

Q-function of policy π ∀(s, a) ∈ S × A : Qπ(s, a) := E

  • t=0

γtr(st, at)

  • s0 = s, a0 = a
  • (✟

a0, s1, a1, s2, a2, · · · ): generated under policy π

13/ 74

slide-21
SLIDE 21

Optimal policy and optimal value

14/ 74

slide-22
SLIDE 22

Optimal policy and optimal value

  • optimal policy π⋆: maximizing values for all states

14/ 74

slide-23
SLIDE 23

Optimal policy and optimal value

  • optimal policy π⋆: maximizing values for all states
  • optimal values: V ⋆ := V π⋆, Q⋆ := Qπ⋆

14/ 74

slide-24
SLIDE 24

Story 1: breaking the sample size barrier via model-based RL under a generative model

Gen Li Tsinghua EE Yuting Wei CMU Stats Yuejie Chi CMU ECE Yuantao Gu Tsinghua EE

slide-25
SLIDE 25
slide-26
SLIDE 26

Need to learn the optimal policy from data samples

slide-27
SLIDE 27

Sampling from a generative model

— Kearns, Singh ’99 For each state-action pair (s, a), collect N samples {(s, a, s′

(i))}1≤i≤N

17/ 74

slide-28
SLIDE 28

Sampling from a generative model

— Kearns, Singh ’99 For each state-action pair (s, a), collect N samples {(s, a, s′

(i))}1≤i≤N

How many samples are sufficient to learn an ε-optimal policy?

17/ 74

slide-29
SLIDE 29

An incomplete list of prior art

  • Kearns & Singh ’99
  • Kakade ’03
  • Kearns, Mansour & Ng ’02
  • Azar, Munos & Kappen ’12
  • Azar, Munos, Ghavamzadeh & Kappen ’13
  • Sidford, Wang, Wu, Yang & Ye ’18
  • Sidford, Wang, Wu & Ye ’18
  • Wang ’17
  • Agarwal, Kakade & Yang ’19
  • Wainwright ’19a
  • Wainwright ’19b
  • Pananjady & Wainwright ’20
  • Yang & Wang ’19
  • Khamaru, Pananjady, Ruan, Wainwright & Jordan ’20
  • Mou, Li, Wainwright, Bartlett & Jordan ’20
  • . . .

18/ 74

slide-30
SLIDE 30

An even shorter list of prior art

algorithm sample size range sample complexity ε-range empirical QVI

|S|2|A|

(1−γ)2 , ∞) |S||A| (1−γ)3ε2

(0,

1

(1−γ)|S|]

Azar et al. ’13 sublinear randomized VI

|S||A|

(1−γ)2 , ∞) |S||A| (1−γ)4ε2

  • 0,

1 1−γ

  • Sidford et al. ’18a

variance-reduced QVI

|S||A|

(1−γ)3 , ∞) |S||A| (1−γ)3ε2

(0, 1] Sidford et al. ’18b randomized primal-dual

|S||A|

(1−γ)2 , ∞) |S||A| (1−γ)4ε2

(0,

1 1−γ ]

Wang ’17 empirical MDP + planning

|S||A|

(1−γ)2 , ∞) |S||A| (1−γ)3ε2

(0,

1 √1−γ ]

Agarwal et al. ’19

19/ 74

slide-31
SLIDE 31

20/ 74

slide-32
SLIDE 32

20/ 74

slide-33
SLIDE 33

All prior theory requires sample size > |S||A| (1 − γ)2

  • sample size barrier

20/ 74

slide-34
SLIDE 34

Is it possible to break such a sample size barrier?

slide-35
SLIDE 35

Two approaches

Model-based approach (“plug-in”)

  • 1. build an empirical estimate

P for P

  • 2. planning based on empirical

P

22/ 74

slide-36
SLIDE 36

Two approaches

Model-based approach (“plug-in”)

  • 1. build an empirical estimate

P for P

  • 2. planning based on empirical

P Model-free approach — learning w/o constructing model explicitly

22/ 74

slide-37
SLIDE 37

Two approaches

Model-based approach (“plug-in”)

  • 1. build empirical estimate

P for P

  • 2. planning based on empirical

P Model-free approach — learning w/o constructing model explicitly

22/ 74

slide-38
SLIDE 38

Model estimation

Sampling: for each (s, a), collect N ind. samples {(s, a, s′

(i))}1≤i≤N

23/ 74

slide-39
SLIDE 39

Model estimation

Sampling: for each (s, a), collect N ind. samples {(s, a, s′

(i))}1≤i≤N

Empirical estimates: estimate P(s′|s, a) by 1 N

N

  • i=1

1{s′

(i) = s′}

  • empirical frequency

23/ 74

slide-40
SLIDE 40

Our method: plug-in estimator + perturbation

24/ 74

slide-41
SLIDE 41

Our method: plug-in estimator + perturbation

24/ 74

slide-42
SLIDE 42

Our method: plug-in estimator + perturbation

24/ 74

slide-43
SLIDE 43

Our method: plug-in estimator + perturbation

24/ 74

slide-44
SLIDE 44

The sample-starved regime?

truth: P ∈ R|S||A|×|S| empirical estimate:

  • P
  • cannot recover P faithfully if sample size ≪ |S|2|A|!

25/ 74

slide-45
SLIDE 45

The sample-starved regime?

truth: P ∈ R|S||A|×|S| empirical estimate:

  • P
  • cannot recover P faithfully if sample size ≪ |S|2|A|!
  • can we trust our policy estimate w/o reliable model estimation?

25/ 74

slide-46
SLIDE 46

Main result: ℓ∞-based sample size

Theorem 1 (Li, Wei, Chi, Gu, Chen ’20) For any 0 < ε ≤

1 1−γ , the optimal policy

π⋆

p of perturbed empirical

MDP achieves V

π⋆

p − V ⋆∞ ≤ ε

with sample complexity at most

  • O
  • |S||A|

(1 − γ)3ε2

  • 26/ 74
slide-47
SLIDE 47

Main result: ℓ∞-based sample size

Theorem 1 (Li, Wei, Chi, Gu, Chen ’20) For any 0 < ε ≤

1 1−γ , the optimal policy

π⋆

p of perturbed empirical

MDP achieves V

π⋆

p − V ⋆∞ ≤ ε

with sample complexity at most

  • O
  • |S||A|

(1 − γ)3ε2

π⋆

p: computed by empirical QVI or PI within

O

  • 1

1−γ

iterations

26/ 74

slide-48
SLIDE 48

Main result: ℓ∞-based sample size

Theorem 1 (Li, Wei, Chi, Gu, Chen ’20) For any 0 < ε ≤

1 1−γ , the optimal policy

π⋆

p of perturbed empirical

MDP achieves V

π⋆

p − V ⋆∞ ≤ ε

with sample complexity at most

  • O
  • |S||A|

(1 − γ)3ε2

π⋆

p: computed by empirical QVI or PI within

O

  • 1

1−γ

iterations

  • minimax lower bound:

Ω(

|S||A| (1−γ)3ε2 )

(Azar et al. ’13)

26/ 74

slide-49
SLIDE 49

27/ 74

slide-50
SLIDE 50

Proof ideas

Elementary decomposition: V ⋆ − V

π⋆ =

V ⋆ −

V π⋆ + V π⋆ − V

π⋆ +

V

π⋆ − V π⋆

28/ 74

slide-51
SLIDE 51

Proof ideas

Elementary decomposition: V ⋆ − V

π⋆ =

V ⋆ −

V π⋆ + V π⋆ − V

π⋆ +

V

π⋆ − V π⋆

V ⋆ −

V π⋆ + 0 + V

π⋆ − V π⋆

  • Step 1: control V π −

V π for a fixed π (Bernstein inequality + high-order decomposition)

28/ 74

slide-52
SLIDE 52

Proof ideas

Elementary decomposition: V ⋆ − V

π⋆ =

V ⋆ −

V π⋆ + V π⋆ − V

π⋆ +

V

π⋆ − V π⋆

V ⋆ −

V π⋆ + 0 + V

π⋆ − V π⋆

  • Step 1: control V π −

V π for a fixed π (Bernstein inequality + high-order decomposition)

  • Step 2: extend it to control

V

π⋆

− V

π⋆

(decouple statistical dependence)

28/ 74

slide-53
SLIDE 53

Step 1: improved theory for policy evaluation

Theorem 2 (Li, Wei, Chi, Gu, Chen’20) Fix any policy π. For 0 < ε ≤

1 1−γ , the plug-in estimator

V π obeys V π − V π∞ ≤ ε with sample complexity at most

  • O
  • |S|

(1 − γ)3ε2

  • 29/ 74
slide-54
SLIDE 54

Step 1: improved theory for policy evaluation

Theorem 2 (Li, Wei, Chi, Gu, Chen’20) Fix any policy π. For 0 < ε ≤

1 1−γ , the plug-in estimator

V π obeys V π − V π∞ ≤ ε with sample complexity at most

  • O
  • |S|

(1 − γ)3ε2

  • key idea 1: high-order decomposition of

V π − V π

29/ 74

slide-55
SLIDE 55

Step 1: improved theory for policy evaluation

Theorem 2 (Li, Wei, Chi, Gu, Chen’20) Fix any policy π. For 0 < ε ≤

1 1−γ , the plug-in estimator

V π obeys V π − V π∞ ≤ ε with sample complexity at most

  • O
  • |S|

(1 − γ)3ε2

  • key idea 1: high-order decomposition of

V π − V π

  • minimax optimal (Azar et al. ’13, Pananjady & Wainwright ’19)

29/ 74

slide-56
SLIDE 56

Step 1: improved theory for policy evaluation

Theorem 2 (Li, Wei, Chi, Gu, Chen’20) Fix any policy π. For 0 < ε ≤

1 1−γ , the plug-in estimator

V π obeys V π − V π∞ ≤ ε with sample complexity at most

  • O
  • |S|

(1 − γ)3ε2

  • key idea 1: high-order decomposition of

V π − V π

  • minimax optimal (Azar et al. ’13, Pananjady & Wainwright ’19)
  • break sample size barrier

|S| (1−γ)2 in prior work (Agarwal et al. ’19, Pananjady & Wainwright ’19, Khamaru et al. ’20)

29/ 74

slide-57
SLIDE 57

Step 2: controlling V

  • π⋆

− V

π⋆

key idea 2: a leave-one-out argument to decouple stat. dependency

— inspired by Agarwal et al. ’19 but different. . .

30/ 74

slide-58
SLIDE 58

Step 2: controlling V

  • π⋆

− V

π⋆

key idea 2: a leave-one-out argument to decouple stat. dependency

— inspired by Agarwal et al. ’19 but different. . .

Caveat: requires the optimal policy to stand out from other policies

30/ 74

slide-59
SLIDE 59

Step 2: controlling V

  • π⋆

− V

π⋆

key idea 3: tie-breaking via perturbation

  • perturb rewards r by a tiny bit =

⇒ π⋆

p

31/ 74

slide-60
SLIDE 60

Summary

Model-based RL is minimax optimal and does not suffer from a sample size barrier!

32/ 74

slide-61
SLIDE 61

Summary

Model-based RL is minimax optimal and does not suffer from a sample size barrier! future directions

  • finite-horizon episodic MDPs
  • Markov games

32/ 74

slide-62
SLIDE 62

Story 2: sample complexity of (asynchronous) Q-learning on Markovian samples

Gen Li Tsinghua EE Yuting Wei CMU Stats Yuejie Chi CMU ECE Yuantao Gu Tsinghua EE

slide-63
SLIDE 63

Model-based vs. model-free RL

Model-based approach (“plug-in”)

  • 1. build an empirical estimate

P for P

  • 2. planning based on empirical

P Model-free approach — learning w/o modeling & estimating environment explicitly

34/ 74

slide-64
SLIDE 64

A classical example: Q-learning on Markovian samples

slide-65
SLIDE 65

Markovian samples and behavior policy

Observed: {st, at, rt}t≥0

  • Markovian trajectory

generated by behavior policy πb Goal: learn optimal value V ⋆ and Q⋆ based on sample trajectory

36/ 74

slide-66
SLIDE 66

Markovian samples and behavior policy

Key quantities of sample trajectory

  • minimum state-action occupancy probability

µmin := min µπb(s, a)

  • stationary distribution
  • mixing time: tmix

36/ 74

slide-67
SLIDE 67

Q-learning: a classical model-free algorithm

Chris Watkins Peter Dayan

Stochastic approximation

  • Robbins & Monro ’51

for solving Bellman equation Q = T (Q)

37/ 74

slide-68
SLIDE 68

Aside: Bellman optimality principle

Bellman operator T (Q)(s, a) := r(s, a)

immediate reward

+ γ E

s′∼P(·|s,a)

  • max

a′∈A Q(s′, a′)

  • next state’s value
  • one-step look-ahead

38/ 74

slide-69
SLIDE 69

Aside: Bellman optimality principle

Bellman operator T (Q)(s, a) := r(s, a)

immediate reward

+ γ E

s′∼P(·|s,a)

  • max

a′∈A Q(s′, a′)

  • next state’s value
  • one-step look-ahead

Bellman equation: Q⋆ is unique solution to T (Q⋆) = Q⋆

Richard Bellman

38/ 74

slide-70
SLIDE 70

Q-learning: a classical model-free algorithm

Chris Watkins Peter Dayan

Stochastic approximation for solving Bellman equation Q = T (Q) Qt+1(st, at) = (1 − ηt)Qt(st, at) + ηtTt(Qt)(st, at)

  • nly update (st,at)-th entry

, t ≥ 0

39/ 74

slide-71
SLIDE 71

Q-learning: a classical model-free algorithm

Chris Watkins Peter Dayan

Stochastic approximation for solving Bellman equation Q = T (Q) Qt+1(st, at) = (1 − ηt)Qt(st, at) + ηtTt(Qt)(st, at)

  • nly update (st,at)-th entry

, t ≥ 0

Tt(Q)(st, at) = r(st, at) + γ max

a′

Q(st+1, a′) T (Q)(s, a) = r(s, a) + γ E

s′∼P (·|s,a)

  • max

a′

Q(s′, a′)

39/ 74

slide-72
SLIDE 72

Q-learning on Markovian samples

  • asynchronous: only a single entry is updated each iteration

40/ 74

slide-73
SLIDE 73

Q-learning on Markovian samples

  • asynchronous: only a single entry is updated each iteration
  • resembles Markov-chain coordinate descent

40/ 74

slide-74
SLIDE 74

Q-learning on Markovian samples

  • asynchronous: only a single entry is updated each iteration
  • resembles Markov-chain coordinate descent
  • off-policy: target policy π⋆ = behavior policy πb

40/ 74

slide-75
SLIDE 75

A highly incomplete list of prior work

  • Watkins, Dayan ’92
  • Tsitsiklis ’94
  • Jaakkola, Jordan, Singh ’94
  • Szepesv´

ari ’98

  • Kearns, Singh ’99
  • Borkar, Meyn ’00
  • Even-Dar, Mansour ’03
  • Beck, Srikant ’12
  • Chi, Zhu, Bubeck, Jordan ’18
  • Shah, Xie ’18
  • Lee, He ’18
  • Wainwright ’19
  • Chen, Zhang, Doan, Maguluri, Clarke ’19
  • Yang, Wang ’19
  • Du, Lee, Mahajan, Wang ’20
  • Chen, Maguluri, Shakkottai, Shanmugam ’20
  • Qu, Wierman ’20
  • Devraj, Meyn ’20
  • Weng, Gupta, He, Ying, Srikant ’20
  • ...

41/ 74

slide-76
SLIDE 76

What is sample complexity of (async) Q-learning?

slide-77
SLIDE 77

Prior art: async Q-learning

Question: how many samples are needed to ensure Q − Q⋆∞ ≤ ε?

paper sample complexity learning rate Even-Dar & Mansour ’03

(tcover)

1 1−γ

(1−γ)4ε2

linear:

1 t

Even-Dar & Mansour ’03

  • t1+3ω

cover

(1−γ)4ε2

1

ω + tcover

1−γ

  • 1

1−ω

poly:

1 tω , ω ∈ ( 1 2 , 1)

Beck & Srikant ’12

t3

cover|S||A|

(1−γ)5ε2

constant Qu & Wierman ’20

tmix µ2

min(1−γ)5ε2

rescaled linear

43/ 74

slide-78
SLIDE 78

Prior art: async Q-learning

Question: how many samples are needed to ensure Q − Q⋆∞ ≤ ε?

if we take µmin ≍

1 |S||A|, tcover ≍ tmix µmin 43/ 74

slide-79
SLIDE 79

Prior art: async Q-learning

Question: how many samples are needed to ensure Q − Q⋆∞ ≤ ε?

if we take µmin ≍

1 |S||A|, tcover ≍ tmix µmin

All prior results require sample size of at least tmix|S|2|A|2!

43/ 74

slide-80
SLIDE 80

Prior art: async Q-learning

Question: how many samples are needed to ensure Q − Q⋆∞ ≤ ε?

if we take µmin ≍

1 |S||A|, tcover ≍ tmix µmin

All prior results require sample size of at least tmix|S|2|A|2!

43/ 74

slide-81
SLIDE 81

Main result: ℓ∞-based sample complexity

Theorem 3 (Li, Wei, Chi, Gu, Chen ’20) For any 0 < ε ≤

1 1−γ , sample complexity of async Q-learning to yield

Q − Q⋆∞ ≤ ε is at most (up to some log factor) 1 µmin(1 − γ)5ε2 + tmix µmin(1 − γ)

44/ 74

slide-82
SLIDE 82

Main result: ℓ∞-based sample complexity

Theorem 3 (Li, Wei, Chi, Gu, Chen ’20) For any 0 < ε ≤

1 1−γ , sample complexity of async Q-learning to yield

Q − Q⋆∞ ≤ ε is at most (up to some log factor) 1 µmin(1 − γ)5ε2 + tmix µmin(1 − γ)

  • Improves upon prior art by at least |S||A|!

— prior art:

tmix µ2

min(1−γ)5ε2 (Qu & Wierman ’20) 44/ 74

slide-83
SLIDE 83

Effect of mixing time on sample complexity

1 µmin(1 − γ)5ε2 + tmix µmin(1 − γ)

  • reflects cost taken to reach steady state
  • one-time expense (almost independent of ε)

— it becomes amortized as algorithm runs

45/ 74

slide-84
SLIDE 84

Effect of mixing time on sample complexity

1 µmin(1 − γ)5ε2 + tmix µmin(1 − γ)

  • reflects cost taken to reach steady state
  • one-time expense (almost independent of ε)

— it becomes amortized as algorithm runs — prior art:

tmix µ2

min(1−γ)5ε2 (Qu & Wierman ’20) 45/ 74

slide-85
SLIDE 85

Learning rates

Our choice: constant stepsize ηt ≡ min

(1−γ)4ε2

γ2

,

1 tmix

  • Qu & Wierman ’20: rescaled linear ηt =

1 µmin(1−γ)

t+max{

1 µmin(1−γ) ,tmix}

  • Beck & Srikant ’12: constant ηt ≡ (1 − γ)4ε2

|S||A|t2

cover

  • too conservative
  • Even-Dar & Mansour ’03: polynomial ηt = t−ω (ω ∈ ( 1

2, 1])

46/ 74

slide-86
SLIDE 86

Minimax lower bound

minimax lower bound

(Azar et al. ’13)

1 µmin(1 − γ)3ε2 asyn Q-learning

(ignoring dependency on tmix)

1 µmin(1 − γ)5ε2

47/ 74

slide-87
SLIDE 87

Minimax lower bound

minimax lower bound

(Azar et al. ’13)

1 µmin(1 − γ)3ε2 asyn Q-learning

(ignoring dependency on tmix)

1 µmin(1 − γ)5ε2 Can we improve dependency on discount complexity

1 1−γ ?

47/ 74

slide-88
SLIDE 88

One strategy: variance reduction

— inspired by Johnson & Zhang ’13, Wainwright ’19

Variance-reduced Q-learning updates Qt(st, at) = (1 − η)Qt−1(st, at) + η

  • Tt(Qt−1) −Tt(Q) +

T (Q)

  • use Q to help reduce variability
  • (st, at)
  • Q: some reference Q-estimate

T : empirical Bellman operator (using a batch of samples)

48/ 74

slide-89
SLIDE 89

Variance-reduced Q-learning

— inspired by Johnson & Zhang ’13, Sidford et al. ’18, Wainwright ’19

for each epoch

  • 1. update Q and

T(Q)

  • 2. run variance-reduced Q-learning updates

49/ 74

slide-90
SLIDE 90

Main result: ℓ∞-based sample complexity

Theorem 4 (Li, Wei, Chi, Gu, Chen ’20) For any 0 < ε ≤ 1, sample complexity for (async) variance-reduced Q-learning to yield Q − Q⋆∞ ≤ ε is at most on the order of 1 µmin(1 − γ)3ε2 + tmix µmin(1 − γ)

  • more aggressive learning rates: ηt ≡ min

✘✘ ✘

(1−γ)4(1−γ)2 γ2

,

1 tmix

  • 50/ 74
slide-91
SLIDE 91

Main result: ℓ∞-based sample complexity

Theorem 4 (Li, Wei, Chi, Gu, Chen ’20) For any 0 < ε ≤ 1, sample complexity for (async) variance-reduced Q-learning to yield Q − Q⋆∞ ≤ ε is at most on the order of 1 µmin(1 − γ)3ε2 + tmix µmin(1 − γ)

  • more aggressive learning rates: ηt ≡ min

✘✘ ✘

(1−γ)4(1−γ)2 γ2

,

1 tmix

  • minimax-optimal for 0 < ε ≤ 1

50/ 74

slide-92
SLIDE 92

Summary

Sharpens finite-sample understanding of Q-learning on Markovian data

51/ 74

slide-93
SLIDE 93

Summary

Sharpens finite-sample understanding of Q-learning on Markovian data future directions

  • function approximation
  • on-policy algorithms like SARSA
  • general Markov-chain-based optimization algorithms

51/ 74

slide-94
SLIDE 94

Story 3: fast global convergence of entropy-regularized natural policy gradient (NPG) methods

Shicong Cen CMU ECE Chen Cheng Stanford Stats Yuting Wei CMU Stats Yuejie Chi CMU ECE

slide-95
SLIDE 95

Policy optimization: a major contributor to these successes

53/ 74

slide-96
SLIDE 96

Policy gradient (PG) methods

Given initial state distribution s ∼ ρ: maximizeπ V π(ρ) := Es∼ρ [V π(s)]

54/ 74

slide-97
SLIDE 97

Policy gradient (PG) methods

Given initial state distribution s ∼ ρ: maximizeπ V π(ρ) := Es∼ρ [V π(s)]

softmax parameterization: πθ(a|s) = exp(θ(s, a))

  • a exp(θ(s, a))

54/ 74

slide-98
SLIDE 98

Policy gradient (PG) methods

Given initial state distribution s ∼ ρ: maximizeπ V π(ρ) := Es∼ρ [V π(s)]

softmax parameterization: πθ(a|s) = exp(θ(s, a))

  • a exp(θ(s, a))

maximizeθ V πθ(ρ) := Es∼ρ [V πθ(s)]

54/ 74

slide-99
SLIDE 99

Policy gradient (PG) methods

Given initial state distribution s ∼ ρ: maximizeπ V π(ρ) := Es∼ρ [V π(s)]

softmax parameterization: πθ(a|s) = exp(θ(s, a))

  • a exp(θ(s, a))

maximizeθ V πθ(ρ) := Es∼ρ [V πθ(s)]

PG method (Sutton et al. ’00)

θ(t+1) = θ(t) + η∇θV π(t)

θ (ρ),

t = 0, 1, · · ·

  • η: learning rate

54/ 74

slide-100
SLIDE 100

Booster 1: natural policy gradient (NPG)

precondition gradients to improve search directions ...

= ⇒

Natural Gradient

NPG method (Kakade ’02)

θ(t+1) = θ(t) + η(Fθ

ρ)†∇θV π(t)

θ (ρ),

t = 0, 1, · · ·

ρ := E

  • ∇θ log πθ(a|s)
  • ∇θ log πθ(a|s)

⊤ : Fisher info matrix

55/ 74

slide-101
SLIDE 101

Booster 2: entropy regularization

accelerate convergence by regularizing objective function V π

τ (s0) := E

  • t=0

γtrt − τ log π(at|st)

  • s0
  • = V π(s) +

τ 1 − γ E

s∼dπ

s

  • a

π(a|s) log π(a|s)

  • entropy
  • s0
  • τ: regularization parameter

s : discounted state visitation distribution

56/ 74

slide-102
SLIDE 102

Booster 2: entropy regularization

accelerate convergence by regularizing objective function V π

τ (s0) := E

  • t=0

γtrt − τ log π(at|st)

  • s0
  • = V π(s) +

τ 1 − γ E

s∼dπ

s

  • a

π(a|s) log π(a|s)

  • entropy
  • s0
  • τ: regularization parameter

s : discounted state visitation distribution

entropy-regularized value maximization

maximizeθ V πθ

τ (ρ) := Es∼ρ [V πθ τ (s)]

(“soft” value function)

56/ 74

slide-103
SLIDE 103

Entropy-regularized natural gradient helps!

A toy bandit example: 3 arms with rewards 1, 0.9 and 0.1 increase regularization

−4 −3 −2 −1 log π(a1) −5 −4 −3 −2 −1 log π(a2)

π⋆

τ

π(0) Policy Gradient

−4 −3 −2 −1 log π(a1) −5 −4 −3 −2 −1 log π(a2)

π⋆

τ

π(0) Natural Policy Gradient

−4 −3 −2 −1 log π(a1) −5 −4 −3 −2 −1 log π(a2)

π⋆

τ

π(0) Policy Gradient

−4 −3 −2 −1 log π(a1) −5 −4 −3 −2 −1 log π(a2)

π⋆

τ

π(0) Natural Policy Gradient 57/ 74

slide-104
SLIDE 104

Unreasonable effectiveness in practice

Advantages of policy gradient type methods

  • allow for flexible parameterizations of policies
  • accommodate both continuous and discrete problems

TRPO = NPG + line search (Schulman et al. ’15) A3C (Mnih et al. ’16) SAC (Haarnoja et al. ’18)

58/ 74

slide-105
SLIDE 105

Challenge: non-concavity

59/ 74

slide-106
SLIDE 106

Challenge: non-concavity

Recent advances

  • PG for control (Fazel et al., 2018; Bhandari and Russo, 2019)
  • PG for tabular MDPs (Agarwal et al. 19, Bhandari and Russo ’19, Mei et

al ’20)

  • unregularized NPG for tabular MDPs (Agarwal et al. ’19, Bhandari and

Russo ’20)

  • . . .

59/ 74

slide-107
SLIDE 107

This work: understanding entropy-regularized NPG methods in tabular settings

slide-108
SLIDE 108

Entropy-regularized NPG in tabular settings

An alternative expression in policy space (tabular setting)

π(t+1)(a|s) ∝ π(t)(a|s)1− ητ

1−γ exp

ηQ(t)

τ (s, a)

1 − γ

  • ,

t = 0, 1, · · ·

  • Q(t)

τ : soft Q-function of π(t);

0 < η ≤ 1−γ

τ : learning rate

  • invariant to the choice of initial state distribution ρ

61/ 74

slide-109
SLIDE 109

Linear convergence with exact gradients

  • ptimal policy: π⋆

τ; optimal “soft” Q function: Q⋆ τ := Qπ⋆

τ

τ

Exact oracle: perfect gradient evaluation Theorem 5 (Cen, Cheng, Chen, Wei, Chi ’20) For any 0 < η ≤ (1 − γ)/τ, entropy-regularized NPG achieves Q⋆

τ − Q(t+1) τ

∞ ≤ C1γ (1 − ητ)t , t = 0, 1, · · ·

  • C1 = Q⋆

τ − Q(0) τ ∞ + 2τ

1 −

ητ 1−γ

  • log π⋆

τ − log π(0)∞ 62/ 74

slide-110
SLIDE 110

Implications: iteration complexity

number of iterations needed to reach Q⋆

τ − Q(t+1) τ

∞ ≤ ε is at most

  • General learning rates (0 < η < 1−γ

τ ):

1 ητ log

C1γ

ε

  • Soft policy iteration (η = 1−γ

τ ):

1 1 − γ log

  • Q⋆

τ − Q(0) τ ∞γ

ε

  • 63/ 74
slide-111
SLIDE 111

Implications: iteration complexity

number of iterations needed to reach Q⋆

τ − Q(t+1) τ

∞ ≤ ε is at most

  • General learning rates (0 < η < 1−γ

τ ):

1 ητ log

C1γ

ε

  • Soft policy iteration (η = 1−γ

τ ):

1 1 − γ log

  • Q⋆

τ − Q(0) τ ∞γ

ε

  • Nearly dimension-free global linear convergence!

63/ 74

slide-112
SLIDE 112

Regularized NPG vs. unregularized NPG

regularized NPG unregularized NPG τ = 0.001 τ = 0

1000 2000 3000 4000 5000 #iterations 10−12 10−10 10−8 10−6 10−4 10−2 100 102

  • Q⋆

τ − Q(t) τ

η = 0.01 η = 0.1 η = 1 1000 2000 3000 4000 5000 #iterations 10−12 10−10 10−8 10−6 10−4 10−2 100 102

  • Q⋆ − Q(t)

η = 0.01 η = 0.1 η = 1

linear rate:

1 ητ log

1

ε

  • sublinear rate:

1 min{η,(1−γ)2}ε

Ours (Agarwal et al. ’19)

64/ 74

slide-113
SLIDE 113

Regularized NPG vs. unregularized NPG

regularized NPG unregularized NPG τ = 0.001 τ = 0

1000 2000 3000 4000 5000 #iterations 10−12 10−10 10−8 10−6 10−4 10−2 100 102

  • Q⋆

τ − Q(t) τ

η = 0.01 η = 0.1 η = 1 1000 2000 3000 4000 5000 #iterations 10−12 10−10 10−8 10−6 10−4 10−2 100 102

  • Q⋆ − Q(t)

η = 0.01 η = 0.1 η = 1

linear rate:

1 ητ log

1

ε

  • sublinear rate:

1 min{η,(1−γ)2}ε

Ours (Agarwal et al. ’19) Entropy regularization enables faster convergence!

64/ 74

slide-114
SLIDE 114

Returning to the original MDP?

How to employ entropy-regularized NPG to find an ε-optimal policy for the original (unregularized) MDP?

  • suffices to find an ε

2-optimal policy of regularized MDP

w/ regularization parameter τ = (1−γ)ε

4 log |A|

  • iteration complexity is the same as before (up to log factor)

65/ 74

slide-115
SLIDE 115

Entropy-regularized NPG with inexact gradients

Inexact oracle: inexact evaluation of Q(t)

τ , which returns

Q(t)

τ

s.t.

  • Q(t)

τ

− Q(t)

τ

  • ∞ ≤ δ,

e.g. using sample-based estimators

66/ 74

slide-116
SLIDE 116

Entropy-regularized NPG with inexact gradients

Inexact oracle: inexact evaluation of Q(t)

τ , which returns

Q(t)

τ

s.t.

  • Q(t)

τ

− Q(t)

τ

  • ∞ ≤ δ,

e.g. using sample-based estimators Inexact entropy-regularized NPG: π(t+1)(a|s) ∝

π(t)(a|s) 1− ητ

1−γ exp

η

Q(t)

τ (s, a)

1 − γ

  • 66/ 74
slide-117
SLIDE 117

Entropy-regularized NPG with inexact gradients

Inexact oracle: inexact evaluation of Q(t)

τ , which returns

Q(t)

τ

s.t.

  • Q(t)

τ

− Q(t)

τ

  • ∞ ≤ δ,

e.g. using sample-based estimators Inexact entropy-regularized NPG: π(t+1)(a|s) ∝

π(t)(a|s) 1− ητ

1−γ exp

η

Q(t)

τ (s, a)

1 − γ

  • Question: stability vis-`

a-vis inexact gradient evaluation?

66/ 74

slide-118
SLIDE 118

Linear convergence with inexact gradients

  • Q(t)

τ

− Q(t)

τ

  • ∞ ≤ δ

Theorem 6 (Cen, Cheng, Chen, Wei, Chi ’20) For any stepsize 0 < η ≤ (1 − γ)/τ, entropy-regularized NPG attains

  • Q⋆

τ − Q(t+1) τ

  • ∞ ≤ γ(1 − ητ)tC1 + C2
  • C1 = Q⋆

τ − Q(0) τ ∞ + 2τ

  • 1 −

ητ 1 − γ

  • log π⋆

τ − log π(0)∞

  • C2 =

2γ 1 +

γ ητ

  • 1 − γ

δ : error floor

  • converges linearly at the same rate until an error floor is hit

67/ 74

slide-119
SLIDE 119

A little analysis when η = 1−γ

τ

slide-120
SLIDE 120

A key lemma: monotonic performance improvement

V (t)

τ

V (t+1)

τ

V (t+1)

τ

(ρ) − V (t)

τ

(ρ) = E

s∼d(t+1)

ρ

1 η − τ 1 − γ

  • KL
  • π(t+1)(·|s)
  • π(t)(·|s)
  • KL divergence

+ 1 η KL π(t)(·|s)

  • π(t+1)(·|s)
  • KL divergence
  • 69/ 74
slide-121
SLIDE 121

A key lemma: monotonic performance improvement

V (t)

τ

V (t+1)

τ

V (t+1)

τ

(ρ) − V (t)

τ

(ρ) = E

s∼d(t+1)

ρ

1 η − τ 1 − γ

  • KL
  • π(t+1)(·|s)
  • π(t)(·|s)
  • KL divergence

+ 1 η KL π(t)(·|s)

  • π(t+1)(·|s)
  • KL divergence
  • ≥ 0
  • if 0 < η ≤ 1 − γ

τ

  • 69/ 74
slide-122
SLIDE 122

“Soft” Bellman operator

Tτ(Q)(s, a) := r(s, a)

immediate reward

+ γ E

s′∼P(·|s,a)

  • max

π(·|s′)

E

a′∼π(·|s′)

  • Q(s′, a′)
  • next state’s value

− τ log π(a′|s′)

  • regularizer
  • 70/ 74
slide-123
SLIDE 123

“Soft” Bellman operator

Tτ(Q)(s, a) := r(s, a)

immediate reward

+ γ E

s′∼P(·|s,a)

  • max

π(·|s′)

E

a′∼π(·|s′)

  • Q(s′, a′)
  • next state’s value

− τ log π(a′|s′)

  • regularizer
  • Soft Bellman equation: Q⋆

τ is the unique solution to

Tτ(Q) = Q γ-contraction of soft Bellman operator: Tτ(Q1) − Tτ(Q2)∞ ≤ γQ1 − Q2∞

70/ 74

slide-124
SLIDE 124

policy iteration

evaluate evaluate greedy greedy

π(0) π(1) π(2)

. . .

Qπ(0) Qπ(1)

Q?

π?

Bellman operator

71/ 74

slide-125
SLIDE 125

policy iteration

evaluate evaluate greedy greedy

π(0) π(1) π(2)

. . .

Qπ(0) Qπ(1)

Q?

π?

Bellman operator soft policy iteration (η = 1−γ

τ )

evaluate evaluate

Qπ(0)

τ

soft greedy soft greedy

π(0) π(1) π(2)

. . .

Q?

τ

π?

τ

Qπ(1)

τ

soft Bellman operator

71/ 74

slide-126
SLIDE 126

Summary

Global linear convergence of entropy-regularized NPG methods for tabular discounted MDPs

−4 −3 −2 −1 log π(a1) −5 −4 −3 −2 −1 log π(a2)

π⋆

τ

π(0) Natural Policy Gradient

1000 2000 3000 4000 5000 #iterations 10−12 10−10 10−8 10−6 10−4 10−2 100 102

  • Q⋆

τ − Q(t) τ

η = 0.01 η = 0.1 η = 1

future directions:

  • function approximation
  • sample complexities
  • soft actor-critic algorithms

72/ 74

slide-127
SLIDE 127

Concluding remarks

Understanding RL requires modern statistics and optimization

73/ 74

slide-128
SLIDE 128

Concluding remarks

Understanding RL requires modern statistics and optimization future directions

  • beyond tabular settings
  • finite-horizon episodic MDPs
  • multi-agent RL (e.g. Markov games)
  • . . .

73/ 74

slide-129
SLIDE 129

Papers:

“Breaking the sample size barrier in model-based reinforcement learning with a generative model,” G. Li, Y. Wei, Y. Chi, Y. Gu, Y. Chen, NeurIPS, 2020 “Sample complexity of asynchronous Q-learning: Sharper analysis and variance reduction,” G. Li, Y. Wei, Y. Chi, Y. Gu, Y. Chen, NeurIPS 2020 “Fast global convergence of natural policy gradient methods with entropy regularization,” S. Cen, C. Cheng, Y. Chen, Y. Wei, Y. Chi, arxiv:2007.06558, 2020