Sample Complexity of Asynchronous Q-Learning: Sharper non-asymptotic - - PowerPoint PPT Presentation

sample complexity of asynchronous q learning
SMART_READER_LITE
LIVE PREVIEW

Sample Complexity of Asynchronous Q-Learning: Sharper non-asymptotic - - PowerPoint PPT Presentation

Sample Complexity of Asynchronous Q-Learning: Sharper non-asymptotic analysis and variance reduction Yuxin Chen EE, Princeton University Gen Li Yuting Wei Yuejie Chi Yuantao Gu Tsinghua EE CMU Statistics CMU ECE Tsinghua EE Sample


slide-1
SLIDE 1

Sample Complexity of Asynchronous Q-Learning:

Sharper non-asymptotic analysis and variance reduction Yuxin Chen EE, Princeton University

slide-2
SLIDE 2

Gen Li Tsinghua EE Yuting Wei CMU Statistics Yuejie Chi CMU ECE Yuantao Gu Tsinghua EE

“Sample complexity of asynchronous Q-learning: sharper analysis and variance reduction,” G. Li, Y. Wei, Y. Chi, Y. Gu, Y. Chen, NeurIPS 2020

slide-3
SLIDE 3

Reinforcement learning (RL)

3/ 33

slide-4
SLIDE 4

RL challenges

  • Unknown or changing environments
  • Delayed rewards
  • Enormous state and action space

4/ 33

slide-5
SLIDE 5

Sample efficiency

Collecting data samples might be expensive or time-consuming clinical trials

  • nline ads

5/ 33

slide-6
SLIDE 6

Sample efficiency

Collecting data samples might be expensive or time-consuming clinical trials

  • nline ads

Calls for in-depth understanding about sample efficiency of RL algorithms

5/ 33

slide-7
SLIDE 7
slide-8
SLIDE 8

This talk: a classical example — Q-learning

slide-9
SLIDE 9

Background: Markov decision processes

slide-10
SLIDE 10

Markov decision process (MDP)

  • S: state space
  • A: action space

8/ 33

slide-11
SLIDE 11

Markov decision process (MDP)

  • S: state space
  • A: action space
  • r(s, a) ∈ [0, 1]: immediate reward

8/ 33

slide-12
SLIDE 12

Markov decision process (MDP)

  • state space S: positions in the maze
  • action space A: up, down, left, right
  • immediate reward r(s, a): cheese, electricity shocks, cats

9/ 33

slide-13
SLIDE 13

Markov decision process (MDP)

  • S: state space
  • A: action space
  • r(s, a) ∈ [0, 1]: immediate reward
  • π(·|s): policy (or action selection rule)

10/ 33

slide-14
SLIDE 14

Markov decision process (MDP)

  • S: state space
  • A: action space
  • r(s, a) ∈ [0, 1]: immediate reward
  • π(·|s): policy (or action selection rule)
  • P(·|s, a): unknown transition probabilities

10/ 33

slide-15
SLIDE 15

Value function

Value of policy π: long-term discounted reward ∀s ∈ S : V π(s) := E

  • t=0

γtrt

  • s0 = s
  • 11/ 33
slide-16
SLIDE 16

Value function

Value of policy π: long-term discounted reward ∀s ∈ S : V π(s) := E

  • t=0

γtrt

  • s0 = s
  • γ ∈ [0, 1): discount factor
  • (a0, s1, a1, s2, a2, · · · ): generated under policy π

11/ 33

slide-17
SLIDE 17

Action-value function (a.k.a. Q-function)

Q-function of policy π ∀(s, a) ∈ S × A : Qπ(s, a) := E

  • t=0

γtrt

  • s0 = s, a0 = a
  • (✟

a0, s1, a1, s2, a2, · · · ): generated under policy π

12/ 33

slide-18
SLIDE 18

Action-value function (a.k.a. Q-function)

Q-function of policy π ∀(s, a) ∈ S × A : Qπ(s, a) := E

  • t=0

γtrt

  • s0 = s, a0 = a
  • (✟

a0, s1, a1, s2, a2, · · · ): generated under policy π

12/ 33

slide-19
SLIDE 19

Optimal policy and optimal value

13/ 33

slide-20
SLIDE 20

Optimal policy and optimal value

  • optimal policy π⋆: maximizing value

13/ 33

slide-21
SLIDE 21

Optimal policy and optimal value

  • optimal policy π⋆: maximizing value
  • optimal value / Q function: V ⋆ := V π⋆, Q⋆ := Qπ⋆

13/ 33

slide-22
SLIDE 22
slide-23
SLIDE 23

Need to learn optimal value / policy from data samples

slide-24
SLIDE 24

Markovian samples and behavior policy

Observed: {st, at, rt}t≥0

  • Markovian trajectory

generated by behavior policy πb Goal: learn optimal value V ⋆ and Q⋆ based on sample trajectory

15/ 33

slide-25
SLIDE 25

Markovian samples and behavior policy

Key quantities of sample trajectory

  • minimum state-action occupancy probability

µmin := min µπb(s, a)

  • stationary distribution
  • mixing time: tmix

15/ 33

slide-26
SLIDE 26

Asynchronous Q-learning (on Markovian samples)

slide-27
SLIDE 27

Model-based vs. model-free RL

Model-based approach (“plug-in”)

  • 1. build empirical estimate

P for P

  • 2. planning based on empirical

P

17/ 33

slide-28
SLIDE 28

Model-based vs. model-free RL

Model-based approach (“plug-in”)

  • 1. build empirical estimate

P for P

  • 2. planning based on empirical

P Model-free approach — learning w/o modeling & estimating environment explicitly

17/ 33

slide-29
SLIDE 29

Q-learning: a classical model-free algorithm

Chris Watkins Peter Dayan

Stochastic approximation

  • Robbins & Monro ’51

for solving Bellman equation Q = T (Q)

18/ 33

slide-30
SLIDE 30

Aside: Bellman optimality principle

Bellman operator T (Q)(s, a) := r(s, a)

immediate reward

+ γ E

s′∼P(·|s,a)

  • max

a′∈A Q(s′, a′)

  • next state’s value
  • one-step look-ahead

19/ 33

slide-31
SLIDE 31

Aside: Bellman optimality principle

Bellman operator T (Q)(s, a) := r(s, a)

immediate reward

+ γ E

s′∼P(·|s,a)

  • max

a′∈A Q(s′, a′)

  • next state’s value
  • one-step look-ahead

Bellman equation: Q⋆ is unique solution to T (Q⋆) = Q⋆

Richard Bellman

19/ 33

slide-32
SLIDE 32

Q-learning: a classical model-free algorithm

Chris Watkins Peter Dayan

Stochastic approximation for solving Bellman equation Q = T (Q) Qt+1(st, at) = (1 − ηt)Qt(st, at) + ηtTt(Qt)(st, at)

  • nly update (st,at)-th entry

, t ≥ 0

20/ 33

slide-33
SLIDE 33

Q-learning: a classical model-free algorithm

Chris Watkins Peter Dayan

Stochastic approximation for solving Bellman equation Q = T (Q) Qt+1(st, at) = (1 − ηt)Qt(st, at) + ηtTt(Qt)(st, at)

  • nly update (st,at)-th entry

, t ≥ 0

Tt(Q)(st, at) = r(st, at) + γ max

a′

Q(st+1, a′) T (Q)(s, a) = r(s, a) + γ E

s′∼P (·|s,a)

  • max

a′

Q(s′, a′)

20/ 33

slide-34
SLIDE 34

Q-learning on Markovian samples

  • asynchronous: only a single entry is updated each iteration

21/ 33

slide-35
SLIDE 35

Q-learning on Markovian samples

  • asynchronous: only a single entry is updated each iteration
  • resembles Markov-chain coordinate descent

21/ 33

slide-36
SLIDE 36

Q-learning on Markovian samples

  • asynchronous: only a single entry is updated each iteration
  • resembles Markov-chain coordinate descent
  • off-policy: target policy π⋆ = behavior policy πb

21/ 33

slide-37
SLIDE 37

A highly incomplete list of prior work

  • Watkins, Dayan ’92
  • Tsitsiklis ’94
  • Jaakkola, Jordan, Singh ’94
  • Szepesv´

ari ’98

  • Kearns, Singh ’99
  • Borkar, Meyn ’00
  • Even-Dar, Mansour ’03
  • Beck, Srikant ’12
  • Chi, Zhu, Bubeck, Jordan ’18
  • Shah, Xie ’18
  • Lee, He ’18
  • Wainwright ’19
  • Chen, Zhang, Doan, Maguluri, Clarke ’19
  • Yang, Wang ’19
  • Du, Lee, Mahajan, Wang ’20
  • Chen, Maguluri, Shakkottai, Shanmugam ’20
  • Qu, Wierman ’20
  • Devraj, Meyn ’20
  • Weng, Gupta, He, Ying, Srikant ’20
  • ...

22/ 33

slide-38
SLIDE 38

What is sample complexity of (async) Q-learning?

slide-39
SLIDE 39

Prior art: async Q-learning

Question: how many samples are needed to ensure Q − Q⋆∞ ≤ ε?

paper sample complexity learning rate Even-Dar & Mansour ’03

(tcover)

1 1−γ

(1−γ)4ε2

linear:

1 t

Even-Dar & Mansour ’03

  • t1+3ω

cover

(1−γ)4ε2

1

ω + tcover

1−γ

  • 1

1−ω

poly:

1 tω , ω ∈ ( 1 2 , 1)

Beck & Srikant ’12

t3

cover|S||A|

(1−γ)5ε2

constant Qu & Wierman ’20

tmix µ2

min(1−γ)5ε2

rescaled linear

24/ 33

slide-40
SLIDE 40

Prior art: async Q-learning

Question: how many samples are needed to ensure Q − Q⋆∞ ≤ ε?

if we take µmin ≍

1 |S||A|, tcover ≍ tmix µmin 24/ 33

slide-41
SLIDE 41

Prior art: async Q-learning

Question: how many samples are needed to ensure Q − Q⋆∞ ≤ ε?

if we take µmin ≍

1 |S||A|, tcover ≍ tmix µmin

All prior results require sample size of at least tmix|S|2|A|2!

24/ 33

slide-42
SLIDE 42

Prior art: async Q-learning

Question: how many samples are needed to ensure Q − Q⋆∞ ≤ ε?

if we take µmin ≍

1 |S||A|, tcover ≍ tmix µmin

All prior results require sample size of at least tmix|S|2|A|2!

24/ 33

slide-43
SLIDE 43

Main result: ℓ∞-based sample complexity

Theorem 1 (Li, Wei, Chi, Gu, Chen ’20) For any 0 < ε ≤

1 1−γ , sample complexity of async Q-learning to yield

Q − Q⋆∞ ≤ ε is at most (up to some log factor) 1 µmin(1 − γ)5ε2 + tmix µmin(1 − γ)

25/ 33

slide-44
SLIDE 44

Main result: ℓ∞-based sample complexity

Theorem 1 (Li, Wei, Chi, Gu, Chen ’20) For any 0 < ε ≤

1 1−γ , sample complexity of async Q-learning to yield

Q − Q⋆∞ ≤ ε is at most (up to some log factor) 1 µmin(1 − γ)5ε2 + tmix µmin(1 − γ)

  • Improves upon prior art by at least |S||A|!

— prior art:

tmix µ2

min(1−γ)5ε2 (Qu & Wierman ’20) 25/ 33

slide-45
SLIDE 45

Effect of mixing time on sample complexity

1 µmin(1 − γ)5ε2 + tmix µmin(1 − γ)

  • reflects cost taken to reach steady state
  • one-time expense (almost independent of ε)

— it becomes amortized as algorithm runs

26/ 33

slide-46
SLIDE 46

Effect of mixing time on sample complexity

1 µmin(1 − γ)5ε2 + tmix µmin(1 − γ)

  • reflects cost taken to reach steady state
  • one-time expense (almost independent of ε)

— it becomes amortized as algorithm runs — prior art:

tmix µ2

min(1−γ)5ε2 (Qu & Wierman ’20) 26/ 33

slide-47
SLIDE 47

Learning rates

Our choice: constant stepsize ηt ≡ min

(1−γ)4ε2

γ2

,

1 tmix

  • Qu & Wierman ’20: rescaled linear ηt =

1 µmin(1−γ)

t+max{

1 µmin(1−γ) ,tmix}

  • Beck & Srikant ’12: constant ηt ≡ (1 − γ)4ε2

|S||A|t2

cover

  • too conservative
  • Even-Dar & Mansour ’03: polynomial ηt = t−ω (ω ∈ ( 1

2, 1])

27/ 33

slide-48
SLIDE 48

Minimax lower bound

minimax lower bound

(Azar et al. ’13)

1 µmin(1 − γ)3ε2 asyn Q-learning

(ignoring dependency on tmix)

1 µmin(1 − γ)5ε2

28/ 33

slide-49
SLIDE 49

Minimax lower bound

minimax lower bound

(Azar et al. ’13)

1 µmin(1 − γ)3ε2 asyn Q-learning

(ignoring dependency on tmix)

1 µmin(1 − γ)5ε2 Can we improve dependency on discount complexity

1 1−γ ?

28/ 33

slide-50
SLIDE 50

One strategy: variance reduction

— inspired by Johnson & Zhang ’13, Wainwright ’19

Variance-reduced Q-learning updates Qt(st, at) = (1 − η)Qt−1(st, at) + η

  • Tt(Qt−1) −Tt(Q) +

T (Q)

  • use Q to help reduce variability
  • (st, at)
  • Q: some reference Q-estimate

T : empirical Bellman operator (using a batch of samples)

29/ 33

slide-51
SLIDE 51

Variance-reduced Q-learning

— inspired by Johnson & Zhang ’13, Wainwright ’19

for each epoch

  • 1. update Q and

T(Q)

  • 2. run variance-reduced Q-learning updates

30/ 33

slide-52
SLIDE 52

Main result: ℓ∞-based sample complexity

Theorem 2 (Li, Wei, Chi, Gu, Chen ’20) For any 0 < ε ≤ 1, sample complexity for (async) variance-reduced Q-learning to yield Q − Q⋆∞ ≤ ε is at most on the order of 1 µmin(1 − γ)3ε2 + tmix µmin(1 − γ)

  • more aggressive learning rates: ηt ≡ min

✘✘ ✘

(1−γ)4(1−γ)2 γ2

,

1 tmix

  • 31/ 33
slide-53
SLIDE 53

Main result: ℓ∞-based sample complexity

Theorem 2 (Li, Wei, Chi, Gu, Chen ’20) For any 0 < ε ≤ 1, sample complexity for (async) variance-reduced Q-learning to yield Q − Q⋆∞ ≤ ε is at most on the order of 1 µmin(1 − γ)3ε2 + tmix µmin(1 − γ)

  • more aggressive learning rates: ηt ≡ min

✘✘ ✘

(1−γ)4(1−γ)2 γ2

,

1 tmix

  • minimax-optimal for 0 < ε ≤ 1

31/ 33

slide-54
SLIDE 54

Concluding remarks

Understanding RL requires modern statistics and optimization

32/ 33

slide-55
SLIDE 55

Concluding remarks

Understanding RL requires modern statistics and optimization future directions

  • function approximation
  • finite-horizon episodic MDPs
  • on-policy algorithms like SARSA
  • general Markov-chain-based optimization algorithms

32/ 33

slide-56
SLIDE 56

Paper:

“Sample complexity of asynchronous Q-learning: sharper analysis and variance reduction,” G. Li, Y. Wei, Y. Chi, Y. Gu, Y. Chen, arxiv:2006.03041, NeurIPS 2020