Breaking the sample size barrier in reinforcement learning via - - PowerPoint PPT Presentation

breaking the sample size barrier in reinforcement
SMART_READER_LITE
LIVE PREVIEW

Breaking the sample size barrier in reinforcement learning via - - PowerPoint PPT Presentation

Breaking the sample size barrier in reinforcement learning via model-based methods plug-in Yuxin Chen EE, Princeton University Gen Li Yuejie Chi Yuantao Gu Yuting Wei Tsinghua EE CMU ECE Tsinghua EE CMU


slide-1
SLIDE 1

Breaking the sample size barrier in reinforcement learning via model-based

  • “plug-in”

methods

Yuxin Chen EE, Princeton University

slide-2
SLIDE 2
slide-3
SLIDE 3

Gen Li Tsinghua EE Yuting Wei CMU Statistics Yuejie Chi CMU ECE Yuantao Gu Tsinghua EE

“Breaking the sample size barrier in model-based reinforcement learning with a generative model,” G. Li, Y. Wei, Y. Chi, Y. Gu, Y. Chen, arxiv:2005.12900, 2020

slide-4
SLIDE 4

Gen Li Tsinghua EE Yuting Wei Berkeley Stat Ph.D. Yuejie Chi CMU ECE Yuantao Gu Tsinghua EE

“Breaking the sample size barrier in model-based reinforcement learning with a generative model,” G. Li, Y. Wei, Y. Chi, Y. Gu, Y. Chen, arxiv:2005.12900, 2020

slide-5
SLIDE 5

Reinforcement learning (RL)

4/ 38

slide-6
SLIDE 6

RL challenges

In RL, an agent learns by interacting with an environment

  • unknown or changing environments
  • delayed rewards or feedback
  • enormous state and action space
  • nonconvexity

5/ 38

slide-7
SLIDE 7

Sample efficiency

Collecting data samples might be expensive or time-consuming clinical trials

  • nline ads

6/ 38

slide-8
SLIDE 8

Sample efficiency

Collecting data samples might be expensive or time-consuming clinical trials

  • nline ads

Calls for design of sample-efficient RL algorithms!

6/ 38

slide-9
SLIDE 9

Background: Markov decision processes

slide-10
SLIDE 10

Markov decision process (MDP)

  • S: state space
  • A: action space

8/ 38

slide-11
SLIDE 11

Markov decision process (MDP)

  • S: state space
  • A: action space
  • r(s, a) ∈ [0, 1]: immediate reward

8/ 38

slide-12
SLIDE 12

Markov decision process (MDP)

  • S: state space
  • A: action space
  • r(s, a) ∈ [0, 1]: immediate reward
  • π(·|s): policy (or action selection rule)

9/ 38

slide-13
SLIDE 13

Markov decision process (MDP)

  • S: state space
  • A: action space
  • r(s, a) ∈ [0, 1]: immediate reward
  • π(·|s): policy (or action selection rule)
  • P(·|s, a): unknown transition probabilities

9/ 38

slide-14
SLIDE 14

Value function

Value of policy π: long-term discounted reward ∀s ∈ S : V π(s) := E

  • t=0

γtr(st, at)

  • s0 = s
  • 10/ 38
slide-15
SLIDE 15

Value function

Value of policy π: long-term discounted reward ∀s ∈ S : V π(s) := E

  • t=0

γtr(st, at)

  • s0 = s
  • (a0, s1, a1, s2, a2, · · · ): generated under policy π

10/ 38

slide-16
SLIDE 16

Value function

Value of policy π: long-term discounted reward ∀s ∈ S : V π(s) := E

  • t=0

γtr(st, at)

  • s0 = s
  • (a0, s1, a1, s2, a2, · · · ): generated under policy π
  • γ ∈ [0, 1): discount factor
  • take γ → 1 to approximate long-horizon MDPs

10/ 38

slide-17
SLIDE 17

Optimal policy and optimal values

  • Optimal policy π⋆: maximizing the value function

11/ 38

slide-18
SLIDE 18

Optimal policy and optimal values

  • Optimal policy π⋆: maximizing the value function
  • Optimal values: V ⋆ := V π⋆

11/ 38

slide-19
SLIDE 19

When the model is known . . .

truth: P r b planning b b planning oracle

π?

b b e.g. policy iteration P r MDP specification

Planning: computing the optimal policy π⋆ given MDP specification

12/ 38

slide-20
SLIDE 20

When the model is unknown . . .

Need to learn optimal policy from samples w/o model specification

13/ 38

slide-21
SLIDE 21

This talk: RL with a generative model / simulator

— Kearns, Singh ’99 For each state-action pair (s, a), collect N samples {(s, a, s′

(i))}1≤i≤N

14/ 38

slide-22
SLIDE 22

Question: how many samples are sufficient to learn an ε-optimal policy

  • ?
slide-23
SLIDE 23

Question: how many samples are sufficient to learn an ε-optimal policy

  • ∀s: V

π(s) ≥ V ⋆(s)−ε

?

slide-24
SLIDE 24

An incomplete list of prior art

  • Kearns & Singh ’99
  • Kakade ’03
  • Kearns, Mansour & Ng ’02
  • Azar, Munos & Kappen ’12
  • Azar, Munos, Ghavamzadeh & Kappen ’13
  • Sidford, Wang, Wu, Yang & Ye ’18
  • Sidford, Wang, Wu & Ye ’18
  • Wang ’17
  • Agarwal, Kakade & Yang ’19
  • Wainwright ’19a
  • Wainwright ’19b
  • Pananjady & Wainwright ’20
  • Yang & Wang ’19
  • Khamaru, Pananjady, Ruan, Wainwright & Jordan ’20
  • Mou, Li, Wainwright, Bartlett & Jordan ’20
  • . . .

16/ 38

slide-25
SLIDE 25

An even shorter list of prior art

algorithm sample size range sample complexity ε-range empirical QVI

|S|2|A|

(1−γ)2 , ∞) |S||A| (1−γ)3ε2

(0,

1

(1−γ)|S|]

Azar et al. ’13 sublinear randomized VI

|S||A|

(1−γ)2 , ∞) |S||A| (1−γ)4ε2

  • 0,

1 1−γ

  • Sidford et al. ’18a

variance-reduced QVI

|S||A|

(1−γ)3 , ∞) |S||A| (1−γ)3ε2

(0, 1] Sidford et al. ’18b empirical MDP + planning

|S||A|

(1−γ)2 , ∞) |S||A| (1−γ)3ε2

(0,

1 √1−γ ]

Agarwal et al. ’19

— see also Wainwright ’19 (for estimating optimal values)

17/ 38

slide-26
SLIDE 26

18/ 38

slide-27
SLIDE 27

18/ 38

slide-28
SLIDE 28

All prior theory requires sample size > |S||A| (1 − γ)2

  • sample size barrier

18/ 38

slide-29
SLIDE 29

Is it possible to close the gap?

slide-30
SLIDE 30

Two approaches

Model-based approach (“plug-in”)

  • 1. build an empirical estimate

P for P

  • 2. planning based on the empirical

P

20/ 38

slide-31
SLIDE 31

Two approaches

Model-based approach (“plug-in”)

  • 1. build an empirical estimate

P for P

  • 2. planning based on the empirical

P Model-free approach (e.g. Q-learning, SARSA) — learning w/o estimating the model explicitly

20/ 38

slide-32
SLIDE 32

Two approaches

Model-based approach (“plug-in”)

  • 1. build an empirical estimate

P for P

  • 2. planning based on the empirical

P Model-free approach (e.g. Q-learning, SARSA) — learning w/o estimating the model explicitly

20/ 38

slide-33
SLIDE 33

Model estimation

Sampling: for each (s, a), collect N ind. samples {(s, a, s′

(i))}1≤i≤N

21/ 38

slide-34
SLIDE 34

Model estimation

Sampling: for each (s, a), collect N ind. samples {(s, a, s′

(i))}1≤i≤N

Empirical estimates: estimate P(s′|s, a) by 1 N

N

  • i=1

1{s′

(i) = s′}

  • empirical frequency

21/ 38

slide-35
SLIDE 35

Model-based (plug-in) estimator

— Azar et al. ’13, Agarwal et al. ’19, Pananjady et al. ’20

b planning b b planning oracle b b e.g. policy iteration P r empirical ‚ P P r P empirical MDP

b π?

Planning based on the empirical MDP with slightly perturbed rewards

22/ 38

slide-36
SLIDE 36

Our method: plug-in estimator + perturbation

— Li, Wei, Chi, Gu, Chen ’20

b planning b b planning oracle b b e.g. policy iteration P r empirical ‚ P rd: rp

b π?

p P r empirical ‚ P P r

P empirical MDP

rewards p rds perturb

Run planning algorithms based on the empirical MDP

22/ 38

slide-37
SLIDE 37

Challenges in the sample-starved regime

truth: P ∈ R|S||A|×|S| empirical estimate:

  • P
  • Can’t recover P faithfully if sample size ≪ |S|2|A|!

23/ 38

slide-38
SLIDE 38

Challenges in the sample-starved regime

truth: P ∈ R|S||A|×|S| empirical estimate:

  • P
  • Can’t recover P faithfully if sample size ≪ |S|2|A|!
  • Can we trust our policy estimate when reliable model estimation

is infeasible?

23/ 38

slide-39
SLIDE 39

Main result

Theorem 1 (Li, Wei, Chi, Gu, Chen ’20) For any 0 < ε ≤

1 1−γ , the optimal policy

π⋆

p of the perturbed

empirical MDP achieves V

π⋆

p − V ⋆∞ ≤ ε

with sample complexity at most

  • O
  • |S||A|

(1 − γ)3ε2

  • 24/ 38
slide-40
SLIDE 40

Main result

Theorem 1 (Li, Wei, Chi, Gu, Chen ’20) For any 0 < ε ≤

1 1−γ , the optimal policy

π⋆

p of the perturbed

empirical MDP achieves V

π⋆

p − V ⋆∞ ≤ ε

with sample complexity at most

  • O
  • |S||A|

(1 − γ)3ε2

π⋆

p: obtained by empirical QVI or PI within

O

  • 1

1−γ

iterations

24/ 38

slide-41
SLIDE 41

Main result

Theorem 1 (Li, Wei, Chi, Gu, Chen ’20) For any 0 < ε ≤

1 1−γ , the optimal policy

π⋆

p of the perturbed

empirical MDP achieves V

π⋆

p − V ⋆∞ ≤ ε

with sample complexity at most

  • O
  • |S||A|

(1 − γ)3ε2

π⋆

p: obtained by empirical QVI or PI within

O

  • 1

1−γ

iterations

  • Minimax lower bound:

Ω(

|S||A| (1−γ)3ε2 )

(Azar et al. ’13)

24/ 38

slide-42
SLIDE 42

25/ 38

slide-43
SLIDE 43

Analysis

slide-44
SLIDE 44

Notation and Bellman equation

  • V π: true value function under policy π
  • Bellman equation: V π = (I − Pπ)−1r

27/ 38

slide-45
SLIDE 45

Notation and Bellman equation

  • V π: true value function under policy π
  • Bellman equation: V π = (I − Pπ)−1r

V π: estimate of value function under policy π

  • Bellman equation:

V π = (I − Pπ)−1r

27/ 38

slide-46
SLIDE 46

Notation and Bellman equation

  • V π: true value function under policy π
  • Bellman equation: V π = (I − Pπ)−1r

V π: estimate of value function under policy π

  • Bellman equation:

V π = (I − Pπ)−1r

  • π⋆: optimal policy w.r.t. true value function

π⋆: optimal policy w.r.t. empirical value function

27/ 38

slide-47
SLIDE 47

Notation and Bellman equation

  • V π: true value function under policy π
  • Bellman equation: V π = (I − Pπ)−1r

V π: estimate of value function under policy π

  • Bellman equation:

V π = (I − Pπ)−1r

  • π⋆: optimal policy w.r.t. true value function

π⋆: optimal policy w.r.t. empirical value function

  • V ⋆ := V π⋆: optimal values under true models

V ⋆ := V

π⋆: optimal values under empirical models

27/ 38

slide-48
SLIDE 48

Proof ideas

Elementary decomposition: V ⋆ − V

π⋆ =

V ⋆ −

V π⋆ + V π⋆ − V

π⋆ +

V

π⋆ − V π⋆

28/ 38

slide-49
SLIDE 49

Proof ideas

Elementary decomposition: V ⋆ − V

π⋆ =

V ⋆ −

V π⋆ + V π⋆ − V

π⋆ +

V

π⋆ − V π⋆

V π⋆ −

V π⋆ + 0 + V

π⋆ − V π⋆

  • Step 1: control V π −

V π for a fixed π (Bernstein inequality + high-order decomposition)

28/ 38

slide-50
SLIDE 50

Proof ideas

Elementary decomposition: V ⋆ − V

π⋆ =

V ⋆ −

V π⋆ + V π⋆ − V

π⋆ +

V

π⋆ − V π⋆

V π⋆ −

V π⋆ + 0 + V

π⋆ − V π⋆

  • Step 1: control V π −

V π for a fixed π (Bernstein inequality + high-order decomposition)

  • Step 2: extend it to control

V

π⋆

− V

π⋆ (

π⋆ depends on samples) (decouple statistical dependency)

28/ 38

slide-51
SLIDE 51

Step 1: improved theory for policy evaluation

Model-based policy evaluation: — given a fixed policy π, estimate V π via the plug-in estimate V π

29/ 38

slide-52
SLIDE 52

Step 1: improved theory for policy evaluation

Model-based policy evaluation: — given a fixed policy π, estimate V π via the plug-in estimate V π

1 "2

||A| )3 " = 1 p1 − γ " = γ " = 1 1 − γ " = 1 γ " = 1 Wa [Pananj

  • [Yang and W
  • [Khamaru et al.,
  • [Mou et al., 2020]

minimax lower bound sample com ple complexity

b d

1 1 − γ

p r i

  • r

w

  • r

k

|S| (1 − γ)2 |S| 1 − γ

  • A sample size barrier

|S| (1−γ)2 already appeared in prior work (Agarwal et al. ’19, Pananjady & Wainwright ’19, Khamaru et al. ’20)

29/ 38

slide-53
SLIDE 53

Step 1: improved theory for policy evaluation

Model-based policy evaluation: — given a fixed policy π, estimate V π via the plug-in estimate V π Theorem 2 (Li, Wei, Chi, Gu, Chen’20) Fix any policy π. For 0 < ε ≤

1 1−γ , the plug-in estimator

V π obeys V π − V π∞ ≤ ε with sample complexity at most

  • O
  • |S|

(1 − γ)3ε2

  • 29/ 38
slide-54
SLIDE 54

Step 1: improved theory for policy evaluation

Model-based policy evaluation: — given a fixed policy π, estimate V π via the plug-in estimate V π Theorem 2 (Li, Wei, Chi, Gu, Chen’20) Fix any policy π. For 0 < ε ≤

1 1−γ , the plug-in estimator

V π obeys V π − V π∞ ≤ ε with sample complexity at most

  • O
  • |S|

(1 − γ)3ε2

  • Minimax optimal for all ε (Azar et al. ’13, Pananjady & Wainwright ’19)

29/ 38

slide-55
SLIDE 55

Key idea 1: a peeling argument

Agarwal, Kakade, Yang 19: first-order expansion

  • V π − V π = γ

I − γPπ −1

Pπ − Pπ V π (⋆) Ours: higher-order expansion − → tighter control

  • V π − V π = γ

I − γPπ −1

Pπ − Pπ

V π+

30/ 38

slide-56
SLIDE 56

Key idea 1: a peeling argument

Agarwal, Kakade, Yang 19: first-order expansion

  • V π − V π = γ

I − γPπ −1

Pπ − Pπ V π (⋆) Ours: higher-order expansion − → tighter control

  • V π − V π = γ

I − γPπ −1

Pπ − Pπ

V π+

+ γ

I − γPπ −1

Pπ − Pπ

  • V π − V π

30/ 38

slide-57
SLIDE 57

Key idea 1: a peeling argument

Agarwal, Kakade, Yang 19: first-order expansion

  • V π − V π = γ

I − γPπ −1

Pπ − Pπ V π (⋆) Ours: higher-order expansion − → tighter control

  • V π − V π = γ

I − γPπ −1

Pπ − Pπ

V π+

+ γ2 (I − γPπ

−1

Pπ − Pπ)

2

V π + γ3 (I − γPπ

−1

Pπ − Pπ)

3

V π + . . .

30/ 38

slide-58
SLIDE 58

Step 2: controlling V

  • π⋆

− V

π⋆

A natural idea: apply our policy evaluation theory + union bound

31/ 38

slide-59
SLIDE 59

Step 2: controlling V

  • π⋆

− V

π⋆

A natural idea: apply our policy evaluation theory + union bound

  • highly suboptimal! (there are exponentially many policies)

31/ 38

slide-60
SLIDE 60

Key idea 2: leave-one-out analysis

Decouple dependency by introducing auxiliary state-action absorbing MDPs by dropping randomness for each (s, a)

P r empirical ‚ P P r leave-one-out b P (s,a) r(s,a)

decouple b dependency — inspired by Agarwal et al. ’19 but quite different . . .

32/ 38

slide-61
SLIDE 61

Key idea 2: leave-one-out analysis

  • Stein ’72
  • El Karoui, Bean, Bickel, Lim, Yu ’13
  • El Karoui ’15
  • Javanmard, Montanari ’15
  • Zhong, Boumal ’17
  • Lei, Bickel, El Karoui ’17
  • Sur, Chen, Cand`

es ’17

  • Abbe, Fan, Wang, Zhong ’17
  • Chen, Fan, Ma, Wang ’17
  • Ma, Wang, Chi, Chen ’17
  • Chen, Chi, Fan, Ma ’18
  • Ding, Chen ’18
  • Dong, Shi ’18
  • Chen, Chi, Fan, Ma, Yan ’19
  • Chen, Fan, Ma, Yan ’19
  • Cai, Li, Poor, Chen ’19
  • Agarwal, Kakade, Yang ’19
  • Pananjady, Wainwright ’19
  • Ling ’20

33/ 38

slide-62
SLIDE 62

Key idea 2: leave-one-out analysis

P r empirical ‚ P P r leave-one-out b P (s,a) r(s,a)

  • 1. embed all randomness from

Ps,a into a single scalar (i.e. r(s,a)

s,a )

34/ 38

slide-63
SLIDE 63

Key idea 2: leave-one-out analysis

P r empirical ‚ P P r leave-one-out b P (s,a) r(s,a)

b

1 1−γ

  • 1. embed all randomness from

Ps,a into a single scalar (i.e. r(s,a)

s,a )

  • 2. build an ǫ-net for this scalar

34/ 38

slide-64
SLIDE 64

Key idea 2: leave-one-out analysis

P r empirical ‚ P P r leave-one-out b P (s,a) r(s,a)

b

1 1−γ

  • 1. embed all randomness from

Ps,a into a single scalar (i.e. r(s,a)

s,a )

  • 2. build an ǫ-net for this scalar

3. π⋆ can be determined by this ǫ-net under separation condition ∀s ∈ S,

  • Q⋆(s,

π⋆(s)) − max

a: a= π⋆(s)

  • Q⋆(s, a) > 0

34/ 38

slide-65
SLIDE 65

Key idea 2: leave-one-out analysis

P r empirical ‚ P P r leave-one-out b P (s,a) r(s,a)

b

1 1−γ

Our decoupling argument vs. Agarwal, Kakade, Yang ’19

  • Agarwal et al. ’19: dependency btw value

V & samples

  • Ours: dependency btw policy

π & samples

34/ 38

slide-66
SLIDE 66

Key idea 3: tie-breaking via perturbation

  • How to ensure separation between the optimal policy and others?

∀s ∈ S,

  • Q⋆(s,

π⋆(s)) − max

a: a= π⋆(s)

  • Q⋆(s, a) > 0

35/ 38

slide-67
SLIDE 67

Key idea 3: tie-breaking via perturbation

  • How to ensure separation between the optimal policy and others?

∀s ∈ S,

  • Q⋆(s,

π⋆(s)) − max

a: a= π⋆(s)

  • Q⋆(s, a) > 0
  • Solution: slightly perturb rewards r =

⇒ π⋆

p

  • ensures

π⋆

p can be differentiated from others

  • V

π⋆

p ≈ V

π⋆

35/ 38

slide-68
SLIDE 68

Key idea 3: tie-breaking via perturbation

  • How to ensure separation between the optimal policy and others?

∀s ∈ S,

  • Q⋆(s,

π⋆(s)) − max

a: a= π⋆(s)

  • Q⋆(s, a) > (1−γ)ε

|S|5|A|5

  • Solution: slightly perturb rewards r =

⇒ π⋆

p

  • ensures

π⋆

p can be differentiated from others

  • V

π⋆

p ≈ V

π⋆

35/ 38

slide-69
SLIDE 69

Other stories: sharpened analysis of Q-learning

Improves existing sample complexity guarantees for asynchronous Q-learning by at least a factor of |S||A|!

“Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction,” G. Li, Y. Wei, Y. Chi, Y. Gu, Y. Chen, NeurIPS 2020

36/ 38

slide-70
SLIDE 70

Other stories: efficiency of natural policy gradient

NPG method with entropy regularization converges linearly!

−4 −3 −2 −1 log π(a1) −5 −4 −3 −2 −1 log π(a2)

π⋆

τ

π(0) Policy Gradient, η = 0.1

−4 −3 −2 −1 log π(a1) −5 −4 −3 −2 −1 log π(a2)

π⋆

τ

π(0) Natural Policy Gradient, η = 0.1

“Fast global convergence of natural policy gradient methods with entropy regularization,” S. Cen, C. Cheng, Y. Chen, Y. Wei, Y. Chi, arxiv:2007.06558, 2020

37/ 38

slide-71
SLIDE 71

Concluding remarks

Understanding RL requires modern statistics and optimization

38/ 38

slide-72
SLIDE 72

Concluding remarks

Understanding RL requires modern statistics and optimization future directions

  • beyond the tabular settings
  • finite-horizon episodic MDPs
  • Markov games

“Breaking the sample size barrier in model-based reinforcement learning with a generative model,” G. Li, Y. Wei, Y. Chi, Y. Gu, Y. Chen, NeurIPS 2020

38/ 38