Breaking the Sample Size Barrier in Model-Based Reinforcement - - PowerPoint PPT Presentation

breaking the sample size barrier in model based
SMART_READER_LITE
LIVE PREVIEW

Breaking the Sample Size Barrier in Model-Based Reinforcement - - PowerPoint PPT Presentation

Breaking the Sample Size Barrier in Model-Based Reinforcement Learning Yuting Wei Carnegie Mellon University Nov, 2020 Gen Li Yuejie Chi Yuantao Gu Yuxin Chen Tsinghua EE CMU ECE Tsinghua EE Princeton EE Reinforcement learning (RL) 3 /


slide-1
SLIDE 1

Breaking the Sample Size Barrier in Model-Based Reinforcement Learning

Yuting Wei Carnegie Mellon University Nov, 2020

slide-2
SLIDE 2

Gen Li Tsinghua EE Yuejie Chi CMU ECE Yuantao Gu Tsinghua EE Yuxin Chen Princeton EE

slide-3
SLIDE 3

Reinforcement learning (RL)

3 / 34

slide-4
SLIDE 4

RL challenges

  • Unknown or changing environment
  • Credit assignment problem
  • Enormous state and action space

4 / 34

slide-5
SLIDE 5

Provable efficiency

  • Collecting samples might be expensive or impossible:

sample efficiency

  • Training deep RL algorithms might take long time:

computational efficiency

5 / 34

slide-6
SLIDE 6

This talk

Question: can we design sample- and computation-efficient RL algorithms?

—– inspired by numerous prior work [Kearns and Singh, 1999, Sidford et al., 2018a, Agarwal et al., 2019]...

6 / 34

slide-7
SLIDE 7

Background: Markov decision processes

7 / 34

slide-8
SLIDE 8

Markov decision process (MDP)

  • S: state space
  • A: action space

8 / 34

slide-9
SLIDE 9

Markov decision process (MDP)

  • S: state space
  • A: action space
  • r(s, a) 2 [0, 1]: immediate reward

8 / 34

slide-10
SLIDE 10

Markov decision process (MDP)

  • S: state space
  • A: action space
  • r(s, a) 2 [0, 1]: immediate reward
  • ⇡(·|s): policy (or action selection rule)

8 / 34

slide-11
SLIDE 11

Markov decision process (MDP)

  • S: state space
  • A: action space
  • r(s, a) 2 [0, 1]: immediate reward
  • ⇡(·|s): policy (or action selection rule)
  • P(·|s, a): unknown transition probabilities

8 / 34

slide-12
SLIDE 12

Help the mouse!

9 / 34

slide-13
SLIDE 13

Help the mouse!

  • state space S: positions in the maze

9 / 34

slide-14
SLIDE 14

Help the mouse!

  • state space S: positions in the maze
  • action space A: up, down, left, right

9 / 34

slide-15
SLIDE 15

Help the mouse!

  • state space S: positions in the maze
  • action space A: up, down, left, right
  • immediate reward r: cheese, electricity shocks, cats

9 / 34

slide-16
SLIDE 16

Help the mouse!

  • state space S: positions in the maze
  • action space A: up, down, left, right
  • immediate reward r: cheese, electricity shocks, cats
  • policy ⇡(·|s): the way to find cheese

9 / 34

slide-17
SLIDE 17

Value function

Value function of policy ⇡: long-term discounted reward 8s 2 S : V ⇡(s) := E " 1 X

t=0

trt

  • s0 = s

#

10 / 34

slide-18
SLIDE 18

Value function

Value function of policy ⇡: long-term discounted reward 8s 2 S : V ⇡(s) := E " 1 X

t=0

trt

  • s0 = s

#

  • 2 [0, 1): discount factor
  • (a0, s1, a1, s2, a2, · · · ): generated under policy ⇡

10 / 34

slide-19
SLIDE 19

Action-value function (a.k.a. Q-function)

Q-function of policy ⇡ 8(s, a) 2 S ⇥ A : Q⇡(s, a) := E " 1 X

t=0

trt

  • s0 = s, a0 = a

#

  • (
  • a0, s1, a1, s2, a2, · · · ): generated under policy ⇡

11 / 34

slide-20
SLIDE 20

Action-value function (a.k.a. Q-function)

Q-function of policy ⇡ 8(s, a) 2 S ⇥ A : Q⇡(s, a) := E " 1 X

t=0

trt

  • s0 = s, a0 = a

#

  • (
  • a0, s1, a1, s2, a2, · · · ): generated under policy ⇡

11 / 34

slide-21
SLIDE 21

Optimal policy

12 / 34

slide-22
SLIDE 22

Optimal policy

  • optimal policy ⇡?: maximizing value function

12 / 34

slide-23
SLIDE 23

Optimal policy

  • optimal policy ⇡?: maximizing value function
  • optimal value / Q function: V ? := V ⇡?; Q? := Q⇡?

12 / 34

slide-24
SLIDE 24

Practically, learn the optimal policy from data samples . . .

slide-25
SLIDE 25

This talk: sampling from a generative model

14 / 34

slide-26
SLIDE 26

This talk: sampling from a generative model

For each state-action pair (s, a), collect N samples {(s, a, s0

(i))}1iN

14 / 34

slide-27
SLIDE 27

This talk: sampling from a generative model

For each state-action pair (s, a), collect N samples {(s, a, s0

(i))}1iN

How many samples are sufficient to learn an "-optimal policy?

14 / 34

slide-28
SLIDE 28

An incomplete list of prior art

  • [Kearns and Singh, 1999]
  • [Kakade, 2003]
  • [Kearns et al., 2002]
  • [Azar et al., 2012]
  • [Azar et al., 2013]
  • [Sidford et al., 2018a]
  • [Sidford et al., 2018b]
  • [Wang, 2019]
  • [Agarwal et al., 2019]
  • [Wainwright, 2019a]
  • [Wainwright, 2019b]
  • [Pananjady and Wainwright, 2019]
  • [Yang and Wang, 2019]
  • [Khamaru et al., 2020]
  • [Mou et al., 2020]
  • . . .

15 / 34

slide-29
SLIDE 29

An even shorter list of prior art

algorithm sample size range sample complexity "-range Empirical QVI ⇥ |S|2|A|

(1−)2 , 1) |S||A| (1−)3"2

(0,

1

p

(1−)|S| ]

[Azar et al., 2013] Sublinear randomized VI ⇥ |S||A|

(1−)2 , 1) |S||A| (1−)4"2

  • 0,

1 1−

⇤ [Sidford et al., 2018b] Variance-reduced QVI ⇥ |S||A|

(1−)3 , 1) |S||A| (1−)3"2

(0, 1] [Sidford et al., 2018a] Randomized primal-dual ⇥ |S||A|

(1−)2 , 1) |S||A| (1−)4"2

(0,

1 1− ]

[Wang, 2019] Empirical MDP + planning ⇥ |S||A|

(1−)2 , 1) |S||A| (1−)3"2

(0,

1 √1− ]

[Agarwal et al., 2019]

important parameters = )

  • # states |S|, # actions |A|
  • the discounted complexity

1 1

  • approximation error " 2 (0,

1 1 ]

16 / 34

slide-30
SLIDE 30

17 / 34

slide-31
SLIDE 31

17 / 34

slide-32
SLIDE 32

All prior theory requires sample size > |S||A| (1 )2 | {z }

sample size barrier

17 / 34

slide-33
SLIDE 33

This talk: break the sample complexity barrier

18 / 34

slide-34
SLIDE 34

Two approaches

Model-based approach (“plug-in”)

  • 1. build empirical estimate b

P for P

  • 2. planning based on empirical b

P

19 / 34

slide-35
SLIDE 35

Two approaches

Model-based approach (“plug-in”)

  • 1. build empirical estimate b

P for P

  • 2. planning based on empirical b

P Model-free approach — learning w/o constructing a model explicitly

19 / 34

slide-36
SLIDE 36

Two approaches

Model-based approach (“plug-in”)

  • 1. build empirical estimate b

P for P

  • 2. planning based on empirical b

P Model-free approach — learning w/o constructing a model explicitly

19 / 34

slide-37
SLIDE 37

Model estimation

Sampling: for each (s, a), collect N ind. samples {(s, a, s0

(i))}1iN

20 / 34

slide-38
SLIDE 38

Model estimation

Sampling: for each (s, a), collect N ind. samples {(s, a, s0

(i))}1iN

Empirical estimates: estimate b P(s0|s, a) by 1 N

N

X

i=1

1{s0

(i) = s0}

| {z }

empirical frequency

20 / 34

slide-39
SLIDE 39

Our method: plug-in estimator + perturbation

21 / 34

slide-40
SLIDE 40

Our method: plug-in estimator + perturbation

21 / 34

slide-41
SLIDE 41

Our method: plug-in estimator + perturbation

21 / 34

slide-42
SLIDE 42

Our method: plug-in estimator + perturbation

21 / 34

slide-43
SLIDE 43

Challenges in the sample-starved regime

truth: P 2 R|S||A|⇥|S| empirical estimate: b P

  • can’t recover P faithfully if sample size ⌧ |S|2|A|!

22 / 34

slide-44
SLIDE 44

Challenges in the sample-starved regime

truth: P 2 R|S||A|⇥|S| empirical estimate: b P

  • can’t recover P faithfully if sample size ⌧ |S|2|A|!

Can we trust our policy estimate when reliable model estimation is infeasible?

22 / 34

slide-45
SLIDE 45

Main result

Theorem (Li, Wei, Chi, Gu, Chen ’20) For every 0 < " 

1 1 , policy b

⇡?

p of perturbed empirical MDP achieves

kV b

⇡?

p V ?k1  "

and kQb

⇡?

p Q?k1  "

with sample complexity at most e O ✓ |S||A| (1 )3"2 ◆ .

23 / 34

slide-46
SLIDE 46

Main result

Theorem (Li, Wei, Chi, Gu, Chen ’20) For every 0 < " 

1 1 , policy b

⇡?

p of perturbed empirical MDP achieves

kV b

⇡?

p V ?k1  "

and kQb

⇡?

p Q?k1  "

with sample complexity at most e O ✓ |S||A| (1 )3"2 ◆ .

  • b

⇡?

p: obtained by empirical QVI or PI within e

O

  • 1

1

  • iterations

23 / 34

slide-47
SLIDE 47

Main result

Theorem (Li, Wei, Chi, Gu, Chen ’20) For every 0 < " 

1 1 , policy b

⇡?

p of perturbed empirical MDP achieves

kV b

⇡?

p V ?k1  "

and kQb

⇡?

p Q?k1  "

with sample complexity at most e O ✓ |S||A| (1 )3"2 ◆ .

  • b

⇡?

p: obtained by empirical QVI or PI within e

O

  • 1

1

  • iterations
  • minimax lower bound: e

Ω(

|S||A| (1)3"2 ) [Azar et al., 2013]

23 / 34

slide-48
SLIDE 48

24 / 34

slide-49
SLIDE 49

A sketch of the main proof ingredients

25 / 34

slide-50
SLIDE 50

Notation and Bellman equation

  • V ⇡: true value function under policy ⇡

I Bellman equation: V = (I P⇡)−1r

[Sutton and Barto, 2018]

26 / 34

slide-51
SLIDE 51

Notation and Bellman equation

  • V ⇡: true value function under policy ⇡

I Bellman equation: V = (I P⇡)−1r

[Sutton and Barto, 2018]

  • b

V ⇡: estimate of value function under policy ⇡

I Bellman equation: b V = (I b P⇡)−1r

26 / 34

slide-52
SLIDE 52

Notation and Bellman equation

  • V ⇡: true value function under policy ⇡

I Bellman equation: V = (I P⇡)−1r

[Sutton and Barto, 2018]

  • b

V ⇡: estimate of value function under policy ⇡

I Bellman equation: b V = (I b P⇡)−1r

  • ⇡?: optimal policy w.r.t. true value function
  • b

⇡?: optimal policy w.r.t. empirical value function

26 / 34

slide-53
SLIDE 53

Notation and Bellman equation

  • V ⇡: true value function under policy ⇡

I Bellman equation: V = (I P⇡)−1r

[Sutton and Barto, 2018]

  • b

V ⇡: estimate of value function under policy ⇡

I Bellman equation: b V = (I b P⇡)−1r

  • ⇡?: optimal policy w.r.t. true value function
  • b

⇡?: optimal policy w.r.t. empirical value function

  • V ? := V ⇡?: optimal values under true models
  • b

V ? := b V b

⇡?: optimal values under empirical models

26 / 34

slide-54
SLIDE 54

Proof ideas (cont.)

Elementary decomposition: V ? V b

⇡? =

  • V ? b

V ⇡? + b V ⇡? b V b

⇡?

+ b V b

⇡? V b ⇡?

27 / 34

slide-55
SLIDE 55

Proof ideas (cont.)

Elementary decomposition: V ? V b

⇡? =

  • V ? b

V ⇡? + b V ⇡? b V b

⇡?

+ b V b

⇡? V b ⇡?

  • V ? b

V ⇡? + 0 + b V b

⇡? V b ⇡?

27 / 34

slide-56
SLIDE 56

Proof ideas (cont.)

Elementary decomposition: V ? V b

⇡? =

  • V ? b

V ⇡? + b V ⇡? b V b

⇡?

+ b V b

⇡? V b ⇡?

  • V ? b

V ⇡? + 0 + b V b

⇡? V b ⇡?

  • Step 1: control V ⇡ b

V ⇡, for fixed ⇡ (Bernstein’s inequality + high order decomposition)

27 / 34

slide-57
SLIDE 57

Proof ideas (cont.)

Elementary decomposition: V ? V b

⇡? =

  • V ? b

V ⇡? + b V ⇡? b V b

⇡?

+ b V b

⇡? V b ⇡?

  • V ? b

V ⇡? + 0 + b V b

⇡? V b ⇡?

  • Step 1: control V ⇡ b

V ⇡, for fixed ⇡ (Bernstein’s inequality + high order decomposition)

  • Step 2: control b

V b

⇡? V b ⇡?

(decouple statistical dependence)

27 / 34

slide-58
SLIDE 58

Step 1: high order decomposition

Bellman equation V ⇡ = (I P⇡)1r

28 / 34

slide-59
SLIDE 59

Step 1: high order decomposition

Bellman equation V ⇡ = (I P⇡)1r [Agarwal et al., 2019] b V ⇡ V ⇡ =

  • I P⇡

1 b P⇡ P⇡ b V ⇡ (?)

28 / 34

slide-60
SLIDE 60

Step 1: high order decomposition

Bellman equation V ⇡ = (I P⇡)1r [Agarwal et al., 2019] b V ⇡ V ⇡ =

  • I P⇡

1 b P⇡ P⇡ b V ⇡ (?) [ours] b V ⇡ V ⇡ =

  • I P⇡

1 b P⇡ P⇡

  • V ⇡+

+

  • I P⇡

1 b P⇡ P⇡ h b V ⇡ V ⇡i

28 / 34

slide-61
SLIDE 61

Step 1: high order decomposition

Bellman equation V ⇡ = (I P⇡)1r [Agarwal et al., 2019] b V ⇡ V ⇡ =

  • I P⇡

1 b P⇡ P⇡ b V ⇡ (?) [ours] b V ⇡ V ⇡ =

  • I P⇡

1 b P⇡ P⇡

  • V ⇡+

+ 2⇣ (I P⇡ 1 b P⇡ P⇡) ⌘2 b V ⇡

28 / 34

slide-62
SLIDE 62

Step 1: high order decomposition

Bellman equation V ⇡ = (I P⇡)1r [Agarwal et al., 2019] b V ⇡ V ⇡ =

  • I P⇡

1 b P⇡ P⇡ b V ⇡ (?) [ours] b V ⇡ V ⇡ =

  • I P⇡

1 b P⇡ P⇡

  • V ⇡+

+ 2⇣ (I P⇡ 1 b P⇡ P⇡) ⌘2 V ⇡ + 2⇣ (I P⇡ 1 b P⇡ P⇡) ⌘2⇥ b V ⇡ V ⇡⇤

28 / 34

slide-63
SLIDE 63

Step 1: high order decomposition

Bellman equation V ⇡ = (I P⇡)1r [Agarwal et al., 2019] b V ⇡ V ⇡ =

  • I P⇡

1 b P⇡ P⇡ b V ⇡ (?) [ours] b V ⇡ V ⇡ =

  • I P⇡

1 b P⇡ P⇡

  • V ⇡+

+ 2⇣ (I P⇡ 1 b P⇡ P⇡) ⌘2 V ⇡ + 3⇣ (I P⇡ 1 b P⇡ P⇡) ⌘3 V ⇡ + . . .

28 / 34

slide-64
SLIDE 64

Step 1: high order decomposition

Bellman equation V ⇡ = (I P⇡)1r [Agarwal et al., 2019] b V ⇡ V ⇡ =

  • I P⇡

1 b P⇡ P⇡ b V ⇡ (?) [ours] b V ⇡ V ⇡ =

  • I P⇡

1 b P⇡ P⇡

  • V ⇡+

+ 2⇣ (I P⇡ 1 b P⇡ P⇡) ⌘2 V ⇡ + 3⇣ (I P⇡ 1 b P⇡ P⇡) ⌘3 V ⇡ + . . . Bernstein’s inequality: | b P⇡ P⇡

  • V ⇡| 

q

Var[V ⇡] N

+ kV ⇡k∞

N

28 / 34

slide-65
SLIDE 65

Byproduct: policy evaluation

Theorem (Li, Wei, Chi, Gu, Chen’20) Fix any policy ⇡. For every 0 < " 

1 1 , plug-in estimator b

V ⇡ obeys k b V ⇡ V ⇡k1  " with sample complexity at most e O ⇣ |S| (1 )3"2 ⌘ .

29 / 34

slide-66
SLIDE 66

Byproduct: policy evaluation

Theorem (Li, Wei, Chi, Gu, Chen’20) Fix any policy ⇡. For every 0 < " 

1 1 , plug-in estimator b

V ⇡ obeys k b V ⇡ V ⇡k1  " with sample complexity at most e O ⇣ |S| (1 )3"2 ⌘ .

  • minimax lower bound [Azar et al., 2013, Pananjady and Wainwright, 2019]

29 / 34

slide-67
SLIDE 67

Byproduct: policy evaluation

Theorem (Li, Wei, Chi, Gu, Chen’20) Fix any policy ⇡. For every 0 < " 

1 1 , plug-in estimator b

V ⇡ obeys k b V ⇡ V ⇡k1  " with sample complexity at most e O ⇣ |S| (1 )3"2 ⌘ .

  • minimax lower bound [Azar et al., 2013, Pananjady and Wainwright, 2019]
  • tackle sample size barrier: prior work requires sample size >

|S| (1)2 [Agarwal et al., 2019, Pananjady and Wainwright, 2019, Khamaru et al., 2020]

29 / 34

slide-68
SLIDE 68

Step 2: controlling b V

b ⇡?

V b

⇡?

A natural idea: apply our policy evaluation theory + union bound

30 / 34

slide-69
SLIDE 69

Step 2: controlling b V

b ⇡?

V b

⇡?

A natural idea: apply our policy evaluation theory + union bound

  • highly suboptimal!

30 / 34

slide-70
SLIDE 70

Step 2: controlling b V

b ⇡?

V b

⇡?

A natural idea: apply our policy evaluation theory + union bound

  • highly suboptimal!

key idea 2: a leave-one-out argument to decouple stat. dependency btw b ⇡ and samples

— inspired by [Agarwal et al., 2019] but quite different . . .

30 / 34

slide-71
SLIDE 71

Key idea 2: leave-one-out argument

  • state-action absorbing MDP for each (s, a): (S, A, b

P(s,a), r, )

31 / 34

slide-72
SLIDE 72

Key idea 2: leave-one-out argument

  • state-action absorbing MDP for each (s, a): (S, A, b

P(s,a), r, )

  • ( b

P P)s,aVb

⇡? = ( b

P P)s,aVb

⇡?

s,a (b

⇡?

s,a: optimal for new MDP)

31 / 34

slide-73
SLIDE 73

Key idea 2: leave-one-out argument

  • state-action absorbing MDP for each (s, a): (S, A, b

P(s,a), r, )

  • ( b

P P)s,aVb

⇡? = ( b

P P)s,aVb

⇡?

s,a (b

⇡?

s,a: optimal for new MDP)

Caveat: require b ⇡? to stand out from other policies

31 / 34

slide-74
SLIDE 74

Key idea 3: tie-breaking via perturbation

  • How to ensure the optimal policy stand out from other policies?

8s 2 S, b Q?(s, b ⇡?(s)) max

a:a6=b ⇡?(s)

b Q?(s, a) !

32 / 34

slide-75
SLIDE 75

Key idea 3: tie-breaking via perturbation

  • How to ensure the optimal policy stand out from other policies?

8s 2 S, b Q?(s, b ⇡?(s)) max

a:a6=b ⇡?(s)

b Q?(s, a) !

  • Solution: slightly perturb rewards r =

) b ⇡?

p

I ensures the uniqueness of b ⇡?

p

I V b

⇡?

p ⇡ V b

⇡?

32 / 34

slide-76
SLIDE 76

Concluding remarks

Understanding RL requires modern statistics and optimization

33 / 34

slide-77
SLIDE 77

Concluding remarks

Understanding RL requires modern statistics and optimization Future directions

  • beyond the tabular setting

[Feng et al., 2020, Jin et al., 2019, Duan and Wang, 2020]

  • finite-horizon episodic MDPs

[Dann and Brunskill, 2015, Jiang and Agarwal, 2018, Wang et al., 2020]

33 / 34

slide-78
SLIDE 78

Paper:

“Breaking the sample size barrier in model-based reinforcement learning with a generative model,” G. Li, Y. Wei, Y. Chi, Y. Gu, Y. Chen, arxiv:2005.12900, 2020

34 / 34