The Provable E ff ectiveness of Policy Gradient Methods in - - PowerPoint PPT Presentation

the provable e ff ectiveness of policy gradient methods
SMART_READER_LITE
LIVE PREVIEW

The Provable E ff ectiveness of Policy Gradient Methods in - - PowerPoint PPT Presentation

The Provable E ff ectiveness of Policy Gradient Methods in Reinforcement Learning Sham Kakade University of Washington & Microsoft Research (with Alekh Agarwal, Jason Lee, and Gaurav Mahajan) r Policy Optimization in RL [AlphaZero,


slide-1
SLIDE 1

The Provable Effectiveness of Policy Gradient Methods in Reinforcement Learning
 Sham Kakade

University of Washington & Microsoft Research

(with Alekh Agarwal, Jason Lee, and Gaurav Mahajan)

r

slide-2
SLIDE 2

Policy Optimization in RL

[AlphaZero, Silver et.al, 17] [OpenAI Five, 18] [OpenAI,19]

slide-3
SLIDE 3

Markov Decision Processes:


a framework for RL

slide-4
SLIDE 4

Markov Decision Processes:


a framework for RL

  • A policy:


π : States → Actions

slide-5
SLIDE 5

Markov Decision Processes:


a framework for RL

  • A policy:


π : States → Actions

  • We execute to obtain a trajectory:


π s0, a0, r0, s1, a1, r1…

slide-6
SLIDE 6

Markov Decision Processes:


a framework for RL

  • A policy:


π : States → Actions

  • We execute to obtain a trajectory:


π s0, a0, r0, s1, a1, r1…

  • Total -discounted reward:


γ Vπ(s0) = 피π [

t=0

γtrt]

slide-7
SLIDE 7

Markov Decision Processes:


a framework for RL

  • A policy:


π : States → Actions

  • We execute to obtain a trajectory:


π s0, a0, r0, s1, a1, r1…

  • Total -discounted reward:


γ Vπ(s0) = 피π [

t=0

γtrt]

  • Goal: Find a policy that maximizes our value

π Vπ(s0)

slide-8
SLIDE 8

Challenges in RL

slide-9
SLIDE 9

Challenges in RL

  • 1. Exploration


(the environment may be unknown)

slide-10
SLIDE 10

Challenges in RL

  • 1. Exploration


(the environment may be unknown)

  • 2. Credit assignment problem


(due to delayed rewards)

slide-11
SLIDE 11

Challenges in RL

  • 1. Exploration


(the environment may be unknown)

  • 2. Credit assignment problem


(due to delayed rewards)

  • 3. Large state/action spaces:


hand state: joint angles/velocities
 cube state: configuration 
 actions: forces applied to actuators

Dexterous Robotic Hand Manipulation OpenAI, 2019

slide-12
SLIDE 12

Part 0: Background

RL, Deep RL, and Supervised Learning (SL)

slide-13
SLIDE 13

The “Tabular” Dynamic Programming approach

  • “Tabular” dynamic programming approach: (with known model)
  • 1. For every entry in the table, compute the state-action value:

  • 2. Update the policy to be greedy:

Qπ(s, a) = 피π[

t=0

γtrt|s0 = s, a0 = a]

π π(s) ← argmaxa Qπ(s, a)

slide-14
SLIDE 14

The “Tabular” Dynamic Programming approach

  • “Tabular” dynamic programming approach: (with known model)
  • 1. For every entry in the table, compute the state-action value:

  • 2. Update the policy to be greedy:

Qπ(s, a) = 피π[

t=0

γtrt|s0 = s, a0 = a]

π π(s) ← argmaxa Qπ(s, a)

  • Generalization: how can we deal with this infinite table? 


Use sampling/supervised learning + deep learning.

slide-15
SLIDE 15

The “Tabular” Dynamic Programming approach

  • “Tabular” dynamic programming approach: (with known model)
  • 1. For every entry in the table, compute the state-action value:

  • 2. Update the policy to be greedy:

Qπ(s, a) = 피π[

t=0

γtrt|s0 = s, a0 = a]

π π(s) ← argmaxa Qπ(s, a)

  • Generalization: how can we deal with this infinite table? 


Use sampling/supervised learning + deep learning.

“deep RL”?


[Bertsekas & Tsitsiklis ’97] provides first systematic analysis of RL with (worst case) “function approximation”. 


slide-16
SLIDE 16

In practice, policy gradient methods rule…

slide-17
SLIDE 17

In practice, policy gradient methods rule…

  • They are the most effective method for 

  • btaining state of the art. 



 


θ ← θ + η∇θVπθ(s0)

slide-18
SLIDE 18

In practice, policy gradient methods rule…

  • They are the most effective method for 

  • btaining state of the art. 



 


θ ← θ + η∇θVπθ(s0)

  • Why do we like them?
slide-19
SLIDE 19

In practice, policy gradient methods rule…

  • They are the most effective method for 

  • btaining state of the art. 



 


θ ← θ + η∇θVπθ(s0)

  • Why do we like them?
  • they easily deal with large state/action spaces


(through the neural net parameterization)

slide-20
SLIDE 20

In practice, policy gradient methods rule…

  • They are the most effective method for 

  • btaining state of the art. 



 


θ ← θ + η∇θVπθ(s0)

  • Why do we like them?
  • they easily deal with large state/action spaces


(through the neural net parameterization)

  • We can estimate the gradient using only simulation of our current policy


 (the expectation is under the state actions visited under )

πθ πθ

slide-21
SLIDE 21

In practice, policy gradient methods rule…

  • They are the most effective method for 

  • btaining state of the art. 



 


θ ← θ + η∇θVπθ(s0)

  • Why do we like them?
  • they easily deal with large state/action spaces


(through the neural net parameterization)

  • We can estimate the gradient using only simulation of our current policy


 (the expectation is under the state actions visited under )

πθ πθ

  • They directly optimize the cost function of interest!
slide-22
SLIDE 22

The Optimization Landscape

Supervised Learning:

  • Gradient descent tends to ‘just work’


in practice (not sensitive to initialization)

  • Saddle points not a problem…

Reinforcement Learning:

  • In many real RL problems, we have

“very” flat regions.

  • Gradients can be exponentially small in

the “horizon” due to lack of exploration.

slide-23
SLIDE 23

The Optimization Landscape

Supervised Learning:

  • Gradient descent tends to ‘just work’


in practice (not sensitive to initialization)

  • Saddle points not a problem…

Reinforcement Learning:

  • In many real RL problems, we have

“very” flat regions.

  • Gradients can be exponentially small in

the “horizon” due to lack of exploration.


 
 Lemma: [Higher order vanishing gradients] Suppose there are states in the MDP . With random initialization, all -th higher-order gradients, for , the spectral norm of the gradients are bounded by .

Thrun ’92

s!

S ≤ 1/(1 − γ) k k < S/log(S) 2−S/2

slide-24
SLIDE 24

The Optimization Landscape

Supervised Learning:

  • Gradient descent tends to ‘just work’


in practice (not sensitive to initialization)

  • Saddle points not a problem…

Reinforcement Learning:

  • In many real RL problems, we have

“very” flat regions.

  • Gradients can be exponentially small in

the “horizon” due to lack of exploration.

This talk: Can we get any handle on policy gradient methods 
 because they are one of the most widely used practical tools?


 
 Lemma: [Higher order vanishing gradients] Suppose there are states in the MDP . With random initialization, all -th higher-order gradients, for , the spectral norm of the gradients are bounded by .

Thrun ’92

s!

S ≤ 1/(1 − γ) k k < S/log(S) 2−S/2

slide-25
SLIDE 25

This talk

We provide provable global convergence and generalization guarantees

  • f (nonconvex) policy gradient methods. 

slide-26
SLIDE 26

This talk

We provide provable global convergence and generalization guarantees

  • f (nonconvex) policy gradient methods. 

  • Part – I: small state spaces + exact gradients


curvature + non-convexity

  • Vanilla PG
  • PG with regularization
  • Natural Policy Gradient

slide-27
SLIDE 27

This talk

We provide provable global convergence and generalization guarantees

  • f (nonconvex) policy gradient methods. 

  • Part – I: small state spaces + exact gradients


curvature + non-convexity

  • Vanilla PG
  • PG with regularization
  • Natural Policy Gradient

  • Part – II: large state spaces


generalization and distribution shift

  • Function approximation/deep nets? Why use PG?
slide-28
SLIDE 28

Part I: Small State Spaces

(and the softmax policy class)

slide-29
SLIDE 29

Policy Optimization over the ”softmax” policy class

(let’s start simple!)

  • Simplest way to parameterize the simplex, without constraints.

slide-30
SLIDE 30

Policy Optimization over the ”softmax” policy class

(let’s start simple!)

  • Simplest way to parameterize the simplex, without constraints.

  • is the probability of action given state 


πθ(a|s) a s πθ(a|s) = exp(θs,a) ∑a′ exp(θs,a′ )

slide-31
SLIDE 31

Policy Optimization over the ”softmax” policy class

(let’s start simple!)

  • Simplest way to parameterize the simplex, without constraints.

  • is the probability of action given state 


πθ(a|s) a s πθ(a|s) = exp(θs,a) ∑a′ exp(θs,a′ )

  • Complete class: contains every stationary policy


slide-32
SLIDE 32

Policy Optimization over the ”softmax” policy class

(let’s start simple!)

  • Simplest way to parameterize the simplex, without constraints.

  • is the probability of action given state 


πθ(a|s) a s πθ(a|s) = exp(θs,a) ∑a′ exp(θs,a′ )

  • Complete class: contains every stationary policy


The policy optimization problem is non-convex.
 Do we have global convergence?

max

θ

Vπθ(s0)

assume

PV

son

exactly

slide-33
SLIDE 33

Global Convergence of PG for Softmax


 Theorem [Vanilla PG for Softmax Policy class] Suppose has full support over the state space. Then, for all states , 


Vθ(μ) = Es∼μ[Vθ(s)] θ ← θ + η∇θVθ(μ) μ s Vθ(s) → V⋆(s)

w

u

starting

state

dist

slide-34
SLIDE 34

Global Convergence of PG for Softmax

  • Even though problem is non-convex, we have global convergence.
  • proof is detailed/asymptotic


 Theorem [Vanilla PG for Softmax Policy class] Suppose has full support over the state space. Then, for all states , 


Vθ(μ) = Es∼μ[Vθ(s)] θ ← θ + η∇θVθ(μ) μ s Vθ(s) → V⋆(s)

slide-35
SLIDE 35

Global Convergence of PG for Softmax

  • Even though problem is non-convex, we have global convergence.
  • proof is detailed/asymptotic
  • Rate could be exponentially slow in terms of
  • Issue: the softmax can have very flat gradients

#states


 Theorem [Vanilla PG for Softmax Policy class] Suppose has full support over the state space. Then, for all states , 


Vθ(μ) = Es∼μ[Vθ(s)] θ ← θ + η∇θVθ(μ) μ s Vθ(s) → V⋆(s)

slide-36
SLIDE 36

Global Convergence: Softmax + Log Barrier regularization


 Theorem [PG: Softmax+Log Barrier] 
 Suppose and with appropriate settings of and After iterations, we have for all , 


Lλ(θ):= Vθ(μ) + λ SA ∑

s,a

log πθ(a|s) θ ← θ + η∇Lλ(θ)

S : #states, A : #actions, H : Horizon = 1/(1 − γ) μ = uniformS λ η

S4A2H6 ϵ2

s Vθ(s) ≥ V⋆(s) − ϵ

slide-37
SLIDE 37

Global Convergence: Softmax + Log Barrier regularization

  • Even though problem is non-convex, we a have poly iteration complexity.


 Theorem [PG: Softmax+Log Barrier] 
 Suppose and with appropriate settings of and After iterations, we have for all , 


Lλ(θ):= Vθ(μ) + λ SA ∑

s,a

log πθ(a|s) θ ← θ + η∇Lλ(θ)

S : #states, A : #actions, H : Horizon = 1/(1 − γ) μ = uniformS λ η

S4A2H6 ϵ2

s Vθ(s) ≥ V⋆(s) − ϵ

slide-38
SLIDE 38

Global Convergence: Softmax + Log Barrier regularization

  • Even though problem is non-convex, we a have poly iteration complexity.
  • Log barrier and uniform helps with conditioning problems.
  • proof is succinct/ requires showing

doesn’t become too small.

  • log barrier reg = KL-regularization

entropy regularization

μ πθ(a|s) ≠


 Theorem [PG: Softmax+Log Barrier] 
 Suppose and with appropriate settings of and After iterations, we have for all , 


Lλ(θ):= Vθ(μ) + λ SA ∑

s,a

log πθ(a|s) θ ← θ + η∇Lλ(θ)

S : #states, A : #actions, H : Horizon = 1/(1 − γ) μ = uniformS λ η

S4A2H6 ϵ2

s Vθ(s) ≥ V⋆(s) − ϵ

slide-39
SLIDE 39

Preconditioning: The Natural Policy Gradient (NPG)

  • Practice: most methods are gradient based, usually variants of:


NPG [K. ‘01]; TRPO [Schulman ‘15]; PPO [Schulman ‘17]

slide-40
SLIDE 40

Preconditioning: The Natural Policy Gradient (NPG)

  • Practice: most methods are gradient based, usually variants of:


NPG [K. ‘01]; TRPO [Schulman ‘15]; PPO [Schulman ‘17]

  • NPG warps the distance metric to stretch the corners out (using the Fisher

information metric) to move ‘more’ near the boundaries. The update is:


F(θ) = Es,a∼πθ [∇log πθ(a|s)∇log πθ(a|s)⊤] θ ← θ + ηF(θ)−1∇Vθ(s0)

slide-41
SLIDE 41

NPG and “soft” policy iteration

  • The softmax policy class: πθ(a|s) ∝ exp(θs,a)
slide-42
SLIDE 42

NPG and “soft” policy iteration

  • The softmax policy class: πθ(a|s) ∝ exp(θs,a)
  • At iteration , the NPG update rule:



 is equivalent to a “soft” policy iteration update rule:
 


t θ ← θ + ηF(θ)−1∇Vθ(s0) π(a|s) ← π(a|s) exp(ηQπ(s, a)) Z

slide-43
SLIDE 43

NPG and “soft” policy iteration

  • The softmax policy class: πθ(a|s) ∝ exp(θs,a)
  • At iteration , the NPG update rule:



 is equivalent to a “soft” policy iteration update rule:
 


t θ ← θ + ηF(θ)−1∇Vθ(s0) π(a|s) ← π(a|s) exp(ηQπ(s, a)) Z

What happens for this non-convex update rule?

A

G a

w

slide-44
SLIDE 44

Global Convergence of NPG

Theorem [NPG] Set . 
 For the softmax policy class, we have after T iterations,

η = (1 − γ)2log A V(T)(ρ) ≥ V⋆(ρ) − 2 (1 − γ)2T

Its

S S

slide-45
SLIDE 45

Global Convergence of NPG

  • Dimension free iteration complexity: (No dependence on

)

S, A, μ

Theorem [NPG] Set . 
 For the softmax policy class, we have after T iterations,

η = (1 − γ)2log A V(T)(ρ) ≥ V⋆(ρ) − 2 (1 − γ)2T

slide-46
SLIDE 46

Global Convergence of NPG

  • Dimension free iteration complexity: (No dependence on

)

S, A, μ

  • Also a “fast rate”.

Theorem [NPG] Set . 
 For the softmax policy class, we have after T iterations,

η = (1 − γ)2log A V(T)(ρ) ≥ V⋆(ρ) − 2 (1 − γ)2T

slide-47
SLIDE 47

Global Convergence of NPG

  • Dimension free iteration complexity: (No dependence on

)

S, A, μ

  • Also a “fast rate”.
  • Even though problem is non-convex, a mirror descent analysis applies.


Analysis idea from [Even-Dar, K., Mansour 2009] Theorem [NPG] Set . 
 For the softmax policy class, we have after T iterations,

η = (1 − γ)2log A V(T)(ρ) ≥ V⋆(ρ) − 2 (1 − γ)2T

slide-48
SLIDE 48

Global Convergence of NPG

  • Dimension free iteration complexity: (No dependence on

)

S, A, μ

  • Also a “fast rate”.
  • Even though problem is non-convex, a mirror descent analysis applies.


Analysis idea from [Even-Dar, K., Mansour 2009] 
 What about approximate/sampled gradients and large state space? Theorem [NPG] Set . 
 For the softmax policy class, we have after T iterations,

η = (1 − γ)2log A V(T)(ρ) ≥ V⋆(ρ) − 2 (1 − γ)2T

slide-49
SLIDE 49

Taking stock: “measures” and related work


what is the role of the “coverage measure” ?

μ

slide-50
SLIDE 50

Brittle policies if we train only from one configuration!

  • [Rajeswaran, Lowrey, Todorov, K. 2017]: showed policies optimized for a single

starting configuration are not robust!

푠0

slide-51
SLIDE 51

Brittle policies if we train only from one configuration!

  • [Rajeswaran, Lowrey, Todorov, K. 2017]: showed policies optimized for a single

starting configuration are not robust!

푠0

slide-52
SLIDE 52

Brittle policies if we train only from one configuration!

  • [Rajeswaran, Lowrey, Todorov, K. 2017]: showed policies optimized for a single

starting configuration are not robust!

푠0

slide-53
SLIDE 53

Brittle policies if we train only from one configuration!

  • [Rajeswaran, Lowrey, Todorov, K. 2017]: showed policies optimized for a single

starting configuration are not robust!

푠0

  • How to fix this?

Training from different starting configurations sampled from fixes this.

s0 ∼ μ

slide-54
SLIDE 54

OpenAI: progress on dexterous hand manipulation

Trained with “domain randomization” Basically, the measure was diverse.

s0 ∼ μ

slide-55
SLIDE 55

OpenAI: progress on dexterous hand manipulation

Trained with “domain randomization” Basically, the measure was diverse.

s0 ∼ μ

slide-56
SLIDE 56

OpenAI: progress on dexterous hand manipulation

Trained with “domain randomization” Basically, the measure was diverse.

s0 ∼ μ

slide-57
SLIDE 57

OpenAI: progress on dexterous hand manipulation

Trained with “domain randomization” Basically, the measure was diverse.

s0 ∼ μ

  • How should we think about approximation/generalization?


(this is not an issue in supervised learning)

  • How should we think about the measure in the infinite state space case?


( lets us sidestep exploration…)

μ μ

slide-58
SLIDE 58

Related Work:


  • ptimization and generalization

Generalization:

slide-59
SLIDE 59

Related Work:


  • ptimization and generalization

Generalization:

  • approx. dynamic programming requires worst-case

guarantees on errors. 
 some relaxations possible: [Munos, 2005, Antos et al., 2008]

ℓ∞

slide-60
SLIDE 60

Related Work:


  • ptimization and generalization

Generalization:

  • approx. dynamic programming requires worst-case

guarantees on errors. 
 some relaxations possible: [Munos, 2005, Antos et al., 2008]

ℓ∞

  • [K. & Langford; ‘02] conservative policy iteration (CPI)


provable guarantees in terms of ‘supervised learning’ error +

  • Related: PSDP [Bagnell et al, ‘04], [Scherer & Geist, ‘14], MD-MPI [Geist et al., ‘19]…
  • imitation learning


μ

slide-61
SLIDE 61

Related Work:


  • ptimization and generalization

Generalization:

  • approx. dynamic programming requires worst-case

guarantees on errors. 
 some relaxations possible: [Munos, 2005, Antos et al., 2008]

ℓ∞

  • [K. & Langford; ‘02] conservative policy iteration (CPI)


provable guarantees in terms of ‘supervised learning’ error +

  • Related: PSDP [Bagnell et al, ‘04], [Scherer & Geist, ‘14], MD-MPI [Geist et al., ‘19]…
  • imitation learning


μ

Optimization and global convergence:

slide-62
SLIDE 62

Related Work:


  • ptimization and generalization

Generalization:

  • approx. dynamic programming requires worst-case

guarantees on errors. 
 some relaxations possible: [Munos, 2005, Antos et al., 2008]

ℓ∞

  • [K. & Langford; ‘02] conservative policy iteration (CPI)


provable guarantees in terms of ‘supervised learning’ error +

  • Related: PSDP [Bagnell et al, ‘04], [Scherer & Geist, ‘14], MD-MPI [Geist et al., ‘19]…
  • imitation learning


μ

Optimization and global convergence:

  • roots in [Even-Dar, K., Mansour 2009]
slide-63
SLIDE 63

Related Work:


  • ptimization and generalization

Generalization:

  • approx. dynamic programming requires worst-case

guarantees on errors. 
 some relaxations possible: [Munos, 2005, Antos et al., 2008]

ℓ∞

  • [K. & Langford; ‘02] conservative policy iteration (CPI)


provable guarantees in terms of ‘supervised learning’ error +

  • Related: PSDP [Bagnell et al, ‘04], [Scherer & Geist, ‘14], MD-MPI [Geist et al., ‘19]…
  • imitation learning


μ

Optimization and global convergence:

  • roots in [Even-Dar, K., Mansour 2009]
  • CPI also gives a subgradient condition for policy search
  • [Scherer & Geist, ’14], [Bhandari & Russo]
slide-64
SLIDE 64

Related Work:


  • ptimization and generalization

Generalization:

  • approx. dynamic programming requires worst-case

guarantees on errors. 
 some relaxations possible: [Munos, 2005, Antos et al., 2008]

ℓ∞

  • [K. & Langford; ‘02] conservative policy iteration (CPI)


provable guarantees in terms of ‘supervised learning’ error +

  • Related: PSDP [Bagnell et al, ‘04], [Scherer & Geist, ‘14], MD-MPI [Geist et al., ‘19]…
  • imitation learning


μ

Optimization and global convergence:

  • roots in [Even-Dar, K., Mansour 2009]
  • CPI also gives a subgradient condition for policy search
  • [Scherer & Geist, ’14], [Bhandari & Russo]
  • [Fazel et. al. 18]: global convergence for LQRs
slide-65
SLIDE 65

Part-II: Large State Spaces

Approximation and Generalization

slide-66
SLIDE 66

Policy Classes

is the probability of action given , parameterized by
 


πθ(a|s) a s πθ(a|s) ∝ exp(fθ(s, a))

slide-67
SLIDE 67

Policy Classes

is the probability of action given , parameterized by
 


πθ(a|s) a s πθ(a|s) ∝ exp(fθ(s, a))

  • Softmax policy class: fθ(s, a) = θs,a
slide-68
SLIDE 68

Policy Classes

is the probability of action given , parameterized by
 


πθ(a|s) a s πθ(a|s) ∝ exp(fθ(s, a))

  • Softmax policy class: fθ(s, a) = θs,a
  • Log-linear policy class:

where

fθ(s, a) = ⃗ θ ⋅ ⃗ ϕ(s, a) ⃗ ϕ(s, a) ∈ Rd

slide-69
SLIDE 69

Policy Classes

is the probability of action given , parameterized by
 


πθ(a|s) a s πθ(a|s) ∝ exp(fθ(s, a))

  • Softmax policy class: fθ(s, a) = θs,a
  • Log-linear policy class:

where

fθ(s, a) = ⃗ θ ⋅ ⃗ ϕ(s, a) ⃗ ϕ(s, a) ∈ Rd

  • Neural policy class:

is a neural network

fθ(s, a)

slide-70
SLIDE 70

NPG & Log Linear Policy Classes

  • Feature vector

,

ϕ(s, a) ∈ ℝd πθ(a|s) ∝ exp(θ ⋅ ϕs,a)

slide-71
SLIDE 71

NPG & Log Linear Policy Classes

  • Feature vector

,

ϕ(s, a) ∈ ℝd πθ(a|s) ∝ exp(θ ⋅ ϕs,a)

  • At iteration , the NPG update rule:



 is equivalent to the “soft”+approximate policy iteration update:

t θ ← θ + ηF(θ)−1∇Vθ(s0)

slide-72
SLIDE 72

NPG & Log Linear Policy Classes

  • Feature vector

,

ϕ(s, a) ∈ ℝd πθ(a|s) ∝ exp(θ ⋅ ϕs,a)

  • At iteration , the NPG update rule:



 is equivalent to the “soft”+approximate policy iteration update:

t θ ← θ + ηF(θ)−1∇Vθ(s0)

  • 1. approximate the

with the the features:
 .
 where is “on-policy” distribution starting from

Qθ w⋆ ∈ argminwEs,a∼d(⋅|π,μ)[(Qθ(s, a) − w ⋅ ϕs,a)

2

d( ⋅ |π, μ) s0, a0 ∼ μ

and

It

slide-73
SLIDE 73

NPG & Log Linear Policy Classes

  • Feature vector

,

ϕ(s, a) ∈ ℝd πθ(a|s) ∝ exp(θ ⋅ ϕs,a)

  • At iteration , the NPG update rule:



 is equivalent to the “soft”+approximate policy iteration update:

t θ ← θ + ηF(θ)−1∇Vθ(s0)

  • 1. approximate the

with the the features:
 .
 where is “on-policy” distribution starting from

Qθ w⋆ ∈ argminwEs,a∼d(⋅|π,μ)[(Qθ(s, a) − w ⋅ ϕs,a)

2

d( ⋅ |π, μ) s0, a0 ∼ μ

  • 2. policy update



 ( is the normalizing constant)

π(a|s) ← π(a|s)exp(w⋆ ⋅ ϕs,a) Zs Zs

r n

E wat

use

samples

run

a

slide-74
SLIDE 74

Realizable case: NPG + log linear policy classes

slide-75
SLIDE 75

Realizable case: NPG + log linear policy classes

  • Realizability: Suppose that

is a linear function in

Qθ(s, a) ϕ(s, a)

slide-76
SLIDE 76

Realizable case: NPG + log linear policy classes

  • Realizability: Suppose that

is a linear function in

Qθ(s, a) ϕ(s, a)

  • Supervised learning error: our estimate

has bounded regression error (say due to sampling)


̂ w t Es,a∼d(⋅|π,μ)[(Qθ(s, a) − ̂ w t ⋅ ϕs,a)

2

] ≤ ϵstat

try

slide-77
SLIDE 77

Realizable case: NPG + log linear policy classes

  • Realizability: Suppose that

is a linear function in

Qθ(s, a) ϕ(s, a)

  • Supervised learning error: our estimate

has bounded regression error (say due to sampling)


̂ w t Es,a∼d(⋅|π,μ)[(Qθ(s, a) − ̂ w t ⋅ ϕs,a)

2

] ≤ ϵstat

  • Conditioning (i.e. “feature coverage”):

and define


∥ϕs,a∥ ≤ 1 κ = 1/σmin(Es,a∼μ[ϕs,aϕ⊤

s,a])

slide-78
SLIDE 78

Realizable case: NPG + log linear policy classes

  • Realizability: Suppose that

is a linear function in

Qθ(s, a) ϕ(s, a)

  • Supervised learning error: our estimate

has bounded regression error (say due to sampling)


̂ w t Es,a∼d(⋅|π,μ)[(Qθ(s, a) − ̂ w t ⋅ ϕs,a)

2

] ≤ ϵstat

  • Conditioning (i.e. “feature coverage”):

and define


∥ϕs,a∥ ≤ 1 κ = 1/σmin(Es,a∼μ[ϕs,aϕ⊤

s,a])

Theorem [NPG] , Norm bound: 
 After iterations, the NPG algorithm returns a s.t.

A : #actions, H : Horizon = 1/(1 − γ) ∥ ̂ w t∥ ≤ W T π V(T)(ρ) ≥ V⋆(ρ) − HW 2 log A T + 4AH3κ ϵstat

TN

  • pt

error

stat

error

slide-79
SLIDE 79

NPG+Log Linear Case


(just notation for sample based approach)

slide-80
SLIDE 80

NPG+Log Linear Case


(just notation for sample based approach)

  • For a state-action distribution , define:


.

υ L(w; θ, υ) := Es,a∼υ[(Qπθ(s, a) − w ⋅ ϕs,a)2]

slide-81
SLIDE 81

NPG+Log Linear Case


(just notation for sample based approach)

  • For a state-action distribution , define:


.

υ L(w; θ, υ) := Es,a∼υ[(Qπθ(s, a) − w ⋅ ϕs,a)2]

  • The NPG update is equivalent to:
slide-82
SLIDE 82

NPG+Log Linear Case


(just notation for sample based approach)

  • For a state-action distribution , define:


.

υ L(w; θ, υ) := Es,a∼υ[(Qπθ(s, a) − w ⋅ ϕs,a)2]

  • The NPG update is equivalent to:
  • 1. approximate the

with the the features:
 .
 where is “on-policy” distribution starting from

Qθ ̂ w t ≈ argminwL(w; θ, d( ⋅ |π, μ)) d( ⋅ |π, μ) s0, a0 ∼ μ

slide-83
SLIDE 83

NPG+Log Linear Case


(just notation for sample based approach)

  • For a state-action distribution , define:


.

υ L(w; θ, υ) := Es,a∼υ[(Qπθ(s, a) − w ⋅ ϕs,a)2]

  • The NPG update is equivalent to:
  • 1. approximate the

with the the features:
 .
 where is “on-policy” distribution starting from

Qθ ̂ w t ≈ argminwL(w; θ, d( ⋅ |π, μ)) d( ⋅ |π, μ) s0, a0 ∼ μ

  • 2. policy update:


 ( is the normalizing constant)

π(a|s) ← π(a|s)exp(w⋆ ⋅ ϕs,a)/Zs Zs

slide-84
SLIDE 84

NPG: Conv. Rate with Approx+Est. Errors

slide-85
SLIDE 85

NPG: Conv. Rate with Approx+Est. Errors

  • Supervised learning error: Suppose the excess risk and approx error are bounded as:


, 
 ,

L(w(t); θ(t), d(t)) − L(w(t)

⋆ ; θ(t), d(t)) ≤ ϵstat

L(w(t)

⋆ ; θ(t), d(t)) ≤ ϵapprox

slide-86
SLIDE 86

NPG: Conv. Rate with Approx+Est. Errors

  • Supervised learning error: Suppose the excess risk and approx error are bounded as:


, 
 ,

L(w(t); θ(t), d(t)) − L(w(t)

⋆ ; θ(t), d(t)) ≤ ϵstat

L(w(t)

⋆ ; θ(t), d(t)) ≤ ϵapprox

  • Conditioning (i.e. “feature coverage”):

and define


∥ϕs,a∥ ≤ 1 κ = 1/σmin(Es,a∼μ[ϕs,aϕ⊤

s,a])

under

fit

at

ite

at.int

slide-87
SLIDE 87

NPG: Conv. Rate with Approx+Est. Errors

  • Supervised learning error: Suppose the excess risk and approx error are bounded as:


, 
 ,

L(w(t); θ(t), d(t)) − L(w(t)

⋆ ; θ(t), d(t)) ≤ ϵstat

L(w(t)

⋆ ; θ(t), d(t)) ≤ ϵapprox

  • Conditioning (i.e. “feature coverage”):

and define


∥ϕs,a∥ ≤ 1 κ = 1/σmin(Es,a∼μ[ϕs,aϕ⊤

s,a])

Theorem [NPG] 
 After iterations, the NPG algorithm returns a s.t. 


where .

A : #actions, H : Horizon = 1/(1 − γ) T π V(T)(s0) ≥ V⋆(s0) − HW 2 log A T + 4AH3(κ ⋅ ϵstat + d⋆ μ

⋅ ϵapprox)

a b

= max

i

ai bi

Ctnmsten4wEi

dH IE

state action

dist of any

Comparton a

y

T

ftp.e

rorg

Eemon

approx

em

slide-88
SLIDE 88

Thank you!

  • theory foundations of PG methods: optimization and approximation guarantees

Alekh Agarwal Jason Lee Gaurav Mahajan

slide-89
SLIDE 89

Thank you!

  • theory foundations of PG methods: optimization and approximation guarantees
  • PG methods effective due to their approximation power

Alekh Agarwal Jason Lee Gaurav Mahajan

slide-90
SLIDE 90

Thank you!

  • theory foundations of PG methods: optimization and approximation guarantees
  • PG methods effective due to their approximation power
  • conceptually (and technically) important issues for progress:

Alekh Agarwal Jason Lee Gaurav Mahajan

slide-91
SLIDE 91

Thank you!

  • theory foundations of PG methods: optimization and approximation guarantees
  • PG methods effective due to their approximation power
  • conceptually (and technically) important issues for progress:
  • exploration: we assumed a good coverage “ ” but this should be learned


(see pc-pg paper!)

μ

Alekh Agarwal Jason Lee Gaurav Mahajan

slide-92
SLIDE 92

Thank you!

  • theory foundations of PG methods: optimization and approximation guarantees
  • PG methods effective due to their approximation power
  • conceptually (and technically) important issues for progress:
  • exploration: we assumed a good coverage “ ” but this should be learned


(see pc-pg paper!)

μ

  • representation/transfer learning: theory of RL is different from SL.

Alekh Agarwal Jason Lee Gaurav Mahajan