The Provable Effectiveness of Policy Gradient Methods in Reinforcement Learning Sham Kakade
University of Washington & Microsoft Research
(with Alekh Agarwal, Jason Lee, and Gaurav Mahajan)
r
The Provable E ff ectiveness of Policy Gradient Methods in - - PowerPoint PPT Presentation
The Provable E ff ectiveness of Policy Gradient Methods in Reinforcement Learning Sham Kakade University of Washington & Microsoft Research (with Alekh Agarwal, Jason Lee, and Gaurav Mahajan) r Policy Optimization in RL [AlphaZero,
University of Washington & Microsoft Research
r
Policy Optimization in RL
[AlphaZero, Silver et.al, 17] [OpenAI Five, 18] [OpenAI,19]
a framework for RL
a framework for RL
π : States → Actions
a framework for RL
π : States → Actions
π s0, a0, r0, s1, a1, r1…
a framework for RL
π : States → Actions
π s0, a0, r0, s1, a1, r1…
γ Vπ(s0) = 피π [
∞
∑
t=0
γtrt]
a framework for RL
π : States → Actions
π s0, a0, r0, s1, a1, r1…
γ Vπ(s0) = 피π [
∞
∑
t=0
γtrt]
π Vπ(s0)
Challenges in RL
Challenges in RL
(the environment may be unknown)
Challenges in RL
(the environment may be unknown)
(due to delayed rewards)
Challenges in RL
(the environment may be unknown)
(due to delayed rewards)
hand state: joint angles/velocities cube state: configuration actions: forces applied to actuators
Dexterous Robotic Hand Manipulation OpenAI, 2019
Qπ(s, a) = 피π[
∞
∑
t=0
γtrt|s0 = s, a0 = a]
π π(s) ← argmaxa Qπ(s, a)
Qπ(s, a) = 피π[
∞
∑
t=0
γtrt|s0 = s, a0 = a]
π π(s) ← argmaxa Qπ(s, a)
Use sampling/supervised learning + deep learning.
Qπ(s, a) = 피π[
∞
∑
t=0
γtrt|s0 = s, a0 = a]
π π(s) ← argmaxa Qπ(s, a)
Use sampling/supervised learning + deep learning.
“deep RL”?
[Bertsekas & Tsitsiklis ’97] provides first systematic analysis of RL with (worst case) “function approximation”.
θ ← θ + η∇θVπθ(s0)
θ ← θ + η∇θVπθ(s0)
θ ← θ + η∇θVπθ(s0)
(through the neural net parameterization)
θ ← θ + η∇θVπθ(s0)
(through the neural net parameterization)
(the expectation is under the state actions visited under )
πθ πθ
θ ← θ + η∇θVπθ(s0)
(through the neural net parameterization)
(the expectation is under the state actions visited under )
πθ πθ
Supervised Learning:
in practice (not sensitive to initialization)
Reinforcement Learning:
“very” flat regions.
the “horizon” due to lack of exploration.
Supervised Learning:
in practice (not sensitive to initialization)
Reinforcement Learning:
“very” flat regions.
the “horizon” due to lack of exploration.
Lemma: [Higher order vanishing gradients] Suppose there are states in the MDP . With random initialization, all -th higher-order gradients, for , the spectral norm of the gradients are bounded by .
Thrun ’92
S ≤ 1/(1 − γ) k k < S/log(S) 2−S/2
Supervised Learning:
in practice (not sensitive to initialization)
Reinforcement Learning:
“very” flat regions.
the “horizon” due to lack of exploration.
This talk: Can we get any handle on policy gradient methods because they are one of the most widely used practical tools?
Lemma: [Higher order vanishing gradients] Suppose there are states in the MDP . With random initialization, all -th higher-order gradients, for , the spectral norm of the gradients are bounded by .
Thrun ’92
S ≤ 1/(1 − γ) k k < S/log(S) 2−S/2
We provide provable global convergence and generalization guarantees
We provide provable global convergence and generalization guarantees
curvature + non-convexity
We provide provable global convergence and generalization guarantees
curvature + non-convexity
generalization and distribution shift
Policy Optimization over the ”softmax” policy class
(let’s start simple!)
Policy Optimization over the ”softmax” policy class
(let’s start simple!)
πθ(a|s) a s πθ(a|s) = exp(θs,a) ∑a′ exp(θs,a′ )
Policy Optimization over the ”softmax” policy class
(let’s start simple!)
πθ(a|s) a s πθ(a|s) = exp(θs,a) ∑a′ exp(θs,a′ )
Policy Optimization over the ”softmax” policy class
(let’s start simple!)
πθ(a|s) a s πθ(a|s) = exp(θs,a) ∑a′ exp(θs,a′ )
The policy optimization problem is non-convex. Do we have global convergence?
max
θ
Vπθ(s0)
Theorem [Vanilla PG for Softmax Policy class] Suppose has full support over the state space. Then, for all states ,
Vθ(μ) = Es∼μ[Vθ(s)] θ ← θ + η∇θVθ(μ) μ s Vθ(s) → V⋆(s)
w
u
Theorem [Vanilla PG for Softmax Policy class] Suppose has full support over the state space. Then, for all states ,
Vθ(μ) = Es∼μ[Vθ(s)] θ ← θ + η∇θVθ(μ) μ s Vθ(s) → V⋆(s)
#states
Theorem [Vanilla PG for Softmax Policy class] Suppose has full support over the state space. Then, for all states ,
Vθ(μ) = Es∼μ[Vθ(s)] θ ← θ + η∇θVθ(μ) μ s Vθ(s) → V⋆(s)
Global Convergence: Softmax + Log Barrier regularization
Theorem [PG: Softmax+Log Barrier] Suppose and with appropriate settings of and After iterations, we have for all ,
Lλ(θ):= Vθ(μ) + λ SA ∑
s,a
log πθ(a|s) θ ← θ + η∇Lλ(θ)
S : #states, A : #actions, H : Horizon = 1/(1 − γ) μ = uniformS λ η
S4A2H6 ϵ2
s Vθ(s) ≥ V⋆(s) − ϵ
Global Convergence: Softmax + Log Barrier regularization
Theorem [PG: Softmax+Log Barrier] Suppose and with appropriate settings of and After iterations, we have for all ,
Lλ(θ):= Vθ(μ) + λ SA ∑
s,a
log πθ(a|s) θ ← θ + η∇Lλ(θ)
S : #states, A : #actions, H : Horizon = 1/(1 − γ) μ = uniformS λ η
S4A2H6 ϵ2
s Vθ(s) ≥ V⋆(s) − ϵ
Global Convergence: Softmax + Log Barrier regularization
doesn’t become too small.
entropy regularization
μ πθ(a|s) ≠
Theorem [PG: Softmax+Log Barrier] Suppose and with appropriate settings of and After iterations, we have for all ,
Lλ(θ):= Vθ(μ) + λ SA ∑
s,a
log πθ(a|s) θ ← θ + η∇Lλ(θ)
S : #states, A : #actions, H : Horizon = 1/(1 − γ) μ = uniformS λ η
S4A2H6 ϵ2
s Vθ(s) ≥ V⋆(s) − ϵ
Preconditioning: The Natural Policy Gradient (NPG)
NPG [K. ‘01]; TRPO [Schulman ‘15]; PPO [Schulman ‘17]
Preconditioning: The Natural Policy Gradient (NPG)
NPG [K. ‘01]; TRPO [Schulman ‘15]; PPO [Schulman ‘17]
information metric) to move ‘more’ near the boundaries. The update is:
F(θ) = Es,a∼πθ [∇log πθ(a|s)∇log πθ(a|s)⊤] θ ← θ + ηF(θ)−1∇Vθ(s0)
is equivalent to a “soft” policy iteration update rule:
t θ ← θ + ηF(θ)−1∇Vθ(s0) π(a|s) ← π(a|s) exp(ηQπ(s, a)) Z
is equivalent to a “soft” policy iteration update rule:
t θ ← θ + ηF(θ)−1∇Vθ(s0) π(a|s) ← π(a|s) exp(ηQπ(s, a)) Z
What happens for this non-convex update rule?
w
Theorem [NPG] Set . For the softmax policy class, we have after T iterations,
η = (1 − γ)2log A V(T)(ρ) ≥ V⋆(ρ) − 2 (1 − γ)2T
)
S, A, μ
Theorem [NPG] Set . For the softmax policy class, we have after T iterations,
η = (1 − γ)2log A V(T)(ρ) ≥ V⋆(ρ) − 2 (1 − γ)2T
)
S, A, μ
Theorem [NPG] Set . For the softmax policy class, we have after T iterations,
η = (1 − γ)2log A V(T)(ρ) ≥ V⋆(ρ) − 2 (1 − γ)2T
)
S, A, μ
Analysis idea from [Even-Dar, K., Mansour 2009] Theorem [NPG] Set . For the softmax policy class, we have after T iterations,
η = (1 − γ)2log A V(T)(ρ) ≥ V⋆(ρ) − 2 (1 − γ)2T
)
S, A, μ
Analysis idea from [Even-Dar, K., Mansour 2009] What about approximate/sampled gradients and large state space? Theorem [NPG] Set . For the softmax policy class, we have after T iterations,
η = (1 − γ)2log A V(T)(ρ) ≥ V⋆(ρ) − 2 (1 − γ)2T
what is the role of the “coverage measure” ?
Brittle policies if we train only from one configuration!
starting configuration are not robust!
푠0
Brittle policies if we train only from one configuration!
starting configuration are not robust!
푠0
Brittle policies if we train only from one configuration!
starting configuration are not robust!
푠0
Brittle policies if we train only from one configuration!
starting configuration are not robust!
푠0
Training from different starting configurations sampled from fixes this.
s0 ∼ μ
Trained with “domain randomization” Basically, the measure was diverse.
s0 ∼ μ
Trained with “domain randomization” Basically, the measure was diverse.
s0 ∼ μ
Trained with “domain randomization” Basically, the measure was diverse.
s0 ∼ μ
Trained with “domain randomization” Basically, the measure was diverse.
s0 ∼ μ
(this is not an issue in supervised learning)
( lets us sidestep exploration…)
μ μ
Generalization:
Generalization:
guarantees on errors. some relaxations possible: [Munos, 2005, Antos et al., 2008]
ℓ∞
Generalization:
guarantees on errors. some relaxations possible: [Munos, 2005, Antos et al., 2008]
ℓ∞
provable guarantees in terms of ‘supervised learning’ error +
μ
Generalization:
guarantees on errors. some relaxations possible: [Munos, 2005, Antos et al., 2008]
ℓ∞
provable guarantees in terms of ‘supervised learning’ error +
μ
Optimization and global convergence:
Generalization:
guarantees on errors. some relaxations possible: [Munos, 2005, Antos et al., 2008]
ℓ∞
provable guarantees in terms of ‘supervised learning’ error +
μ
Optimization and global convergence:
Generalization:
guarantees on errors. some relaxations possible: [Munos, 2005, Antos et al., 2008]
ℓ∞
provable guarantees in terms of ‘supervised learning’ error +
μ
Optimization and global convergence:
Generalization:
guarantees on errors. some relaxations possible: [Munos, 2005, Antos et al., 2008]
ℓ∞
provable guarantees in terms of ‘supervised learning’ error +
μ
Optimization and global convergence:
is the probability of action given , parameterized by
πθ(a|s) a s πθ(a|s) ∝ exp(fθ(s, a))
is the probability of action given , parameterized by
πθ(a|s) a s πθ(a|s) ∝ exp(fθ(s, a))
is the probability of action given , parameterized by
πθ(a|s) a s πθ(a|s) ∝ exp(fθ(s, a))
where
fθ(s, a) = ⃗ θ ⋅ ⃗ ϕ(s, a) ⃗ ϕ(s, a) ∈ Rd
is the probability of action given , parameterized by
πθ(a|s) a s πθ(a|s) ∝ exp(fθ(s, a))
where
fθ(s, a) = ⃗ θ ⋅ ⃗ ϕ(s, a) ⃗ ϕ(s, a) ∈ Rd
is a neural network
fθ(s, a)
,
ϕ(s, a) ∈ ℝd πθ(a|s) ∝ exp(θ ⋅ ϕs,a)
,
ϕ(s, a) ∈ ℝd πθ(a|s) ∝ exp(θ ⋅ ϕs,a)
is equivalent to the “soft”+approximate policy iteration update:
t θ ← θ + ηF(θ)−1∇Vθ(s0)
,
ϕ(s, a) ∈ ℝd πθ(a|s) ∝ exp(θ ⋅ ϕs,a)
is equivalent to the “soft”+approximate policy iteration update:
t θ ← θ + ηF(θ)−1∇Vθ(s0)
with the the features: . where is “on-policy” distribution starting from
Qθ w⋆ ∈ argminwEs,a∼d(⋅|π,μ)[(Qθ(s, a) − w ⋅ ϕs,a)
2
d( ⋅ |π, μ) s0, a0 ∼ μ
,
ϕ(s, a) ∈ ℝd πθ(a|s) ∝ exp(θ ⋅ ϕs,a)
is equivalent to the “soft”+approximate policy iteration update:
t θ ← θ + ηF(θ)−1∇Vθ(s0)
with the the features: . where is “on-policy” distribution starting from
Qθ w⋆ ∈ argminwEs,a∼d(⋅|π,μ)[(Qθ(s, a) − w ⋅ ϕs,a)
2
d( ⋅ |π, μ) s0, a0 ∼ μ
( is the normalizing constant)
π(a|s) ← π(a|s)exp(w⋆ ⋅ ϕs,a) Zs Zs
r n
use
run
is a linear function in
Qθ(s, a) ϕ(s, a)
is a linear function in
Qθ(s, a) ϕ(s, a)
has bounded regression error (say due to sampling)
̂ w t Es,a∼d(⋅|π,μ)[(Qθ(s, a) − ̂ w t ⋅ ϕs,a)
2
] ≤ ϵstat
is a linear function in
Qθ(s, a) ϕ(s, a)
has bounded regression error (say due to sampling)
̂ w t Es,a∼d(⋅|π,μ)[(Qθ(s, a) − ̂ w t ⋅ ϕs,a)
2
] ≤ ϵstat
and define
∥ϕs,a∥ ≤ 1 κ = 1/σmin(Es,a∼μ[ϕs,aϕ⊤
s,a])
is a linear function in
Qθ(s, a) ϕ(s, a)
has bounded regression error (say due to sampling)
̂ w t Es,a∼d(⋅|π,μ)[(Qθ(s, a) − ̂ w t ⋅ ϕs,a)
2
] ≤ ϵstat
and define
∥ϕs,a∥ ≤ 1 κ = 1/σmin(Es,a∼μ[ϕs,aϕ⊤
s,a])
Theorem [NPG] , Norm bound: After iterations, the NPG algorithm returns a s.t.
A : #actions, H : Horizon = 1/(1 − γ) ∥ ̂ w t∥ ≤ W T π V(T)(ρ) ≥ V⋆(ρ) − HW 2 log A T + 4AH3κ ϵstat
error
error
(just notation for sample based approach)
(just notation for sample based approach)
.
υ L(w; θ, υ) := Es,a∼υ[(Qπθ(s, a) − w ⋅ ϕs,a)2]
(just notation for sample based approach)
.
υ L(w; θ, υ) := Es,a∼υ[(Qπθ(s, a) − w ⋅ ϕs,a)2]
(just notation for sample based approach)
.
υ L(w; θ, υ) := Es,a∼υ[(Qπθ(s, a) − w ⋅ ϕs,a)2]
with the the features: . where is “on-policy” distribution starting from
Qθ ̂ w t ≈ argminwL(w; θ, d( ⋅ |π, μ)) d( ⋅ |π, μ) s0, a0 ∼ μ
(just notation for sample based approach)
.
υ L(w; θ, υ) := Es,a∼υ[(Qπθ(s, a) − w ⋅ ϕs,a)2]
with the the features: . where is “on-policy” distribution starting from
Qθ ̂ w t ≈ argminwL(w; θ, d( ⋅ |π, μ)) d( ⋅ |π, μ) s0, a0 ∼ μ
( is the normalizing constant)
π(a|s) ← π(a|s)exp(w⋆ ⋅ ϕs,a)/Zs Zs
, ,
L(w(t); θ(t), d(t)) − L(w(t)
⋆ ; θ(t), d(t)) ≤ ϵstat
L(w(t)
⋆ ; θ(t), d(t)) ≤ ϵapprox
, ,
L(w(t); θ(t), d(t)) − L(w(t)
⋆ ; θ(t), d(t)) ≤ ϵstat
L(w(t)
⋆ ; θ(t), d(t)) ≤ ϵapprox
and define
∥ϕs,a∥ ≤ 1 κ = 1/σmin(Es,a∼μ[ϕs,aϕ⊤
s,a])
, ,
L(w(t); θ(t), d(t)) − L(w(t)
⋆ ; θ(t), d(t)) ≤ ϵstat
L(w(t)
⋆ ; θ(t), d(t)) ≤ ϵapprox
and define
∥ϕs,a∥ ≤ 1 κ = 1/σmin(Es,a∼μ[ϕs,aϕ⊤
s,a])
Theorem [NPG] After iterations, the NPG algorithm returns a s.t.
where .
A : #actions, H : Horizon = 1/(1 − γ) T π V(T)(s0) ≥ V⋆(s0) − HW 2 log A T + 4AH3(κ ⋅ ϵstat + d⋆ μ
∞
⋅ ϵapprox)
a b
∞
= max
i
ai bi
Comparton a
y
approx
em
Alekh Agarwal Jason Lee Gaurav Mahajan
Alekh Agarwal Jason Lee Gaurav Mahajan
Alekh Agarwal Jason Lee Gaurav Mahajan
(see pc-pg paper!)
μ
Alekh Agarwal Jason Lee Gaurav Mahajan
(see pc-pg paper!)
μ
Alekh Agarwal Jason Lee Gaurav Mahajan