Hessian Aided Policy Gradient Z. Shen 1 , A. Ribeiro 2 , H. Hassani 2 - - PowerPoint PPT Presentation

hessian aided policy gradient
SMART_READER_LITE
LIVE PREVIEW

Hessian Aided Policy Gradient Z. Shen 1 , A. Ribeiro 2 , H. Hassani 2 - - PowerPoint PPT Presentation

Motivation Our Results/Contribution Summary Hessian Aided Policy Gradient Z. Shen 1 , A. Ribeiro 2 , H. Hassani 2 , H. Qian 1 , C. Mi 1 1 Department of Computer Science and Technology Zhejiang University 2 Department of Electrical and Systems


slide-1
SLIDE 1

Motivation Our Results/Contribution Summary

Hessian Aided Policy Gradient

  • Z. Shen1, A. Ribeiro2 , H. Hassani2 , H. Qian1, C. Mi1

1Department of Computer Science and Technology

Zhejiang University

2Department of Electrical and Systems Engineering

University of Pennsylvania

International Conference on Machine Learning, 2019

Shen, Ribeiro, Hassani, Qian, Mi Hessian Aided Policy Gradient

slide-2
SLIDE 2

Motivation Our Results/Contribution Summary

Outline

1

Motivation Reinforcement Learning via Policy Optimization Variance Reduction for Oblivious Optimization

2

Our Results/Contribution Variance Reduction for Non-oblivious Optimization Unbiased Policy Hessian Estimator

Shen, Ribeiro, Hassani, Qian, Mi Hessian Aided Policy Gradient

slide-3
SLIDE 3

Motivation Our Results/Contribution Summary Reinforcement Learning via Policy Optimization Variance Reduction for Oblivious Optimization

Outline

1

Motivation Reinforcement Learning via Policy Optimization Variance Reduction for Oblivious Optimization

2

Our Results/Contribution Variance Reduction for Non-oblivious Optimization Unbiased Policy Hessian Estimator

Shen, Ribeiro, Hassani, Qian, Mi Hessian Aided Policy Gradient

slide-4
SLIDE 4

Motivation Our Results/Contribution Summary Reinforcement Learning via Policy Optimization Variance Reduction for Oblivious Optimization

Policy Optimization as Stochastic Maximization

max

θ∈Rd J (θ) def

= Eτ∼πθ[R(τ)]

MDP

def

= (S, A, P, r, ρ0, γ) P : S × A × S → [0, 1], r : S × A → R; Policy: πθ(·|s) : A → [0, 1], ∀s ∈ S; Trajectory:τ

def

= (s0, a0, . . . , aH−1, sH) ∼ πθ: ai ∼ πθ(·|si), si+1 ∼ P(·|si, ai), s0 ∼ ρ0(·) Probability and discounted cumulative reward of a trajectory: p(τ)

def

= ρ(s0)

H−1

  • h=0

p(sh+1|sh, ah)πθ(ah|sh) R(τ)

def

=

H−1

  • h=0

γhr(sh, ah)

Shen, Ribeiro, Hassani, Qian, Mi Hessian Aided Policy Gradient

slide-5
SLIDE 5

Motivation Our Results/Contribution Summary Reinforcement Learning via Policy Optimization Variance Reduction for Oblivious Optimization

Policy Optimization with REINFORCE

max

θ∈Rd J (θ) def

= Eτ∼πθ[R(τ)] Non-oblivious: p(τ) depends on θ REINFORCE (SGD) θt+1 := θt + ηg(θ; Sτ) finds J (θǫ) ≤ ǫ (ǫ-FOSP) using O(1/ǫ4) samples of τ g(θ; Sτ) def = 1 |Sτ|

  • τ∈Sτ

R(τ)∇ log πθθ(τ), τ ∈ Sτ ∼ πθ

Shen, Ribeiro, Hassani, Qian, Mi Hessian Aided Policy Gradient

slide-6
SLIDE 6

Motivation Our Results/Contribution Summary Reinforcement Learning via Policy Optimization Variance Reduction for Oblivious Optimization

Outline

1

Motivation Reinforcement Learning via Policy Optimization Variance Reduction for Oblivious Optimization

2

Our Results/Contribution Variance Reduction for Non-oblivious Optimization Unbiased Policy Hessian Estimator

Shen, Ribeiro, Hassani, Qian, Mi Hessian Aided Policy Gradient

slide-7
SLIDE 7

Motivation Our Results/Contribution Summary Reinforcement Learning via Policy Optimization Variance Reduction for Oblivious Optimization

Oblivious Stochastic Optimization

min

θ∈Rd L(θ) def

= Ez∼p(z)[ ˜ L(θ; z)] (1) Oblivious: p(z) is independent of θ Stochastic Gradient Descent (SGD) θt+1 := θt − η∇ ˜ L(θt; Sz) finds L(θǫ) ≤ ǫ (ǫ-FOSP) using O(1/ǫ4) samples of z ˜ L(θ; Sz) def = 1 |Sz|

  • z∈Sz

˜ L(θ; z)

Shen, Ribeiro, Hassani, Qian, Mi Hessian Aided Policy Gradient

slide-8
SLIDE 8

Motivation Our Results/Contribution Summary Reinforcement Learning via Policy Optimization Variance Reduction for Oblivious Optimization

Variance Reduction

Oblivious Case

min

θRd L(θ) def

= Ez∼p(z)[ ˜ L(θ; z)] (2) Oblivious: p(z) is independent of θ SPIDER gt := gt−1 + ∆t def =

  • ∇ ˜

L(θt; Sz) − ∇ ˜ L(θt−1; Sz)

  • ESz [∆t]=∇L(θt)−∇L(θt−1)

θt+1 := θt − η · gt, (E[gt] = ∇L(θt)) finds L(θǫ) ≤ ǫ using O(1/ǫ3) samples of z ˜ L(θ; Sz) def = 1 |Sz|

  • z∈Sz

˜ L(θ; z)

Shen, Ribeiro, Hassani, Qian, Mi Hessian Aided Policy Gradient

slide-9
SLIDE 9

Motivation Our Results/Contribution Summary Reinforcement Learning via Policy Optimization Variance Reduction for Oblivious Optimization

Variance Reduction

Non-oblivious Case?

max

θ∈Rd J (θ) def

= Eτ∼πθ[R(τ)] (3) Non-oblivious: p(τ) depends on θ SPIDER gt := gt−1 + ∆t def =

  • g(θt; Sτ) − g(θt−1; Sτ)
  • ESτ [∆t]=∇J (θt)−∇J (θt−1)

, τ ∈ Sτ ∼ πθt θt+1 := θt + ηgt, (E[gt] = ∇J (θt)) g(θ; Sτ) def = 1 |Sτ|

  • τ∈Sτ

R(τ)∇ log πθθ(τ)

Shen, Ribeiro, Hassani, Qian, Mi Hessian Aided Policy Gradient

slide-10
SLIDE 10

Motivation Our Results/Contribution Summary Variance Reduction for Non-oblivious Optimization Unbiased Policy Hessian Estimator

Outline

1

Motivation Reinforcement Learning via Policy Optimization Variance Reduction for Oblivious Optimization

2

Our Results/Contribution Variance Reduction for Non-oblivious Optimization Unbiased Policy Hessian Estimator

Shen, Ribeiro, Hassani, Qian, Mi Hessian Aided Policy Gradient

slide-11
SLIDE 11

Motivation Our Results/Contribution Summary Variance Reduction for Non-oblivious Optimization Unbiased Policy Hessian Estimator

Variance Reduction for Non-oblivious Optimization

θt+1 := θt + ηgt, (E[gt] = ∇J (θt)) gt := gt−1 + ∆t, E[∆t] = ∇J (θt) − ∇J (θt−1) θa

def

= a · θt + (1 − a) · θt−1, a ∈ [0, 1] ∇J (θt) − ∇J (θt−1) = 1 [∇2J (θa) · (θt − θt−1)]da = 1 ∇2J (θa)da

  • · (θt − θt−1)

(Eτa[ ˜ ∇2(θa; τa)] = ∇2J (θa)) =Ea∼Uni([0,1])[∇2J (θa)] · (θt − θt−1), =E[ ˜ ∇2(θa) · (θt − θt−1)]

Shen, Ribeiro, Hassani, Qian, Mi Hessian Aided Policy Gradient

slide-12
SLIDE 12

Motivation Our Results/Contribution Summary Variance Reduction for Non-oblivious Optimization Unbiased Policy Hessian Estimator

Variance Reduction

Non-oblivious Case!

max

θ∈Rd J (θ) def

= Eτ∼πθ[R(τ)] (4) HAPG gt := gt−1 + ˜ ∇2(θt, θt−1; Sa,τ)[θt − θt−1] θt+1 := θt + ηgt, (E[gt] = J (θt)) ˜ ∇2(θt, θt−1; Sa,τ) def = 1 |Sa,τ|

  • (a,τa)∈Sa,τ

˜ ∇2(θa; τa), where a ∼ Uni([0, 1]), τa ∼ πθa.(θa

def

= a · θt + (1 − a) · θt−1) finds J (θǫ) ≤ ǫ using O(1/ǫ3) samples of τ.

Shen, Ribeiro, Hassani, Qian, Mi Hessian Aided Policy Gradient

slide-13
SLIDE 13

Motivation Our Results/Contribution Summary Variance Reduction for Non-oblivious Optimization Unbiased Policy Hessian Estimator

Outline

1

Motivation Reinforcement Learning via Policy Optimization Variance Reduction for Oblivious Optimization

2

Our Results/Contribution Variance Reduction for Non-oblivious Optimization Unbiased Policy Hessian Estimator

Shen, Ribeiro, Hassani, Qian, Mi Hessian Aided Policy Gradient

slide-14
SLIDE 14

Motivation Our Results/Contribution Summary Variance Reduction for Non-oblivious Optimization Unbiased Policy Hessian Estimator

Unbiased Policy Hessian Estimator

∇J (θ) =

  • τ

R(τ)∇p(τ; πθ)dτ =

  • τ

p(τ; πθ) · [R(τ)∇ log p(τ; πθ)] dτ ∇2J (θ) =

  • τ

R(τ)∇p(τ; πθ)[∇ log p(τ; πθ)]⊤ + p(τ; πθ) · [R(τ)∇2 log p(τ; πθ)]dτ =

  • τ

R(τ)p(τ; πθ){∇ log p(τ; πθ)[∇ log p(τ; πθ)]⊤ + ∇2 log p(τ; πθ)}dτ ˜ ∇2(θ; τ)

def

= R(τ){∇ log p(τ; πθ)[∇ log p(τ; πθ)]⊤+∇2 log p(τ; πθ)}, τ ∼ πθ.

Shen, Ribeiro, Hassani, Qian, Mi Hessian Aided Policy Gradient

slide-15
SLIDE 15

Motivation Our Results/Contribution Summary

Summary

First method that provably reduces the sample complexity to achieve an ǫ-FOSP of the RL objective from O( 1

ǫ4 ) to O( 1 ǫ3 ).

Shen, Ribeiro, Hassani, Qian, Mi Hessian Aided Policy Gradient