Nonlinear Distributional Gradient Temporal Difference Learning Chao - - PowerPoint PPT Presentation

nonlinear distributional gradient temporal difference
SMART_READER_LITE
LIVE PREVIEW

Nonlinear Distributional Gradient Temporal Difference Learning Chao - - PowerPoint PPT Presentation

Nonlinear Distributional Gradient Temporal Difference Learning Chao Qu 1 Shie Mannor 2 Huan Xu 3 1 Ant Financial Services Group 2 Faculty of Electrical Engineering, Technion 3 H. Milton Stewart School of Industrial and Systems Engineering, Georgia


slide-1
SLIDE 1

Nonlinear Distributional Gradient Temporal Difference Learning

Chao Qu 1 Shie Mannor 2 Huan Xu 3

1Ant Financial Services Group 2Faculty of Electrical Engineering, Technion

  • 3H. Milton Stewart School of Industrial and Systems Engineering, Georgia Tech
slide-2
SLIDE 2

The distributional reinforcement learning has gained much attention recently [Bellemare et al., 2017]. It explicitly considers the stochastic nature of the long term return Z(s, a) . The recursion of Z(s, a) is described by the distributional Bellman equation, Z(s, a) D = R(s, a) + γZ(s′, a′), where D = stands for “equal in distribution”

slide-3
SLIDE 3

The distributional reinforcement learning has gained much attention recently [Bellemare et al., 2017]. It explicitly considers the stochastic nature of the long term return Z(s, a) . The recursion of Z(s, a) is described by the distributional Bellman equation, Z(s, a) D = R(s, a) + γZ(s′, a′), where D = stands for “equal in distribution”

slide-4
SLIDE 4

Distributional gradient temporal differenct learning

We consider a distributional counterpart of Gradient Temporal Difference Learning [Sutton et al., 2008]. Properties:

  • Convergence in the off-policy setting.
  • Convergence with the nonlinear function approximation.
  • Include distributional nature of the long term reward.
slide-5
SLIDE 5

Distributional gradient temporal differenct learning

We consider a distributional counterpart of Gradient Temporal Difference Learning [Sutton et al., 2008]. Properties:

  • Convergence in the off-policy setting.
  • Convergence with the nonlinear function approximation.
  • Include distributional nature of the long term reward.
slide-6
SLIDE 6

To measure the distance between distributions Z(s, a) and T Z(s, a), we need to introduce Cram´ er distance. Suppose there are two distributions P and Q and their cumulative distribution functions are FP and FQ respectively, then the square root of Cram´ er distance between P and Q is ℓ2(P, Q) := ∞

−∞

(FP(x) − FQ(x))2dx 1/2.

slide-7
SLIDE 7

To measure the distance between distributions Z(s, a) and T Z(s, a), we need to introduce Cram´ er distance. Suppose there are two distributions P and Q and their cumulative distribution functions are FP and FQ respectively, then the square root of Cram´ er distance between P and Q is ℓ2(P, Q) := ∞

−∞

(FP(x) − FQ(x))2dx 1/2.

slide-8
SLIDE 8

Denote the (cumulative) distribution function of Z(s) as Fθ(s, z), Gθ(s, z) as the distribution function of T Z(s). D-MSPBE: minimize:

θ

J(θ) := ΦT

θ D(Fθ − Gθ)2 (ΦT

θ DΦθ)−1,

slide-9
SLIDE 9

Denote the (cumulative) distribution function of Z(s) as Fθ(s, z), Gθ(s, z) as the distribution function of T Z(s). D-MSPBE: minimize:

θ

J(θ) := ΦT

θ D(Fθ − Gθ)2 (ΦT

θ DΦθ)−1,

slide-10
SLIDE 10
  • Value distribution (Fθ(s, z)) is discrete within the range

[Vmin, Vmax] with m atoms.

  • φθ(s, z) = ∂Fθ(s,z)

∂θ

and (Φθ)((i,j),l) =

∂ ∂θl Fθ(si, zj).

  • Project onto the space spanned by Φ w.r.t. the Cram´

er distance and then obtain D-MSPBE.

  • SGD and weight duplication trick to optimize it.
slide-11
SLIDE 11

Distributional GTD2

Input: step size αt, step size βt, policy π. for t = 0, 1, ... do wt+1 = wt + βt

m

  • j=1
  • − φT

θt(st, zj)wt + δθt

  • φθt(st, zj)

θt+1 =Γ[θt + αt{

m

  • j=1
  • φθt(st, zj) − φθt(st+1, zj − rt

γ )

  • φT

θt(st, zj)wt − ht}]

Γ : Rd → Rd is a projection onto an compact set C with a smooth boundary. ht =

  • m

j=1(δθt − wT t φθt(st, zj))∇2Fθt(st, zj)wt,

where δθt = Fθt(st+1, zj−rt

γ

) − Fθt(st, zj). end for

slide-12
SLIDE 12

Some remarks:

  • Use the temporal distribution difference δθt instead of the

temporal difference in GTD2.

  • Summation over zj, which corresponds to the integral in the

Cram´ er distance.

  • ht results from the nonlinear function approximation, which is

zero in the linear case. it can be evaluated using forward and backward propagation.

slide-13
SLIDE 13

Theoretical Result

Theorem

Let (st, rt, s′

t)t≥0 be a sequence of transitions. The positive

step-sizes in the algrithm satisfy ∞

t=0 at = ∞, ∞ t=0 βt = ∞,

t=0 α2 t , ∞ t=1 β2 t < ∞ and αt βt → 0, as t → ∞ . Assume that for

any θ ∈ C and s ∈ S s.t. d(s) > 0, Fθ is three times continuously

  • differentiable. Further assume that for each θ ∈ C,
  • E m

j=1 φθ(s, zj)φT θ (s, zj)

  • is nonsingular. Then the Algorithm

converges with probability one, as t → ∞.

slide-14
SLIDE 14

Distributional Greedy GQ

Input: step size αt, step size βt, 0 ≤ η ≤ 1 for t = 0, 1, ... do Q(st+1, a) = m

j=1 zjpj(st, a), where pj(st, a) is the density

function w.r.t. Fθ((st, a)). a∗ = arg maxa Q(st+1, a). wt+1 = wt + βt

m

  • j=1
  • − φT

θt((st, at), zj)wt + δθt

  • × φθt((st, at), zj).

θt+1 = θt + αt{

m

  • j=1
  • δθtφθt((st, at), zj)−

ηφθt((st+1, a∗), zj − rt γ )(φT

θt((st, at), zj)wt)

  • }.

where δθt = Fθt((st+1, a∗), zj−rt

γ

) − Fθt((st, at), zj). end for

slide-15
SLIDE 15

Experimental Result

2000 4000 6000 8000 time step 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 DMSPBE Distributional GTD2 Distributional TDC 100 200 300 400 500 Episodes −50 −40 −30 −20 −10 10 Cumulative Reward C51 DQN Distributional Greedy-GQ 2000 4000 6000 8000 Episodes 2 4 6 8 10 12 14 16 18 Kill Counts C51 DQN Distributional Greedy-GQ

slide-16
SLIDE 16

Thank you!

Visit our poster today at pacific Ballroom #33.