SLIDE 1 Nonlinear Distributional Gradient Temporal Difference Learning
Chao Qu 1 Shie Mannor 2 Huan Xu 3
1Ant Financial Services Group 2Faculty of Electrical Engineering, Technion
- 3H. Milton Stewart School of Industrial and Systems Engineering, Georgia Tech
SLIDE 2
The distributional reinforcement learning has gained much attention recently [Bellemare et al., 2017]. It explicitly considers the stochastic nature of the long term return Z(s, a) . The recursion of Z(s, a) is described by the distributional Bellman equation, Z(s, a) D = R(s, a) + γZ(s′, a′), where D = stands for “equal in distribution”
SLIDE 3
The distributional reinforcement learning has gained much attention recently [Bellemare et al., 2017]. It explicitly considers the stochastic nature of the long term return Z(s, a) . The recursion of Z(s, a) is described by the distributional Bellman equation, Z(s, a) D = R(s, a) + γZ(s′, a′), where D = stands for “equal in distribution”
SLIDE 4 Distributional gradient temporal differenct learning
We consider a distributional counterpart of Gradient Temporal Difference Learning [Sutton et al., 2008]. Properties:
- Convergence in the off-policy setting.
- Convergence with the nonlinear function approximation.
- Include distributional nature of the long term reward.
SLIDE 5 Distributional gradient temporal differenct learning
We consider a distributional counterpart of Gradient Temporal Difference Learning [Sutton et al., 2008]. Properties:
- Convergence in the off-policy setting.
- Convergence with the nonlinear function approximation.
- Include distributional nature of the long term reward.
SLIDE 6
To measure the distance between distributions Z(s, a) and T Z(s, a), we need to introduce Cram´ er distance. Suppose there are two distributions P and Q and their cumulative distribution functions are FP and FQ respectively, then the square root of Cram´ er distance between P and Q is ℓ2(P, Q) := ∞
−∞
(FP(x) − FQ(x))2dx 1/2.
SLIDE 7
To measure the distance between distributions Z(s, a) and T Z(s, a), we need to introduce Cram´ er distance. Suppose there are two distributions P and Q and their cumulative distribution functions are FP and FQ respectively, then the square root of Cram´ er distance between P and Q is ℓ2(P, Q) := ∞
−∞
(FP(x) − FQ(x))2dx 1/2.
SLIDE 8 Denote the (cumulative) distribution function of Z(s) as Fθ(s, z), Gθ(s, z) as the distribution function of T Z(s). D-MSPBE: minimize:
θ
J(θ) := ΦT
θ D(Fθ − Gθ)2 (ΦT
θ DΦθ)−1,
SLIDE 9 Denote the (cumulative) distribution function of Z(s) as Fθ(s, z), Gθ(s, z) as the distribution function of T Z(s). D-MSPBE: minimize:
θ
J(θ) := ΦT
θ D(Fθ − Gθ)2 (ΦT
θ DΦθ)−1,
SLIDE 10
- Value distribution (Fθ(s, z)) is discrete within the range
[Vmin, Vmax] with m atoms.
∂θ
and (Φθ)((i,j),l) =
∂ ∂θl Fθ(si, zj).
- Project onto the space spanned by Φ w.r.t. the Cram´
er distance and then obtain D-MSPBE.
- SGD and weight duplication trick to optimize it.
SLIDE 11 Distributional GTD2
Input: step size αt, step size βt, policy π. for t = 0, 1, ... do wt+1 = wt + βt
m
θt(st, zj)wt + δθt
θt+1 =Γ[θt + αt{
m
- j=1
- φθt(st, zj) − φθt(st+1, zj − rt
γ )
θt(st, zj)wt − ht}]
Γ : Rd → Rd is a projection onto an compact set C with a smooth boundary. ht =
j=1(δθt − wT t φθt(st, zj))∇2Fθt(st, zj)wt,
where δθt = Fθt(st+1, zj−rt
γ
) − Fθt(st, zj). end for
SLIDE 12 Some remarks:
- Use the temporal distribution difference δθt instead of the
temporal difference in GTD2.
- Summation over zj, which corresponds to the integral in the
Cram´ er distance.
- ht results from the nonlinear function approximation, which is
zero in the linear case. it can be evaluated using forward and backward propagation.
SLIDE 13 Theoretical Result
Theorem
Let (st, rt, s′
t)t≥0 be a sequence of transitions. The positive
step-sizes in the algrithm satisfy ∞
t=0 at = ∞, ∞ t=0 βt = ∞,
∞
t=0 α2 t , ∞ t=1 β2 t < ∞ and αt βt → 0, as t → ∞ . Assume that for
any θ ∈ C and s ∈ S s.t. d(s) > 0, Fθ is three times continuously
- differentiable. Further assume that for each θ ∈ C,
- E m
j=1 φθ(s, zj)φT θ (s, zj)
- is nonsingular. Then the Algorithm
converges with probability one, as t → ∞.
SLIDE 14 Distributional Greedy GQ
Input: step size αt, step size βt, 0 ≤ η ≤ 1 for t = 0, 1, ... do Q(st+1, a) = m
j=1 zjpj(st, a), where pj(st, a) is the density
function w.r.t. Fθ((st, a)). a∗ = arg maxa Q(st+1, a). wt+1 = wt + βt
m
θt((st, at), zj)wt + δθt
θt+1 = θt + αt{
m
- j=1
- δθtφθt((st, at), zj)−
ηφθt((st+1, a∗), zj − rt γ )(φT
θt((st, at), zj)wt)
where δθt = Fθt((st+1, a∗), zj−rt
γ
) − Fθt((st, at), zj). end for
SLIDE 15 Experimental Result
2000 4000 6000 8000 time step 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 DMSPBE Distributional GTD2 Distributional TDC 100 200 300 400 500 Episodes −50 −40 −30 −20 −10 10 Cumulative Reward C51 DQN Distributional Greedy-GQ 2000 4000 6000 8000 Episodes 2 4 6 8 10 12 14 16 18 Kill Counts C51 DQN Distributional Greedy-GQ
SLIDE 16
Thank you!
Visit our poster today at pacific Ballroom #33.