Nonlinear Distributional Gradient Temporal Difference Learning Chao - PowerPoint PPT Presentation

Nonlinear Distributional Gradient Temporal Difference Learning Chao Qu 1 Shie Mannor 2 Huan Xu 3 1 Ant Financial Services Group 2 Faculty of Electrical Engineering, Technion 3 H. Milton Stewart School of Industrial and Systems Engineering, Georgia Tech

The distributional reinforcement learning has gained much attention recently [Bellemare et al., 2017]. It explicitly considers the stochastic nature of the long term return Z ( s , a ) . The recursion of Z ( s , a ) is described by the distributional Bellman equation, Z ( s , a ) D = R ( s , a ) + γ Z ( s ′ , a ′ ) , where D = stands for “equal in distribution”

Distributional gradient temporal differenct learning We consider a distributional counterpart of Gradient Temporal Difference Learning [Sutton et al., 2008]. Properties: • Convergence in the off-policy setting. • Convergence with the nonlinear function approximation. • Include distributional nature of the long term reward.

To measure the distance between distributions Z ( s , a ) and T Z ( s , a ), we need to introduce Cram´ er distance. Suppose there are two distributions P and Q and their cumulative distribution functions are F P and F Q respectively, then the square root of Cram´ er distance between P and Q is � � ∞ � 1 / 2 . ( F P ( x ) − F Q ( x )) 2 dx ℓ 2 ( P , Q ) := −∞

Denote the (cumulative) distribution function of Z ( s ) as F θ ( s , z ), G θ ( s , z ) as the distribution function of T Z ( s ). D-MSPBE: J ( θ ) := � Φ T θ D ( F θ − G θ ) � 2 minimize: θ D Φ θ ) − 1 , (Φ T θ

• Value distribution ( F θ ( s , z )) is discrete within the range [ V min , V max ] with m atoms. • φ θ ( s , z ) = ∂ F θ ( s , z ) ∂ and (Φ θ ) (( i , j ) , l ) = ∂θ l F θ ( s i , z j ). ∂θ • Project onto the space spanned by Φ w.r.t. the Cram´ er distance and then obtain D-MSPBE. • SGD and weight duplication trick to optimize it.

Distributional GTD2 Input: step size α t , step size β t , policy π . for t = 0 , 1 , ... do m � � � − φ T � � w t +1 = w t + β t θ t ( s t , z j ) w t + δ θ t φ θ t ( s t , z j ) j =1 m φ θ t ( s t , z j ) − φ θ t ( s t +1 , z j − r t � � � � � θ t +1 =Γ[ θ t + α t { ) γ j =1 φ T θ t ( s t , z j ) w t − h t } ] Γ : R d → R d is a projection onto an compact set C with a smooth boundary. � m t φ θ t ( s t , z j )) ∇ 2 F θ t ( s t , z j ) w t , � j =1 ( δ θ t − w T h t = � where δ θ t = F θ t ( s t +1 , z j − r t ) − F θ t ( s t , z j ) . γ end for

Some remarks: • Use the temporal distribution difference δ θ t instead of the temporal difference in GTD2. • Summation over z j , which corresponds to the integral in the Cram´ er distance. • h t results from the nonlinear function approximation, which is zero in the linear case. it can be evaluated using forward and backward propagation.

Theoretical Result Theorem Let ( s t , r t , s ′ t ) t ≥ 0 be a sequence of transitions. The positive step-sizes in the algrithm satisfy � ∞ t =0 a t = ∞ , � ∞ t =0 β t = ∞ , � ∞ t =0 α 2 t , � ∞ t =1 β 2 t < ∞ and α t β t → 0 , as t → ∞ . Assume that for any θ ∈ C and s ∈ S s.t. d ( s ) > 0 , F θ is three times continuously differentiable. Further assume that for each θ ∈ C, E � m j =1 φ θ ( s , z j ) φ T � � θ ( s , z j ) is nonsingular. Then the Algorithm converges with probability one, as t → ∞ .

Distributional Greedy GQ Input: step size α t , step size β t , 0 ≤ η ≤ 1 for t = 0 , 1 , ... do Q ( s t +1 , a ) = � m j =1 z j p j ( s t , a ) , where p j ( s t , a ) is the density function w.r.t. F θ (( s t , a )). a ∗ = arg max a Q ( s t +1 , a ) . m � − φ T � � w t +1 = w t + β t θ t (( s t , a t ) , z j ) w t + δ θ t j =1 × φ θ t (( s t , a t ) , z j ) . m � � θ t +1 = θ t + α t { δ θ t φ θ t (( s t , a t ) , z j ) − j =1 ηφ θ t (( s t +1 , a ∗ ) , z j − r t )( φ T � θ t (( s t , a t ) , z j ) w t ) } . γ where δ θ t = F θ t (( s t +1 , a ∗ ) , z j − r t ) − F θ t (( s t , a t ) , z j ) . γ end for

Experimental Result 18 10 Distributional GTD2 0.9 16 Distributional TDC 0.8 0 14 0.7 12 Cumulative Reward −10 0.6 Kill Counts DMSPBE 10 −20 0.5 8 0.4 6 −30 0.3 4 C51 −40 C51 0.2 2 DQN DQN Distributional Greedy-GQ Distributional Greedy-GQ 0.1 −50 0 0 2000 4000 6000 8000 0 100 200 300 400 500 0 2000 4000 6000 8000 time step Episodes Episodes

Thank you! Visit our poster today at pacific Ballroom #33.

Nonlinear Distributional Gradient Temporal Difference Learning Chao - PowerPoint PPT Presentation

Nonlinear Distributional Gradient Temporal Difference Learning Chao Qu 1 Shie Mannor 2 Huan Xu 3 1 Ant Financial Services Group 2 Faculty of Electrical Engineering, Technion 3 H. Milton Stewart School of Industrial and Systems Engineering, Georgia

Nonlinear Control Lecture # 31 Nonlinear Observers Nonlinear Control Lecture # 31 Nonlinear

Nonlinear Control Lecture # 22 Special nonlinear Forms Nonlinear Control Lecture # 22 Special

Nonlinear Control Lecture # 21 Special nonlinear Forms Nonlinear Control Lecture # 21 Special

Distributional Semantics The unsupervised modeling of meaning on a large scale Tim Van de Cruys

Chapter 6: Temporal Difference Learning Objectives of this chapter: Introduce Temporal Difference

Temporal Difference Learning Robert Platt Northeastern University If one had to identify one

Nonlinear models Posterior Gradient Ascent Adaptive Step Size Approach to Limit Example Will

Spatio-Temporal Statistics with R Chapter Two: Exploring Spatio-Temporal Data Spatio-Temporal

Nonlinear Control Lecture # 8 Special nonlinear Forms Nonlinear Control Lecture # 8 Special

Nonlinear Control Lecture # 12 Nonlinear Observers and Output Feedback Stabilization Nonlinear

Nonlinear Control Lecture # 20 Special nonlinear Forms Nonlinear Control Lecture # 20 Special

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Causal inference Part II: Difference In Difference and Instrumental Variables Difference in

Distributional Compositionality Intro to Distributional Semantics Raffaella Bernardi University

Linear mixed models with improper priors and flexible distributional assumptions for longitudinal

Convergence Rate of Markov Chains Will Perkins April 16, 2013 Convergence Last class we saw that

Convergence of spectral measures and eigenvalue rigidity Elizabeth Meckes Case Western Reserve

Bridging the gap between Stochastic Approximation and Markov chains Aymeric DIEULEVEUT ENS

CS70: Jean Walrand: Lecture 33. Review Distribution of X n n 0 . 3 3 0 . 7 2 0 . 2 2 0 . 4

Limiting Spectral Distribution of Stochastic Block Model Yizhe Zhu University of Washington

Metropolis-Hastings Algorithm for Mixture Model and its Weak Convergence Kengo, KAMATANI

Understanding MCMC Marcel Lthi, University of Basel Slides based on presentation by Sandro

The hyperbolic Brownian plane Thomas Budzinski ENS Paris July 7th, 2016 Thomas Budzinski The