nonlinear distributional gradient temporal difference
play

Nonlinear Distributional Gradient Temporal Difference Learning Chao - PowerPoint PPT Presentation

Nonlinear Distributional Gradient Temporal Difference Learning Chao Qu 1 Shie Mannor 2 Huan Xu 3 1 Ant Financial Services Group 2 Faculty of Electrical Engineering, Technion 3 H. Milton Stewart School of Industrial and Systems Engineering, Georgia


  1. Nonlinear Distributional Gradient Temporal Difference Learning Chao Qu 1 Shie Mannor 2 Huan Xu 3 1 Ant Financial Services Group 2 Faculty of Electrical Engineering, Technion 3 H. Milton Stewart School of Industrial and Systems Engineering, Georgia Tech

  2. The distributional reinforcement learning has gained much attention recently [Bellemare et al., 2017]. It explicitly considers the stochastic nature of the long term return Z ( s , a ) . The recursion of Z ( s , a ) is described by the distributional Bellman equation, Z ( s , a ) D = R ( s , a ) + γ Z ( s ′ , a ′ ) , where D = stands for “equal in distribution”

  3. The distributional reinforcement learning has gained much attention recently [Bellemare et al., 2017]. It explicitly considers the stochastic nature of the long term return Z ( s , a ) . The recursion of Z ( s , a ) is described by the distributional Bellman equation, Z ( s , a ) D = R ( s , a ) + γ Z ( s ′ , a ′ ) , where D = stands for “equal in distribution”

  4. Distributional gradient temporal differenct learning We consider a distributional counterpart of Gradient Temporal Difference Learning [Sutton et al., 2008]. Properties: • Convergence in the off-policy setting. • Convergence with the nonlinear function approximation. • Include distributional nature of the long term reward.

  5. Distributional gradient temporal differenct learning We consider a distributional counterpart of Gradient Temporal Difference Learning [Sutton et al., 2008]. Properties: • Convergence in the off-policy setting. • Convergence with the nonlinear function approximation. • Include distributional nature of the long term reward.

  6. To measure the distance between distributions Z ( s , a ) and T Z ( s , a ), we need to introduce Cram´ er distance. Suppose there are two distributions P and Q and their cumulative distribution functions are F P and F Q respectively, then the square root of Cram´ er distance between P and Q is � � ∞ � 1 / 2 . ( F P ( x ) − F Q ( x )) 2 dx ℓ 2 ( P , Q ) := −∞

  7. To measure the distance between distributions Z ( s , a ) and T Z ( s , a ), we need to introduce Cram´ er distance. Suppose there are two distributions P and Q and their cumulative distribution functions are F P and F Q respectively, then the square root of Cram´ er distance between P and Q is � � ∞ � 1 / 2 . ( F P ( x ) − F Q ( x )) 2 dx ℓ 2 ( P , Q ) := −∞

  8. Denote the (cumulative) distribution function of Z ( s ) as F θ ( s , z ), G θ ( s , z ) as the distribution function of T Z ( s ). D-MSPBE: J ( θ ) := � Φ T θ D ( F θ − G θ ) � 2 minimize: θ D Φ θ ) − 1 , (Φ T θ

  9. Denote the (cumulative) distribution function of Z ( s ) as F θ ( s , z ), G θ ( s , z ) as the distribution function of T Z ( s ). D-MSPBE: J ( θ ) := � Φ T θ D ( F θ − G θ ) � 2 minimize: θ D Φ θ ) − 1 , (Φ T θ

  10. • Value distribution ( F θ ( s , z )) is discrete within the range [ V min , V max ] with m atoms. • φ θ ( s , z ) = ∂ F θ ( s , z ) ∂ and (Φ θ ) (( i , j ) , l ) = ∂θ l F θ ( s i , z j ). ∂θ • Project onto the space spanned by Φ w.r.t. the Cram´ er distance and then obtain D-MSPBE. • SGD and weight duplication trick to optimize it.

  11. Distributional GTD2 Input: step size α t , step size β t , policy π . for t = 0 , 1 , ... do m � � � − φ T � � w t +1 = w t + β t θ t ( s t , z j ) w t + δ θ t φ θ t ( s t , z j ) j =1 m φ θ t ( s t , z j ) − φ θ t ( s t +1 , z j − r t � � � � � θ t +1 =Γ[ θ t + α t { ) γ j =1 φ T θ t ( s t , z j ) w t − h t } ] Γ : R d → R d is a projection onto an compact set C with a smooth boundary. � m t φ θ t ( s t , z j )) ∇ 2 F θ t ( s t , z j ) w t , � j =1 ( δ θ t − w T h t = � where δ θ t = F θ t ( s t +1 , z j − r t ) − F θ t ( s t , z j ) . γ end for

  12. Some remarks: • Use the temporal distribution difference δ θ t instead of the temporal difference in GTD2. • Summation over z j , which corresponds to the integral in the Cram´ er distance. • h t results from the nonlinear function approximation, which is zero in the linear case. it can be evaluated using forward and backward propagation.

  13. Theoretical Result Theorem Let ( s t , r t , s ′ t ) t ≥ 0 be a sequence of transitions. The positive step-sizes in the algrithm satisfy � ∞ t =0 a t = ∞ , � ∞ t =0 β t = ∞ , � ∞ t =0 α 2 t , � ∞ t =1 β 2 t < ∞ and α t β t → 0 , as t → ∞ . Assume that for any θ ∈ C and s ∈ S s.t. d ( s ) > 0 , F θ is three times continuously differentiable. Further assume that for each θ ∈ C, E � m j =1 φ θ ( s , z j ) φ T � � θ ( s , z j ) is nonsingular. Then the Algorithm converges with probability one, as t → ∞ .

  14. Distributional Greedy GQ Input: step size α t , step size β t , 0 ≤ η ≤ 1 for t = 0 , 1 , ... do Q ( s t +1 , a ) = � m j =1 z j p j ( s t , a ) , where p j ( s t , a ) is the density function w.r.t. F θ (( s t , a )). a ∗ = arg max a Q ( s t +1 , a ) . m � − φ T � � w t +1 = w t + β t θ t (( s t , a t ) , z j ) w t + δ θ t j =1 × φ θ t (( s t , a t ) , z j ) . m � � θ t +1 = θ t + α t { δ θ t φ θ t (( s t , a t ) , z j ) − j =1 ηφ θ t (( s t +1 , a ∗ ) , z j − r t )( φ T � θ t (( s t , a t ) , z j ) w t ) } . γ where δ θ t = F θ t (( s t +1 , a ∗ ) , z j − r t ) − F θ t (( s t , a t ) , z j ) . γ end for

  15. Experimental Result 18 10 Distributional GTD2 0.9 16 Distributional TDC 0.8 0 14 0.7 12 Cumulative Reward −10 0.6 Kill Counts DMSPBE 10 −20 0.5 8 0.4 6 −30 0.3 4 C51 −40 C51 0.2 2 DQN DQN Distributional Greedy-GQ Distributional Greedy-GQ 0.1 −50 0 0 2000 4000 6000 8000 0 100 200 300 400 500 0 2000 4000 6000 8000 time step Episodes Episodes

  16. Thank you! Visit our poster today at pacific Ballroom #33.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend