discount factor as a regularizer in rl
play

Discount Factor as a Regularizer in RL Ron Amit , Ron Meir - PowerPoint PPT Presentation

ICML 2020 Discount Factor as a Regularizer in RL Ron Amit , Ron Meir (Technion) , Kamil Ciosek (MSR) Microsoft Research, Cambridge UK RL problems objectives The expected -discounted return (value function) Evaluation discount


  1. ICML 2020 Discount Factor as a Regularizer in RL Ron Amit , Ron Meir (Technion) , Kamil Ciosek (MSR) Microsoft Research, Cambridge UK

  2. RL problems objectives β€’ The expected 𝛿 𝑓 -discounted return (value function) Evaluation discount factor β€’ Policy Evaluation β€’ Policy Optimization How can we improve perfomance in the limited data regime?

  3. Discount regularization β€’ Discount regularization : β€œ guidance discount factor ” (Jiang ’ 15 ) Algorithm hyperparameter β€’ Theoretical analysis: β€’ Petrik and Scherrer ’ 09 – Approx. DP β€’ Jiang ’ 15 – model based Better performance for limited data β€’ Regularization effect: β€’ ↑ Bias π‘Š 𝛿 βˆ’ π‘Š 𝛿 𝑓 β€’ ↓ Variance ΰ·  π‘Š βˆ’ π‘Š 𝛿 β€’ Our work: β€’ In TD learning, discount regularization == explicit added regularizer β€’ When is discount regularization effective?

  4. Temporal Difference (TD) Learning β€’ Policy evaluation with value-function model β€’ Batch TD(0) Discount factor hyperparameter Aim to minimize

  5. Equivalent Form β€’ Equivalent update steps Discount regularization (using 𝜹 < 𝜹 𝒇 ) ⇕ ⇕ Using 𝜹 𝒇 + regularization term Regularization term gradient Similar Equivalence β€’ (expected) SARSA β€’ LSTD Activation regularization

  6. The Equivalent Regularizer β€’ Activation regularization 𝑀 2 regularization β€’ Tabular case: Discount regularization is sensitive to the empirical distribution

  7. Tabular Experiments 4x4 GridWorld β€’ Policy evaluation, 𝜌 𝑏 𝑑 uniform. 𝜌 ( 𝛿 𝑓 = 0.99 ) β€’ Goal : find ΰ·  π‘Š that estimates π‘Š 𝛿 𝑓 β€’ Loss measures: 2 = Οƒ π‘‘βˆˆπ‘‡ ΰ·  𝜌 2 ΰ·  β€’ 𝑴 πŸ‘ loss: 𝜌 In each MDP Instance: π‘Š βˆ’ π‘Š π‘Š βˆ’ π‘Š 𝛿 𝑓 𝛿 𝑓 β€’ 2 Draw 𝔽𝑆 𝑑 β€’ Draw 𝑄(. |𝑑, 𝑏) β€’ Ranking Loss: βˆ’Kandal`s_Tau ΰ·  𝜌 π‘Š, π‘Š 𝛿 𝑓 ( ~ number of order switches between state ranks) β€’ Average over 1000 MDP instances β€’ Data: trajectories of 50 time-steps

  8. Discount Regularization 𝑀 2 Regularization TD(0) Results 𝑀 2 loss Ranking Loss (𝛿 𝑓 = 0.99)

  9. Effect of the Empirical Distribution β€’ Equivalent regularizer: β€’ Tuples (𝑑, 𝑑 β€² , 𝑠) generation: 𝑑~𝑕(𝑑) , 𝑑 β€² ~𝑄 𝜌 𝑑 β€² 𝑑 , 𝑠~𝑆 𝜌 (𝑑) β€’ For each MDP - draw distribution 𝑕(𝑑) at 𝑒 π‘ˆπ‘Š from uniform 𝑴 πŸ‘ regularization Discount regularization Non-uniform Non-uniform Uniform Uniform (𝛿 𝑓 = 0.99)

  10. Effect of the Mixing Time β€’ Lower mixing time (slow mixing) β†’ Higher estimation variance β†’ more regularization is needed 𝑴 πŸ‘ regularization Discount regularization Slow mixing Slow mixing Fast mixing Fast mixing (𝛿 𝑓 = 0.99) (LSTD, 2 trajectories)

  11. Policy Optimization 𝜌 βˆ’ π‘Š 𝜌 βˆ— Goal : min π‘Š 𝛿 𝑓 𝛿 𝑓 𝜌 1 Policy-iteration: β€’ For episodes: β€’ Get data β€’ ΰ·  𝑅 ← Policy evaluation (e.g, SARSA) β€’ Improvement step (e.g., 𝜁 -epsilon-greedy) Activation regularization term:

  12. Deep RL Experiments β€’ Actor-critic algorithms: DDPG (Lillicrap β€˜ 15), TD3 (Fujimoto β€˜ 18) β€’ Mujoco continuous control (Todorov β€˜ 12) β€’ Goal: undiscounted sum of rewards (𝛿 𝑓 = 1) β€’ Limited number of time-steps (2e5 or less) β€’ Tested cases: β€’ Discount regularization (and no 𝑀 2 ) β€’ 𝑀 2 regularization (and 𝛿 = 0.999 )

  13. Discount Regularization L2 Regularization Discount Regularization L2 Regularization 2e5 steps 2e5 steps Fewer steps Fewer steps 𝛿 = 0.99 2.5e2 steps HalfCheetah-v2 𝛿 = 0.99 𝛿 = 0.8 1e5 steps Ant-v2 𝛿 = 0.99 5 5e4 steps Hopper-v2

  14. Conclusions β€’ Discount regularization in TD is equivalent to adding a regularizer term β€’ Regularization effectiveness is closely related to the data distribution and mixing rate. β€’ Generalization in deep RL is strongly affected by regularization β€’ Future work – theory needed Thanks for listening

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend