Discount Factor as a Regularizer in RL Ron Amit , Ron Meir - PowerPoint PPT Presentation

ICML 2020 Discount Factor as a Regularizer in RL Ron Amit , Ron Meir (Technion) , Kamil Ciosek (MSR) Microsoft Research, Cambridge UK

RL problems objectives • The expected 𝛿 𝑓 -discounted return (value function) Evaluation discount factor • Policy Evaluation • Policy Optimization How can we improve perfomance in the limited data regime?

Discount regularization • Discount regularization : “ guidance discount factor ” (Jiang ’ 15 ) Algorithm hyperparameter • Theoretical analysis: • Petrik and Scherrer ’ 09 – Approx. DP • Jiang ’ 15 – model based Better performance for limited data • Regularization effect: • ↑ Bias 𝑊 𝛿 − 𝑊 𝛿 𝑓 • ↓ Variance ෠ 𝑊 − 𝑊 𝛿 • Our work: • In TD learning, discount regularization == explicit added regularizer • When is discount regularization effective?

Temporal Difference (TD) Learning • Policy evaluation with value-function model • Batch TD(0) Discount factor hyperparameter Aim to minimize

Equivalent Form • Equivalent update steps Discount regularization (using 𝜹 < 𝜹 𝒇 ) ⇕ ⇕ Using 𝜹 𝒇 + regularization term Regularization term gradient Similar Equivalence • (expected) SARSA • LSTD Activation regularization

The Equivalent Regularizer • Activation regularization 𝑀 2 regularization • Tabular case: Discount regularization is sensitive to the empirical distribution

Tabular Experiments 4x4 GridWorld • Policy evaluation, 𝜌 𝑏 𝑡 uniform. 𝜌 ( 𝛿 𝑓 = 0.99 ) • Goal : find ෠ 𝑊 that estimates 𝑊 𝛿 𝑓 • Loss measures: 2 = σ 𝑡∈𝑇 ෠ 𝜌 2 ෠ • 𝑴 𝟑 loss: 𝜌 In each MDP Instance: 𝑊 − 𝑊 𝑊 − 𝑊 𝛿 𝑓 𝛿 𝑓 • 2 Draw 𝔽𝑆 𝑡 • Draw 𝑄(. |𝑡, 𝑏) • Ranking Loss: −Kandal`s_Tau ෠ 𝜌 𝑊, 𝑊 𝛿 𝑓 ( ~ number of order switches between state ranks) • Average over 1000 MDP instances • Data: trajectories of 50 time-steps

Discount Regularization 𝑀 2 Regularization TD(0) Results 𝑀 2 loss Ranking Loss (𝛿 𝑓 = 0.99)

Effect of the Empirical Distribution • Equivalent regularizer: • Tuples (𝑡, 𝑡 ′ , 𝑠) generation: 𝑡~𝑕(𝑡) , 𝑡 ′ ~𝑄 𝜌 𝑡 ′ 𝑡 , 𝑠~𝑆 𝜌 (𝑡) • For each MDP - draw distribution 𝑕(𝑡) at 𝑒 𝑈𝑊 from uniform 𝑴 𝟑 regularization Discount regularization Non-uniform Non-uniform Uniform Uniform (𝛿 𝑓 = 0.99)

Effect of the Mixing Time • Lower mixing time (slow mixing) → Higher estimation variance → more regularization is needed 𝑴 𝟑 regularization Discount regularization Slow mixing Slow mixing Fast mixing Fast mixing (𝛿 𝑓 = 0.99) (LSTD, 2 trajectories)

Policy Optimization 𝜌 − 𝑊 𝜌 ∗ Goal : min 𝑊 𝛿 𝑓 𝛿 𝑓 𝜌 1 Policy-iteration: • For episodes: • Get data • ෠ 𝑅 ← Policy evaluation (e.g, SARSA) • Improvement step (e.g., 𝜁 -epsilon-greedy) Activation regularization term:

Deep RL Experiments • Actor-critic algorithms: DDPG (Lillicrap ‘ 15), TD3 (Fujimoto ‘ 18) • Mujoco continuous control (Todorov ‘ 12) • Goal: undiscounted sum of rewards (𝛿 𝑓 = 1) • Limited number of time-steps (2e5 or less) • Tested cases: • Discount regularization (and no 𝑀 2 ) • 𝑀 2 regularization (and 𝛿 = 0.999 )

Discount Regularization L2 Regularization Discount Regularization L2 Regularization 2e5 steps 2e5 steps Fewer steps Fewer steps 𝛿 = 0.99 2.5e2 steps HalfCheetah-v2 𝛿 = 0.99 𝛿 = 0.8 1e5 steps Ant-v2 𝛿 = 0.99 5 5e4 steps Hopper-v2

Conclusions • Discount regularization in TD is equivalent to adding a regularizer term • Regularization effectiveness is closely related to the data distribution and mixing rate. • Generalization in deep RL is strongly affected by regularization • Future work – theory needed Thanks for listening

Discount Factor as a Regularizer in RL Ron Amit , Ron Meir - PowerPoint PPT Presentation

ICML 2020 Discount Factor as a Regularizer in RL Ron Amit , Ron Meir (Technion) , Kamil Ciosek (MSR) Microsoft Research, Cambridge UK RL problems objectives The expected -discounted return (value function) Evaluation discount

Lecture 2: Stochastic Discount Factor Simon Gilchrist Boston Univerity and NBER EC 745 Fall,

C;5SH DISCOUNT PROGRAM In a world of sameness, we're giving you o tions. Cash Discount Program

2020 Annual GSM Report on the Discount objective Discount objective Monthly Monitoring 31

Triadic Factor Analysis Cynthia Glodeanu Institute of Algebra, TU Dresden October 19, 2010.

Discount Rates in Small Scale Fisheries Discount Rates in Small Scale Fisheries L OUISE T EH I

Northern Maine Transmission Discount Decision The Request for a Transmission Rate Discount

Establishing Public Sector Investment Discount Rate CEA March 2009 Making Difference

Existence and uniqueness of optimal cyclic discount Tatyana process with discount Shutkina

Feature Grouping as a Stochastic Regularizer for High Dimensional Structured Data Sergl

Trimming the 1 Regularizer: Statistical Analysis, Optimization, and Applications to Deep

A New Boosting Algorithm Using Input-Dependent Regularizer Rong Jin rong+@cs.cmu.edu Yan Liu

Attribute Grammars intermediate syntax semantics representation Language Implementation 2

Certainty Factor certainty factor CF (is the certainty factor in the hypothesis H due to

(IHBG) Competitive NOFA Training Rating Factor 3: Soundness of Approach 1 Rating Factor 3

Predicting condition specific transcription factors for target gene. Kaur Alasoo 19.09.2012

Rating Factor 1 Review Rating Factor 1 Capacity of the Applicant 1 Rating Factor Review 2

PIP-II 800 MeV Linac Fernanda G. Garcia PIP-II Machine Advisory Committee Meeting 15-17 March

CS 4803 / 7643: Deep Learning Topics: Regularization Neural Networks Optimization

Neural Networks for Machine Learning Lecture 11a Hopfield Nets Geoffrey Hinton Nitish

Risk Management Strategy Implementation Slides Head of Quality & Risk June 2016

SELL MORE PRODUCTS ONLINE STEP FIVE SELL THEM STUFF PART 2 jane hamill 1 THE PLUS SIZED MARKET

Announcements Homework 2 Due 2/11 (today) at 11:59pm Electronic HW2 Written HW2

THE NEWSVENDOR AND APPLICATIONS 2 T HE N EWSVENDOR M ODEL 3 ON EILL S H AMMER 3/2 W

Non-Determinis)c Search CS 188: Ar)ficial Intelligence Markov

Discount Factor as a Regularizer in RL Ron Amit , Ron Meir - PowerPoint PPT Presentation

ICML 2020 Discount Factor as a Regularizer in RL Ron Amit , Ron Meir (Technion) , Kamil Ciosek (MSR) Microsoft Research, Cambridge UK RL problems objectives The expected -discounted return (value function) Evaluation discount

Lecture 2: Stochastic Discount Factor Simon Gilchrist Boston Univerity and NBER EC 745 Fall,

C;5SH DISCOUNT PROGRAM In a world of sameness, we're giving you o tions. Cash Discount Program

2020 Annual GSM Report on the Discount objective Discount objective Monthly Monitoring 31

Triadic Factor Analysis Cynthia Glodeanu Institute of Algebra, TU Dresden October 19, 2010.

Discount Rates in Small Scale Fisheries Discount Rates in Small Scale Fisheries L OUISE T EH I

Northern Maine Transmission Discount Decision The Request for a Transmission Rate Discount

Establishing Public Sector Investment Discount Rate CEA March 2009 Making Difference

Existence and uniqueness of optimal cyclic discount Tatyana process with discount Shutkina

Feature Grouping as a Stochastic Regularizer for High Dimensional Structured Data Sergl

Trimming the 1 Regularizer: Statistical Analysis, Optimization, and Applications to Deep

A New Boosting Algorithm Using Input-Dependent Regularizer Rong Jin rong+@cs.cmu.edu Yan Liu

Attribute Grammars intermediate syntax semantics representation Language Implementation 2

Certainty Factor certainty factor CF (is the certainty factor in the hypothesis H due to

(IHBG) Competitive NOFA Training Rating Factor 3: Soundness of Approach 1 Rating Factor 3

Predicting condition specific transcription factors for target gene. Kaur Alasoo 19.09.2012

Rating Factor 1 Review Rating Factor 1 Capacity of the Applicant 1 Rating Factor Review 2

PIP-II 800 MeV Linac Fernanda G. Garcia PIP-II Machine Advisory Committee Meeting 15-17 March

CS 4803 / 7643: Deep Learning Topics: Regularization Neural Networks Optimization

Neural Networks for Machine Learning Lecture 11a Hopfield Nets Geoffrey Hinton Nitish

Risk Management Strategy Implementation Slides Head of Quality &amp; Risk June 2016

SELL MORE PRODUCTS ONLINE STEP FIVE SELL THEM STUFF PART 2 jane hamill 1 THE PLUS SIZED MARKET

Announcements Homework 2 Due 2/11 (today) at 11:59pm Electronic HW2 Written HW2

THE NEWSVENDOR AND APPLICATIONS 2 T HE N EWSVENDOR M ODEL 3 ON EILL S H AMMER 3/2 W

Non-Determinis)c Search CS 188: Ar)ficial Intelligence Markov

Risk Management Strategy Implementation Slides Head of Quality & Risk June 2016