Discount Factor as a Regularizer in RL Ron Amit , Ron Meir - - PowerPoint PPT Presentation

β–Ά
discount factor as a regularizer in rl
SMART_READER_LITE
LIVE PREVIEW

Discount Factor as a Regularizer in RL Ron Amit , Ron Meir - - PowerPoint PPT Presentation

ICML 2020 Discount Factor as a Regularizer in RL Ron Amit , Ron Meir (Technion) , Kamil Ciosek (MSR) Microsoft Research, Cambridge UK RL problems objectives The expected -discounted return (value function) Evaluation discount


slide-1
SLIDE 1

Discount Factor as a Regularizer in RL

Ron Amit , Ron Meir (Technion) , Kamil Ciosek (MSR) ICML 2020

Microsoft Research, Cambridge UK

slide-2
SLIDE 2

RL problems objectives

  • The expected 𝛿𝑓-discounted return (value function)
  • Policy Evaluation
  • Policy Optimization

Evaluation discount factor

How can we improve perfomance in the limited data regime?

slide-3
SLIDE 3

Discount regularization

  • Discount regularization:
  • Theoretical analysis:
  • Petrik and Scherrer ’09 – Approx. DP
  • Jiang ’15 – model based
  • Regularization effect:
  • ↑ Bias
  • ↓ Variance
  • Our work:
  • In TD learning, discount regularization == explicit added regularizer
  • When is discount regularization effective?

Better performance for limited data

β€œguidance discount factor” (Jiang ’15 )

ΰ·  π‘Š βˆ’ π‘Š

𝛿

π‘Š

π›Ώβˆ’π‘Š 𝛿𝑓

Algorithm hyperparameter

slide-4
SLIDE 4

Temporal Difference (TD) Learning

  • Policy evaluation with value-function model
  • Batch TD(0)

Discount factor hyperparameter

Aim to minimize

slide-5
SLIDE 5

Equivalent Form

  • Equivalent update steps

⇕

Discount regularization (using 𝜹 < πœΉπ’‡) Using πœΉπ’‡ + regularization term

⇕

Activation regularization Regularization term gradient Similar Equivalence

  • (expected) SARSA
  • LSTD
slide-6
SLIDE 6

The Equivalent Regularizer

  • Activation regularization
  • Tabular case:

Discount regularization is sensitive to the empirical distribution

𝑀2 regularization

slide-7
SLIDE 7

Tabular Experiments

  • Policy evaluation, 𝜌 𝑏 𝑑 uniform.
  • Goal: find ΰ· 

π‘Š that estimates π‘Š

𝛿𝑓 𝜌 (𝛿𝑓 = 0.99)

  • Loss measures:
  • π‘΄πŸ‘ loss:

ΰ·  π‘Š βˆ’ π‘Š

𝛿𝑓 𝜌 2 2 = Οƒπ‘‘βˆˆπ‘‡ ΰ· 

π‘Š βˆ’ π‘Š

𝛿𝑓 𝜌 2

  • Ranking Loss: βˆ’Kandal`s_Tau ΰ· 

π‘Š, π‘Š

𝛿𝑓 𝜌 ( ~ number of order switches between state ranks)

  • Average over 1000 MDP instances
  • Data: trajectories of 50 time-steps

4x4 GridWorld In each MDP Instance:

  • Draw 𝔽𝑆 𝑑
  • Draw 𝑄(. |𝑑, 𝑏)
slide-8
SLIDE 8

TD(0) Results

Ranking Loss 𝑀2 loss Discount Regularization 𝑀2 Regularization

(𝛿𝑓 = 0.99)

slide-9
SLIDE 9

Effect of the Empirical Distribution

  • Equivalent regularizer:
  • Tuples (𝑑, 𝑑′, 𝑠) generation: 𝑑~𝑕(𝑑) , 𝑑′~π‘„πœŒ 𝑑′ 𝑑 , 𝑠~π‘†πœŒ(𝑑)
  • For each MDP - draw distribution 𝑕(𝑑) at π‘’π‘ˆπ‘Š from uniform

π‘΄πŸ‘ regularization Discount regularization

(𝛿𝑓 = 0.99) Uniform Uniform Non-uniform Non-uniform

slide-10
SLIDE 10

Effect of the Mixing Time

  • Lower mixing time (slow mixing) β†’ Higher estimation variance

β†’ more regularization is needed

π‘΄πŸ‘ regularization Discount regularization

(LSTD, 2 trajectories) (𝛿𝑓 = 0.99) Fast mixing Slow mixing Slow mixing Fast mixing

slide-11
SLIDE 11

Policy Optimization

Goal: min

𝜌

π‘Š

𝛿𝑓 𝜌 βˆ’ π‘Š 𝛿𝑓 πœŒβˆ— 1

Policy-iteration:

  • For episodes:
  • Get data
  • ΰ· 

𝑅 ← Policy evaluation (e.g, SARSA)

  • Improvement step (e.g., 𝜁-epsilon-greedy)

Activation regularization term:

slide-12
SLIDE 12

Deep RL Experiments

  • Actor-critic algorithms: DDPG (Lillicrap β€˜15), TD3 (Fujimoto β€˜18)
  • Mujoco continuous control (Todorov β€˜12)
  • Goal: undiscounted sum of rewards (𝛿𝑓 = 1)
  • Limited number of time-steps (2e5 or less)
  • Tested cases:
  • Discount regularization (and no 𝑀2)
  • 𝑀2 regularization (and 𝛿 = 0.999 )
slide-13
SLIDE 13

2.5e2 steps 1e5 steps 5e4 steps 𝛿 = 0.99 𝛿 = 0.99 𝛿 = 0.995 HalfCheetah-v2 Ant-v2 Hopper-v2 Discount Regularization 2e5 steps Discount Regularization Fewer steps L2 Regularization Fewer steps L2 Regularization 2e5 steps 𝛿 = 0.8

slide-14
SLIDE 14

Conclusions

  • Discount regularization in TD is equivalent to adding a regularizer term
  • Regularization effectiveness is closely related to the data distribution and

mixing rate.

  • Generalization in deep RL is strongly affected by regularization
  • Future work – theory needed

Thanks for listening