Discount Factor as a Regularizer in RL
Ron Amit , Ron Meir (Technion) , Kamil Ciosek (MSR) ICML 2020
Microsoft Research, Cambridge UK
Discount Factor as a Regularizer in RL Ron Amit , Ron Meir - - PowerPoint PPT Presentation
ICML 2020 Discount Factor as a Regularizer in RL Ron Amit , Ron Meir (Technion) , Kamil Ciosek (MSR) Microsoft Research, Cambridge UK RL problems objectives The expected -discounted return (value function) Evaluation discount
Ron Amit , Ron Meir (Technion) , Kamil Ciosek (MSR) ICML 2020
Microsoft Research, Cambridge UK
Evaluation discount factor
Better performance for limited data
βguidance discount factorβ (Jiang β15 )
ΰ· π β π
πΏ
π
πΏβπ πΏπ
Algorithm hyperparameter
Discount factor hyperparameter
Discount regularization (using πΉ < πΉπ) Using πΉπ + regularization term
Activation regularization Regularization term gradient Similar Equivalence
Discount regularization is sensitive to the empirical distribution
π2 regularization
π that estimates π
πΏπ π (πΏπ = 0.99)
ΰ· π β π
πΏπ π 2 2 = Οπ‘βπ ΰ·
π β π
πΏπ π 2
π, π
πΏπ π ( ~ number of order switches between state ranks)
4x4 GridWorld In each MDP Instance:
Ranking Loss π2 loss Discount Regularization π2 Regularization
(πΏπ = 0.99)
π΄π regularization Discount regularization
(πΏπ = 0.99) Uniform Uniform Non-uniform Non-uniform
β more regularization is needed
π΄π regularization Discount regularization
(LSTD, 2 trajectories) (πΏπ = 0.99) Fast mixing Slow mixing Slow mixing Fast mixing
Goal: min
π
π
πΏπ π β π πΏπ πβ 1
Policy-iteration:
π β Policy evaluation (e.g, SARSA)
Activation regularization term:
2.5e2 steps 1e5 steps 5e4 steps πΏ = 0.99 πΏ = 0.99 πΏ = 0.995 HalfCheetah-v2 Ant-v2 Hopper-v2 Discount Regularization 2e5 steps Discount Regularization Fewer steps L2 Regularization Fewer steps L2 Regularization 2e5 steps πΏ = 0.8
mixing rate.