lipschitz continuity in model based reinforcement learning
play

Lipschitz Continuity in Model-based Reinforcement Learning Kavosh - PowerPoint PPT Presentation

Lipschitz Continuity in Model-based Reinforcement Learning Kavosh Asadi*, Dipendra Misra*, Michael L. Littman * denotes equal contribution 1 Model-based RL value/policy planning acting model experience model learning model learning: T


  1. Lipschitz Continuity in Model-based Reinforcement Learning Kavosh Asadi*, Dipendra Misra*, Michael L. Littman * denotes equal contribution � 1

  2. Model-based RL value/policy planning acting model experience model learning model learning: T ( s 0 | s, a ) ≈ b T ( s 0 | s, a ) R ( s, a ) ≈ b R ( s, a ) planning: s 1 b s 1 b s 0 s 1 b b T s 2 b s 2 s 2 b s 3 b s 3 s 3 � 2

  3. Compounding Error [Talvitie 2014, Venkatraman et al. 2015] ‣ happens when models are imperfect, which is almost always true ‣ estimation error or partial observability ‣ agnostic setting credit to Matt Cooper for the video github.com/dyelax truth model � 3

  4. Main Takeaway Lipschitz continuity plays a key role in compounding errors and more generally in the theory of model-based RL Given two metric spaces and , a function f : M 1 7! M 2 ( M 1 , d 1 ) ( M 2 , d 2 ) is Lipschitz if the Lipschitz constant defined below is finite: � � f ( s 1 ) , f ( s 2 ) d 2 K d 1 ,d 2 ( f ) := sup d 1 ( s 1 , s 2 ) s 1 ∈ M 1 ,s 2 ∈ M 1 f ( s 1 ) f ( s ) s 1 � 4

  5. Wasserstein Metric [Villani, 2008] in stochastic domains, we need to quantify di ff erence between two distributions µ 2 µ 1 µ 2 µ 1 Z Z W ( µ 1 , µ 2 ) := inf j ( s 1 , s 2 ) d ( s 1 , s 2 ) ds 2 ds 1 j ∈ Λ � 5

  6. Three Theorems ‣ multi-step prediction error ‣ value function estimation error ‣ Lipschitz continuity of value function � 6

  7. Multi-step Prediction Error assume a accurate model: ∆ � b � T ( · | s, a ) , T ( · | s, a ) ≤ ∆ ∀ s ∀ a W K ( b given a accurate model with a Lipschitz constant and a true T ) ∆ model with Lipschitz constant and a state distribution : µ ( s ) K ( T ) n − 1 X � b � T n ( · | µ ) , T n ( · | µ ) ( k ) i δ ( n ) := W ≤ ∆ i =0 � � K ( T ) , K ( b δ :error :prediction horizon k : min T ) n � 7

  8. Value Function Estimation Error value/policy planning acting model experience model learning :Lipschitz constant of reward K ( R ) how inaccurate can the value function be? γ K ( R ) ∆ � � � ≤ � V T ( s ) − V b T ( s ) ∀ s (1 − γ )(1 − γ k ) � � K ( T ) , K ( b k : min T ) � 8

  9. Lipschitz Continuity of Value Function ‣ Generalized VI [Littman and Szepesvári, 96]: a Lipschitz operator ‣ repeat until convergence: Z T ( s 0 | s, a ) f � � Q ( s 0 , · ) ds 0 Q ( s, a ) ← R ( s, a )+ γ ‣ value function is Lipschitz in every iteration (including the fixed point) K ( R ) K ( Q ) ≤ 1 − γ K ( T ) ‣ one implication: value-aware model learning [Farahmand et al, 2017] is equivalent to Wasserstein (will appear in PGMRL workshop later in the conference) � 9

  10. Controlling Lipschitz Constant with Neural Nets for each layer, ensure the weights are in a desired norm ball: Lipschitz constant of entire net is bounded by multiplication of Lipschitz constant of layers � 10

  11. Is Controlling the Lipschitz Constant of Transition Models Useful? ‣ Cartpole (left) and Pendulum (right) ‣ learn a model o ffl ine using random -250 samples -500 -750 -1000 ‣ perform policy gradient using the -1250 model -1500 ‣ test the policy in the environment average return per episode ‣ improved reward (higher is better) by an intermediate Lipschitz value more experiments (including on stochastic domains) in the paper � 11

  12. Contributions: ‣ key role of Lipschitz constant in model-based RL: ‣ compounding error ‣ value function estimation error ‣ Lipschitz continuity of value function ‣ learning stochastic models using EM (skipped, details in the paper) ‣ quantifying Lipschitz constant of neural nets (skipped, details in the paper) ‣ model regularization by controlling the Lipschitz constant ‣ usefulness of Wasserstein for model-based RL (skipped, details in the paper) Questions? � 12

  13. ★ ★ ★ ★ References: Littman and Szepesvári, "A Generalized Reinforcement-Learning model: Convergence and Applications", 1996 Villani, "Optimal Transport, Old and New", 2014 Talvitie, "Model Regularization for Stable Sample Rollouts", 2014 Venkatraman, Hebert, and Bagnell, "Improving Multi-Step Prediction of Learned Time Series Models", 2015 � 13

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend