Lipschitz Continuity in Model-based Reinforcement Learning Kavosh - - PowerPoint PPT Presentation

lipschitz continuity in model based reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Lipschitz Continuity in Model-based Reinforcement Learning Kavosh - - PowerPoint PPT Presentation

Lipschitz Continuity in Model-based Reinforcement Learning Kavosh Asadi*, Dipendra Misra*, Michael L. Littman * denotes equal contribution 1 Model-based RL value/policy planning acting model experience model learning model learning: T


slide-1
SLIDE 1

Lipschitz Continuity in Model-based Reinforcement Learning

Kavosh Asadi*, Dipendra Misra*, Michael L. Littman

* denotes equal contribution

1

slide-2
SLIDE 2

Model-based RL

2

model learning:

T(s0|s, a) ≈ b T(s0|s, a) R(s, a) ≈ b R(s, a) s0 b s1 b s2 b s3 b T

planning: model learning

value/policy

experience model acting planning

s1 b s1 s2 b s2 s3 b s3

slide-3
SLIDE 3

Compounding Error

  • happens when models are imperfect, which is almost always true
  • estimation error or partial observability
  • agnostic setting

3

truth model

credit to Matt Cooper for the video github.com/dyelax

[Talvitie 2014, Venkatraman et al. 2015]

slide-4
SLIDE 4

Given two metric spaces and , a function is Lipschitz if the Lipschitz constant defined below is finite:

Main Takeaway

4

(M1, d1)

f : M1 7! M2

Kd1,d2(f) := sup

s1∈M1,s2∈M1

d2

  • f(s1), f(s2)
  • d1(s1, s2)

(M2, d2)

Lipschitz continuity plays a key role in compounding errors and more generally in the theory of model-based RL

f(s)

f(s1)

s1

slide-5
SLIDE 5

Wasserstein Metric

5

µ1

µ2

in stochastic domains, we need to quantify difference between two distributions

µ1

µ2

W(µ1, µ2) := inf

j∈Λ

Z Z j(s1, s2)d(s1, s2)ds2ds1

[Villani, 2008]

slide-6
SLIDE 6

Three Theorems

  • multi-step prediction error
  • value function estimation error
  • Lipschitz continuity of value function

6

slide-7
SLIDE 7

Multi-step Prediction Error

7

given a accurate model with a Lipschitz constant and a true model with Lipschitz constant and a state distribution :

∆ n δ :error

:prediction horizon assume a accurate model:

∆ ∀s ∀a W b T(· | s, a), T(· | s, a)

  • ≤ ∆

K( b T) K(T) µ(s) δ(n) := W b T n(· | µ), T n(· | µ)

  • ≤ ∆

n−1

X

i=0

(k)i k : min

  • K(T), K( b

T)

slide-8
SLIDE 8

Value Function Estimation Error

8

how inaccurate can the value function be?

:Lipschitz constant of reward

model learning

value/policy

experience model acting planning

  • VT (s) − V b

T (s)

γK(R)∆ (1 − γ)(1 − γk)

K(R)

k : min

  • K(T), K( b

T)

  • ∀s
slide-9
SLIDE 9
  • Generalized VI [Littman and Szepesvári, 96]:
  • repeat until convergence:
  • value function is Lipschitz in every iteration (including the fixed point)
  • one implication: value-aware model learning [Farahmand et al, 2017]

is equivalent to Wasserstein (will appear in PGMRL workshop later in the conference)

Lipschitz Continuity of Value Function

9

Q(s, a)←R(s, a)+γ Z T(s0 | s, a)f

  • Q(s0, ·)
  • ds0

K(Q) ≤ K(R) 1 − γK(T)

a Lipschitz operator

slide-10
SLIDE 10

Controlling Lipschitz Constant with Neural Nets

10

for each layer, ensure the weights are in a desired norm ball: Lipschitz constant of entire net is bounded by multiplication of Lipschitz constant of layers

slide-11
SLIDE 11

Is Controlling the Lipschitz Constant

  • f Transition Models Useful?

11

  • 250
  • 500
  • 750
  • 1000
  • 1250
  • 1500

average return per episode

  • Cartpole (left) and Pendulum (right)
  • learn a model offline using random

samples

  • perform policy gradient using the

model

  • test the policy in the environment
  • improved reward (higher is better)

by an intermediate Lipschitz value

more experiments (including on stochastic domains) in the paper

slide-12
SLIDE 12

Contributions:

  • key role of Lipschitz constant in model-based RL:
  • compounding error
  • value function estimation error
  • Lipschitz continuity of value function
  • learning stochastic models using EM (skipped, details in the paper)
  • quantifying Lipschitz constant of neural nets (skipped, details in the paper)
  • model regularization by controlling the Lipschitz constant
  • usefulness of Wasserstein for model-based RL (skipped, details in the paper)

12

Questions?

slide-13
SLIDE 13

References:

Littman and Szepesvári, "A Generalized Reinforcement-Learning model: Convergence and Applications", 1996

Villani, "Optimal Transport, Old and New", 2014

Talvitie, "Model Regularization for Stable Sample Rollouts", 2014

Venkatraman, Hebert, and Bagnell, "Improving Multi-Step Prediction of Learned Time Series Models", 2015

13