Bayesian RL Tutorial 1/25 Gaussian Process Temporal Difference - - PowerPoint PPT Presentation

bayesian rl tutorial 1 25 gaussian process temporal
SMART_READER_LITE
LIVE PREVIEW

Bayesian RL Tutorial 1/25 Gaussian Process Temporal Difference - - PowerPoint PPT Presentation

Bayesian RL Tutorial 1/25 Gaussian Process Temporal Difference Learning Yaakov Engel Collaborators: Shie Mannor, Ron Meir Why use GPs in RL? A Bayesian approach to value estimation Forces us to to make our assumptions explicit


slide-1
SLIDE 1

Bayesian RL Tutorial 1/25

slide-2
SLIDE 2

Gaussian Process Temporal Difference Learning

Yaakov Engel

Collaborators: Shie Mannor, Ron Meir

slide-3
SLIDE 3

Why use GPs in RL?

  • A Bayesian approach to value estimation
  • Forces us to to make our assumptions explicit
  • Non-parametric – priors are placed and inference is performed

directly in function space (kernels).

  • But, can also be defined parametrically
  • Domain knowledge intuitively coded in priors
  • Provides full posterior over values, not just point estimates
  • Efficient, on-line implementations, suitable for large problems

Bayesian RL Tutorial 3/25

slide-4
SLIDE 4

Gaussian Processes

Definition: “An indexed set of jointly Gaussian random variables” Note: The index set X may be just about any set. Example: F(x), index is x ∈ [0, 1]n F’s distribution is specified by its mean and covariance: E

  • F(x)
  • = m(x) ,

Cov

  • F(x), F(x′)
  • = k(x, x′)

Conditions on k: Symmetric, positive definite ⇒ k is a Mercer kernel

Bayesian RL Tutorial 4/25

slide-5
SLIDE 5

Example: Parametric GP

A linear combination of basis functions: F(x) = φ(x)⊤W

1

. . . .

x φ ( ) φ ( )

2 nx

x φ ( )

Wn W W1 2

Σ

If W ∼ N {mw, Cw}, then F is a GP: E[F(x)] = φ(x)⊤mw, Cov[F(x), F(x′)] = φ(x)⊤Cwφ(x′)

Bayesian RL Tutorial 5/25

slide-6
SLIDE 6

Conditioning – Gauss-Markov Thm.

Theorem Let Z and Y be random vectors jointly distributed according to the multivariate normal distribution  Z Y   ∼ N     mz my   ,  Czz Czy Cyz Cyy      . Then Z|Y ∼ N

  • ˆ

Z, P

  • , where

ˆ Z = mz + CzyC−1

yy(Y − my)

P = Czz − CzyC−1

yyCyz. Bayesian RL Tutorial 6/25

slide-7
SLIDE 7

GP Regression

Sample: ((x1, y1), . . . , (xt, yt)) Model equation: Y (xi) = F(xi) + N(xi) GP Prior on F: F ∼ N {0, k(·, ·)}

1

N(x )

1 1

Y(x ) F(x )

. . . .

N(x )

2

Y(x ) N(x ) Y(x ) F(x ) F(x )

2 2 t t t

N: IID zero-mean Gaussian noise, with variance σ2

Bayesian RL Tutorial 7/25

slide-8
SLIDE 8

GP Regression (ctd.)

Denote: Yt = (Y (x1), . . . , Y (xt))⊤ , kt(x) = (k(x1, x), . . . , k(xt, x))⊤ , Kt = [kt(x1), . . . , kt(xt)] . Then:  F(x) Yt   ∼ N     0   ,  k(x, x) kt(x)⊤ kt(x) Kt + σ2I      Now apply conditioning formula to compute the poste- rior moments of F(x), given Yt = yt = (y1, . . . , yt)⊤.

Bayesian RL Tutorial 8/25

slide-9
SLIDE 9

Example

−10 −8 −6 −4 −2 2 4 6 8 10 −1 −0.5 0.5 1 1.5 Training Set SINC SGPR σ confidence Test err=0.131

Bayesian RL Tutorial 9/25

slide-10
SLIDE 10

Markov Decision Processes

xt+1 xt xt r t at

z−1

MDP Controller

State space: X, state x ∈ X Action space: A, action a ∈ A Joint state-action space: Z = X × A, z = (x, a) Transition prob. density: xt+1 ∼ p(·|xt, at) Reward prob. density: R(xt, at) ∼ q(·|xt, at)

Bayesian RL Tutorial 10/25

slide-11
SLIDE 11

Control and Returns

Stationary policy: at ∼ µ(·|xt) Path: ξµ = (z0, z1, . . .) Discounted Return: D(ξµ) = ∞

i=0 γiR(zi)

Value function: V µ(x) = Eµ[D(ξµ)|x0 = x] State-action value func.: Qµ(z) = Eµ[D(ξµ)|z0 = z] Goal: Find a policy µ∗ maximizing V µ(x) ∀x ∈ X Note: If Q∗(x, a) = Qµ∗(x, a) is available, then an optimal action for state x is given by any a∗ ∈ argmaxa Q∗(x, a).

Bayesian RL Tutorial 11/25

slide-12
SLIDE 12

Value-Based RL

xt+1 xt xt r t at at

z−1 learning data learning data

(a|x) µ MDP Policy: MRP

µ µ

^ ^ Value Estimator: V (x) or Q (x,a)

Bayesian RL Tutorial 12/25

slide-13
SLIDE 13

Bellman’s Equation

For a fixed policy µ: V µ(x) = Ex′,a|x

  • ¯

R(x, a) + γV µ(x′)

  • Optimal value and policy:

V ∗(x) = max

µ

V µ(x) , µ∗ = argmax

µ

V µ(x) How to solve it?

  • Methods based on Value Iteration (e.g. Q-learning)
  • Methods based on Policy Iteration (e.g. SARSA, OPI,

Actor-Critic)

Bayesian RL Tutorial 13/25

slide-14
SLIDE 14

Solution Method Taxonomy

Value−Function based Purely Policy based (Policy Gradient) Policy Iteration type Value Iteration type (Q−Learning) (Actor−Critic, OPI, SARSA)

RL Algorithms

PI methods need a “subroutine” for policy evaluation

Bayesian RL Tutorial 14/25

slide-15
SLIDE 15

What’s Missing?

Shortcomings of current policy evaluation methods:

  • Some methods can only be applied to small problems
  • No probabilistic interpretation - how good is the estimate?
  • Only parametric methods are capable of operating on-line
  • Non-parametric methods are more flexible but only work off-line
  • Small-step-size (stoch. approx.) methods use data inefficiently
  • Finite-time solutions lack interpretability, all statements are

asymptotic

  • Convergence issues

Bayesian RL Tutorial 15/25

slide-16
SLIDE 16

GP Temporal Difference Learning

Model Equations: R(xi) = V (xi) − γV (xi+1) + N(xi, xi+1) Or, in compact form: Rt = Ht+1Vt+1 + Nt Ht =       

1 −γ . . . 1 −γ . . . . . . . . . . . . 1 −γ

       . Our (Bayesian) goal: Find the posterior distribution of V , given a sequence of observed states and rewards.

Bayesian RL Tutorial 16/25

slide-17
SLIDE 17

Deterministic Dynamics

Bellman’s Equation: V (xi) = ¯ R(xi) + γV (xi+1) Define: N(x) = R(x) − ¯ R(x) Assumption: N(xi) are Normal, IID, with variance σ2. Model Equations: R(xi) = V (xi) − γV (xi+1) + N(xi) In compact form: Rt = Ht+1Vt+1 + Nt , with Nt ∼ N

  • 0, σ2I
  • Bayesian RL Tutorial

17/25

slide-18
SLIDE 18

Stochastic Dynamics

The discounted return: D(xi) = EµD(xi) + (D(xi) − EµD(xi)) = V (xi) + ∆V (xi) For a stationary MDP: D(xi) = R(xi) + γD(xi+1) (where xi+1 ∼ p(·|xi, ai), ai ∼ µ(·|xi)) Substitute and rearrange: R(xi) = V (xi) − γV (xi+1) + N(xi, xi+1) N(xi, xi+1)

def

= ∆V (xi) − γ∆V (xi+1) Assumption: ∆V (xi) are Normal, i.i.d., with variance σ2. In compact form: Rt = Ht+1Vt+1 + Nt , with Nt ∼ N

  • 0, σ2Ht+1H⊤

t+1

  • Bayesian RL Tutorial

18/25

slide-19
SLIDE 19

The Posterior

General noise covariance: Cov[Nt] = Σt Joint distribution:   Rt−1 V (x)   ∼ N      0   ,   HtKtH⊤

t + Σt

Htkt(x) kt(x)⊤H⊤

t

k(x, x)      Condition on Rt−1: E[V (x)|Rt−1 = rt−1] = kt(x)⊤αt Cov[V (x), V (x′)|Rt−1 = rt−1] = k(x, x′) − kt(x)⊤Ctkt(x′)

αt = H⊤

t

“ HtKtH⊤

t + Σt

”−1 rt−1, Ct = H⊤

t

“ HtKtH⊤

t + Σt

”−1 Ht.

Bayesian RL Tutorial 19/25

slide-20
SLIDE 20

Learning State-Action Values

Under a fixed stationary policy µ, state-action pairs zt form a Markov chain, just like the states xt. Consequently Qµ(z) behaves similarly to V µ(x): R(zi) = Q(zi) − γQ(zi+1) + N(zi, zi+1) Posterior moments: E[Q(z)|Rt−1 = rt−1] = kt(z)⊤αt Cov[Q(z), Q(z′)|Rt−1 = rt−1] = k(z, z′) − kt(z)⊤Ctkt(z′)

Bayesian RL Tutorial 20/25

slide-21
SLIDE 21

Policy Improvement

Optimistic Policy Iteration algorithms work by maintaining a policy evaluator ˆ Qt and selecting the action at time t semi-greedily w.r.t. to the current state-action value estimates ˆ Qt(xt, ·). Policy evaluator Parameters OPI algorithm Online TD(λ) (Sutton) wt SARSA (Rummery & Niranjan) Online GPTD (Engel et Al.) αt, Ct GPSARSA (Engel et Al.)

Bayesian RL Tutorial 21/25

slide-22
SLIDE 22

GPSARSA Algorithm

Initialize α0 = 0, C0 = 0, D0 = {z0}, c0 = 0, d0 = 0, 1/s0 = 0 for t = 1, 2, . . .

  • bserve xt−1, at−1, rt−1, xt

at = SemiGreedyAction(xt, Dt−1, αt−1, Ct−1) dt =

γσ2

t−1

st−1 dt−1 + temporal difference

ct = . . . , st = . . . αt = @αt−1 1 A + ct

st dt

Ct = 2 4Ct−1 0⊤ 3 5 +

1 st ctc⊤ t

Dt = Dt−1 ∪ {zt} end for return αt, Ct, Dt

Bayesian RL Tutorial 22/25

slide-23
SLIDE 23

A 2D Navigation Task

− 6 −60 −60 −60 −50 −50 −50 − 5 −50 −50 −50 −50 −40 − 4 −40 −40 − 4 −30 −30 −30 −30 −30 −30 − 3 −20 −20 −20 −20 − 2 −20 −10 −10 −10

Bayesian RL Tutorial 23/25

slide-24
SLIDE 24

Challenges

  • How to use value uncertainty?
  • What’s a disciplined way to select actions?
  • What’s the best noise covariance?
  • Bias, variance, learning curves
  • POMDPs
  • More complicated tasks

Questions?

Bayesian RL Tutorial 24/25