Gaussian Process Temporal Difference Learning - Theory and Practice - - PowerPoint PPT Presentation

gaussian process temporal difference learning theory and
SMART_READER_LITE
LIVE PREVIEW

Gaussian Process Temporal Difference Learning - Theory and Practice - - PowerPoint PPT Presentation

Gaussian Process Temporal Difference Learning - Theory and Practice Yaakov Engel Collaborators: Shie Mannor, Ron Meir, Peter Szabo, Dmitry Volkinshtein, Nadav Aharony, Tzachi Zehavi Kernel-RL workshop ICML06 Timeline ICML03:


slide-1
SLIDE 1

Gaussian Process Temporal Difference Learning - Theory and Practice

Yaakov Engel

Collaborators: Shie Mannor, Ron Meir, Peter Szabo, Dmitry Volkinshtein, Nadav Aharony, Tzachi Zehavi

slide-2
SLIDE 2

Kernel-RL workshop – ICML’06

Timeline

  • ICML’03: Bayes meets Bellman paper – GPTD model for

MDPs with deterministic transitions

  • ICML’05: RL with GPs paper – GPTD model for general

MDPs + GPSARSA for learning control

  • NIPS’05: Learning to control an Octopus Arm – GPTD applied

to a high dimensional control problem

  • OPNET’05: Network association-control with GPSARSA

2/30

slide-3
SLIDE 3

Kernel-RL workshop – ICML’06

Why use GPs in RL?

  • A Bayesian approach to value estimation
  • Forces us to to make our assumptions explicit
  • Non-parametric – priors are placed and inference is performed

directly in function space (kernels).

  • But, can also be defined parametrically
  • Domain knowledge intuitively coded in priors
  • Provides full posterior, not just point estimates
  • Efficient, on-line implementations, suitable for large problems

3/30

slide-4
SLIDE 4

Kernel-RL workshop – ICML’06

The Bayesian Approach

Z Y

  • Z – hidden process, Y – observable
  • We want to infer Z from measurements of Y
  • Statistical dependence between Z and Y known: P(Y |Z)
  • Place prior over Z, reflecting our uncertainty: P(Z)
  • Observe Y = y
  • Compute posterior: P(Z|Y = y) =

P (y|Z)P (Z) R dZ′P (y|Z′)P (Z′) 4/30

slide-5
SLIDE 5

Kernel-RL workshop – ICML’06

Gaussian Processes

Definition: “An indexed set of jointly Gaussian random variables” Note: The index set X may be just about any set. Example: F(x), index is x ∈ [0, 1]n F’s distribution is specified by its mean and covariance: E

  • F(x)
  • = m(x) ,

Cov

  • F(x), F(x′)
  • = k(x, x′)

m is a function X → R, k is a function X × X → R. Conditions on k: Symmetric, positive definite ⇒ k is a Mercer kernel

5/30

slide-6
SLIDE 6

Kernel-RL workshop – ICML’06

GP Regression

Model equation: Y (x) = F(x) + N(x) Prior: F ∼ N {0, k(·, ·)} Noise: N ∼ N

  • 0, σ2δ(· − ·)
  • Goal:

Find the posterior distribution of F, given a sample for Y (via Bayes’ rule)

6/30

slide-7
SLIDE 7

Kernel-RL workshop – ICML’06

Example

−10 −8 −6 −4 −2 2 4 6 8 10 −1 −0.5 0.5 1 1.5 Training Set SINC SGPR σ confidence Test err=0.131

7/30

slide-8
SLIDE 8

Kernel-RL workshop – ICML’06

Markov Decision Processes

X: state space U: action space p: X × X × U → [0, 1], xt+1 ∼ p(·|xt, ut) q: I R × X × U → [0, 1], R(xt, ut) ∼ q(·|xt, ut) A Stationary policy: µ: U × X → [0, 1], ut ∼ µ(·|xt) Discounted Return: Dµ(x) = ∞

i=0 γiR(xi, ui)|(x0 = x)

Value function: V µ(x) = Eµ[Dµ(x)] Goal: Find a policy µ∗ maximizing V µ(x) ∀x ∈ X

8/30

slide-9
SLIDE 9

Kernel-RL workshop – ICML’06

Bellman’s Equation

For a fixed policy µ: V µ(x) = Ex′,u|x

  • R(x, u) + γV µ(x′)
  • Optimal value and policy:

V ∗(x) = max

µ

V µ(x) , µ∗ = argmax

µ

V µ(x) How to solve it?

  • Methods based on Value Iteration (e.g. Q-learning)
  • Methods based on Policy Iteration (e.g. SARSA, OPI,

Actor-Critic)

9/30

slide-10
SLIDE 10

Kernel-RL workshop – ICML’06

Solution Method Taxonomy

Value−Function based Purely Policy based (Policy Gradient) Policy Iteration type Value Iteration type (Q−Learning) (Actor−Critic, OPI, SARSA)

RL Algorithms

PI methods need a “subroutine” for policy evaluation

10/30

slide-11
SLIDE 11

Kernel-RL workshop – ICML’06

What’s Missing?

Shortcomings of current policy evaluation methods:

  • Some methods can only be applied to small problems
  • No probabilistic interpretation - how good is the estimate?
  • Only parametric methods are capable of operating on-line
  • Non-parametric methods are more flexible but only work off-line
  • Small-step-size (stoch. approx.) methods use data inefficiently
  • Finite-time solutions lack interpretability, all statements are

asymptotic

  • Convergence issues

11/30

slide-12
SLIDE 12

Kernel-RL workshop – ICML’06

Gaussian Process Temporal Difference Learning

Model Equations: R(xi) = V (xi) − γV (xi+1) + N(xi, xi+1) Or, in compact form: Rt = Ht+1Vt+1 + Nt Ht =       

1 −γ . . . 1 −γ . . . . . . . . . . . . 1 −γ

       . Our (Bayesian) goal: Find the posterior distribution of V , given a sequence of observed states and rewards.

12/30

slide-13
SLIDE 13

Kernel-RL workshop – ICML’06

Deterministic Dynamics

Bellman’s Equation: V (xi) = ¯ R(xi) + γV (xi+1) Define: N(x) = R(x) − ¯ R(x) Assumption: N(xi) are Normal, i.i.d., with variance σ2. Model Equations: R(xi) = V (xi) − γV (xi+1) + N(xi) In compact form: Rt = Ht+1Vt+1 + Nt , with Nt ∼ N

  • 0, σ2I
  • 13/30
slide-14
SLIDE 14

Kernel-RL workshop – ICML’06

Stochastic Dynamics

The discounted return: D(xi) = EµD(xi) + (D(xi) − EµD(xi)) = V (xi) + ∆V (xi) For a stationary MDP: D(xi) = R(xi) + γD(xi+1) (where xi+1 ∼ p(·|xi, ui), ui ∼ µ(·|xi)) Substitute and rearrange: R(xi) = V (xi) − γV (xi+1) + N(xi, xi+1) N(xi, xi+1)

def

= ∆V (xi) − γ∆V (xi+1) Assumption: ∆V (xi) are Normal, i.i.d., with variance σ2. In compact form: Rt = Ht+1Vt+1 + Nt , with Nt ∼ N

  • 0, σ2Ht+1H⊤

t+1

  • 14/30
slide-15
SLIDE 15

Kernel-RL workshop – ICML’06

The Posterior

General noise covariance: Cov[Nt] = Σt Joint distribution:   Rt−1 V (x)   ∼ N      0   ,   HtKtH⊤

t + Σt

Htkt(x) kt(x)⊤H⊤

t

k(x, x)      Invoke Bayes’ Rule: E[V (x)|Rt−1 = rt−1] = kt(x)⊤αt Cov[V (x), V (x′)|Rt−1 = rt−1] = k(x, x′) − kt(x)⊤Ctkt(x′)

kt(x) = (k(x0, x), . . . , k(xt, x))⊤ , Kt = [kt(x0), . . . , kt(xt)] αt = H⊤

t

“ HtKtH⊤

t + Σt

”−1 rt−1, Ct = H⊤

t

“ HtKtH⊤

t + Σt

”−1 Ht.

15/30

slide-16
SLIDE 16

Kernel-RL workshop – ICML’06

A Parametric Gaussian Process Model

A linear combination of features: V (x) = φ(x)⊤W

1

. . . .

x φ ( ) φ ( )

2 nx

x φ ( )

Wn W W1 2

Σ

Prior on W: Gaussian, with E[W] = 0, Cov[W, W] = I Prior on V : Gaussian, with E[V (x)] = 0, Cov[V (x), V (x′)] = φ(x)⊤φ(x′)

16/30

slide-17
SLIDE 17

Kernel-RL workshop – ICML’06

Comparison of Models

Parametric Nonparametric Parametrization V (x) = φ(x)⊤W None, V is V Prior W ∼ N {0, I} V ∼ N {0, k(·, ·)} E[V (x)] Cov[V (x), V (x′)] φ(x)⊤φ(x′) k(x, x′) We seek W|Rt−1 V (x)|Rt−1 If we can find a set of basis functions satisfying φ(x)⊤φ(x′) = k(x, x′), the two models become equiva- lent. In fact, such a set always exists [Mercer]. However, it may be infinite

17/30

slide-18
SLIDE 18

Kernel-RL workshop – ICML’06

Relation to Monte-Carlo Estimation

In the stochastic model: Σt = σ2Ht+1H⊤

t+1

Also, let: (Yt)i = t

j=i γj−iR(xi, ui)

Then: E[W|Rt] =

  • ΦtΦ⊤

t + σ2I

−1 ΦtYt Cov[W|Rt] = σ2 ΦtΦ⊤

t + σ2I

−1 That’s the solution to GP regression on Monte-Carlo samples of the discounted return.

18/30

slide-19
SLIDE 19

Kernel-RL workshop – ICML’06

MAP / ML Solutions

Since the posterior is Gaussian: ˆ wMAP

t+1

= E[W|Rt] =

  • ΦtΦ⊤

t + σ2I

−1 ΦtYt Performing ML inference using the same model we get: ˆ wML

t+1 =

  • ΦtΦ⊤

t

−1 ΦtYt That’s the LSTD(1) (Least-Squares Monte-Carlo) solution.

19/30

slide-20
SLIDE 20

Kernel-RL workshop – ICML’06

Policy Improvement

How can we perform policy improvement? State values? Not without a transition model (even then tricky). State-action (Q-) values? Yes! Idea: Use a state-action value GP How?

  • Define a state-action kernel: k
  • (x, u), (x′, u′)
  • Run GPTD on state-action pairs
  • Use some semi-greedy action selection rule

We call this GPSARSA.

20/30

slide-21
SLIDE 21

Kernel-RL workshop – ICML’06

A Simple Experiment

− 6 −60 −60 −60 −50 −50 −50 − 5 −50 −50 −50 −50 −40 − 4 −40 −40 − 4 −30 −30 −30 −30 −30 −30 − 3 −20 −20 −20 −20 − 2 −20 −10 −10 −10

21/30

slide-22
SLIDE 22

Kernel-RL workshop – ICML’06

The Octopus Arm

Can bend and twist at any point Can do this in any direction Can be elongated and shortened Can change cross section Can grab using any part of the arm Virtually infinitely many DOF

22/30

slide-23
SLIDE 23

Kernel-RL workshop – ICML’06

Our Arm Model

C

N

✁ ✁ ✁ ✂ ✂ ✂ ✂ ✂ ✂ ✄ ✄ ✄ ✄ ✄ ✄ ☎ ☎ ☎ ☎ ✆ ✆ ✆ ✆ ✝ ✝ ✝ ✝ ✞ ✞ ✞ ✞

C

1

✟✠ ✡☛ ☞✌ ✍✎ ✏✑ ✒✓ ✔✕ ✖✗ ✘✙ ✚✛ ✜✢ ✣✤ ✥✦ ✧★ ✩✪ ✫✬ ✭✮ ✯✰ ✱ ✱ ✱ ✱ ✱ ✱ ✲ ✲ ✲ ✲ ✲ ✲ ✳ ✳ ✳ ✳ ✳ ✳ ✳ ✳ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✵ ✵ ✵ ✵ ✵ ✵ ✶ ✶ ✶ ✶ ✶ ✶ ✷ ✷ ✷ ✷ ✷ ✷ ✸ ✸ ✸ ✸ ✸ ✸

ventral side dorsal side pair #1 pair #N+1 longitudinal muscle longitudinal muscle transverse muscle transverse muscle arm tip arm base

23/30

slide-24
SLIDE 24

Kernel-RL workshop – ICML’06

Actions

Each action specifies a set of fixed activations –

  • ne for each muscle in the arm.

Action # 1 Action # 2 Action # 3 Action # 4 Action # 5 Action # 6

Base rotation adds duplicates of actions 1,2,4 and 5 with positive and negative torques applied to the base.

24/30

slide-25
SLIDE 25

Kernel-RL workshop – ICML’06

The Control Problem

Starting from a random position, bring {any part, tip} of arm into contact with a goal region, optimally. Optimality criteria: Time, energy, obstacle avoidance Constraint: We only have access to sampled trajectories Our approach: Define problem as a MDP Solve using a GPTD algorithm

25/30

slide-26
SLIDE 26

Kernel-RL workshop – ICML’06

The Task

−0.1 −0.05 0.05 0.1 0.15 −0.1 −0.05 0.05 0.1 0.15

t = 1.38

26/30

slide-27
SLIDE 27

Kernel-RL workshop – ICML’06

Movies

27/30

slide-28
SLIDE 28

Kernel-RL workshop – ICML’06

Association Control in WLANs

28/30

slide-29
SLIDE 29

Kernel-RL workshop – ICML’06

Association Control in WLANs

Setting: n users, m ≪ n access points (APs), The problem: Associate users with APs, optimally. Complications: Users are not the same, they move around, change their behavior over time, what is meant by “optimally”? etc. Idea: Model the system as a MDP, solve using GPSARSA Results:

  • Tested on simple networks using the OPNET simulator
  • Preliminary results look promising
  • More work is needed

29/30

slide-30
SLIDE 30

Kernel-RL workshop – ICML’06

Challenges

  • How to use value uncertainty?
  • What’s a disciplined way to select actions?
  • What’s the best noise covariance?
  • Bias, variance, learning curves
  • POMDPs
  • More complicated tasks

Questions?

30/30