Gaussian Processes for Sample Efficient Reinforcement Learning with - - PowerPoint PPT Presentation

gaussian processes for sample efficient reinforcement
SMART_READER_LITE
LIVE PREVIEW

Gaussian Processes for Sample Efficient Reinforcement Learning with - - PowerPoint PPT Presentation

Gaussian Processes for Sample Efficient Reinforcement Learning with RMAX-like Exploration Tobias Jung and Peter Stone Department of Computer Science University of Texas at Austin { tjung,pstone } @cs.utexas.edu Outline: 1. Motivation &


slide-1
SLIDE 1

Gaussian Processes for Sample Efficient Reinforcement Learning with RMAX-like Exploration

Tobias Jung and Peter Stone

Department of Computer Science University of Texas at Austin

{tjung,pstone}@cs.utexas.edu

Outline:

  • 1. Motivation & framework
  • 2. Technical implementation
  • 3. Experiments

GP-RMAX – ECML 09/21/10 – p.1/18

slide-2
SLIDE 2

Part I: Motivation & Overview

This is what w e w ant to do (and why)

GP-RMAX – ECML 09/21/10 – p.2/18

slide-3
SLIDE 3

Objective: dynamic programming

Consider: Time-dis rete de ision p ro ess t = 0, 1, 2, . . . with

X ⊂

RD state spa e ( ontinuous), A a tion spa e (nite) T ransition fun tion xt+1 = f(xt, at) (deterministi ) Rew a rd fun tion r(xt, at) (immediate pa y
  • )
Goal: F
  • r
any x0 nd a tions a0, a1, . . . su h that

t≥0 γtr(xt, at)

is maximized. Dynami p rogramming: (value iteration) If transitions f and rew a rd r a re kno wn, w e an solve

Q = T Q,

where (T Q)(x, a) := r(x, a) + γ max

a′

Q(f(x, a), a′) ∀x, a

to
  • btain Q∗
, the
  • ptimal
value fun tion. On e Q∗ is al ulated, b est a tion in xt is simply argmaxa Q∗(xt, a). Problems: Usually f and r a re not kno wn a p rio ri =

lea rned from samples. (State-a tion spa e to
  • big
to do VI, ⇔ will la rgely igno re this)

= ⇒

Our goal: w ant to imp rove sample e ien y .

GP-RMAX – ECML 09/21/10 – p.3/18

slide-4
SLIDE 4

Model-based reinforcement learning

Rema rk: throughout the pap er w e will assume that the rew a rd fun tion is sp e ied a p rio ri.

= ⇒

Sample e ien y
  • f
RL wholly dep ends
  • n
sample e ien y
  • f
mo del lea rner.

GP-RMAX – ECML 09/21/10 – p.4/18

slide-5
SLIDE 5

Overview of the talk

Benets
  • f
mo del-based RL: Mo re sample e ient than mo del-free (ho w ever, also mo re
  • mputationally
exp ensive) : Samples
  • nly
used to lea rn mo del, but not as test-p
  • ints
in value iteration. Sample e ien y
  • f
RL wholly dep ends
  • n
sample e ien y
  • f
mo del lea rner. (Mo del an b e reused to solve dierent tasks in same environment.) Mo del-based RL: requires us to w
  • rry
ab
  • ut
3 things 1. Ho w to implement planner? Here: simple interp
  • lation
  • n
grid. (not pa rt
  • f
this pap er) 2. Ho w to implement mo del-lea rner? 3. Ho w to implement explo ration? Our
  • ntribution
GP-RMAX: mo del-lea rner=Gaussian p ro ess regression F ully Ba y esian: p rovides natural (un) ertaint y fo r ea h p redi tion. Automated, data-driven hyp erpa rameter sele tion. F ramew
  • rk
fo r feature sele tion: nd & eliminate irrelevant va riables/dire tions: imp roves generalization & p redi tion p erfo rman e =

faster mo del lea rning. imp roves un ertaint y estimates =

mo re e ient explo ration. Exp eriments indi ate highly sample-e ient
  • nline
RL p
  • ssible.

GP-RMAX – ECML 09/21/10 – p.5/18

slide-6
SLIDE 6

Motivation: GP+ARD Can Reduce Need for Exploration

Example:
  • mpa
re three app roa hes fo r mo del lea rning in a 100 × 100 gridw
  • rld.

100 x 100 cells Start Goal Actions: Right Up

xright

new

= xold + 0.01 yright

new

= yold xup

new

= xold yup

new

= yold + 0.01

After
  • bserving
20 transitions, w e plot ho w ertain ea h mo del is ab
  • ut
its p redi tions fo r right:

0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x coordinate y coordinate 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

10 × 10

grid

0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x coordinate y coordinate 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Hand-tuned unifo rm RBF

0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x coordinate y coordinate 0.01 0.02 0.03 0.04 0.05 0.06 0.07

GP with ARD k ernel GP+ARD dete ts that the y- o
  • rdinate
is irrelevant =

redu ed explo ration =

faster lea rning.

GP-RMAX – ECML 09/21/10 – p.6/18

slide-7
SLIDE 7

Part II: Technical implementation

This is ho w w e do it

GP-RMAX – ECML 09/21/10 – p.7/18

slide-8
SLIDE 8
  • a. Model learning with GPs

GP-RMAX – ECML 09/21/10 – p.8/18

slide-9
SLIDE 9

Model learning with GPs

General idea: Have to lea rn D
  • dim
transition fun tion x′ = f(x, a). T
  • do
this, w e
  • mbine
multiple univa riate GPs. T raining: Data
  • nsists
  • f
transitions {(xt, at, x′

t)}N t=1

, where x′

t = f(xt, at)

and xt, x′

t ∈

RD . T rain indep endently
  • ne
GP fo r ea h state va riable, a tion.

GPij

mo dels i-th state va riable under a tion a = j

GPij

has hyp erpa rameters

θij

found from minimizing ma rginal lik eliho
  • d

min

  • θij

L( θij) = − 1 2 log det(K

θij + σI) − 1

2 yT(K

θij + σI)−1y − n

2 log 2π

On e trained, GPij p ro du es fo r any state x∗ Predi tion ˜

fi(x∗, a = j) := k

θij (x∗)T(K θij + σI)−1y

. Un ertaint y ˜

ci(x∗, a = j) := k

θij (x∗, x∗) − k θij (x∗)T(K θij + σI)−1k θij (x∗).

A t the end, p redi tions
  • f
individual state va riables a re sta k ed together.

GP-RMAX – ECML 09/21/10 – p.9/18

slide-10
SLIDE 10

Automatic relevance determination

Automated p ro edure fo r hyp erpa rameter sele tion:

= ⇒

an use
  • v
with la rger numb er
  • f
hyp erpa rameters (infeasible to set b y hand)

= ⇒

b etter t regula rities
  • f
data, remove what is irrelevant Cova rian e: W e
  • nsider
three va riants
  • f
the fo rm:

kθ(x, x′) = v0 exp

  • − 1

2 (x − x′)TΩ(x − x′)

  • + b
with s ala r hyp erpa rameters v0, b and matrix Ω given b y

Variant I: Ω = hI

.

Variant II: Ω = diag(a1, . . . , aD). Variant III: Ω = MkMT

k + diag(a1, . . . , aD).

Note: (I I), (I I I)
  • ntain
adjustable pa rameters fo r every state va riable Setting them automati ally from data =

Mo del sele tion automati ally determines their relevan e Can use lik eliho
  • d
s o res to p rune irrelevant state va riables.

(I)

PSfrag repla ements

h

(II)

PSfrag repla ements

a1 a2

(III)

PSfrag repla ements

s1 s2 u1 u2

GP-RMAX – ECML 09/21/10 – p.10/18

slide-11
SLIDE 11
  • b. Planning (with approximate model)

GP-RMAX – ECML 09/21/10 – p.11/18

slide-12
SLIDE 12

Value iteration in

RD Rememb er: Input to the planner is the urrent mo del. The urrent mo del p ro du es fo r any (x, a)

˜ f(x, a),

the p redi ted su esso r state

˜ c(x, a),

the asso iated un ertaint y (0= ertain, 1=un ertain) General idea: V alue iteration
  • n
grid Γh + multidimensional interp
  • lation.
Instead
  • f
true transition fun tion, simulate transitions with urrent mo del. As in RMAX integrate explo ration into value up dates. (Nouri & Littman 2009) Algo rithm: iterate k = 1, 2, . . . : ∀ no de ξi ∈ Γh , a tion a

Qk+1(ξi, a) = (1 − ˜ c(ξi, a)) ·

  • r(ξi, a)
given a p rio ri

+γ max

a′

Qk ˜ f(ξi, a), a′

  • interp
  • lation
in RD
  • + ˜

c(ξi, a) · V

MAX Note: If ˜

c(ξi, a) ≈ 0,

no explo ration. If ˜

c(ξi, a) ≈ 1,

state is a rti ially made mo re attra tive =

explo ration.

GP-RMAX – ECML 09/21/10 – p.12/18

slide-13
SLIDE 13

Part III: Experiments

These a re the results

GP-RMAX – ECML 09/21/10 – p.13/18

slide-14
SLIDE 14

Experimental setup

Examine what: examine
  • nline
lea rning p erfo rman e
  • f
GP-RMAX, that is, sample
  • mplexit
y , and qualit y
  • f
lea rned b ehavio r in va rious p
  • pula
r b en hma rk domains. Domains: Mountain a r (2D state spa e) Inverted p endulum (2D state spa e) Bi y le balan ing (4D state spa e) A rob
  • t
(swing-up) (4D state spa e) Contestants: Sa rsa(λ ) + tile o ding

GP-RMAXexp

(explo ration where un ertaint y is determinded from GP)

GP-RMAXnoexp

(no explo ration)

GP-RMAXgrid

(explo ration where un ertaint y is determined from grid)

GP-RMAX – ECML 09/21/10 – p.14/18

slide-15
SLIDE 15

Results 2D domains

5 10 15 20 100 150 200 250 300 350 400 450 500 Episodes Steps to goal (lower is better) Mountain car (GP−RMAX)

  • ptimal

GP−RMAX exp GP−RMAX noexp GP−RMAX grid5 GP−RMAX grid10 200 400 600 800 1000 100 150 200 250 300 350 400 450 500 Episodes Steps to goal (lower is better) Mountain car (Sarsa)

  • ptimal

Sarsa(λ) Tilecoding 10 Sarsa(λ) Tilecoding 20 5 10 15 20 −600 −500 −400 −300 −200 −100 Episodes Total reward (higher is better) Inverted pendulum (GP−RMAX)

  • ptimal

GP−RMAX exp GP−RMAX noexp GP−RMAX grid5 GP−RMAX grid10 100 200 300 400 500 −450 −400 −350 −300 −250 −200 −150 −100 −50 Episodes Total reward (higher is better) Inverted pendulum (Sarsa)

  • ptimal

Sarsa(λ) Tilecoding 10 Sarsa(λ) Tilecoding 40

GP-RMAX – ECML 09/21/10 – p.15/18

slide-16
SLIDE 16

Results 4D domains

10 20 30 40 50 −14 −12 −10 −8 −6 −4 −2 Episodes Total reward (higher is better) Bicycle balancing (GP−RMAX) Bicycle not balanced

  • ptimal

GP−RMAX exp GP−RMAX noexp GP−RMAX grid5 50 100 150 200 250 −14 −12 −10 −8 −6 −4 −2 Episodes Total reward (higher is better) Bicycle balancing (Sarsa) Bicycle not balanced

  • ptimal

Sarsa(λ) Tilecoding 7 Sarsa(λ) Tilecoding 10 20 40 60 80 100 50 100 150 200 250 300 350 400 450 500 Episodes Steps to goal (lower is better) Acrobot (GP−RMAX)

  • ptimal**

GP−RMAX exp GP−RMAX noexp GP−RMAX grid5 100 200 300 400 500 50 100 150 200 250 300 350 400 450 500 Episodes Steps to goal (lower is better) Acrobot (Sarsa)

  • ptimal**

Sarsa(λ) Tilecoding 7 Sarsa(λ) Tilecoding 20

GP-RMAX – ECML 09/21/10 – p.16/18

slide-17
SLIDE 17

Finish

GP-RMAX: Online mo del-based RL that sepa rates fun tion app ro ximation in the mo del-lea rner (whi h requires samples) from interp
  • lation
in planner (whi h do es not require samples). Emplo ys GPs with data-driven, automati hyp erpa rameter sele tion (feature sele tion): imp roves generalization & p redi tion p erfo rman e =

faster mo del lea rning imp roves un ertaint y estimates =

mo re e ient explo ration.

= ⇒

La rge gains
  • ver
mo del-free RL p
  • ssible
(if mo del lea rning is easier than VF lea rning). Limitations & future w
  • rk:
Majo r p roblem: planner relies
  • n
global value iteration A naive grid is limited to lo w dimensionalit y . Mo re fan y grids (spa rse, adaptive) might s ale to higher dimensionalit y , but this is la rgely
  • p
en resea r h. Mino r p roblems: doing a w a y with
  • ur
simplifying assumptions deterministi state transitions (exp eriments done with w ell-b ehaved simulations) kno wn rew a rd fun tion dis rete (nite) a tions

GP-RMAX – ECML 09/21/10 – p.17/18

slide-18
SLIDE 18

Related work

Closely related: [1℄ A. Nouri and M. L. Littman. Dimension redu tion and its appli ation to mo del-based explo ration in
  • ntinuous
spa es. ECML, 2010 [2℄ S. Davies. Multidimensional triangulation and interp
  • lation
fo r reinfo r ement lea rning. NIPS, 1996. [3℄ T. Hester, M. Quinlan, and P . Stone. Generalized Mo del Lea rning fo r Reinfo r ement Lea rning
  • n
a Humanoid Rob
  • t.
ICRA, 2010. [4℄ N. K. Jong and P . Stone. Mo del-based explo ration in
  • ntinuous
state spa es. In: 7th Symp
  • sium
  • n
Abstra tion, Refo rmulation and App ro ximation, 2007. Related: [5℄ A. Bernstein and N. Shimkin. A daptive-resolution reinfo rm ement lea rning with e ient explo ration. Ma hine Lea rning (published
  • nline
5 Ma y 2010). [6℄ R. Brafman and M. T ennenholtz. R-MAX, a general p
  • lynomial
time algo rithm fo r nea r-optimal reinfo r ement lea rning. JMLR, 3:213-231, 2002. [7℄ M. P . Deisenroth, C. E. Rasmussen, and J. P eters. Gaussian p ro ess dynami p rogramming. Neuro
  • mputing,
72(7-9):1508-1524, 2009. [8℄ L. Li, M. L. Littman, and C. R. Mansley . Online explo ration in least-squa res p
  • li y
iteration. AAMAS, 2009 [9℄ A. Nouri and M. L. Littman. Multi-resolution explo ration in
  • ntinuous
spa es. NIPS, 2008

GP-RMAX – ECML 09/21/10 – p.18/18