 
              Gaussian Processes for Sample Efficient Reinforcement Learning with RMAX-like Exploration Tobias Jung and Peter Stone Department of Computer Science University of Texas at Austin { tjung,pstone } @cs.utexas.edu Outline: 1. Motivation & framework 2. Technical implementation 3. Experiments GP-RMAX – ECML 09/21/10 – p.1/18
Part I: Motivation & Overview This is what w e w ant to do (and why) GP-RMAX – ECML 09/21/10 – p.2/18
Consider: Time-dis rete de ision p ro ess t = 0 , 1 , 2 , . . . with R D state spa e ( ontinuous), A a tion spa e (�nite) Objective: dynamic programming T ransition fun tion x t +1 = f ( x t , a t ) (deterministi ) Rew a rd fun tion r ( x t , a t ) (immediate pa y o� ) X ⊂ Goal: F o r any x 0 �nd a tions a 0 , a 1 , . . . su h that � is maximized. Dynami p rogramming: (value iteration) If transitions f and rew a rd r a re kno wn , w e an solve t ≥ 0 γ t r ( x t , a t ) where ( T Q )( x, a ) := r ( x, a ) + γ max to obtain Q ∗ , the optimal value fun tion. On e Q ∗ is al ulated, b est a tion in x t is simply argmax a Q ∗ ( x t , a ) . Q ( f ( x, a ) , a ′ ) Q = T Q, ∀ x, a a ′ Problems: Usually f and r a re not kno wn a p rio ri = lea rned from samples. (State-a tion spa e �to o big� to do VI, ⇔ will la rgely igno re this) Our goal: w ant to imp rove sample e� ien y . ⇒ = ⇒ GP-RMAX – ECML 09/21/10 – p.3/18
Model-based reinforcement learning Rema rk: throughout the pap er w e will assume that the rew a rd fun tion is sp e i�ed a p rio ri. Sample e� ien y of RL wholly dep ends on sample e� ien y of mo del lea rner. = ⇒ GP-RMAX – ECML 09/21/10 – p.4/18
Bene�ts of mo del-based RL: Mo re sample e� ient than mo del-free (ho w ever, also mo re omputationally exp ensive) : Overview of the talk Samples only used to lea rn mo del, but not as �test-p oints� in value iteration. Sample e� ien y of RL wholly dep ends on sample e� ien y of mo del lea rner. (Mo del an b e reused to solve di�erent tasks in same environment.) Mo del-based RL: requires us to w o rry ab out 3 things 1. Ho w to implement planner? Here: simple interp olation on grid. (not pa rt of this pap er) 2. Ho w to implement mo del-lea rner? 3. Ho w to implement explo ration? Our ontribution GP-RMAX: mo del-lea rner=Gaussian p ro ess regression F ully Ba y esian: p rovides natural (un) ertaint y fo r ea h p redi tion. Automated, data-driven hyp erpa rameter sele tion. F ramew o rk fo r feature sele tion: �nd & eliminate irrelevant va riables/dire tions: imp roves generalization & p redi tion p erfo rman e = faster mo del lea rning. imp roves un ertaint y estimates = mo re e� ient explo ration. Exp eriments indi ate highly sample-e� ient online RL p ossible. ⇒ ⇒ GP-RMAX – ECML 09/21/10 – p.5/18
Example: ompa re three app roa hes fo r mo del lea rning in a 100 × 100 gridw o rld. Motivation: GP+ARD Can Reduce Need for Exploration 100 x 100 cells Goal x right = x old + 0 . 01 Actions: new After observing 20 transitions, w e plot ho w ertain ea h mo del is ab out its p redi tions fo r �right�: y right = y old new x up = x old Right Up new y up Start = y old + 0 . 01 new 1 1 1 1 0.9 0.9 0.9 0.9 0.9 0.8 0.07 0.8 0.8 0.8 0.8 0.7 0.06 0.7 0.7 0.7 0.7 0.6 0.05 0.6 0.6 y coordinate 0.6 0.6 y coordinate y coordinate grid 0.5 Hand-tuned unifo rm RBF GP with ARD k ernel 0.5 0.5 0.5 0.5 0.04 0.4 0.4 0.4 0.4 0.4 0.03 0.3 0.3 0.3 0.3 0.3 GP+ARD dete ts that the y- o o rdinate is irrelevant = redu ed explo ration = faster lea rning. 0.02 0.2 0.2 0.2 0.2 0.2 0.01 0.1 0.1 0.1 0.1 0.1 0 0 0 0 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 x coordinate x coordinate x coordinate 10 × 10 ⇒ ⇒ GP-RMAX – ECML 09/21/10 – p.6/18
Part II: Technical implementation This is ho w w e do it GP-RMAX – ECML 09/21/10 – p.7/18
a. Model learning with GPs GP-RMAX – ECML 09/21/10 – p.8/18
General idea: Have to lea rn D -dim transition fun tion x ′ = f ( x , a ) . Model learning with GPs T o do this, w e ombine multiple univa riate GPs. T raining: Data onsists of transitions { ( x t , a t , x ′ , where x ′ and x t , x ′ R D . T rain indep endently one GP fo r ea h state va riable, a tion. mo dels i -th state va riable under a tion a = j has hyp erpa rameters � found from minimizing ma rginal lik eliho o d t ) } N t = f ( x t , a t ) t ∈ t =1 GP ij GP ij θ ij On e trained, GP ij p ro du es fo r any state x ∗ θ ij ) = − 1 θ ij + σ I ) − 1 θ ij + σ I ) − 1 y − n Predi tion ˜ . L ( � 2 y T ( K � min 2 log det( K � 2 log 2 π � θ ij Un ertaint y ˜ θ ij ( x ∗ ) . A t the end, p redi tions of individual state va riables a re sta k ed together. f i ( x ∗ , a = j ) := k � θ ij ( x ∗ ) T ( K � θ ij + σ I ) − 1 y c i ( x ∗ , a = j ) := k � θ ij ( x ∗ , x ∗ ) − k � θ ij ( x ∗ ) T ( K � θ ij + σ I ) − 1 k � GP-RMAX – ECML 09/21/10 – p.9/18
PSfrag repla ements Automated p ro edure fo r hyp erpa rameter sele tion: an use ov with la rger numb er of hyp erpa rameters Automatic relevance determination (infeasible to set b y hand) b etter �t regula rities of data, remove what is irrelevant Cova rian e: W e onsider three va riants of the fo rm: = ⇒ (I) PSfrag repla ements = ⇒ h with s ala r hyp erpa rameters v 0 , b and matrix Ω given b y . � � − 1 k θ ( x , x ′ ) = v 0 exp 2 ( x − x ′ ) T Ω ( x − x ′ ) + b Variant II: Ω = diag( a 1 , . . . , a D ) . PSfrag repla ements (II) k + diag( a 1 , . . . , a D ) . a 2 Note: Variant I: Ω = h I a 1 (I I), (I I I) ontain adjustable pa rameters fo r every state va riable Setting them automati ally from data = Variant III: Ω = M k M T (III) Mo del sele tion automati ally determines their relevan e u 1 Can use lik eliho o d s o res to p rune irrelevant state va riables. u 2 s 1 s 2 ⇒ GP-RMAX – ECML 09/21/10 – p.10/18
b. Planning (with approximate model) GP-RMAX – ECML 09/21/10 – p.11/18
R D Rememb er: Input to the planner is the urrent mo del. Value iteration in The urrent mo del �p ro du es� fo r any ( x, a ) f ( x, a ) , the p redi ted su esso r state c ( x, a ) , the asso iated un ertaint y (0= ertain, 1=un ertain) General idea: V alue iteration on grid Γ h + multidimensional interp olation. ˜ Instead of true transition fun tion, simulate transitions with urrent mo del. ˜ As in RMAX integrate �explo ration� into value up dates. (Nouri & Littman 2009) Algo rithm: iterate k = 1 , 2 , . . . : ∀ no de ξ i ∈ Γ h , a tion a MAX given a p rio ri interp olation in R D Note: � ˜ � � If ˜ c ( ξ i , a ) ≈ 0 , no explo ration. f ( ξ i , a ) , a ′ � Q k +1 ( ξ i , a ) = (1 − ˜ c ( ξ i , a )) · r ( ξ i , a ) + γ max + ˜ c ( ξ i , a ) · V Q k a ′ � �� � � �� � If ˜ c ( ξ i , a ) ≈ 1 , state is a rti� ially made mo re attra tive = explo ration. ⇒ GP-RMAX – ECML 09/21/10 – p.12/18
Part III: Experiments These a re the results GP-RMAX – ECML 09/21/10 – p.13/18
Examine what: examine online lea rning p erfo rman e of GP-RMAX, that is, sample omplexit y , and Experimental setup qualit y of lea rned b ehavio r in va rious p opula r b en hma rk domains. Domains: Mountain a r (2D state spa e) Inverted p endulum (2D state spa e) Bi y le balan ing (4D state spa e) A rob ot (swing-up) (4D state spa e) Contestants: Sa rsa( λ ) + tile o ding (explo ration where un ertaint y is determinded from GP) (no explo ration) (explo ration where un ertaint y is determined from grid) GP-RMAXexp GP-RMAXnoexp GP-RMAXgrid GP-RMAX – ECML 09/21/10 – p.14/18
Results 2D domains Mountain car (GP−RMAX) Mountain car (Sarsa) 500 500 optimal optimal Sarsa( λ ) Tilecoding 10 GP−RMAX exp 450 450 Steps to goal (lower is better) Steps to goal (lower is better) GP−RMAX noexp Sarsa( λ ) Tilecoding 20 GP−RMAX grid5 400 400 GP−RMAX grid10 350 350 300 300 250 250 200 200 150 150 100 100 0 5 10 15 20 0 200 400 600 800 1000 Episodes Episodes Inverted pendulum (Sarsa) Inverted pendulum (GP−RMAX) 0 0 −50 −100 Total reward (higher is better) Total reward (higher is better) −100 optimal −200 −150 GP−RMAX exp −200 GP−RMAX noexp optimal −300 GP−RMAX grid5 Sarsa( λ ) Tilecoding 10 −250 GP−RMAX grid10 Sarsa( λ ) Tilecoding 40 −400 −300 −350 −500 −400 −600 −450 0 5 10 15 20 0 100 200 300 400 500 Episodes Episodes GP-RMAX – ECML 09/21/10 – p.15/18
Recommend
More recommend