gaussian processes for sample efficient reinforcement
play

Gaussian Processes for Sample Efficient Reinforcement Learning with - PowerPoint PPT Presentation

Gaussian Processes for Sample Efficient Reinforcement Learning with RMAX-like Exploration Tobias Jung and Peter Stone Department of Computer Science University of Texas at Austin { tjung,pstone } @cs.utexas.edu Outline: 1. Motivation &


  1. Gaussian Processes for Sample Efficient Reinforcement Learning with RMAX-like Exploration Tobias Jung and Peter Stone Department of Computer Science University of Texas at Austin { tjung,pstone } @cs.utexas.edu Outline: 1. Motivation & framework 2. Technical implementation 3. Experiments GP-RMAX – ECML 09/21/10 – p.1/18

  2. Part I: Motivation & Overview This is what w e w ant to do (and why) GP-RMAX – ECML 09/21/10 – p.2/18

  3. Consider: Time-dis rete de ision p ro ess t = 0 , 1 , 2 , . . . with R D state spa e ( ontinuous), A a tion spa e (�nite) Objective: dynamic programming T ransition fun tion x t +1 = f ( x t , a t ) (deterministi ) Rew a rd fun tion r ( x t , a t ) (immediate pa y o� ) X ⊂ Goal: F o r any x 0 �nd a tions a 0 , a 1 , . . . su h that � is maximized. Dynami p rogramming: (value iteration) If transitions f and rew a rd r a re kno wn , w e an solve t ≥ 0 γ t r ( x t , a t ) where ( T Q )( x, a ) := r ( x, a ) + γ max to obtain Q ∗ , the optimal value fun tion. On e Q ∗ is al ulated, b est a tion in x t is simply argmax a Q ∗ ( x t , a ) . Q ( f ( x, a ) , a ′ ) Q = T Q, ∀ x, a a ′ Problems: Usually f and r a re not kno wn a p rio ri = lea rned from samples. (State-a tion spa e �to o big� to do VI, ⇔ will la rgely igno re this) Our goal: w ant to imp rove sample e� ien y . ⇒ = ⇒ GP-RMAX – ECML 09/21/10 – p.3/18

  4. Model-based reinforcement learning Rema rk: throughout the pap er w e will assume that the rew a rd fun tion is sp e i�ed a p rio ri. Sample e� ien y of RL wholly dep ends on sample e� ien y of mo del lea rner. = ⇒ GP-RMAX – ECML 09/21/10 – p.4/18

  5. Bene�ts of mo del-based RL: Mo re sample e� ient than mo del-free (ho w ever, also mo re omputationally exp ensive) : Overview of the talk Samples only used to lea rn mo del, but not as �test-p oints� in value iteration. Sample e� ien y of RL wholly dep ends on sample e� ien y of mo del lea rner. (Mo del an b e reused to solve di�erent tasks in same environment.) Mo del-based RL: requires us to w o rry ab out 3 things 1. Ho w to implement planner? Here: simple interp olation on grid. (not pa rt of this pap er) 2. Ho w to implement mo del-lea rner? 3. Ho w to implement explo ration? Our ontribution GP-RMAX: mo del-lea rner=Gaussian p ro ess regression F ully Ba y esian: p rovides natural (un) ertaint y fo r ea h p redi tion. Automated, data-driven hyp erpa rameter sele tion. F ramew o rk fo r feature sele tion: �nd & eliminate irrelevant va riables/dire tions: imp roves generalization & p redi tion p erfo rman e = faster mo del lea rning. imp roves un ertaint y estimates = mo re e� ient explo ration. Exp eriments indi ate highly sample-e� ient online RL p ossible. ⇒ ⇒ GP-RMAX – ECML 09/21/10 – p.5/18

  6. Example: ompa re three app roa hes fo r mo del lea rning in a 100 × 100 gridw o rld. Motivation: GP+ARD Can Reduce Need for Exploration 100 x 100 cells Goal x right = x old + 0 . 01 Actions: new After observing 20 transitions, w e plot ho w ertain ea h mo del is ab out its p redi tions fo r �right�: y right = y old new x up = x old Right Up new y up Start = y old + 0 . 01 new 1 1 1 1 0.9 0.9 0.9 0.9 0.9 0.8 0.07 0.8 0.8 0.8 0.8 0.7 0.06 0.7 0.7 0.7 0.7 0.6 0.05 0.6 0.6 y coordinate 0.6 0.6 y coordinate y coordinate grid 0.5 Hand-tuned unifo rm RBF GP with ARD k ernel 0.5 0.5 0.5 0.5 0.04 0.4 0.4 0.4 0.4 0.4 0.03 0.3 0.3 0.3 0.3 0.3 GP+ARD dete ts that the y- o o rdinate is irrelevant = redu ed explo ration = faster lea rning. 0.02 0.2 0.2 0.2 0.2 0.2 0.01 0.1 0.1 0.1 0.1 0.1 0 0 0 0 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 x coordinate x coordinate x coordinate 10 × 10 ⇒ ⇒ GP-RMAX – ECML 09/21/10 – p.6/18

  7. Part II: Technical implementation This is ho w w e do it GP-RMAX – ECML 09/21/10 – p.7/18

  8. a. Model learning with GPs GP-RMAX – ECML 09/21/10 – p.8/18

  9. General idea: Have to lea rn D -dim transition fun tion x ′ = f ( x , a ) . Model learning with GPs T o do this, w e ombine multiple univa riate GPs. T raining: Data onsists of transitions { ( x t , a t , x ′ , where x ′ and x t , x ′ R D . T rain indep endently one GP fo r ea h state va riable, a tion. mo dels i -th state va riable under a tion a = j has hyp erpa rameters � found from minimizing ma rginal lik eliho o d t ) } N t = f ( x t , a t ) t ∈ t =1 GP ij GP ij θ ij On e trained, GP ij p ro du es fo r any state x ∗ θ ij ) = − 1 θ ij + σ I ) − 1 θ ij + σ I ) − 1 y − n Predi tion ˜ . L ( � 2 y T ( K � min 2 log det( K � 2 log 2 π � θ ij Un ertaint y ˜ θ ij ( x ∗ ) . A t the end, p redi tions of individual state va riables a re sta k ed together. f i ( x ∗ , a = j ) := k � θ ij ( x ∗ ) T ( K � θ ij + σ I ) − 1 y c i ( x ∗ , a = j ) := k � θ ij ( x ∗ , x ∗ ) − k � θ ij ( x ∗ ) T ( K � θ ij + σ I ) − 1 k � GP-RMAX – ECML 09/21/10 – p.9/18

  10. PSfrag repla ements Automated p ro edure fo r hyp erpa rameter sele tion: an use ov with la rger numb er of hyp erpa rameters Automatic relevance determination (infeasible to set b y hand) b etter �t regula rities of data, remove what is irrelevant Cova rian e: W e onsider three va riants of the fo rm: = ⇒ (I) PSfrag repla ements = ⇒ h with s ala r hyp erpa rameters v 0 , b and matrix Ω given b y . � � − 1 k θ ( x , x ′ ) = v 0 exp 2 ( x − x ′ ) T Ω ( x − x ′ ) + b Variant II: Ω = diag( a 1 , . . . , a D ) . PSfrag repla ements (II) k + diag( a 1 , . . . , a D ) . a 2 Note: Variant I: Ω = h I a 1 (I I), (I I I) ontain adjustable pa rameters fo r every state va riable Setting them automati ally from data = Variant III: Ω = M k M T (III) Mo del sele tion automati ally determines their relevan e u 1 Can use lik eliho o d s o res to p rune irrelevant state va riables. u 2 s 1 s 2 ⇒ GP-RMAX – ECML 09/21/10 – p.10/18

  11. b. Planning (with approximate model) GP-RMAX – ECML 09/21/10 – p.11/18

  12. R D Rememb er: Input to the planner is the urrent mo del. Value iteration in The urrent mo del �p ro du es� fo r any ( x, a ) f ( x, a ) , the p redi ted su esso r state c ( x, a ) , the asso iated un ertaint y (0= ertain, 1=un ertain) General idea: V alue iteration on grid Γ h + multidimensional interp olation. ˜ Instead of true transition fun tion, simulate transitions with urrent mo del. ˜ As in RMAX integrate �explo ration� into value up dates. (Nouri & Littman 2009) Algo rithm: iterate k = 1 , 2 , . . . : ∀ no de ξ i ∈ Γ h , a tion a MAX given a p rio ri interp olation in R D Note: � ˜ � � If ˜ c ( ξ i , a ) ≈ 0 , no explo ration. f ( ξ i , a ) , a ′ � Q k +1 ( ξ i , a ) = (1 − ˜ c ( ξ i , a )) · r ( ξ i , a ) + γ max + ˜ c ( ξ i , a ) · V Q k a ′ � �� � � �� � If ˜ c ( ξ i , a ) ≈ 1 , state is a rti� ially made mo re attra tive = explo ration. ⇒ GP-RMAX – ECML 09/21/10 – p.12/18

  13. Part III: Experiments These a re the results GP-RMAX – ECML 09/21/10 – p.13/18

  14. Examine what: examine online lea rning p erfo rman e of GP-RMAX, that is, sample omplexit y , and Experimental setup qualit y of lea rned b ehavio r in va rious p opula r b en hma rk domains. Domains: Mountain a r (2D state spa e) Inverted p endulum (2D state spa e) Bi y le balan ing (4D state spa e) A rob ot (swing-up) (4D state spa e) Contestants: Sa rsa( λ ) + tile o ding (explo ration where un ertaint y is determinded from GP) (no explo ration) (explo ration where un ertaint y is determined from grid) GP-RMAXexp GP-RMAXnoexp GP-RMAXgrid GP-RMAX – ECML 09/21/10 – p.14/18

  15. Results 2D domains Mountain car (GP−RMAX) Mountain car (Sarsa) 500 500 optimal optimal Sarsa( λ ) Tilecoding 10 GP−RMAX exp 450 450 Steps to goal (lower is better) Steps to goal (lower is better) GP−RMAX noexp Sarsa( λ ) Tilecoding 20 GP−RMAX grid5 400 400 GP−RMAX grid10 350 350 300 300 250 250 200 200 150 150 100 100 0 5 10 15 20 0 200 400 600 800 1000 Episodes Episodes Inverted pendulum (Sarsa) Inverted pendulum (GP−RMAX) 0 0 −50 −100 Total reward (higher is better) Total reward (higher is better) −100 optimal −200 −150 GP−RMAX exp −200 GP−RMAX noexp optimal −300 GP−RMAX grid5 Sarsa( λ ) Tilecoding 10 −250 GP−RMAX grid10 Sarsa( λ ) Tilecoding 40 −400 −300 −350 −500 −400 −600 −450 0 5 10 15 20 0 100 200 300 400 500 Episodes Episodes GP-RMAX – ECML 09/21/10 – p.15/18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend