reinforcement learning policy op5miza5on
play

Reinforcement Learning Policy Op5miza5on Pieter Abbeel UC Berkeley - PowerPoint PPT Presentation

Reinforcement Learning Policy Op5miza5on Pieter Abbeel UC Berkeley EECS Policy Op>miza>on n Consider control policy parameterized by parameter vector H X max E[ R ( s t ) | ] t =0 n O<en stochas>c policy class


  1. Reinforcement Learning – Policy Op5miza5on Pieter Abbeel UC Berkeley EECS

  2. Policy Op>miza>on n Consider control policy parameterized by parameter vector θ H X max E[ R ( s t ) | π θ ] θ t =0 n O<en stochas>c policy class (smooths out the problem): n probability of taking ac>on u in state s π θ ( u | s )

  3. Learning to Trot/Run A<er learning Before learning (hand-tuned) [Policy search was done through trials on the actual robot.] Kohl and Stone, ICRA2004

  4. Learning to Trot/Run 12 parameters define the Aibo’s gait: n n The front locus (3 parameters: height, x-pos., y-pos.) n The rear locus (3 parameters) n Locus length n Locus skew mul>plier in the x-y plane (for turning) n The height of the front of the body n The height of the rear of the body n The >me each foot takes to move through its locus n The frac>on of >me each foot spends on the ground Kohl and Stone, ICRA2004

  5. [Policy search was done in simulation] [Ng + al, ISER 2004]

  6. Learning to Hover

  7. Ball-In-A-Cup [Kober and Peters, 2009]

  8. Learning to Walk in 20 Minutes [Tedrake, Zhang, Seung 2005]

  9. Learning to Walk in 20 Minutes passive hip joint [1DOF] Arms: coupled to opposite leg to reduce yaw moment freely swinging load [1DOF] 44 cm 2 x 2 (roll, pitch) posi>on controlled servo motors [4 DOF] 9DOFs: * 6 internal DOFs * 3 DOFs for the robot’s orienta>on (always assumed in contact with ground at a single point, absolute (x,y) ignored) Natural gait down 0.03 radians ramp: 0.8Hz, 6.5cm steps [Tedrake, Zhang, Seung 2005]

  10. Learning to Walk in 20 Minutes [Tedrake, Zhang, Seung 2005]

  11. Gradient-Free Methods H X max U ( θ ) = max E[ R ( s t ) | π θ ] θ θ t =0 n Cross-Entropy Method (CEM) n Covariance Matrix Adapta>on (CMA) n Dynamics model: stochas>c: OK; unknown: OK n Policy class: stochas>c: OK n Downside: gradient-free methods slower than gradient-based methods à in prac>ce OK if low-dimensional θ and willing to do do many runs

  12. Gradient-Based Policy Op>miza>on H X max U ( θ ) = max E[ R ( s t ) | π θ ] θ θ t =0 f f s0 s1 s2 π θ π θ π θ u0 u1 u2 R R R r0 r1 r2

  13. Overview of Methods / Seings Dynamics Policy D+K D+U S+K+R S+K S+U D S+R S + + + + PD + + + + + + + LR D: determinis>c; S: stochas>c; K: known; U: unknown; R: reparameterizable; PD: path deriva>ve (=perturba>on analysis) LR: likelihood ra>o (=score func>on)

  14. Ques>ons n When more than one is applicable, which one is best? n When dynamics is only available as black-box, but deriva>ves aren’t available – finite differences based deriva>ves? n Vs. directly finite differences / gradient-free on the policy n Note: finite difference tricky (imprac>cal?) when can’t control random seed n What if model is unknown, but es>mate available

  15. Gradient Computa>on – Unknown Model – Finite Differences

  16. Noise Can Dominate

  17. Finite Differences and Noise n Solu>on 1: Average over many samples n Solu>on 2: Fix the randomness (if possible) n Intui>on by example: wind influence on a helicopter is stochas>c, but if we assume the same wind paqern across trials, this will make the different choices of θ more readily comparable n General instan>a>on: Fix the random seed; and the result is determinis>c system n Ng & Jordan, 2000 provide theore>cal analysis of gains from fixing randomness

  18. Path Deriva>ve for Dynamics: D+K; Policy: D Reminder of op>miza>on objec>ve: n H X max U ( θ ) = max E[ R ( s t ) | π θ ] θ θ t =0 Can compute gradient es>mate along current roll-out: n H ∂ U ∂ R ∂ s ( s t ) ∂ s t X = ∂θ i ∂θ i t =0 ∂ s t = ∂ f ∂ s ( s t − 1 , u t − 1 ) ∂ s t − 1 + ∂ f ∂ s ( s t − 1 , u t − 1 ) ∂ u t − 1 ∂θ i ∂θ i ∂θ i ∂ u t = ∂π θ ( s t , θ ) + ∂π θ ∂ s ( s t , θ ) ∂ s t ∂θ i ∂θ i ∂θ i

  19. Path Deriva>ve for Dynamics: S+K+R; Policy: S+R Reminder of op>miza>on objec>ve: n H X max U ( θ ) = max E[ R ( s t ) | π θ ] θ θ t =0 (draw reparameterized graph on board) n + average over mul>ple samples n

  20. Overview of Methods / Seings Dynamics Policy D+K D+U S+K+R S+K S+U D S+R S + + + + PD + + + + + + + LR D: determinis>c; S: stochas>c; K: known; U: unknown; R: reparameterizable; PD: path deriva>ve (=perturba>on analysis) LR: likelihood ra>o (=score func>on)

  21. Gradient Computa>on – Unknown Model – Likelihood Ra>o

  22. Likelihood Ra>o Gradient [Note: Can also be derived/generalized through an importance sampling deriva>on – Tang and Abbeel, 2011]

  23. Importance Sampling n On board..

  24. Likelihood Ra>o Gradient Es>mate

  25. Likelihood Ra>o Gradient Es>mate

  26. Likelihood Ra>o Gradient Es>mate n As formulated thus far: unbiased but very noisy n Fixes that lead to real-world prac>cality n Baseline n Temporal structure n Also: KL-divergence trust region / natural gradient (= general trick, equally applicable to perturba>on analysis and finite differences)

  27. Likelihood Ra>o with Baseline n Gradient es>mate with baseline: m g = 1 X r θ log P ( τ ( i ) ; θ )( R ( τ ( i ) ) � b ) ˆ m i =1 n Crudely, increasing log-likelihood of paths with higher than baseline reward and decreasing log-likelihood of paths with lower than baseline reward " m # 1 n S>ll unbiased? Yes! X r θ log P ( τ ( i ) ; θ ) b E = 0 m i =1

  28. Likelihood Ra>o and Temporal Structure n Current es>mate: m g = 1 X r θ log P ( τ ( i ) ; θ )( R ( τ ( i ) ) � b ) ˆ m i =1 H − 1 ! H − 1 ! m = 1 r θ log π θ ( u ( i ) t | s ( i ) R ( s ( i ) t , u ( i ) X X X t ) t ) � b m i =1 t =0 t =0 n Future ac>ons do not depend on past rewards, hence can lower variance by instead using: H − 1 H − 1 m ! 1 r θ log π θ ( u ( i ) t | s ( i ) R ( s ( i ) k , u ( i ) X X X t ) k ) � b m i =1 t =0 k = t

  29. Step-sizing and Trust Regions n Naïve step-sizing: Line search n Step-sizing necessary as gradient is only first-order approxima>on n Line search in the direc>on of gradient n Simple, but expensive (evalua>ons along the line) n Naïve: ignores where the first-order approxima>on is good/poor

  30. Step-sizing and Trust Regions n Advanced step-sizing: Trust regions n First-order approxima>on from gradient is a good approxima>on within “trust region” à Solve for best point within trust region: g > δθ max ˆ δθ s . t . KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) ≤ ε

  31. KL Trust Region (a.k.a. natural gradient) n Solve for best point within trust region: g > δθ max ˆ δθ s . t . KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) ≤ ε n KL can be approximated efficiently with 2 nd order expansion: G: Fisher Informa>on Matrix

  32. Experiments in Locomo>on [Schulman, Levine, Abbeel, 2014]

  33. Actor-Cri>c Variant n Current es>mate: H − 1 H − 1 m ! 1 r θ log π θ ( u ( i ) t | s ( i ) R ( s ( i ) k , u ( i ) X X X t ) k ) � b m i =1 t =0 k = t Sample based estimate of Q ( s ( i ) k , u ( i ) k ) n Actor-cri>c algorithms in parallel run an es>mator for the Q- func>on, and subs>tute in the es>mated Q value

  34. Learning Locomo>on [Schulman, Moritz, Levine, Jordan, Abbeel, 2015]

  35. In Contrast: Darpa Robo>cs Challenge

  36. Thank you

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend