maximum entropy framework inverse rl soft optimality and
play

Maximum Entropy Framework: Inverse RL, Soft Optimality, and More - PowerPoint PPT Presentation

Maximum Entropy Framework: Inverse RL, Soft Optimality, and More Chelsea Finn and Sergey Levine UC Berkeley 5/20/2017 Introductions Sergey Levine Chelsea Finn assistant professor PhD student Outline 1. A World without Rewards 2. A


  1. Maximum Entropy Framework: Inverse RL, Soft Optimality, and More Chelsea Finn and Sergey Levine UC Berkeley 5/20/2017

  2. Introductions Sergey Levine Chelsea Finn assistant professor PhD student

  3. Outline 1. A World without Rewards 2. A Probabilistic Model of Behavior 3. Application: Inverse RL 4. GANs and Energy-Based Models 5. Application: Soft-Q Learning

  4. Outline 1. A World without Rewards 2. A Probabilistic Model of Behavior 3. Application: Inverse RL 4. GANs and Energy-Based Models 5. Application: Soft-Q Learning

  5. reward Mnih et al. ’15 what is the reward? reinforcement learning agent In the real world, humans don’t get a score. video from Montessori New Zealand

  6. Tesauro ’95 Kohl & Stone, ’04 Mnih et al. ’15 Silver et al. ‘16 reward function is essential for RL real-world domains : reward/cost often di ffi cult to specify • robotic manipulation • autonomous driving • dialog systems • virtual assistants • and more…

  7. One approach: Mimic actions of human expert behavioral cloning + simple, sometimes works well - but no reasoning about outcomes or dynamics - the expert might have di ff erent degrees of freedom - the expert might not be always optimal Can we reason about human decision-making?

  8. Outline 1. A World without Rewards 2. A Probabilistic Model of Behavior 3. Application: Inverse RL 4. GANs and Energy-Based Models 5. Application: Soft-Q Learning

  9. Op&mal Control as a Model of Human Behavior Muybridge (c. 1870) Mombaur et al. ‘09 Li & Todorov ‘06 Ziebart ‘08 opQmize this to explain the data

  10. What if the data is no not op&mal? some mistakes maTer more than others! behavior is stochas'c but good behavior is sQll the most likely

  11. A probabilis&c graphical model of decision making no assumpQon of opQmal behavior!

  12. Inference = planning how to do inference?

  13. A closer look at the backward pass “opQmisQc” transiQon (not a good idea!) Ziebart et al. ‘10 “Modeling InteracQon via the Principle of Maximum Causal Entropy”

  14. Stochas&c op&mal control (MaxCausalEnt) summary summary: 1. ProbabilisQc graphical model for opQmal control variants: 2. Control = inference (similar to HMM, EKF, etc.) 3. Very similar to dynamic programming, value iteraQon, etc. (but “soc”)

  15. Outline 1. A World without Rewards 2. A Probabilistic Model of Behavior 3. Application: Inverse RL 4. GANs and Energy-Based Models 5. Application: Soft-Q Learning

  16. Under reward, we can model how human can sub-optimally maximize reward. How can this help us with learning?

  17. Inverse Optimal Control / Inverse Reinforcement Learning : infer cost/reward function from demonstrations (IOC/IRL) (Kalman ’64, Ng & Russell ’00) goal : given : - recover reward function - state & action space - then use reward to get policy - roll-outs from π * - dynamics model [sometimes] Challenges underde fj ned problem di ffi cult to evaluate a learned reward demonstrations may not be precisely optimal

  18. Early IRL Approaches - deterministic MDP - alternative between solving MDP & updating reward - heuristics for handling sub-optimality Ng & Russell ‘00 : expert actions should have higher value than other actions, larger gap is better Abbeel & Ng ’04 : expert policy w.r.t. cost should match feature counts of expert trajectories Ratli ff et al. ’06 : max margin formulation between value of expert actions and other actions How to handle ambiguity and suboptimality?

  19. Maximum Entropy Inverse RL (Ziebart et al. ’08) handle ambiguity using probabilistic model of behavior Notation: : reward with parameters [linear case ] : dataset of demonstrations Whiteboard

  20. Maximum Entropy Inverse RL (Ziebart et al. ’08)

  21. What about unknown dynamics? Whiteboard

  22. Case Study : Guided Cost Learning ICML 2016 Goals : - remove need to solve MDP in the inner loop - be able to handle unknown dynamics - handle continuous state & actions spaces

  23. guided cost learning algorithm policy π 0 generate policy x (1) h (2) h ( 3 ) h k m n p samples from π c ( x ) 2 θ x ( 2 ) ( 3 ) h ( 1 ) h h 2 2 2 2 x (1) h ( 2 ) ( 3 ) h h 1 1 1 1 Update reward using samples & demos update π w.r.t. reward reward r policy π

  24. guided cost learning algorithm policy π 0 generate policy x (1) h (2) h ( 3 ) h k m n p samples from π c ( x ) 2 θ x ( 2 ) ( 3 ) h ( 1 ) h h 2 2 2 2 x (1) h ( 2 ) ( 3 ) h h 1 1 1 1 Update reward using generator samples & demos discriminator update π w.r.t. reward reward r policy π (partially optimize) update reward in inner loop of policy optimization

  25. guided cost learning algorithm policy π 0 generate policy x h (1) ( 3 ) h (2) h k n p m samples from π c ( x ) 2 θ ( 2 ) x ( 1 ) h ( 3 ) h h 2 2 2 2 x (1) h ( 2 ) h ( 3 ) h 1 1 1 1 Update reward using generator samples & demos discriminator update π w.r.t. reward reward r policy π (partially optimize) Ho et al., ICML ’16, NIPS ‘16

  26. GCL Experiments Real-world Tasks dish placement pouring almonds state includes unsupervised state includes goal plate pose visual features [Finn et al. ’16] action: joint torques

  27. Comparisons Path Integral IRL Relative Entropy IRL (Kalakrishnan et al. ‘13) (Boularias et al. ‘11) generate policy x h (1) ( 3 ) h (2) h k n p m samples from q c ( x ) 2 θ ( 2 ) x ( 1 ) h ( 3 ) h h 2 2 2 2 x (1) h ( 2 ) h ( 3 ) h 1 1 1 1 Update reward using samples & demos reward r

  28. Dish placement, demos

  29. Dish placement, standard cost

  30. Dish placement, RelEnt IRL • video of dish baseline method

  31. Dish placement, GCL policy • video of dish our method - samples & reoptimizing

  32. Pouring, demos • video of pouring demos

  33. Pouring, RelEnt IRL • video of pouring baseline method

  34. Pouring, GCL policy • video of pouring our method - samples

  35. Conclusion : We can recover successful policies for new positions. Is the reward function also useful for new scenarios?

  36. Dish placement - GCL reopt. • video of dish our method - samples & reoptimizing

  37. Pouring - GCL reopt. • video of pouring our method - reoptimization Note : normally the GAN discriminator is discarded

  38. Guided Cost Learning & Generative Adversarial Imitation Learning Strengths - can handle unknown dynamics - scales to neural net rewards - e ffi cient enough for real robots Limitations - adversarial optimization is hard - can’t scale to raw pixel observations of demos - demonstrations typically collected with kinesthetic teaching or teleoperation ( fj rst person)

  39. Outline 1. A World without Rewards 2. A Probabilistic Model of Behavior 3. Application: Inverse RL 4. GANs and Energy-Based Models 5. Application: Soft-Q Learning

  40. Generative Adversarial Networks (Goodfellow et al. ’14) Arjovsky et al. ‘17 Zhu et al. ‘17 Isola et al. ‘17 Similarly, GANs learn an objective for generative modeling. real D G noise generated

  41. Connection between Inverse RL and GANs trajectory τ sample x policy π ~q( τ ) generator G reward r discriminator D discriminator discriminator only needs to learn data distribution, θ independent of generator density Finn*, Christiano*, Abbeel, Levine, arXiv ‘16

  42. Connection between Inverse RL and GANs trajectory τ sample x policy π ~q( τ ) generator G cost c discriminator D generator generator objective is entropy-regularized RL Finn*, Christiano*, Abbeel, Levine, arXiv ‘16

  43. GANs for training EBMs MaxEnt IRL is an energy-based model sampler q(x) generator G energy E discriminator D Use the generator’s density q(x) to form a consistent estimator of the energy function 1 Z exp( − E θ ( x )) D θ ( x ) = 1 Z exp( − E θ ( x )) + q ( x ) Dai et al., ICLR submission ‘17 Kim & Bengio ICLR Workshop ’16; Zhao et al. arXiv ’16; Zhai et al. ICLR sub ‘17 Finn*, Christiano*, Abbeel, Levine, arXiv ‘16

  44. Outline 1. A World without Rewards 2. A Probabilistic Model of Behavior 3. Application: Inverse RL 4. GANs and Energy-Based Models 5. Application: Soft-Q Learning

  45. Stochas&c models for learning control • How can we track both hypotheses?

  46. StochasQc energy-based policies Tuomas Haarnoja Haoran Tang

  47. SoK Q-learning

  48. Tractable amor&zed inference for con&nuous ac&ons Wang & Liu, ‘17

  49. StochasQc energy-based policies aid exploraQon

  50. StochasQc energy-based policies provide pretraining

  51. Stochas&c Op&mal Control & MaxEnt in RL Sallans & Hinton. Using Free Energies to Represent Q-values in a MulQagent Reinforcement Learning Task. 2000. Nachum et al. Bridging the Gap Between Value and Policy Based Reinforcement Learning. 2017. O’Donoghue et al. Combining Policy Gradient and Q-Learning. 2017 Peters et al. RelaQve Entropy Policy Search. 2010.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend