reinforcement learning a primer multi task goal
play

Reinforcement Learning: A Primer, Multi-Task, Goal-Conditioned CS - PowerPoint PPT Presentation

Reinforcement Learning: A Primer, Multi-Task, Goal-Conditioned CS 330 1 Introduction Some background: - Not a native English speaker so please lease please lease let me know if you don t understand something Karol - I like robots


  1. Reinforcement Learning: A Primer, Multi-Task, Goal-Conditioned CS 330 1

  2. Introduction Some background: - Not a native English speaker so please lease please lease let me know if you don ’ t understand something Karol - I like robots ☺ Hausman - Studied classical robotics first - Got fascinated by deep RL in the middle of my PhD after a talk by Sergey Levine - Research Scientist at Robotics @ Google 2

  3. Why Reinforcement Learning? Isolated action that doesn ’ t affect the future? 3

  4. Why Reinforcement Learning? Isolated action that doesn ’ t affect the future? Supervised learning? Common applications robotics language & dialog autonomous driving business operations finance (most deployed ML systems) + a key aspect of intelligence 4

  5. The Plan Multi-task reinforcement learning problem Policy gradients & their multi-task counterparts Q-learning < — should be review Multi-task Q-learning 7

  6. The Plan Mu Multi lti-task task reinf einforcement orcement learning arning problem oblem Policy gradients & their multi-task counterparts Q-learning < — should be review Multi-task Q-learning 8

  7. Terminology & notation 1. run away 2. ignore 3. pet 10 Slide adapted from Sergey Levine

  8. Terminology & notation 1. run away 2. ignore 3. pet 11 Slide adapted from Sergey Levine

  9. Imitation Learning supervised training learning data Images: Bojarski et al. ‘ 16, NVIDIA 15 Slide adapted from Sergey Levine

  10. Imitation Learning Imitation Learning vs Reinforcement Learning? supervised training learning data Images: Bojarski et al. ‘ 16, NVIDIA 16 Slide adapted from Sergey Levine

  11. Reward functions 18 Slide adapted from Sergey Levine

  12. The goal of reinforcement learning 20 Slide adapted from Sergey Levine

  13. Partial observability Fully observable? - Simulated robot performing a reaching task given the goal position and positions and velocities of all of its joints - Indiscriminate robotic grasping from a bin given an overhead image - A robot sorting trash given a camera image 21

  14. The goal of reinforcement learning infinite horizon case finite horizon case 22 Slide adapted from Sergey Levine

  15. What is a reinforcement learning task ? Recall : supervised learning Reinforcement learning action space dynamics data generating distributions, loss A task: 𝑗 ≜ {𝒯 𝑗 , 𝒝 𝑗 , 𝑞 𝑗 (𝐭 1 ), 𝑞 𝑗 (𝐭 ′ |𝐭, 𝐛), 𝑠 𝒰 𝑗 ≜ {𝑞 𝑗 (𝐲), 𝑞 𝑗 (𝐳|𝐲), ℒ 𝑗 } A task: 𝒰 𝑗 (𝐭, 𝐛)} state initial state reward space distribution a Markov decision process much more than the semantic meaning of task! 23

  16. Examples Task Distributions 𝑗 ≜ {𝒯 𝑗 , 𝒝 𝑗 , 𝑞 𝑗 (𝐭 1 ), 𝑞 𝑗 (𝐭 ′ |𝐭, 𝐛), 𝑠 A task: 𝒰 𝑗 (𝐭, 𝐛)} Character animation: across maneuvers 𝑗 (𝐭, 𝐛) vary 𝑠 across garments & initial states 𝑞 𝑗 (𝐭 1 ), 𝑞 𝑗 (𝐭 ′ |𝐭, 𝐛) vary Multi-robot RL: 𝒯 𝑗 , 𝒝 𝑗 , 𝑞 𝑗 (𝐭 1 ), 𝑞 𝑗 (𝐭 ′ |𝐭, 𝐛) vary 25

  17. What is a reinforcement learning task ? action space dynamics Reinforcement learning 𝑗 ≜ {𝒯 𝑗 , 𝒝 𝑗 , 𝑞 𝑗 (𝐭 1 ), 𝑞 𝑗 (𝐭 ′ |𝐭, 𝐛), 𝑠 A task: 𝒰 𝑗 (𝐭, 𝐛)} state initial state reward space distribution An alternative view: A task identifier is part of the state : 𝐭 = (𝐭, 𝐴 𝑗 ) original state 𝑗 } = ⋃𝒯 𝑗 , ⋃𝒝 𝑗 , 1 𝑗 ≜ {𝒯 𝑗 , 𝒝 𝑗 , 𝑞 𝑗 (𝐭 1 ), 𝑞(𝐭 ′ |𝐭, 𝐛), 𝑠(𝐭, 𝐛)} 𝑞 𝑗 (𝐭 1 ), 𝑞(𝐭 ′ |𝐭, 𝐛), 𝑠(𝐭, 𝐛) {𝒰 𝑂 ∑ 𝒰 𝑗 It can be cast as a standard Markov decision process ! 26

  18. The goal of multi-task reinforcement learning Multi-task RL What is the reward? The same as before The same as before, except : a task identifier is part of the state : 𝐭 = (𝐭, 𝐴 𝑗 ) Or, for goal-conditioned RL: 𝑠(𝐭) = 𝑠(𝐭, 𝐭 𝑕 ) = −𝑒(𝐭, 𝐭 𝑕 ) e.g. one-hot task ID Distance function 𝑒 examples: language description - Euclidean ℓ 2 desired goal state, 𝐴 𝑗 = 𝐭 𝑕 “ goal-conditioned RL ” - sparse 0/1 If it's still a standard Markov decision process , 27 then, why not apply standard RL algorithms ? You can! You can often do better.

  19. The Plan Multi-task reinforcement learning problem Policy gradients & their multi-task counterparts Q-learning Multi-task Q-learning 28

  20. The anatomy of a reinforcement learning algorithm This lecture : focus on model-free RL methods (policy gradient, Q-learning) 10/19 : focus on model-based RL methods 29

  21. On-policy vs Off-policy - Data comes from any policy - Data comes from the current policy - Works with specific RL - Compatible with all RL algorithms algorithms - Can ’ t reuse data from previous - Much more sample efficient, policies can re-use old data 30

  22. Evaluating the objective 32 Slide adapted from Sergey Levine

  23. Direct policy differentiation a convenient identity 34 Slide adapted from Sergey Levine

  24. Direct policy differentiation 35 Slide adapted from Sergey Levine

  25. Evaluating the policy gradient fit a model to estimate return generate samples (i.e. run the policy) improve the policy 36 Slide adapted from Sergey Levine

  26. Comparison to maximum likelihood Multi-task learning algorithms can readily be applied! supervised training learning data 38 Slide adapted from Sergey Levine

  27. What did we just do? good stuff is made more likely bad stuff is made less likely simply formalizes the notion of “ trial and error ” ! 40 Slide adapted from Sergey Levine

  28. Policy Gradients Pros: + Simple + Easy to combine with existing multi-task & meta-learning algorithms Cons: - Produces a high-variance gradient - Can be mitigated with baselines (used by all algorithms in practice), trust regions - Requires on-policy data - Cannot reuse existing experience to estimate the gradient! - Importance weights can help, but also high variance 41

  29. The Plan Multi-task reinforcement learning problem Policy gradients & their multi-task/meta counterparts Q-learning Multi-task Q-learning 42

  30. Value-Based RL: Definitions 𝑈 Value function: 𝑊 𝜌 (𝐭 𝑢 ) = ∑ 𝔽 𝜌 𝑠(𝐭 𝑢 ′ , 𝐛 𝑢 ′ ) ∣ 𝐭 𝑢 total reward starting from 𝐭 and following 𝜌 𝑢 ′ =𝑢 "how good is a state ” total reward starting from 𝐭 , taking 𝐛 , 𝑈 Q function: 𝑅 𝜌 (𝐭 𝑢 , 𝐛 𝑢 ) = ∑ 𝔽 𝜌 𝑠(𝐭 𝑢 ′ , 𝐛 𝑢 ′ ) ∣ 𝐭 𝑢 , 𝐛 𝑢 and then following 𝜌 𝑢 ′ =𝑢 "how good is a state-action pair ” 𝑊 𝜌 (𝐭 𝑢 ) = 𝔽 𝐛 𝑢 ∼𝜌(⋅|𝐭 𝑢 ) 𝑅 𝜌 (𝐭 𝑢 , 𝐛 𝑢 ) They're related: If you know 𝑅 𝜌 , you can use it to improve 𝜌 . 𝐛 𝑅 𝜌 (𝐭, 𝐛) New policy is at least as good as old policy. Set 𝜌 𝐛 𝐭 ← 1 for 𝐛 = arg𝑛𝑏𝑦 43

  31. Value-Based RL: Definitions 𝑈 Value function: 𝑊 𝜌 (𝐭 𝑢 ) = ∑ 𝔽 𝜌 𝑠(𝐭 𝑢 ′ , 𝐛 𝑢 ′ ) ∣ 𝐭 𝑢 total reward starting from 𝐭 and following 𝜌 𝑢 ′ =𝑢 "how good is a state ” total reward starting from 𝐭 , taking 𝐛 , 𝑈 Q function: 𝑅 𝜌 (𝐭 𝑢 , 𝐛 𝑢 ) = ∑ 𝔽 𝜌 𝑠(𝐭 𝑢 ′ , 𝐛 𝑢 ′ ) ∣ 𝐭 𝑢 , 𝐛 𝑢 and then following 𝜌 𝑢 ′ =𝑢 "how good is a state-action pair ” For the optimal policy 𝜌 ⋆ : 𝑅 ⋆ (𝐭 𝑢 , 𝐛 𝑢 ) = 𝔽 𝐭 ′ ∼𝑞(⋅|𝐭,𝐛) 𝑠(𝐭, 𝐛) + 𝛿𝑛𝑏𝑦 𝐛 ′ 𝑅 ⋆ (𝐭 ′ , 𝐛 ′ ) Bellman equation 44

  32. Value function: 𝑊 𝜌 𝐭 𝑢 = ? Value-Based RL Q function: 𝑅 𝜌 𝐭 𝑢 , 𝐛 𝑢 = ? Q* function: 𝑅 ∗ 𝐭 𝑢 , 𝐛 𝑢 = ? Reward = 1 if I can play it in a month, 0 otherwise Value* function: 𝑊 ∗ 𝐭 𝑢 = ? a 3 a 2 a 1 s t 𝐛 𝑅 𝜌 (𝐭, 𝐛) Set 𝜌 𝐛 𝐭 ← 1 for 𝐛 = arg𝑛𝑏𝑦 New policy is at least as good as old policy. Current 𝜌 𝐛 1 𝐭 = 1 45

  33. Fitted Q-iteration Algorithm Algorithm hyperparameters Result : get a policy 𝜌(𝐛|𝐭) from arg𝑛𝑏𝑦 𝐛 𝑅 𝜚 (𝐭, 𝐛) We can reuse data from previous policies! using replay buffers Important notes: an off-policy algorithm This is not a gradient descent algorithm! 46 Can be readily extended to multi-task / goal-conditioned RL Slide adapted from Sergey Levine

  34. Example: Q-learning Applied to Robotics Continuous action space? Simple optimization algorithm -> Cross Entropy Method (CEM) 47

  35. QT-Opt: Q-learning at Scale CEM optimization In-memory buffers Bellman updaters stored data from all past experiments Training jobs Slide adapted from D. Kalashnikov QT-Opt: Kalashnikov et al. ‘ 18, Google Brain

  36. QT-Opt: MDP Definition for Grasping State: over the shoulder RGB camera image, no depth Actio ion: 4DOF pose change in Cartesian space + gripper control Reward: binary reward at the end, if the object was lifted. Sparse. No shaping Automatic success detection: Slide adapted from D. Kalashnikov

  37. QT-Opt: Setup and Results 96 96% test success ra rate! 7 robots collected 580k grasps Unseen test objects 50

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend