Reinforcement Learning: A Primer, Multi-Task, Goal-Conditioned CS - PowerPoint PPT Presentation

Reinforcement Learning: A Primer, Multi-Task, Goal-Conditioned CS 330 1

Introduction Some background: - Not a native English speaker so please lease please lease let me know if you don ’ t understand something Karol - I like robots ☺ Hausman - Studied classical robotics first - Got fascinated by deep RL in the middle of my PhD after a talk by Sergey Levine - Research Scientist at Robotics @ Google 2

Why Reinforcement Learning? Isolated action that doesn ’ t affect the future? 3

Why Reinforcement Learning? Isolated action that doesn ’ t affect the future? Supervised learning? Common applications robotics language & dialog autonomous driving business operations finance (most deployed ML systems) + a key aspect of intelligence 4

The Plan Multi-task reinforcement learning problem Policy gradients & their multi-task counterparts Q-learning < — should be review Multi-task Q-learning 7

The Plan Mu Multi lti-task task reinf einforcement orcement learning arning problem oblem Policy gradients & their multi-task counterparts Q-learning < — should be review Multi-task Q-learning 8

Terminology & notation 1. run away 2. ignore 3. pet 10 Slide adapted from Sergey Levine

Terminology & notation 1. run away 2. ignore 3. pet 11 Slide adapted from Sergey Levine

Imitation Learning supervised training learning data Images: Bojarski et al. ‘ 16, NVIDIA 15 Slide adapted from Sergey Levine

Imitation Learning Imitation Learning vs Reinforcement Learning? supervised training learning data Images: Bojarski et al. ‘ 16, NVIDIA 16 Slide adapted from Sergey Levine

Reward functions 18 Slide adapted from Sergey Levine

The goal of reinforcement learning 20 Slide adapted from Sergey Levine

Partial observability Fully observable? - Simulated robot performing a reaching task given the goal position and positions and velocities of all of its joints - Indiscriminate robotic grasping from a bin given an overhead image - A robot sorting trash given a camera image 21

The goal of reinforcement learning infinite horizon case finite horizon case 22 Slide adapted from Sergey Levine

What is a reinforcement learning task ? Recall : supervised learning Reinforcement learning action space dynamics data generating distributions, loss A task: 𝑗 ≜ {𝒯 𝑗 , 𝒝 𝑗 , 𝑞 𝑗 (𝐭 1 ), 𝑞 𝑗 (𝐭 ′ |𝐭, 𝐛), 𝑠 𝒰 𝑗 ≜ {𝑞 𝑗 (𝐲), 𝑞 𝑗 (𝐳|𝐲), ℒ 𝑗 } A task: 𝒰 𝑗 (𝐭, 𝐛)} state initial state reward space distribution a Markov decision process much more than the semantic meaning of task! 23

Examples Task Distributions 𝑗 ≜ {𝒯 𝑗 , 𝒝 𝑗 , 𝑞 𝑗 (𝐭 1 ), 𝑞 𝑗 (𝐭 ′ |𝐭, 𝐛), 𝑠 A task: 𝒰 𝑗 (𝐭, 𝐛)} Character animation: across maneuvers 𝑗 (𝐭, 𝐛) vary 𝑠 across garments & initial states 𝑞 𝑗 (𝐭 1 ), 𝑞 𝑗 (𝐭 ′ |𝐭, 𝐛) vary Multi-robot RL: 𝒯 𝑗 , 𝒝 𝑗 , 𝑞 𝑗 (𝐭 1 ), 𝑞 𝑗 (𝐭 ′ |𝐭, 𝐛) vary 25

What is a reinforcement learning task ? action space dynamics Reinforcement learning 𝑗 ≜ {𝒯 𝑗 , 𝒝 𝑗 , 𝑞 𝑗 (𝐭 1 ), 𝑞 𝑗 (𝐭 ′ |𝐭, 𝐛), 𝑠 A task: 𝒰 𝑗 (𝐭, 𝐛)} state initial state reward space distribution An alternative view: A task identifier is part of the state : 𝐭 = (𝐭, 𝐴 𝑗 ) original state 𝑗 } = ⋃𝒯 𝑗 , ⋃𝒝 𝑗 , 1 𝑗 ≜ {𝒯 𝑗 , 𝒝 𝑗 , 𝑞 𝑗 (𝐭 1 ), 𝑞(𝐭 ′ |𝐭, 𝐛), 𝑠(𝐭, 𝐛)} 𝑞 𝑗 (𝐭 1 ), 𝑞(𝐭 ′ |𝐭, 𝐛), 𝑠(𝐭, 𝐛) {𝒰 𝑂 ∑ 𝒰 𝑗 It can be cast as a standard Markov decision process ! 26

The goal of multi-task reinforcement learning Multi-task RL What is the reward? The same as before The same as before, except : a task identifier is part of the state : 𝐭 = (𝐭, 𝐴 𝑗 ) Or, for goal-conditioned RL: 𝑠(𝐭) = 𝑠(𝐭, 𝐭 𝑕 ) = −𝑒(𝐭, 𝐭 𝑕 ) e.g. one-hot task ID Distance function 𝑒 examples: language description - Euclidean ℓ 2 desired goal state, 𝐴 𝑗 = 𝐭 𝑕 “ goal-conditioned RL ” - sparse 0/1 If it's still a standard Markov decision process , 27 then, why not apply standard RL algorithms ? You can! You can often do better.

The Plan Multi-task reinforcement learning problem Policy gradients & their multi-task counterparts Q-learning Multi-task Q-learning 28

The anatomy of a reinforcement learning algorithm This lecture : focus on model-free RL methods (policy gradient, Q-learning) 10/19 : focus on model-based RL methods 29

On-policy vs Off-policy - Data comes from any policy - Data comes from the current policy - Works with specific RL - Compatible with all RL algorithms algorithms - Can ’ t reuse data from previous - Much more sample efficient, policies can re-use old data 30

Evaluating the objective 32 Slide adapted from Sergey Levine

Direct policy differentiation a convenient identity 34 Slide adapted from Sergey Levine

Direct policy differentiation 35 Slide adapted from Sergey Levine

Evaluating the policy gradient fit a model to estimate return generate samples (i.e. run the policy) improve the policy 36 Slide adapted from Sergey Levine

Comparison to maximum likelihood Multi-task learning algorithms can readily be applied! supervised training learning data 38 Slide adapted from Sergey Levine

What did we just do? good stuff is made more likely bad stuff is made less likely simply formalizes the notion of “ trial and error ” ! 40 Slide adapted from Sergey Levine

Policy Gradients Pros: + Simple + Easy to combine with existing multi-task & meta-learning algorithms Cons: - Produces a high-variance gradient - Can be mitigated with baselines (used by all algorithms in practice), trust regions - Requires on-policy data - Cannot reuse existing experience to estimate the gradient! - Importance weights can help, but also high variance 41

The Plan Multi-task reinforcement learning problem Policy gradients & their multi-task/meta counterparts Q-learning Multi-task Q-learning 42

Value-Based RL: Definitions 𝑈 Value function: 𝑊 𝜌 (𝐭 𝑢 ) = ∑ 𝔽 𝜌 𝑠(𝐭 𝑢 ′ , 𝐛 𝑢 ′ ) ∣ 𝐭 𝑢 total reward starting from 𝐭 and following 𝜌 𝑢 ′ =𝑢 "how good is a state ” total reward starting from 𝐭 , taking 𝐛 , 𝑈 Q function: 𝑅 𝜌 (𝐭 𝑢 , 𝐛 𝑢 ) = ∑ 𝔽 𝜌 𝑠(𝐭 𝑢 ′ , 𝐛 𝑢 ′ ) ∣ 𝐭 𝑢 , 𝐛 𝑢 and then following 𝜌 𝑢 ′ =𝑢 "how good is a state-action pair ” 𝑊 𝜌 (𝐭 𝑢 ) = 𝔽 𝐛 𝑢 ∼𝜌(⋅|𝐭 𝑢 ) 𝑅 𝜌 (𝐭 𝑢 , 𝐛 𝑢 ) They're related: If you know 𝑅 𝜌 , you can use it to improve 𝜌 . 𝐛 𝑅 𝜌 (𝐭, 𝐛) New policy is at least as good as old policy. Set 𝜌 𝐛 𝐭 ← 1 for 𝐛 = arg𝑛𝑏𝑦 43

Value-Based RL: Definitions 𝑈 Value function: 𝑊 𝜌 (𝐭 𝑢 ) = ∑ 𝔽 𝜌 𝑠(𝐭 𝑢 ′ , 𝐛 𝑢 ′ ) ∣ 𝐭 𝑢 total reward starting from 𝐭 and following 𝜌 𝑢 ′ =𝑢 "how good is a state ” total reward starting from 𝐭 , taking 𝐛 , 𝑈 Q function: 𝑅 𝜌 (𝐭 𝑢 , 𝐛 𝑢 ) = ∑ 𝔽 𝜌 𝑠(𝐭 𝑢 ′ , 𝐛 𝑢 ′ ) ∣ 𝐭 𝑢 , 𝐛 𝑢 and then following 𝜌 𝑢 ′ =𝑢 "how good is a state-action pair ” For the optimal policy 𝜌 ⋆ : 𝑅 ⋆ (𝐭 𝑢 , 𝐛 𝑢 ) = 𝔽 𝐭 ′ ∼𝑞(⋅|𝐭,𝐛) 𝑠(𝐭, 𝐛) + 𝛿𝑛𝑏𝑦 𝐛 ′ 𝑅 ⋆ (𝐭 ′ , 𝐛 ′ ) Bellman equation 44

Value function: 𝑊 𝜌 𝐭 𝑢 = ? Value-Based RL Q function: 𝑅 𝜌 𝐭 𝑢 , 𝐛 𝑢 = ? Q* function: 𝑅 ∗ 𝐭 𝑢 , 𝐛 𝑢 = ? Reward = 1 if I can play it in a month, 0 otherwise Value* function: 𝑊 ∗ 𝐭 𝑢 = ? a 3 a 2 a 1 s t 𝐛 𝑅 𝜌 (𝐭, 𝐛) Set 𝜌 𝐛 𝐭 ← 1 for 𝐛 = arg𝑛𝑏𝑦 New policy is at least as good as old policy. Current 𝜌 𝐛 1 𝐭 = 1 45

Fitted Q-iteration Algorithm Algorithm hyperparameters Result : get a policy 𝜌(𝐛|𝐭) from arg𝑛𝑏𝑦 𝐛 𝑅 𝜚 (𝐭, 𝐛) We can reuse data from previous policies! using replay buffers Important notes: an off-policy algorithm This is not a gradient descent algorithm! 46 Can be readily extended to multi-task / goal-conditioned RL Slide adapted from Sergey Levine

Example: Q-learning Applied to Robotics Continuous action space? Simple optimization algorithm -> Cross Entropy Method (CEM) 47

QT-Opt: Q-learning at Scale CEM optimization In-memory buffers Bellman updaters stored data from all past experiments Training jobs Slide adapted from D. Kalashnikov QT-Opt: Kalashnikov et al. ‘ 18, Google Brain

QT-Opt: MDP Definition for Grasping State: over the shoulder RGB camera image, no depth Actio ion: 4DOF pose change in Cartesian space + gripper control Reward: binary reward at the end, if the object was lifted. Sparse. No shaping Automatic success detection: Slide adapted from D. Kalashnikov

QT-Opt: Setup and Results 96 96% test success ra rate! 7 robots collected 580k grasps Unseen test objects 50

Reinforcement Learning: A Primer, Multi-Task, Goal-Conditioned CS - PowerPoint PPT Presentation

Reinforcement Learning: A Primer, Multi-Task, Goal-Conditioned CS 330 1 Introduction Some background: - Not a native English speaker so please lease please lease let me know if you don t understand something Karol - I like robots

Reinforcement Learning: A Primer, Multi-Task, Goal-Conditioned CS 330 1 Logistics Homework

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Multi-agent learning Multi-agent reinforcement learning Gerard Vreeswijk , Intelligent Systems

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Table S2. Gene-specific PCR primer pairs for all validated SBSs. Forward primer Reverse primer

REINFORCEMENT LEARNING IN MULTI-AGENT SYSTEMS MACHINE LEARNING MEETUP DR. ANA PELETEIRO

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Multi-Task Active Learning Yi Zhang Outline Active Learning Multi-Task Active Learning

Tariff Primer: A Graphic Presentation of the Fordney- Tariff Primer: A Graphic Presentation of the

Linac Simulation Linac Simulation Primer Primer J.-F. Ostiguy APC ostiguy@fnal.gov September

THE TRIUNE GOD Knowing God Rev Dr Clive Chin ORPC SoCM Adult Sunday Class 14 April 2019

Th The mo moral economi mies of housing in postcr credit boom m Croatia (and Hungary): di

Addressing Social Determinants Through a Cross-Sector Approach Jennifer Lee, Senior Program

CS 378: Autonomous Intelligent Robotics Instructor: Jivko Sinapov

Welcome New Students! Study and Work in Canada January 2015 Presented by International

1 Some quick stats 58 70 35GB players days 2 Round Challenge Points Classic Profiling

Learning in Intelligent Systems October 14, 2016 Janyl Jumadinova Overview of Learning 2/19

Program Survey Findings, Lessons Learned, and Next Steps 8/3/17 Michael Garringer Director of