Data-Efficient Hierarchical Reinforcement Learning Authors: Ofir - PowerPoint PPT Presentation

Data-Efficient Hierarchical Reinforcement Learning Authors: Ofir Nachum, Shixiang Gu, Honglak Lee, Sergey Levine Presented by: Samuel Yigzaw 1

OUTLINE Introduction Background Main Contributions Related Work Experiments Conclusion PAGE 2

INTRODUCTION PAGE 3

§ Deep reinforcement learning has performed well in areas with relatively small action and/or state spaces § Atari games § Go § Simple continuous control tasks § When action and state spaces are both large and continuous, much work needs to be done PAGE 4

§ If you told your robot maid to go pick up groceries for you, how would it do that? § If it was a human maid, the task would be accomplished by breaking down the requirements necessary in order to complete the goal. § Top level: go to store, buy groceries, come back § Breakdown of “go to store”: leave house, walk down street, enter store § Breakdown of “leave house”: walk to front door, open door, walk through door, lock door § Breakdown of “walk to front door”: basic motor actions which aren’t consciously processed PAGE 5

Hierarchical Reinforcement Learning § This is an inherently hierarchical method of accomplishing tasks § Hierarchical Reinforcement Learning (HRL) is the area of RL which focuses on trying to bring the benefits of hierarchical reasoning to RL § In HRL, multiple layers of policies are learned, where higher level policies act over which lower level policies to run at some moment § There are many presumed benefits to this: § Temporal and behavioral abstraction § Much smaller action spaces for higher level policies § More reliable credit assignment PAGE 6

HIRO: HIerarchical Reinforcement learning with Off-policy correction § 2-layer design § Uses a high-level policy to select goals for a low-level policy § Provides low-level policy with a goal to try to achieve (specified as a state observation) § Uses off-policy training with a correction in order to increase efficiency PAGE 7

BACKGROUND PAGE 8

Off-policy Temporal Difference Learning § The main learning algorithm used is TD3, a variant of DDPG § DDPG § Q-function Q θ , parameterized by θ $ § Trained to minimize E 𝑡 ! , 𝑏 ! , 𝑡 !"# = 𝑅 θ 𝑡 ! , 𝑏 ! − 𝑆t − γ 𝑅 θ 𝑡 !"# ,µ φ 𝑡 !"# § Deterministic policy µ φ , parameterized by φ § Trained to maximize 𝑅 θ 𝑡 ! ,µ φ 𝑡 ! , over all 𝑡 ! § Behaviour policy used to collect experience is augmented with Gaussian noise § Helpful for off-policy correction PAGE 9

MAIN CONTRIBUTIONS PAGE 10

Hierarchy of Two Policies High-level policy µhi and low-level policy µ lo § µhi observes state and produces a high-level action (a goal state) 𝑕 ! ∈𝑆 % ! § § Every c steps, sample a new goal: 𝑕 " ~ µhi § Otherwise, use a transition function, 𝑕 " = ℎ 𝑡 "#$ , 𝑕 "#$ , 𝑡 " µlo observes 𝑡 ! and 𝑕 ! and produces a low-level action 𝑏 ! ~ µ lo 𝑡 ! , 𝑕 ! § Environment yields reward 𝑆 ! § Low-level policy receives intrinsic reward 𝑠 ! = 𝑠 𝑡 ! , 𝑕 ! , 𝑏 ! , 𝑡 !"# § § 𝑠 is a fixed parameterized reward function Low-level policy stores the experience 𝑡 ! , 𝑕 ! , 𝑏 ! , 𝑠 ! , 𝑡 !"# , ℎ 𝑡 !&# , 𝑕 !&# , 𝑡 ! for off-policy training § High-level policy stores the experience 𝑡 !:!"(&# , 𝑕 !:!"(&# , 𝑏 !:!"(&# , 𝑆 !:!"(&# , 𝑡 !") for off-policy training § PAGE 11

Parameterized Rewards § Goal 𝑕 ( is specified as the difference between current state 𝑡 ( and desired state 𝑡 ( + 𝑕 ( § Simple goal transition model could be ℎ 𝑡 ( , 𝑕 ( , 𝑡 ()* = 𝑡 ( + 𝑕 ( − 𝑡 ()* § This leaves the desired goal constant as 𝑡 ! changes § Intrinsic reward is a parameterized reward function based on Euclidean distance between the current observation and goal observation § 𝑠 𝑡 ! , 𝑕 ! , 𝑏 ! , 𝑡 !"# = − 𝑡 ! + 𝑕 ! − 𝑡 !"# $ § The low-level policy is trained with an input-space that includes 𝑡 ( and 𝑕 ( § Intrinsic rewards allow the lower-level policy to receive dense relevant reward signals immediately, before any task-specific rewards are available PAGE 12

Basic Design PAGE 13

Two-Level HRL Example PAGE 14

Off-Policy Corrections for Higher-Level Training § Previous two-level HRL algorithms use on-policy training § As low-level policy is trained, the high-level policy is trying to solve a non-stationary problem § Old off-policy experience may have different transitions for the same state and goal § Off-policy algorithms generally have better sample efficiency than on-policy algorithms § HIRO applies an off-policy correction factor during training § Allows HIRO to take advantage of off-policy sample efficiency benefits PAGE 15

Off-Policy Corrections for Higher-Level Training § Recall the higher-level policy experience 𝑡 !:!#$%& , 𝑕 !:!#$%& , 𝑏 !:!#$%& , 𝑆 !:!#$%& , 𝑡 !#' § Goal: convert to state-action-reward transitions 𝑡 ! , 𝑕 ! , ∑ 𝑆 !:!#$%& , 𝑡 !#' § These can be pushed onto the replay buffer of any standard off-policy algorithm § Problem: past instances of lower-level policies will exhibit different behavior given the same goals 𝑕 ! and end up in different states 𝑡 !#&:!#$ PAGE 16

Off-Policy Corrections for Higher-Level Training § Idea: change the 𝑕 ! of past high-level policy experience in order to make the observed action sequence more likely with the current low-level policy § In the past: 𝑏 !:!#$%& ~ µ lo 𝑡 !:!#$%& , 𝑕 !:!#$%& § Solution: 𝑕 ! to maximize µ lo 𝑏 !:!#$%& |𝑡 !:!#$%& , ( 𝑕 !:!#$%& for the current µ lo § Find ( § Relabel the high-level experience with ( 𝑕 ! : 𝑡 ! , ( 𝑕 ! , ∑ 𝑆 !:!#$%& , 𝑡 !#' ' § 𝑚𝑝𝑕 µ lo 𝑡 !:!#$%& , * ∝ − & 𝑏 ( − µ lo 𝑡 ( , * !#$%& ' ∑ ()! 𝑕 !:!#$%& 𝑕 ( + 𝑑𝑝𝑜𝑡𝑢𝑏𝑜𝑢 ' § To approximately maximize this, 10 candidate goals are chosen: § 8 candidate goals are sampled from a Gaussian centered around 𝑡 !#$ − 𝑡 ! § Additionally, the original 𝑕 ! and 𝑡 !#$ − 𝑡 ! are candidates as well PAGE 17

RELATED WORK PAGE 18

§ To help learn useful lower-level policies, some recent work uses auxiliary rewards § Either hand-crafted rewards or exploration-encouraging rewards § HIRO uses a parameterized reward function § To produce semantically distinct behavior, some recent work pretrain the lower-level policy on diverse tasks § This requires suitably similar tasks and is not general § Hierarchical Actor-Critic uses off-policy training, but without the correction § FeUdal Networks also use goals and parameterized lower-level rewards § Goals and rewards are computed in terms of a learned state representation, not directly § HIRO uses raw goal and state representations, so it can immediately train on intrinsic rewards PAGE 19

EXPERIMENTS PAGE 20

Comparative Analysis PAGE 22

Ablative Analysis PAGE 23

CONCLUSION PAGE 24

Summary of Main Contributions § A general approach for training a two-layer HRL algorithm § Goals specified in terms of a difference between desired state and current state § Lower-level policy is trained with parameterized rewards § Both policies are trained concurrently in an off-policy manner § Leads to high sample-efficiency § Off-policy correction allows for the use of past experience for training the higher- level policy PAGE 25

Future Work § The algorithm was evaluated on fairly simple tasks § State and action spaces were both low-dimensional § Environment was fully-observed § Further work could be done to apply this algorithm or an improved version to harder tasks PAGE 26

Data-Efficient Hierarchical Reinforcement Learning Authors: Ofir - PowerPoint PPT Presentation

Data-Efficient Hierarchical Reinforcement Learning Authors: Ofir Nachum, Shixiang Gu, Honglak Lee, Sergey Levine Presented by: Samuel Yigzaw 1 OUTLINE Introduction Background Main Contributions Related Work Experiments Conclusion PAGE 2

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

FeUdal Networks for Hierarchical Reinforcement Learning Alexander Sasha Vezhnevets, Simon

Hierarchical Bounding Volume October 11, 2005 () Hierarchical Bounding Volume October 11, 2005

What is a hierarchical model? Richard Erickson Quantitative Ecologist DataCamp Hierarchical

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

to day Front Door Front door there has to be many entry points; these need to be in the

BACK TO LIFE An interactive map of front door and assessment data Presented by Matt Wagner

The ABCN front end chip for The ABCN front end chip for ATLAS Inner Detector Upgrade Jan Kaplon

Building Frontends Thierry Sans Recipes to become a good front-end developer Load Javascript

HMIS User Meeting May 2019 211 Orange County 1 Agenda Staff Changes - Erin Virtual

What is a causal effect? How to express it? And why it matters. Rodrigo Pinto UCLA Econ 262A

Front Door Serv rvice Overview Development of the Front Door Service A single integrated team -

FRESH approach to functional assessment Consider how we handover our patients in clinical