FeUdal Networks for Hierarchical Reinforcement Learning Alexander - PowerPoint PPT Presentation

FeUdal Networks for Hierarchical Reinforcement Learning Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, Koray Kavukcuoglu: DeepMind Rene Bidart (rbbidart@uwaterloo.ca) CS885 June 22, 2018

Why do we care about HRL? Reinforcement Learning is hard! ● Long time horizons and sparse rewards are problematic for current methods ● Many of the methods we use are not intuitively appealing ● Look to human decision process for inspiration

Hierarchies How do we make decisions? ● If we are hungry, do we reason in terms of small muscle movements? ● To play the guitar, do we randomly jitter our fingers until we play a song? No, we reason using hierarchies of abstraction. ● We already use conv nets to learn hierarchical structure in images. Why not use hierarchical structure in policies?

Feudalism ● Governance system in Europe in middle ages ● Extremely hierarchical, based on ownership of property ● Higher level people have control over the lower levels, but not over people many layers lower

Feudal RL (1993) Rewards Reward Hiding : Agent ● Managers reward sub-managers for satisfying Rewards their commands, not through an external reward Agent ● Managers have absolute control Rewards Information Hiding ● Observe world at different resolutions Agent ● Managers don’t know what happens at a other Actions levels of the hierarchy Environment

Feudal RL (1993) ● Q-learning ● Used to solve a simple maze task ● Didn’t generate good results on more complex or less obviously hierarchical problems

FeUdal Networks (2017): Overview Rewards Manager Manager ● Sets directional goals for the worker ● Rewarded by environment ● Does not directly act in environment Goals, Rewards Worker Worker ● Higher temporal resolution Actions ● Reward for achieving manager’s goals Environment ● Produces primitive actions in environment

FeUdal Networks (2017): Overview Rewards Architecture Manager ● Both worker and manager share a state embedding ● Both worker and manager use RNNs Goals, Rewards Goals Worker ● Manager produces directional goals for Actions worker in latent space Environment ● Trained using novel transition policy gradient

FeUdal Network: Details

FeUdal Network Shared Dense Embedding ● Embedding of input state ● Used by both worker and manager to produce goal and action ● CNN ○ 16 8x8 filters ○ 32 4x4 filters ○ 256 fully connected ○ ReLU

FeUdal Network Manager: Goal embedding ● Lower Temporal Resolution, goals summed over last 10 time steps ● Uses dilated LSTM ● Goal is in low-dimensional space, not environment ● Trained using transition policy gradient

FeUdal Network Worker: Action Embedding ● LSTM on shared embedding ● Embedding U matrix: ○ Rows: actions [a] ○ Columns : embedding dimension [k]

FeUdal Network Goal embedding: Worker ● Compress manager’s goal to dim k using linear transformation - ɸ ● Same dim as action embedding ● Linear transformation with no bias ○ Can’t produce a 0 vector ○ Can’t ignore the manager’s input, so manager’s goal will influence final policy

FeUdal Network Action: Worker ● Product of action embedding matrix (U) with goal embedding (w) ● Produces a distribution over actions ● Action = softmax(U*w)

Training Manager: Transition Policy Gradient ● Worker’s goal is directional, rather than absolute ● Instead of increasing the probability of an action, we shift the direction of the goal Actor-critic: Value function from internal critic:

Training: Worker’s Intrinsic Reward ● Intrinsic reward is based on if the worker follows the correct direction

Training: Worker’s Intrinsic Reward Actor-Critic: Reward isn’t truly hierarchical ● They use weighted sum of intrinsic reward, and environment reward

Why directional goals? Feasibility ● Worker can more easily cause directional shifts, rather than reaching a new location in state space Structural Generalization ● A single sub goal (direction) can be useful in many different locations in the state space

More Details: Dilated LSTM ● Better able to preserve memories over long periods ● Output is summed over previous 10 steps ● Specific type of Dilated RNN Dilated RNN [Chang et al. 2017]:

Results: Atari ● Outperforms LSTM baseline whenever there are more delayed rewards

Results : Water Maze ● Circular space with invisible goal, agent must find goal ● Next episode put in a random location, and agent must find goal again ● Left are individual episodes, right visualizes the sub-policies ● Agent learns meaningful sub goals

Results: Temporal Resolution Ablations ● Removing dilations from the LSTM or using full temporal time scale for manager is significantly worse

Results: Intrinsic Reward Ablations ● Using only intrinsic reward at right ● Environment reward is not necessary for good performance

Results: Atari Action Repeat Transfer ● One of the goals of HRL was better transfer learning ● Transfer learning with different number of action repeats ● Manager’s policy does not depend on how the worker achieves these goals

Summary ● Directional rather than absolute goals are useful ● Dilated LSTM is crucial for high performance ● Improves long-term credit assignment over baselines ● Improves transfer across different action repeats ● Manager’s goals are meaningful low-level behaviors from the worker

Thoughts ● Ablation studies were crucial to get a better idea what is going on - something missing in a lot of DL papers. ● Why does worker produce goal and action embedding, rather than just feeding it into a fully connected network?

Questions?

FeUdal Networks for Hierarchical Reinforcement Learning Alexander - PowerPoint PPT Presentation

FeUdal Networks for Hierarchical Reinforcement Learning Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, Koray Kavukcuoglu: DeepMind Rene Bidart (rbbidart@uwaterloo.ca) CS885 June 22, 2018

FeUdal Networks for Hierarchical Reinforcement Learning Alexander Sasha Vezhnevets, Simon

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Reinforcement Learning Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Hierarchical Bounding Volume October 11, 2005 () Hierarchical Bounding Volume October 11, 2005

What is a hierarchical model? Richard Erickson Quantitative Ecologist DataCamp Hierarchical

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

Care Provision An Experimental Investigation Sheheryar Banuri (World Bank) Angela de Oliveira

Intrinsic Ar-39 & Ar-42 Juergen Reichenbacher Calibration Workshop Fermilab, 27-July-2017

Parent Education Night November 5th, 2020 With Sierra Daniels, School Counselor And Todd

Agile Cambridge 2018 Ricco Nourzad 26.09.2018 Some facts about me Ricco PO, SM & DEV

AI-based Mobile Robotics Planning and Control: Markov Decision Processes Planning Static vs.

100 Million Friends You Can Never Know Adding COPPA compliant social networking to Poptropica

A REINFORCEMENT LEARNING PERSPECTIVE ON AGI Itamar Arel, Machine Intelligence Lab

Matthew Series Lesson #065 February 1, 2015 Dean Bible Ministries www.deanbibleministries.org

FeUdal Networks for Hierarchical Reinforcement Learning Alexander - PowerPoint PPT Presentation

FeUdal Networks for Hierarchical Reinforcement Learning Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, Koray Kavukcuoglu: DeepMind Rene Bidart (rbbidart@uwaterloo.ca) CS885 June 22, 2018

FeUdal Networks for Hierarchical Reinforcement Learning Alexander Sasha Vezhnevets, Simon

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Reinforcement Learning Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Hierarchical Bounding Volume October 11, 2005 () Hierarchical Bounding Volume October 11, 2005

What is a hierarchical model? Richard Erickson Quantitative Ecologist DataCamp Hierarchical

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

Care Provision An Experimental Investigation Sheheryar Banuri (World Bank) Angela de Oliveira

Intrinsic Ar-39 &amp; Ar-42 Juergen Reichenbacher Calibration Workshop Fermilab, 27-July-2017

Parent Education Night November 5th, 2020 With Sierra Daniels, School Counselor And Todd

Agile Cambridge 2018 Ricco Nourzad 26.09.2018 Some facts about me Ricco PO, SM &amp; DEV

AI-based Mobile Robotics Planning and Control: Markov Decision Processes Planning Static vs.

100 Million Friends You Can Never Know Adding COPPA compliant social networking to Poptropica

A REINFORCEMENT LEARNING PERSPECTIVE ON AGI Itamar Arel, Machine Intelligence Lab

Matthew Series Lesson #065 February 1, 2015 Dean Bible Ministries www.deanbibleministries.org

Intrinsic Ar-39 & Ar-42 Juergen Reichenbacher Calibration Workshop Fermilab, 27-July-2017

Agile Cambridge 2018 Ricco Nourzad 26.09.2018 Some facts about me Ricco PO, SM & DEV