Modular mul*task reinforcement learning with policy sketches Jacob - PowerPoint PPT Presentation

Modular mul*task reinforcement learning with policy sketches Jacob Andreas, Sergey Levine and Dan Klein

The learning problem make planks 2

The learning problem make planks make sticks 3

Learning from sketches get wood get wood use saw use axe 4

The op*ons framework 5

The op*ons framework +1 6

The op*ons framework +1 7

The op*ons framework [SuCon et al. 99, Bacon & Precup 16] 8

Learning from intermediate rewards r r [Kearns & Singh 02, Kulkarni et al. 16] 9

Learning from demonstra*ons Ï [Stolle & Precup 02, Fox & Krishnan et al. 16] 10

Learning from policy sketches get wood use saw Ï 11

Why sketches? Easy to collect Portable Crafting environment make plank get wood use toolshed make stick get wood use workbench make cloth get grass use factory make rope get grass use toolshed make bridge get iron get wood make bed ∗ get wood use toolshed make axe ∗ get wood use workbench make shears get wood use workbench get gold get iron get wood get gem get wood use workbench 12

Learning from policy sketches

Learning from policy sketches make planks get wood use saw 14

Learning from policy sketches make sticks get wood use axe 15

Learning from policy sketches get wood π a use saw get wood π b use axe 16 [e.g. Branavan et al. 09, Oh et al. 17, Hermann et al. 17]

Learning from policy sketches get wood use saw get wood use axe 17

` get wood use saw π 1 π 2 get wood use axe π 1 π 3 18

` get wood use saw π 1 π 2 get wood use axe π 1 π 3 19

` get wood π 1

Policy representa*on π 1 get wood 21

Policy representa*on ??? π 1 get wood 22

Policy representa*on 23

Policy representa*on Ac*on probabili*es π 1 get wood 26

Policy search ac*on state reward baseline Σ Σ ( ) ∇ log π( | ) (r t - b) tasks steps 27

Policy search Σ Σ ( ) ∇ log π( | ) (r t - b) tasks steps get wood 28

Policy search Σ Σ ( ) ∇ log π( | ) (r t - b) tasks steps use axe 29

Policy search Reward .40 Σ Σ ( ) ∇ log π( | ) (r t - b) SUBPOLICY tasks steps 30

Improving policy search 31

Improving policy search ac*on state reward baseline Σ Σ ( ) ∇ log π( | ) (r t - b) tasks steps 32

Improving policy search ( ) ( ) ∇ log π( | ) (r t - ) ∇ log π( | ) (r t - ) use saw use saw make planks make nails ( ) ( ) ∇ log π( | ) (r t - ) ∇ log π( | ) (r t - ) use axe use axe make planks make nails ( ) ( ) ∇ log π( | ) (r t - ) ∇ log π( | ) (r t - ) get wood get wood make planks make nails ( ) ( ) ∇ log π( | ) (r t - ) ∇ log π( | ) (r t - ) get iron get iron make planks make nails 33

Improving policy search .89 Reward .40 Σ Σ ( ) ∇ log π( | ) (r t - ) SUBPOLICY TASK tasks steps 34

Do sketches help?

The maze naviga*on task 36

The maze naviga*on task 37

The maze naviga*on task Sketches: modular Unsupervised Reward Sketches: joint 0 1 2 3 x 10 6 episodes 38

The mini-crag task 39

The mini-crag task 40

The mini-crag task Sketches: modular Sketches: joint Reward Unsupervised 0 1 2 3 x 10 6 episodes 41

The cliff-walking task 42

The cliff-walking task Sketches: modular log Reward Sketches: joint Unsupervised 0 1 2 3 x 10 8 *mesteps 43

Zero-shot generaliza*on What if I see a sketch I’ve never seen before? get iron use axe 44

Zero-shot generaliza*on What if I see a sketch I’ve never seen before? 100 89 75 Joint 77 50 Modular 49 25 1 0 Mul*task Zero-shot 45

Zero-shot generaliza*on What if I see a sketch I’ve never seen before? 100 89 75 Joint 77 50 Modular 49 25 1 0 Mul*task Zero-shot 46

Fast adapta*on What if I don’t get a sketch at test *me? ??? 47

Fast adapta*on What if I don’t get a sketch at test *me? 100 89 75 Unsupervised 77 50 Sketches 47 25 1 0 Mul*task Adapta*on 48

Fast adapta*on What if I don’t get a sketch at test *me? 100 89 75 Unsupervised 76 50 Sketches 47 42 25 0 Mul*task Adapta*on 49

Conclusions

A *ny bit of data goes a long way Crafting environment make plank get wood use toolshed make stick get wood use workbench make cloth get grass use factory make rope get grass use toolshed make bridge get iron get wood use factory make bed ∗ get wood use toolshed get grass use workbench make axe ∗ get wood use workbench get iron use toolshed make shears get wood use workbench get iron use workbench get gold get iron get wood use factory use bridge get gem get wood use workbench get iron use toolshed use axe 51

A *ny bit of data goes a long way Crafting environment make plank get wood use toolshed make stick get wood use workbench make cloth get grass use factory make rope get grass use toolshed make bridge get iron get wood use factory make bed ∗ get wood use toolshed get grass use workbench make axe ∗ get wood use workbench get iron use toolshed make shears get wood use workbench get iron use workbench get gold get iron get wood use factory use bridge get gem get wood use workbench get iron use toolshed use axe 52

Thank you! https://github.com/jacobandreas/psketch

Modular mul*task reinforcement learning with policy sketches Jacob - PowerPoint PPT Presentation

Modular mul*task reinforcement learning with policy sketches Jacob Andreas, Sergey Levine and Dan Klein The learning problem make planks 2 The learning problem make planks make sticks 3 Learning from sketches get wood get wood use saw

Modular Budgets Modular Budgets Modular Budgets Modular Budgets OSPA NANO Session 10/25/06

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Mul&lingualism @ ECUAD Debora O & Tara Wren

Mul$-Object Synchroniza$on Mul$-Object Programs What happens

Mul$-Object Synchroniza$on Mul$-Object Programs What happens

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &

Deep Reinforcement Learning 1 Outline 1. Overview of Reinforcement Learning 2. Policy Search 3.

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Emergent solu-ons to high dimensional mul--task reinforcement learning Stephen Kelly &

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Iron* - An Introduction to Getting Dynamic on .NET //kristiankristensen.dk

The Iron Mystery If: The Fe +2 /Fe +3 boundary is at a pe o and pe(w) of 13.03 Oxygen

Iron: Managing Obligations in Higher-Order Concurrent Separation Logic s Bizjak 1 Daniel Gratzer 1

Human Subjects Statistical Ethics Principles: Respect Aaron Rendahl Beneficence slides by

Essex-Hudson Greenway Ice & Iron Greenway East Coast Greenway September 11th Memorial Trail

Delivering value through the cycle Michael Gollschewski, managing director Pilbara Mines Global

Iron Mountain Reports First Quarter 2018 Results BOSTON April 26, 2018 Iron Mountain

MobileIron Introduction 2007 Company founded purpose-built for multi-OS mobility MobileIron

Modular mul*task reinforcement learning with policy sketches Jacob - PowerPoint PPT Presentation

Modular mul*task reinforcement learning with policy sketches Jacob Andreas, Sergey Levine and Dan Klein The learning problem make planks 2 The learning problem make planks make sticks 3 Learning from sketches get wood get wood use saw

Modular Budgets Modular Budgets Modular Budgets Modular Budgets OSPA NANO Session 10/25/06

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Mul&amp;lingualism @ ECUAD Debora O &amp; Tara Wren

Mul$-Object Synchroniza$on Mul$-Object Programs What happens

Mul$-Object Synchroniza$on Mul$-Object Programs What happens

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &amp;

Deep Reinforcement Learning 1 Outline 1. Overview of Reinforcement Learning 2. Policy Search 3.

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Emergent solu-ons to high dimensional mul--task reinforcement learning Stephen Kelly &amp;

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Iron* - An Introduction to Getting Dynamic on .NET //kristiankristensen.dk

The Iron Mystery If: The Fe +2 /Fe +3 boundary is at a pe o and pe(w) of 13.03 Oxygen

Iron: Managing Obligations in Higher-Order Concurrent Separation Logic s Bizjak 1 Daniel Gratzer 1

Human Subjects Statistical Ethics Principles: Respect Aaron Rendahl Beneficence slides by

Essex-Hudson Greenway Ice &amp; Iron Greenway East Coast Greenway September 11th Memorial Trail

Delivering value through the cycle Michael Gollschewski, managing director Pilbara Mines Global

Iron Mountain Reports First Quarter 2018 Results BOSTON April 26, 2018 Iron Mountain

MobileIron Introduction 2007 Company founded purpose-built for multi-OS mobility MobileIron

Mul&lingualism @ ECUAD Debora O & Tara Wren

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &

Emergent solu-ons to high dimensional mul--task reinforcement learning Stephen Kelly &

Essex-Hudson Greenway Ice & Iron Greenway East Coast Greenway September 11th Memorial Trail