Advice-Based Exploration in Model-Based Reinforcement Learning - PowerPoint PPT Presentation

Advice-Based Exploration in Model-Based Reinforcement Learning Rodrigo Toro Icarte 1 , 2 Toryn Q. Klassen 1 Richard Valenzano 1 , 3 Sheila A. McIlraith 1 1 University of Toronto, Toronto, Canada { rntoro,toryn,rvalenzano,sheila } @cs.toronto.edu 2 Vector Institute, Toronto, Canada 3 Element AI, Toronto, Canada May 11, 2018

Advice-Based Exploration in Model-Based Reinforcement Learning Rodrigo Toro Icarte Richard Valenzano Sheila A. McIlraith 1 / 31

Motivation Reinforcement Learning (RL) is a way of discovering how to act. • exploration by performing random actions • exploitation by performing actions that led to rewards Applications include Atari games (Mnih et al., 2015), board games (Silver et al., 2017), and data center cooling 1 . However, very large amounts of training data are often needed. 1 www.technologyreview.com/s/601938/the-ai-that-cut-googles-energy-bill-could-soon-help-you/ 2 / 31

Humans learning behavior aren’t limited to pure RL. Humans can use • demonstrations • feedback • advice What is advice? • recommendations regarding behaviour that • may describe suboptimal ways of doing things, • may not be universally applicable, • or may even contain errors • Even in these cases people often extract value and we aim to have RL agents do likewise. 3 / 31

Our contributions • We make the first proposal to use Linear Temporal Logic (LTL) to advise reinforcement learners. • We show how to use LTL advice to do model-based RL faster (as demonstrated in experiments). 4 / 31

Outline • background • MDPs • reinforcement learning • model-based reinforcement learning • advice • the language of advice: LTL • using advice to guide exploration • experimental results 5 / 31

Running example � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ♂ � � � � � � � � � � � � � � � Actions : • move left, move right, move up, move down • They fail with probability 0.2 Rewards : • Door +1000; nail -10; step -1 Goal : • Maximize cumulative reward 6 / 31

Markov Decision Process M = � S , s 0 , A , γ, T , R � • S is a finite set of states. • s 0 ∈ S is the initial state. • A is a finite set of actions. • γ is the discount factor. • T ( s ′ | s , a ) is the transition probability function. • R ( s , a ) is the reward function. Goal : Find the optimal policy π ∗ ( a | s ) 7 / 31

Given the model, we can compute an optimal policy. We can compute π ∗ ( a | s ) by solving the Bellman equation: � T ( s ′ | s , a ) max a ′ Q ∗ ( s ′ , a ′ ) Q ∗ ( s , a ) = R ( s , a ) + γ s ′ and then π ∗ ( a | s ) = max Q ∗ ( s , a ) a 8 / 31

What if we don’t know T ( s ′ | s , a ) or R ( s , a ) ? Reinforcement learning methods try to find π ∗ ( a | s ) by sampling from T ( s ′ | s , a ) and R ( s , a ). 9 / 31

Reinforcement Learning Diagram from Sutton and Barto (1998, Figure 3.1) 10 / 31

Reinforcement Learning 10 / 31

Two kinds of reinforcement learning model-free RL: a policy is learned without explicitly learning T and R model-based RL: T and R are learned, and a policy is constructed based on them 11 / 31

Model-Based Reinforcement Learning Idea : Estimate R and T from experience (by counting): n ( s , a ) T ( s ′ | s , a ) = n ( s , a , s ′ ) 1 � ˆ ˆ R ( s , a ) = r i n ( s , a ) n ( s , a ) i =1 While learning the model, how should the agent behave? 12 / 31

Algorithms for Model-Based Reinforcement Learning We’ll consider MBIE-EB (Strehl and Littman, 2008), though in the paper we talk about R-MAX, another algorithm. • Initialize ˆ Q ( s , a ) optimistically: Q ( s , a ) = R max ˆ 1 − γ • Compute the optimal policy with an exploration bonus: β � Q ∗ ( s , a ) = ˆ ˆ T ( s ′ | s , a ) max ˆ Q ∗ ( s ′ , a ′ ) ˆ R ( s , a ) + γ + � n ( s , a ) a ′ s ′ � �� bonus This part is like the Bellman equation (with estimates for R and T ) 13 / 31

MBIE-EB in action Train Test How can we help this agent? 14 / 31

Outline • background • MDPs • reinforcement learning • model-based reinforcement learning • advice • the language of advice: LTL • using advice to guide exploration • experimental results 15 / 31

Advice � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ♂ � � � � � � � � � � � � � � � Advice examples : • Get the key and then go to the door • Avoid nails What we want to achieve with advice: • speed up learning (if the advice is good) • not rule out possible solutions (even if the advice is bad) 16 / 31

Vocabulary To give advice, we need to be able to describe the MDP in a symbolic way. � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ♂ � � � � � � � � � � � � � � � • Use a labeling function L : S → T (Σ) • e.g., at ( key ) ∈ L ( s ) iff the location of the agent is equal to the location of the key in state s . 17 / 31

The language: LTL advice Linear Temporal Logic (LTL) (Pnueli, 1977) provides temporal operators: next ϕ , ϕ 1 until ϕ 2 , always ϕ , eventually ϕ . LTL advice examples • “Get the key and then go to the door” becomes eventually ( at ( key ) ∧ next eventually ( at ( door ))) • “Avoid nails” becomes always ( ∀ ( x ∈ nails ) . ¬ at ( x )) 18 / 31

Tracking progress in following advice LTL advice “Get the key and then go to the door” eventually ( at ( key ) ∧ next eventually ( at ( door ))) Corresponding NFA: true true true at ( key ) at ( door ) u 0 u 1 u 2 start 19 / 31

Tracking progress in following advice LTL advice “Avoid nails” always ( ∀ ( x ∈ nails ) . ¬ at ( x )) Corresponding NFA: ∀ ( n ∈ nails ) . ¬ at ( n ) ∀ ( n ∈ nails ) . ¬ at ( n ) v 0 v 1 start 20 / 31

Guidance and avoiding dead-ends true true true at ( key ) at ( door ) u 0 u 1 u 2 start ∀ ( n ∈ nails ) . ¬ at ( n ) ∀ ( n ∈ nails ) . ¬ at ( n ) v 0 v 1 start From these, we can compute • guidance formula ˆ ϕ guide • dead-ends avoidance formula ˆ ϕ ok 21 / 31

The background knowledge function We use a function h : S × A × L Σ → N to estimate the number of actions needed to make formulas true. • the value of h ( s , a , ℓ ) for all literals ℓ has to be specified • e.g., we estimate the actions needed to make at ( c ) true using the Manhattan distance to c • estimates for conjunctions or disjunctions are computed by taking maximums or minimums • e.g, h ( s , a , at ( key 1 ) ∨ at ( key 2 )) = min { h ( s , a , at ( key 1 )) , h ( s , a , at ( key 2 )) } 22 / 31

Using h with the guidance and avoidance formulas � h ( s , a , ˆ ϕ guide ) if h ( s , a , ˆ ϕ ok ) = 0 ˆ h ( s , a ) = h ( s , a , ˆ ϕ guide ) + C otherwise true true true � � � � � � � � � � � � � � at ( key ) at ( door ) � � � � u 0 u 1 u 2 � � � � � � start � � � � � � � � � � � � � � � � ∀ ( n ∈ nails ) . ¬ at ( n ) ♂ � � � � ∀ ( n ∈ nails ) . ¬ at ( n ) v 0 v 1 � � � � � � � � � � � � � � start ϕ guide = at ( key ) ˆ ϕ ok = ∀ ( x ∈ nails ) . ¬ at ( x ) ˆ 23 / 31

MBIE-EB with advice • Initialize ˆ Q ( s , a ) optimistically: h ( s , a )) + (1 − α ) R max Q ( s , a ) = α ( − ˆ ˆ 1 − γ • Compute the optimal policy with an exploration bonus: Q ∗ ( s , a ) = α ( − 1) + (1 − α ) ˆ ˆ R ( s , a ) + β � ˆ ˆ T ( s ′ | s , a ) max Q ∗ ( s ′ , a ′ ) + γ � n ( s , a ) a ′ s ′ 24 / 31

Advice in action Train Test Advice: get the key and then go to the door. 25 / 31

Advice can improve performance. Number of training steps 1 Normalized reward 0 − 1 No advice Using advice − 2 5 , 000 10 , 000 15 , 000 Advice: get the key and then go the door, and avoid nails 26 / 31

Less complete advice is also useful. Number of training steps 1 Normalized reward 0 − 1 No advice Using advice − 2 5 , 000 10 , 000 15 , 000 Advice: get the key and then go to the door 27 / 31

As advice quality declines, so do early results. Number of training steps 1 Normalized reward 0 − 1 No advice Using advice − 2 5 , 000 10 , 000 15 , 000 Advice: get the key 28 / 31

Bad advice can be recovered from. Number of training steps 1 Normalized reward 0 − 1 No advice Using advice − 2 5 , 000 10 , 000 15 , 000 Advice: go to every nail 29 / 31

A larger experiment (with R-MAX-based algorithm) Advice: for every key in the map, get it and then go to a door; avoid nails and holes; get all the cookies 30 / 31

Advice-Based Exploration in Model-Based Reinforcement Learning - PowerPoint PPT Presentation

Advice-Based Exploration in Model-Based Reinforcement Learning Rodrigo Toro Icarte 1 , 2 Toryn Q. Klassen 1 Richard Valenzano 1 , 3 Sheila A. McIlraith 1 1 University of Toronto, Toronto, Canada { rntoro,toryn,rvalenzano,sheila } @cs.toronto.edu 2

Mid Norfolk Citizens Advice Diss & Thetford Citizens Advice Norfolk Citizens Advice ADVICE

Meta-Reinforcement Learning of Structured Exploration Strategies Abhishek Gupta , Russell

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Distributional Reinforcement Learning for Efficient Exploration Hengshuai Yao Huawei Hi-Silicon

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

EU Advice Project Citizens Advice Wandsworth Caroline Dunne 2018 EU Advice Project EU

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

Advanced Model-Based Reinforcement Learning CS 294-112: Deep Reinforcement Learning Sergey

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Better Advice, Better Lives Adults Select Committee 21 st June Usk 1 Better Advice, Better Lives

Introduction to embodiment Language has function Language is situated Interpreting

Vector Semantics and Embeddings CSE354 - Spring 2020 Natural Language Processing Tasks

"To a man with a hammer, everything looks like a nail" -Mark Twain Underneath the

MeanSum : A Neural Model for Unsupervised Multi-Document Abstractive Summarization Eric Chu *

Nailgun: Breaking the Privilege Isolation on ARM Zhenyu Ning COMPASS Lab Wayne State University

Envelopes and String Art Gregory Quenell 1 Activity: 1 Draw line segments connecting 0 . 8 (0

The Fragmentation Attack in Practice Andrea Bittau a.bittau@cs.ucl.ac.uk September 17, 2005 Aim

Ring Models for Group Candidates William McCune August 2004 http://www.mcs.anl.gov/

Advice-Based Exploration in Model-Based Reinforcement Learning - PowerPoint PPT Presentation

Advice-Based Exploration in Model-Based Reinforcement Learning Rodrigo Toro Icarte 1 , 2 Toryn Q. Klassen 1 Richard Valenzano 1 , 3 Sheila A. McIlraith 1 1 University of Toronto, Toronto, Canada { rntoro,toryn,rvalenzano,sheila } @cs.toronto.edu 2

Mid Norfolk Citizens Advice Diss &amp; Thetford Citizens Advice Norfolk Citizens Advice ADVICE

Meta-Reinforcement Learning of Structured Exploration Strategies Abhishek Gupta , Russell

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Distributional Reinforcement Learning for Efficient Exploration Hengshuai Yao Huawei Hi-Silicon

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

EU Advice Project Citizens Advice Wandsworth Caroline Dunne 2018 EU Advice Project EU

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

Advanced Model-Based Reinforcement Learning CS 294-112: Deep Reinforcement Learning Sergey

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Better Advice, Better Lives Adults Select Committee 21 st June Usk 1 Better Advice, Better Lives

Introduction to embodiment Language has function Language is situated Interpreting

Vector Semantics and Embeddings CSE354 - Spring 2020 Natural Language Processing Tasks

&quot;To a man with a hammer, everything looks like a nail&quot; -Mark Twain Underneath the

MeanSum : A Neural Model for Unsupervised Multi-Document Abstractive Summarization Eric Chu *

Nailgun: Breaking the Privilege Isolation on ARM Zhenyu Ning COMPASS Lab Wayne State University

Envelopes and String Art Gregory Quenell 1 Activity: 1 Draw line segments connecting 0 . 8 (0

The Fragmentation Attack in Practice Andrea Bittau a.bittau@cs.ucl.ac.uk September 17, 2005 Aim

Ring Models for Group Candidates William McCune August 2004 http://www.mcs.anl.gov/

Mid Norfolk Citizens Advice Diss & Thetford Citizens Advice Norfolk Citizens Advice ADVICE

"To a man with a hammer, everything looks like a nail" -Mark Twain Underneath the