Unified View width of backup Dynamic Temporal- programming - PowerPoint PPT Presentation

Unified View width of backup Dynamic Temporal- programming difference learning height (depth) of backup Exhaustive Monte search Carlo ... R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Chapter 8: Planning and Learning Objectives of this chapter: To think more generally about uses of environment models Integration of (unifying) planning, learning, and execution “Model-based reinforcement learning” R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2

Paths to a policy Model Model learning Direct Environmental Simulation planning Experience interaction Direct RL methods Value function Greedification Model-based RL Policy

Models Model: anything the agent can use to predict how the environment will respond to its actions Distribution model: description of all possibilities and their probabilities e.g., p ( s’, r | s, a ) for all s, a, s’, r ˆ Sample model, a.k.a. a simulation model produces sample experiences for given s, a allows reset, exploring starts often much easier to come by Both types of models can be used to produce hypothetical experience R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4

Planning Planning: any computational process that uses a model to create or improve a policy Planning in AI: state-space planning plan-space planning (e.g., partial-order planner) We take the following (unusual) view: all state-space planning methods involve computing value functions, either explicitly or implicitly they all apply backups to simulated experience R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5

Planning Cont. Classical DP methods are state-space planning methods Heuristic search methods are state-space planning methods A planning method based on Q-learning: Do forever: 1. Select a state, S ∈ S , and an action, A ∈ A ( s ), at random 2. Send S, A to a sample model, and obtain a sample next reward, R , and a sample next state, S 0 3. Apply one-step tabular Q-learning to S, A, R, S 0 : Q ( S, A ) ← Q ( S, A ) + α [ R + γ max a Q ( S 0 , a ) − Q ( S, A )] Random-Sample One-Step Tabular Q-Planning R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6

Paths to a policy Model Model learning Direct Environmental Simulation planning Experience interaction Direct RL methods Value function Greedification Dyna Policy

Learning, Planning, and Acting Two uses of real experience: model learning: to improve the model direct RL: to directly improve the value function and policy Improving value function and/or policy via a model is sometimes called indirect RL. Here, we call it planning. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8

Direct (model-free) vs. Indirect (model-based) RL Direct methods Indirect methods: make fuller use of simpler experience: get not affected by bad better policy with models fewer environment interactions But they are very closely related and can be usefully combined: planning, acting, model learning, and direct RL can occur simultaneously and in parallel R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9

The Dyna Architecture R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10

The Dyna-Q Algorithm Initialize Q ( s, a ) and Model ( s, a ) for all s ∈ S and a ∈ A ( s ) Do forever: (a) S ← current (nonterminal) state (b) A ← ε -greedy( S, Q ) (c) Execute action A ; observe resultant reward, R , and state, S 0 direct RL (d) Q ( S, A ) ← Q ( S, A ) + α [ R + γ max a Q ( S 0 , a ) − Q ( S, A )] (e) Model ( S, A ) ← R, S 0 (assuming deterministic environment) model learning (f) Repeat n times: S ← random previously observed state A ← random action previously taken in S planning R, S 0 ← Model ( S, A ) Q ( S, A ) ← Q ( S, A ) + α [ R + γ max a Q ( S 0 , a ) − Q ( S, A )] R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11

Dyna-Q on a Simple Maze rewards = 0 until goal, when =1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12

Dyna-Q Snapshots: Midway in 2nd Episode n n W ITHOUT P LANNING ( N =0 ) W ITH P LANNING ( N =50 ) G G S S R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13

When the Model is Wrong:   Blocking Maze The changed environment is harder R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14

When the Model is Wrong:   Shortcut Maze The changed environment is easier R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15

What is Dyna-Q ? + Uses an “exploration bonus”: Keeps track of time since each state-action pair was tried for real An extra reward is added for transitions caused by state-action pairs related to how long ago they were tried: the longer unvisited, the more reward for visiting of R + κ √ τ , time since last visiting the state-action pair The agent actually “plans” how to visit long unvisited states R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16

Prioritized Sweeping Which states or state-action pairs should be generated during planning? Work backwards from states whose values have just changed: Maintain a queue of state-action pairs whose values would change a lot if backed up, prioritized by the size of the change When a new backup occurs, insert predecessors according to their priorities Always perform backups from first in queue Moore & Atkeson 1993; Peng & Williams 1993 improved by McMahan & Gordon 2005; Van Seijen 2013 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17

Prioritized Sweeping Initialize Q ( s, a ), Model ( s, a ), for all s, a , and PQueue to empty Do forever: (a) S ← current (nonterminal) state (b) A ← policy ( S, Q ) (c) Execute action A ; observe resultant reward, R , and state, S 0 (d) Model ( S, A ) ← R, S 0 (e) P ← | R + γ max a Q ( S 0 , a ) − Q ( S, A ) | . (f) if P > θ , then insert S, A into PQueue with priority P (g) Repeat n times, while PQueue is not empty: S, A ← first ( PQueue ) R, S 0 ← Model ( S, A ) Q ( S, A ) ← Q ( S, A ) + α [ R + γ max a Q ( S 0 , a ) − Q ( S, A )] Repeat, for all ¯ S, ¯ A predicted to lead to S : R ← predicted reward for ¯ ¯ S, ¯ A, S P ← | ¯ R + γ max a Q ( S, a ) − Q ( ¯ S, ¯ A ) | . if P > θ then insert ¯ S, ¯ A into PQueue with priority P R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18

Prioritized Sweeping vs. Dyna-Q Both use n =5 backups per environmental interaction R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19

Rod Maneuvering (Moore and Atkeson 1993) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20

Improved Prioritized Sweeping with Small Backups Planning is a form of state-space search a massive computation which we want to control to maximize its efficiency Prioritized sweeping is a form of search control focusing the computation where it will do the most good But can we focus better? Can we focus more tightly? Small backups are perhaps the smallest unit of search work and thus permit the most flexible allocation of effort R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21

Full and Sample (One-Step) Backups Sample backups Value Full backups estimated (one-step TD) (DP) s s ! ( s ) a a v π V r r s' s' policy evaluation TD(0) s max a v * V *( s ) r s' value iteration s,a s,a r r Q ! ( a , s ) q π s' s' a' a' Q-policy evaluation Sarsa s,a s,a r r s' s' Q * q * ( a , s ) max max a' a' Q-value iteration Q-learning R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 22

Heuristic Search Used for action selection, not for changing a value function (=heuristic evaluation function) Backed-up values are computed, but typically discarded Extension of the idea of a greedy policy — only deeper Also suggests ways to select states to backup: smart focusing: R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 23

Summary Emphasized close relationship between planning and learning Important distinction between distribution models and sample models Looked at some ways to integrate planning and learning synergy among planning, acting, model learning Distribution of backups: focus of the computation prioritized sweeping small backups sample backups trajectory sampling: backup along trajectories heuristic search Size of backups: full/sample/small; deep/shallow R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 24

Unified View width of backup Dynamic Temporal- programming - PowerPoint PPT Presentation

Unified View width of backup Dynamic Temporal- programming difference learning height (depth) of backup Exhaustive Monte search Carlo ... R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 8: Planning

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Basics of Unified Sports Ways to get involved with Unified Sports in Ohio Ohio 1 What are

SARVAM UCS Unified Communication Server Unified Communication Server for Modern Enterprises

Cumbernauld Academy Existing aerial view from west Site Plan Aerial view from South Aerial view

Unified Straight and Curved Steel Girder Design Specifications Introduction Unified Steel

UNIFIED PAYMENTS AT A GLANCE DEAR MERCHANT, WELCOME TO UNIFIED PAYMENTS! At Unified Payments,

SPORTS! Unified Basketball Special Olympics U NIFIED B ASKETBALL Unified Basketball helps

Maple View Flats 1.24.17 1 Site Plan Maple View Flats - 1.24.17 Historic Homes to be moved and

Student view socrative.com STUDENT LOGIN or LOGIN, STUDENT LOGIN Room name: ALLIANCE Student

Update to the 2020 Unified Transportation Program Texas Transportation Commission 2020 Unified

2021 Unified Transportation Program Development Texas Transportation Commission 2021 Unified

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

IGES view on new market IGES view on new market- IGES view on new market IGES view on new market

A O I Posterior View A O I Posterior View A O I

101 iOS Container View Controllers Container View Controllers Display a view controller inside

Towards Deep Multi-View Stereo Silvano Galliani October 2, 2017 1 / 40 Towards Deep Multi-View

Notes on Quantitative UX Research at Google Chris Chapman Quantitative UX Researcher Overview

A Reinforcement Learning and Synthetic Data Approach to Mobile Notification Management Rowan

Jugger - 5pm Thursdays on Memorial Glade cs160. cs160. valkyriesavage.com valkyriesavage.com

Creating a recruitment plan FOR your participants, not just for your research Ashley Smith,

Chapter 5: Monte Carlo Methods Monte Carlo methods are learning methods Experience

Monte Carlo Methods CS60077: Reinforcement Learning Abir Das IIT Kharagpur Sep 06 and 12, 2019

Graph Sketching, Sampling, Streaming, and Space Efficient Optimization (Part II) Sudipto Guha

the Division. Comments issued by the CF Staff may be different from those included here based