Module 9 LAO* CS 886 Sequential Decision Making and Reinforcement - PowerPoint PPT Presentation

Module 9 LAO* CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Large State Space • Value Iteration, Policy Iteration and Linear Programming – Complexity at least quadratic in |𝑇| • Problem: |𝑇| may be very large – Queuing problems: infinite state space – Factored problems: exponentially many states 2 CS886 (c) 2013 Pascal Poupart

Mitigate Size of State Space • Two ideas: • Exploit initial state – Not all states are reachable • Exploit heuristic ℎ – approximation of optimal value function – usually an upper bound ℎ 𝑡 ≥ 𝑊 ∗ 𝑡 ∀𝑡 3 CS886 (c) 2013 Pascal Poupart

LAO* Algorithm • Related to – A*: path heuristic search – AO*: tree heuristic search – LAO*: cyclic graph heuristic search • LAO* alternates between – State space expansion – Policy optimization • value iteration, policy iteration, linear programming 6 CS886 (c) 2013 Pascal Poupart

Terminology • 𝑇 : state space • 𝑇 𝐹 ⊆ 𝑇 : envelope – Growing set of states • 𝑇 𝑈 ⊆ 𝑇 𝐹 : terminal states – States whose children are not in the envelope 𝜌 ⊆ 𝑇 𝐹 : states reachable from 𝑡 0 by following 𝜌 • 𝑇 𝑡 0 • ℎ(𝑡) : heuristic such that ℎ 𝑡 ≥ 𝑊 ∗ 𝑡 ∀𝑡 – E.g., ℎ 𝑡 = max 𝑡,𝑏 𝑆(𝑡, 𝑏)/(1 − 𝛿) 7 CS886 (c) 2013 Pascal Poupart

LAO* Algorithm LAO*(MDP, heuristic ℎ ) 𝑇 𝐹 ← {𝑡 0 } , 𝑇 𝑈 ← {𝑡 0 } Repeat Let 𝑆 𝐹 𝑡, 𝑏 = ℎ(𝑡) 𝑡 ∈ 𝑇 𝑈 otherwise 𝑆(𝑡, 𝑏) 0 𝑡 ∈ 𝑇 𝑈 Let 𝑈 𝐹 (𝑡 ′ |𝑡, 𝑏) = otherwise (𝑡 ′ |𝑡, 𝑏) Pr Find optimal policy 𝜌 for 𝑇 𝐹 , 𝑆 𝐹 , 𝑈 𝐹 𝜌 Find reachable states 𝑇 𝑡 0 𝜌 ∩ 𝑇 𝑈 Select reachable terminal states s 1 , … , s k ⊆ 𝑇 𝑡 0 𝑇 𝑈 ← (𝑇 𝑈 ∖ 𝑡 1 , … , 𝑡 𝑙 ) ∪ (𝑑ℎ𝑗𝑚𝑒𝑠𝑓𝑜 𝑡 1 , … , 𝑡 𝑙 ∖ 𝑇 𝐹 ) 𝑇 𝐹 ← 𝑇 𝐹 ∪ 𝑑ℎ𝑗𝑚𝑒𝑠𝑓𝑜( 𝑡 1 , … , 𝑡 𝑙 ) 𝜌 ∩ 𝑇 𝑈 is empty Until 𝑇 𝑡 0 8 CS886 (c) 2013 Pascal Poupart

Efficiency Efficiency influenced by 1. Choice of terminal states to add to envelope 2. Algorithm to find optimal policy – Can use value iteration, policy iteration, modified policy iteration, linear programming – Key: reuse previous computation • E.g., start with previous policy or value function at each iteration 9 CS886 (c) 2013 Pascal Poupart

Convergence • Theorem: LAO* converges to the optimal policy • Proof: – Fact: At each iteration, the value function 𝑊 is an upper bound on 𝑊 ∗ due to the heuristic function ℎ – Proof by contradiction: suppose the algorithm stops, but 𝜌 is not optimal. • Since the algorithm stopped, all states reachable by 𝜌 are in 𝑇 𝐹 ∖ 𝑇 𝑈 • Hence, the value function 𝑊 is the value of 𝜌 and since 𝜌 is suboptimal then 𝑊 < 𝑊 ∗ , which contradicts the fact that 𝑊 is an upper bound on 𝑊 ∗ 10 CS886 (c) 2013 Pascal Poupart

Summary • LAO* – Extension of basic solution algorithms (value iteration, policy iteration, linear programming) – Exploit initial state and heuristic function – Gradually grow an envelope of states – Complexity depends on # of reachable states instead of size of state space 11 CS886 (c) 2013 Pascal Poupart

Module 9 LAO* CS 886 Sequential Decision Making and Reinforcement - PowerPoint PPT Presentation

Module 9 LAO* CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Large State Space Value Iteration, Policy Iteration and Linear Programming Complexity at least quadratic in || Problem: ||

JOBS IN VALUE CHAINS ANALYSIS INTRODUCTION Roadmap: Why are we here today? Agenda for the

WebEOC Training 1 Topics Module 1 WebEOC Overview Module 2 Getting Started Module 3

Module E: Solving Systems of Linear Equations Module E Math 237 Module E Section E.0 Section

Module V: Vector Spaces Module V Math 237 Module V Section V.0 Section V.1 Section V.2

Agenda Module 1 - Risk, Volatility & Timescale Module 2 - Asset Allocation Module 3 -

Emergency Management Roles and Responsibilities Joe Myers Agenda MODULE 1 WHAT IS MODULE

1 MODULE SPECIFICATION Module Aims The module aims to deliver knowledge of the essential

Canadian Bioinformatics Workshops www.bioinformatics.ca Module #: Title of Module 2 Module bio

Module A: Algebraic properties of linear maps Module A Math 237 Module A Section A.1 Section

6.15 Module 15: Research and Presentation Module Title Research and Presentation Module NFQ

Module Title: Broadcasting & Presentation Skills Level : 4 Credit Value : 20 Code of module

Agenda Module 1 - Risk, Volatility & Timescale Module 2 - Asset Allocation Module 3 -

Using the Code Review Module Szeged DrupalCon Using the Code Review Module Doug Green Stella

Module 3 Doing a Noise Audit This module and Module 2 provide the necessary training needed

MODERATE SEDATION MODULE MODERATE SEDATION MODULE MODERATE SEDATION MODULE Introduction

Auxiliary Rubrics Module 6 Module 5 Review At the conclusion of Module 5, the team completed

First-Order Algorithms for Approximate TV-Regularized Image Denoising Stephen Wright University

1 Feature Extraction and Description Visual Vocabulary Construction From a database of

Designs of Orthogonal Filter Banks and Orthogonal Cosine-Modulated Filter Banks Jie Yan

A two-step sequential linear programming algorithm for MINLP problems: An application to gas

Lecture 4 Homework Hw 1 and 2 will be reoped after class for every body. New deadline

Searching Sorting and Searching arrays Given an array of ints find the index of the

Searching Algorithms by Dharmin Shah and Jeff Carter CSSE 221-02 Fundamentals of Software

Structural Programming Course Content and Data Structures Introduction Vectors

Module 9 LAO* CS 886 Sequential Decision Making and Reinforcement - PowerPoint PPT Presentation

Module 9 LAO* CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Large State Space Value Iteration, Policy Iteration and Linear Programming Complexity at least quadratic in || Problem: ||

JOBS IN VALUE CHAINS ANALYSIS INTRODUCTION Roadmap: Why are we here today? Agenda for the

WebEOC Training 1 Topics Module 1 WebEOC Overview Module 2 Getting Started Module 3

Module E: Solving Systems of Linear Equations Module E Math 237 Module E Section E.0 Section

Module V: Vector Spaces Module V Math 237 Module V Section V.0 Section V.1 Section V.2

Agenda Module 1 - Risk, Volatility &amp; Timescale Module 2 - Asset Allocation Module 3 -

Emergency Management Roles and Responsibilities Joe Myers Agenda MODULE 1 WHAT IS MODULE

1 MODULE SPECIFICATION Module Aims The module aims to deliver knowledge of the essential

Canadian Bioinformatics Workshops www.bioinformatics.ca Module #: Title of Module 2 Module bio

Module A: Algebraic properties of linear maps Module A Math 237 Module A Section A.1 Section

6.15 Module 15: Research and Presentation Module Title Research and Presentation Module NFQ

Module Title: Broadcasting &amp; Presentation Skills Level : 4 Credit Value : 20 Code of module

Agenda Module 1 - Risk, Volatility &amp; Timescale Module 2 - Asset Allocation Module 3 -

Using the Code Review Module Szeged DrupalCon Using the Code Review Module Doug Green Stella

Module 3 Doing a Noise Audit This module and Module 2 provide the necessary training needed

MODERATE SEDATION MODULE MODERATE SEDATION MODULE MODERATE SEDATION MODULE Introduction

Auxiliary Rubrics Module 6 Module 5 Review At the conclusion of Module 5, the team completed

First-Order Algorithms for Approximate TV-Regularized Image Denoising Stephen Wright University

1 Feature Extraction and Description Visual Vocabulary Construction From a database of

Designs of Orthogonal Filter Banks and Orthogonal Cosine-Modulated Filter Banks Jie Yan

A two-step sequential linear programming algorithm for MINLP problems: An application to gas

Lecture 4 Homework Hw 1 and 2 will be reoped after class for every body. New deadline

Searching Sorting and Searching arrays Given an array of ints find the index of the

Searching Algorithms by Dharmin Shah and Jeff Carter CSSE 221-02 Fundamentals of Software

Structural Programming Course Content and Data Structures Introduction Vectors

Agenda Module 1 - Risk, Volatility & Timescale Module 2 - Asset Allocation Module 3 -

Module Title: Broadcasting & Presentation Skills Level : 4 Credit Value : 20 Code of module

Agenda Module 1 - Risk, Volatility & Timescale Module 2 - Asset Allocation Module 3 -