Approximate Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille - PowerPoint PPT Presentation

Approximate Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL – INRIA Lille MVA-RL Course

Value Iteration: the Idea 1. Let V 0 be any vector in R N 2. At each iteration k = 1 , 2 , . . . , K ◮ Compute V k + 1 = T V k 3. Return the greedy policy � � � π K ( x ) ∈ arg max r ( x , a ) + γ p ( y | x , a ) V K ( y ) . a ∈ A y A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 2/63

Value Iteration: the Guarantees ◮ From the fixed point property of T : k →∞ V k = V ∗ lim ◮ From the contraction property of T || V k + 1 − V ∗ || ∞ ≤ γ k + 1 || V 0 − V ∗ || ∞ → 0 Problem : what if V k + 1 � = T V k ?? A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 3/63

Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1 , 2 , . . . , K ◮ Policy evaluation given π k , compute V k = V π k . ◮ Policy improvement : compute the greedy policy � � � π k + 1 ( x ) ∈ arg max a ∈ A r ( x , a ) + γ p ( y | x , a ) V π k ( y ) . y 3. Return the last policy π K A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 4/63

Policy Iteration: the Guarantees The policy iteration algorithm generates a sequences of policies with non-decreasing performance V π k + 1 ≥ V π k , and it converges to π ∗ in a finite number of iterations. Problem : what if V k � = V π k ?? A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 5/63

Sources of Error ◮ Approximation error . If X is large or continuous , value functions V cannot be represented correctly ⇒ use an approximation space F ◮ Estimation error . If the reward r and dynamics p are unknown , the Bellman operators T and T π cannot be computed exactly ⇒ estimate the Bellman operators from samples A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 6/63

In This Lecture ◮ Infinite horizon setting with discount γ ◮ Study the impact of approximation error ◮ Study the impact of estimation error in the next lecture A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 7/63

Performance Loss Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 8/63

Performance Loss From Approximation Error to Performance Loss Question : if V is an approximation of the optimal value function V ∗ with an error error = � V − V ∗ � how does it translate to the (loss of) performance of the greedy policy � � � π ( x ) ∈ arg max p ( y | x , a ) r ( x , a , y ) + γ V ( y ) a ∈ A y i.e. performance loss = � V ∗ − V π � ??? A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 9/63

Performance Loss From Approximation Error to Performance Loss Proposition Let V ∈ R N be an approximation of V ∗ and π its corresponding greedy policy, then 2 γ � V ∗ − V π � ∞ 1 − γ � V ∗ − V � ∞ ≤ . � �� approx. error performance loss Furthermore, there exists ǫ > 0 such that if � V − V ∗ � ∞ ≤ ǫ , then π is optimal . A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 10/63

Performance Loss From Approximation Error to Performance Loss Proof. � V ∗ − V π � ∞ ≤ �T V ∗ − T π V � ∞ + �T π V − T π V π � ∞ ≤ �T V ∗ − T V � ∞ + γ � V − V π � ∞ ≤ γ � V ∗ − V � ∞ + γ ( � V − V ∗ � ∞ + � V ∗ − V π � ∞ ) 2 γ 1 − γ � V ∗ − V � ∞ . ≤ � A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 11/63

Performance Loss From Approximation Error to Performance Loss Question: how do we compute V ? Problem: unlike in standard approximation scenarios (see supervised learning), we have a limited access to the target function, i.e. V ∗ Objective: given an approximation space F , compute an approximation V which is as close as possible to the best approximation of V ∗ in F , i.e. f ∈F || V ∗ − f || V ≈ arg inf A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 12/63

Approximate Value Iteration Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 13/63

Approximate Value Iteration Approximate Value Iteration: the Idea Let A be an approximation operator . 1. Let V 0 be any vector in R N 2. At each iteration k = 1 , 2 , . . . , K ◮ Compute V k + 1 = AT V k 3. Return the greedy policy � � � π K ( x ) ∈ arg max r ( x , a ) + γ p ( y | x , a ) V K ( y ) . a ∈ A y A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 14/63

Approximate Value Iteration Approximate Value Iteration: the Idea Let A = Π ∞ be a projection operator in L ∞ -norm, which corresponds to V k + 1 = Π ∞ T V k = arg inf V ∈F �T V k − V � ∞ A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 15/63

Approximate Value Iteration Approximate Value Iteration: convergence Proposition The projection Π ∞ is a non-expansion and the joint operator Π ∞ T is a contraction . Then there exists a unique fixed point ˜ V = Π ∞ T ˜ V which guarantees the convergence of AVI. A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 16/63

Approximate Value Iteration Approximate Value Iteration: performance loss Proposition (Bertsekas & Tsitsiklis, 1996) Let V K be the function returned by AVI after K iterations and π K its corresponding greedy policy. Then + 2 γ K + 1 2 γ 1 − γ � V ∗ − V 0 � ∞ � V ∗ − V π K � ∞ ≤ ( 1 − γ ) 2 max 0 ≤ k < K �T V k − AT V k � ∞ . � �� initial error worst approx. error A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 17/63

Approximate Value Iteration Approximate Value Iteration: performance loss Proof. Let ε = max 0 ≤ k < K �T V k − AT V k � ∞ . For any 0 ≤ k < K we have � V ∗ − V k + 1 � ∞ �T V ∗ − T V k � ∞ + �T V k − V k + 1 � ∞ ≤ γ � V ∗ − V k � ∞ + ε, ≤ then � V ∗ − V K � ∞ ( 1 + γ + · · · + γ K − 1 ) ε + γ K � V ∗ − V 0 � ∞ ≤ 1 1 − γ ε + γ K � V ∗ − V 0 � ∞ ≤ Since from Proposition 1 we have that � V ∗ − V π K � ∞ ≤ 1 − γ � V ∗ − V K � ∞ , then we obtain 2 γ ( 1 − γ ) 2 ε + 2 γ K + 1 2 γ � V ∗ − V π K � ∞ ≤ 1 − γ � V ∗ − V 0 � ∞ . A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 18/63

Approximate Value Iteration Fitted Q-iteration with linear approximation Assumption: access to a generative model. r ( x, a ) State x Reward Generative model Action a y ∼ p ( ·| x, a ) Next state Idea: work with Q -functions and linear spaces. ◮ Q ∗ is the unique fixed point of T defined over X × A as: � T Q ( x , a ) = p ( y | x , a )[ r ( x , a , y ) + γ max Q ( y , b )] . b y ◮ F is a space defined by d features φ 1 , . . . , φ d : X × A → R as: � α j φ j ( x , a ) , α ∈ R d � d � F = Q α ( x , a ) = . j = 1 ⇒ At each iteration compute Q k + 1 = Π ∞ T Q k A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 19/63

Approximate Value Iteration Fitted Q-iteration with linear approximation ⇒ At each iteration compute Q k + 1 = Π ∞ T Q k Problems: ◮ the Π ∞ operator cannot be computed efficiently ◮ the Bellman operator T is often unknown A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 20/63

Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Π ∞ operator cannot be computed efficiently . Let µ a distribution over X . We use a projection in L 2 ,µ -norm onto the space F : Q ∈F � Q − T Q k � 2 Q k + 1 = arg min µ . A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 21/63

Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Bellman operator T is often unknown . 1. Sample n state actions ( X i , A i ) with X i ∼ µ and A i random, 2. Simulate Y i ∼ p ( ·| X i , A i ) and R i = r ( X i , A i , Y i ) with the generative model, 3. Estimate T Q k ( X i , A i ) with Z i = R i + γ max a ∈ A Q k ( Y i , a ) (unbiased E [ Z i | X i , A i ] = T Q k ( X i , A i ) ), A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 22/63

Approximate Value Iteration Fitted Q-iteration with linear approximation At each iteration k compute Q k + 1 as n � � � 2 1 Q k + 1 = arg min Q α ( X i , A i ) − Z i n Q α ∈F i = 1 ⇒ Since Q α is a linear function in α , the problem is a simple quadratic minimization problem with closed form solution. A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 23/63

Approximate Value Iteration Other implementations ◮ K -nearest neighbour ◮ Regularized linear regression with L 1 or L 2 regularisation ◮ Neural network ◮ Support vector machine A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 24/63

Approximate Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille - PowerPoint PPT Presentation

Approximate Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N 2. At each iteration k = 1 , 2 , . . . , K

Approximate Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA

A Series of Lectures on Approximate Dynamic Programming Dimitri P . Bertsekas Laboratory for

A Series of Lectures on Approximate Dynamic Programming Dimitri P . Bertsekas Laboratory for

Reinforcement Learning: Approximate Dynamic Programming Decision Making Under Uncertainty, Chapter

Approximate Computing Is Dead; Long Live Approximate Computing Adrian Sampson Cornell Hardware

Approximate Cross-Validation and Dynamic Experiments for Policy Choice Maximilian Kasy

Dynamic Programming Outline and Reading Matrix Chain-Product (5.3.1) Dynamic Programming:

Dynamic Programming Prof. Kuan-Ting Lai 2020/4/10 Dynamic Programming Dynamic Programming is

CS 170 Section 6 Dynamic Programming Owen Jow | owenjow@berkeley.edu Agenda Dynamic

Dynamic Programming Kevin Zatloukal July 18, 2011 Motivation Dynamic programming deserves

Approximate Nearest Neighbors Search Approximate Nearest Neighbors Search in High Dimensions in

Dynamic programming 1 Dynamic programming also solve a problem by combining the solutions to

Dynamic Programming December 15, 2016 CMPE 250 Dynamic Programming December 15, 2016 1 / 60

Dynamic Programming Dynamic Programming Steps. 9 View the problem solution as the result of a

Two Approximate- Programmability Birds, One Statistical- Inference Stone Adrian Sampson

COMMUNICATING [with empathy] @ DY DYNAMIC JILL JILL @ DY DYNAMIC JILL TENSION IS INEVITABLE @

Dynamic scoping Scoping in Hofl Theory of Programming Languages Computer Science Department

How large should a hash CS 5633 -- Spring 2005 table be? Goal: Make the table as small as

Dynamic Matching Problems Francis Bloch Ecole Polytechnique Budapest, June 27, 2013 Francis

GC for interactive and real-time systems Interactive or real-time app concerns Reducing length

Path-dependent Inefficient Strategies and How to Make Them Efficient Frankfurt MathFinance

Concepts of Programming Languages: Static vs. Dynamic Typing Toni Schumacher Institute for

CSC 1800 Organization of Programming Languages Object Oriented Languages 1 Introduction

Simulating Energy Aware Networks in Large Scale Distributed Systems Betsegaw Lemma Amersho