Off-policy methods with approximation Recall off-policy learning - PowerPoint PPT Presentation

Chapter 11 Off-policy methods with approximation

Recall off-policy learning involves two policies • One policy π whose value function we are learning • the target policy • Another policy 𝜈 that is used to select actions • the behavior policy

Off-policy is much harder with Function Approximation • Even linear FA • Even for prediction (two fixed policies π and 𝜈 ) • Even for Dynamic Programming • The deadly triad: FA, TD, off-policy • Any two are OK, but not all three • With all three, we may get instability   (elements of 𝜾 may increase to ± ∞ )

There are really 2 off-policy problems One we know how to solve, one we are not sure One about the future, one about the present • The easy problem is that of off-policy targets (future) • We have been correcting for that since Chapters 5 and 6 • Using importance sampling in the target • The hard problem is that of the distribution of states to update (present);   we are no longer updating according to the on-policy distribution

Baird’s counterexample illustrates the instability 2 θ 1 + θ 8 2 θ 2 + θ 8 2 θ 3 + θ 8 2 θ 4 + θ 8 2 θ 5 + θ 8 2 θ 6 + θ 8 Components of the parameter vector θ 8 at the end of the episode under semi-gradient ctor off-policy TD(0) de 1% (similar for DP) 99% θ 1 – θ 6 1% θ 7 +2 θ 8 θ 7 π (solid |· ) = 1 Episodes µ (dashed |· ) = 6 / 7 µ (solid |· ) = 1 / 7

What causes the instability? • It has nothing to do with learning or sampling • Even dynamic programming suffers from divergence with FA • It has nothing to do with exploration, greedification, or control • Even prediction alone can diverge • It has nothing to do with local minima   or complex non-linear approximators • Even simple linear approximators can produce instability

The deadly triad • The risk of divergence arises whenever we combine three things: 1. Function approximation • significantly generalizing from large numbers of examples 2. Bootstrapping • learning value estimates from other value estimates,   Any 2 Ok as in dynamic programming and temporal-difference learning 3. Off-policy learning (Why is dynamic programming off-policy?) • learning about a policy from data not due to that policy,   as in Q-learning, where we learn about the greedy policy from data with a necessarily more exploratory policy

TD(0) can diverge: A simple example r=1 θ 2 θ r + γθ ⇥ φ � − θ ⇥ φ δ = 0 + 2 θ − θ = θ = TD update: ∆ θ αδφ = Diverges! αθ = TD fixpoint: θ ∗ = 0

Geometric intuition v θ . as a giant vector ∈ R | S | = ˆ v ( · , θ ) → " # ( B π v )( s ) . X X p ( s 0 | s, a ) v ( s 0 ) = π ( s, a ) r ( s, a ) + γ a 2 A s 0 2 S v π ( Π B π v θ , Value Error The space of all VE( value functions ) E B ( r o r r θ e n a m l l e B Π v π = v θ ⇤ ⌘ min k VE k VE ( E B P � PBE = 0 = Π B π v θ , v θ min k BE k The subspace of all value functions representable as v θ ✓ 1 2 ✓ 2 1 an

Can we do without bootstrapping? • Bootstrapping is critical to the computational efficiency of DP • Bootstrapping is critical to the data efficiency of TD methods • On the other hand, bootstrapping introduces bias, which harms the asymptotic performance of approximate methods • The degree of bootstrapping can be finely controlled via the λ parameter, from λ =0 (full bootstrapping) to λ =1 (no bootstrapping)

4 examples of the effect of bootstrapping   suggest that λ =1 (no bootstrapping) is a very poor choice Red points are the cases In all cases of no bootstrapping lower is better Pure No We need bootstrapping! bootstrapping bootstrapping

Desiderata: We want a TD algorithm that • Bootstraps (genuine TD) • Works with linear function approximation   (stable, reliably convergent) • Is simple, like linear TD — O(n) • Learns fast, like linear TD • Can learn off-policy • Learns from online causal trajectories   (no repeat sampling from the same state)

4 easy steps to stochastic gradient descent 1. Pick an objective function ,   J ( θ ) a parameterized function to be minimized 2. Use calculus to analytically compute the gradient � θ J ( θ ) 3. Find a “sample gradient” that you can sample on θ ⇥ θ � α ⇤ θ J t ( θ ) every time step and whose expected value equals the gradient 4. Take small steps in proportional to the sample gradient: θ θ ⇥ θ � α ⇤ θ J t ( θ )

Conventional TD is not the gradient of anything ∆ θ = αδφ TD(0) algorithm: δ = r + γθ ⇥ φ � − θ ⇥ φ ∂ J Assume there is a J such that: = δφ i ∂θ i Then look at the second derivative: } ∂ 2 J = ∂ ( δφ i ) C = ( γφ � o j − φ j ) φ i n ∂ 2 J ∂ 2 J t ∂θ j ∂θ i ∂θ j r a d i � = c t ∂θ j ∂θ i ∂θ i ∂θ j i o ∂ 2 J = ∂ ( δφ j ) n ! = ( γφ � i − φ i ) φ j ∂θ i ∂θ j ∂θ i Real 2 nd derivatives must be symmetric Etienne Barnard 1993

A-split example (Dayan 1992) Clearly, the true values are V ( A ) = 0 . 5 A V ( B ) = 1 50% 50% But if you minimize the naive B objective fn, , 100% J ( θ ) = E [ δ 2 ] then you get the solution 1 0 V ( A ) = 1 / 3 V ( B ) = 2 / 3 Even in the tabular case (no FA)

Indistinguishable pairs of MDPs � � � � � These two have different Value Errors, 0 but the same Return Errors 0 ✓ 1 2 0 ✓ 1 ✓ 1 2 2 (both errors have the same minima) � 2 � J RE ( θ ) 2 = J VE ( θ ) 2 + E h� i v π ( S t ) − G t � A t : ∞ ∼ π � These two have different Bellman Errors, - 1 1 - 1 1 - 1 B � A B but the same Projected Bellman Errors A B 0 (the errors have different minima) 0 0

Not all objectives can be estimated from data Not all minima can be found by learning Data Data distribution distribution d P µ ( ξ ) = d P µ ( ξ ) = MDP 1 MDP 2 MDP 1 MDP 2 TDE PBE RE BE 1 BE 2 VE 1 VE 2 ✓ ⇤ ✓ ⇤ 3 4 ✓ ⇤ ✓ ⇤ 1 2 ✓ ⇤ No learning algorithm can find the minimum of the Bellman Error � � � � � �

The Gradient-TD Family of Algorithms • True gradient-descent algorithms in the Projected Bellman Error • GTD( λ ) and GQ( λ ), for learning V and Q • Solve two open problems: • convergent linear-complexity off-policy TD learning • convergent non-linear TD • Extended to control variate, proximal forms by Mahadevan et al.

First relate the geometry to the iid statistics TV θ E B S M R T Π Π TV θ V θ RMSPBE MSPBE ( θ ) Φ , D matrix of the feature vectors for all states ⇥ V θ � Π TV θ ⇥ 2 = D Π = Φ ( Φ ⇧ D Φ ) � 1 Φ ⇧ D ⇥ Π ( V θ � TV θ ) ⇥ 2 = Φ T D ( TV θ − V θ ) = E [ δφ ] D ( Π ( V θ � TV θ )) ⇤ D ( Π ( V θ � TV θ )) = Φ T D Φ = E [ φφ T ] ( V θ � TV θ ) ⇤ Π ⇤ D Π ( V θ � TV θ ) = ( V θ � TV θ ) ⇤ D ⇤ Φ ( Φ ⇤ D Φ ) � 1 Φ ⇤ D ( V θ � TV θ ) = ( Φ ⇤ D ( TV θ � V θ )) ⇤ ( Φ ⇤ D Φ ) � 1 Φ ⇤ D ( TV θ � V θ ) = φφ ⇤ ⇥ � 1 E [ δφ ] . E [ δφ ] ⇤ E � =

Derivation of the TDC algorithm r → s � s − ∆ θ = � 1 � 1 2 α r θ k V θ � Π TV θ k 2 2 α r θ J ( θ ) = D φ � φ � 1 φφ > ⇤ � 1 E [ δφ ] ⇣ ⌘ ⇥ = 2 α r θ E [ δφ ] E φφ > ⇤ � 1 E [ δφ ] ⇥ = � α ( r θ E [ δφ ]) E φφ > ⇤ � 1 E [ δφ ] r + γφ 0> θ � φ > θ ⇥ � � ⇤ ⇥ = � α E r θ [ φ ] E φ ( γφ 0 � φ ) > i > φφ > ⇤ � 1 E [ δφ ] h ⇥ = � α E E φφ > ⇤ � 1 E [ δφ ] φ 0 φ > ⇤ φφ > ⇤� � ⇥ ⇥ ⇥ = � α � E γ E E φφ > ⇤ � 1 E [ δφ ] φ 0 φ > ⇤ ⇥ ⇥ = α E [ δφ ] � αγ E E φ 0 φ > ⇤ ⇥ ⇡ α E [ δφ ] � αγ E w This is the trick! αδφ � αγφ 0 φ > w (sampling) ⇡ is a second w � ⇥ n set of weights

TD with gradient correction (TDC) algorithm aka GTD(0) • on each transition r → s � s − φ � φ • update two parameters TD(0) with gradient correction θ ← θ + αδφ − αγφ � � φ ⇥ w ⇥ w ← w + β ( δ − φ � w ) φ estimate of the • where, as usual TD error ( ) for δ the current state φ δ = r + γθ ⇥ φ � − θ ⇥ φ

Convergence theorems • All algorithms converge w.p.1 to the TD fix-point: E [ δφ ] − → 0 • GTD, GTD-2 converges at one time scale α = β − → 0 • TD-C converges in a two-time-scale sense α α , β − → 0 → 0 β −

Off-policy result: Baird’s counter-example &! "! "! % & "! '()(*+,+)-. ! /01 ! 67 ! "! $ '()*+, & ! & ! "! 123 # "! ! "! ! "! ! "!!! #!!! $!!! %!!! &!!! 23++45 123 ! " " 234 ! ! "! #! $! %! &!! &"! &#! &$! &%! "!! )-../0 Gradient algorithms converge. TD diverges.

Computer Go experiment • Learn a linear value � E [ ∆ θ T D ] � 0.8 function (probability of winning) for 9x9 Go from self play 0.6 GTD2 RNEU TDC • One million features, 0.4 each corresponding to a TD GTD2 template on a part of 0.2 GTD the Go board TDC 0 .000001 .000003 .00001 .00003 .0001 .0003 .001 • An established ! experimental testbed

Off-policy methods with approximation Recall off-policy learning - PowerPoint PPT Presentation

Chapter 11 Off-policy methods with approximation Recall off-policy learning involves two policies One policy whose value function we are learning the target policy Another policy that is used to select actions the behavior

GPU tuning, part 1 (updated) CSE 6230: HPC Tools & Apps Fall 2014 September 30 &

6. Approximation and fitting norm approximation least-norm problems regularized

Circuits Lecture 11 Uniform Circuit Complexity 1 Recall 2 Recall Non-uniform complexity 2

Approximation Methods in Derivatives Pricing Minqiang Li Bloomberg LP September 24, 2013 1 / 27

Moderately exponential approximation Bridging the gap between exact computation and polynomial

6. Approximation and fitting Prof. Ying Cui Department of Electrical Engineering Shanghai Jiao

Deep Approximation via Deep Learning Zuowei Shen Department of Mathematics National University

ECS 231 Lecture on Approximation and Error Analysis 1 / 9 Approximation and error analysis 1.

LOCAL LINEAR APPROXIMATION MATH 200 GOALS Be able to compute the local linear approximation

Lecture 18: PCP Theorem and Hardness of Approximation I Arijit Bishnu 26.04.2010 Introduction

Advanced Algorithms COMS31900 Approximation algorithms part three (Fully) Polynomial Time

3rd Annual Automotive Industry Warranty & Recall Symposium Global Financial Advisory Services

Lectur Lecture 4: e 4: Electr Electrical Test Equipm ical Test Equipment ent Recall Fr

Current state on filter approximation and evaluation Thibault Hilaire (thibault.hilaire@lip6.fr)

Meshless Meshless Methods Meshless Meshless Methods Methods Methods Contents

Practical Linear- -value value Practical Linear Approximation Techniques Approximation

SQL Workshop Data Types Doug Shook Data Types Four categories String Numeric

Reconstructing Control Flow from Predicated Assembly Code Bjrn Decker, Saarland University

CS 451 Software Engineering Winter 2009 Yuanfang Cai Room 104, University Crossings

SMT-Style Program Analysis with Value-based Refinements Vijay DSilva Leopold Haller Daniel

Sequential Decision Making AIMA Chapters: 17.1, 17.2, 17.3. Sutton and Barto, Reinforcement

Reinforcement Learning Part 2 Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department

devicetree: kernel internals and practical troubleshooting There have been many presentations on

Why is Key-Value Store + GPU important? GPU Key-Value Store Massive Parallelism Good to store

Off-policy methods with approximation Recall off-policy learning - PowerPoint PPT Presentation

Chapter 11 Off-policy methods with approximation Recall off-policy learning involves two policies One policy whose value function we are learning the target policy Another policy that is used to select actions the behavior

GPU tuning, part 1 (updated) CSE 6230: HPC Tools &amp; Apps Fall 2014 September 30 &amp;

6. Approximation and fitting norm approximation least-norm problems regularized

Circuits Lecture 11 Uniform Circuit Complexity 1 Recall 2 Recall Non-uniform complexity 2

Approximation Methods in Derivatives Pricing Minqiang Li Bloomberg LP September 24, 2013 1 / 27

Moderately exponential approximation Bridging the gap between exact computation and polynomial

6. Approximation and fitting Prof. Ying Cui Department of Electrical Engineering Shanghai Jiao

Deep Approximation via Deep Learning Zuowei Shen Department of Mathematics National University

ECS 231 Lecture on Approximation and Error Analysis 1 / 9 Approximation and error analysis 1.

LOCAL LINEAR APPROXIMATION MATH 200 GOALS Be able to compute the local linear approximation

Lecture 18: PCP Theorem and Hardness of Approximation I Arijit Bishnu 26.04.2010 Introduction

Advanced Algorithms COMS31900 Approximation algorithms part three (Fully) Polynomial Time

3rd Annual Automotive Industry Warranty &amp; Recall Symposium Global Financial Advisory Services

Lectur Lecture 4: e 4: Electr Electrical Test Equipm ical Test Equipment ent Recall Fr

Current state on filter approximation and evaluation Thibault Hilaire (thibault.hilaire@lip6.fr)

Meshless Meshless Methods Meshless Meshless Methods Methods Methods Contents

Practical Linear- -value value Practical Linear Approximation Techniques Approximation

SQL Workshop Data Types Doug Shook Data Types Four categories String Numeric

Reconstructing Control Flow from Predicated Assembly Code Bjrn Decker, Saarland University

CS 451 Software Engineering Winter 2009 Yuanfang Cai Room 104, University Crossings

SMT-Style Program Analysis with Value-based Refinements Vijay DSilva Leopold Haller Daniel

Sequential Decision Making AIMA Chapters: 17.1, 17.2, 17.3. Sutton and Barto, Reinforcement

Reinforcement Learning Part 2 Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department

devicetree: kernel internals and practical troubleshooting There have been many presentations on

Why is Key-Value Store + GPU important? GPU Key-Value Store Massive Parallelism Good to store

GPU tuning, part 1 (updated) CSE 6230: HPC Tools & Apps Fall 2014 September 30 &

3rd Annual Automotive Industry Warranty & Recall Symposium Global Financial Advisory Services