off policy methods with approximation recall off policy
play

Off-policy methods with approximation Recall off-policy learning - PowerPoint PPT Presentation

Chapter 11 Off-policy methods with approximation Recall off-policy learning involves two policies One policy whose value function we are learning the target policy Another policy that is used to select actions the behavior


  1. Chapter 11 Off-policy methods with approximation

  2. Recall off-policy learning involves two policies • One policy π whose value function we are learning • the target policy • Another policy 𝜈 that is used to select actions • the behavior policy

  3. Off-policy is much harder with Function Approximation • Even linear FA • Even for prediction (two fixed policies π and 𝜈 ) • Even for Dynamic Programming • The deadly triad: FA, TD, off-policy • Any two are OK, but not all three • With all three, we may get instability 
 (elements of 𝜾 may increase to ± ∞ )

  4. There are really 2 off-policy problems One we know how to solve, one we are not sure One about the future, one about the present • The easy problem is that of off-policy targets (future) • We have been correcting for that since Chapters 5 and 6 • Using importance sampling in the target • The hard problem is that of the distribution of states to update (present); 
 we are no longer updating according to the on-policy distribution

  5. Baird’s counterexample illustrates the instability 2 θ 1 + θ 8 2 θ 2 + θ 8 2 θ 3 + θ 8 2 θ 4 + θ 8 2 θ 5 + θ 8 2 θ 6 + θ 8 Components of the parameter vector θ 8 at the end of the episode under semi-gradient ctor off-policy TD(0) de 1% (similar for DP) 99% θ 1 – θ 6 1% θ 7 +2 θ 8 θ 7 π (solid |· ) = 1 Episodes µ (dashed |· ) = 6 / 7 µ (solid |· ) = 1 / 7

  6. What causes the instability? • It has nothing to do with learning or sampling • Even dynamic programming suffers from divergence with FA • It has nothing to do with exploration, greedification, or control • Even prediction alone can diverge • It has nothing to do with local minima 
 or complex non-linear approximators • Even simple linear approximators can produce instability

  7. The deadly triad • The risk of divergence arises whenever we combine three things: 1. Function approximation • significantly generalizing from large numbers of examples 2. Bootstrapping • learning value estimates from other value estimates, 
 Any 2 Ok as in dynamic programming and temporal-difference learning 3. Off-policy learning (Why is dynamic programming off-policy?) • learning about a policy from data not due to that policy, 
 as in Q-learning, where we learn about the greedy policy from data with a necessarily more exploratory policy

  8. TD(0) can diverge: A simple example r=1 θ 2 θ r + γθ ⇥ φ � − θ ⇥ φ δ = 0 + 2 θ − θ = θ = TD update: ∆ θ αδφ = Diverges! αθ = TD fixpoint: θ ∗ = 0

  9. Geometric intuition v θ . as a giant vector ∈ R | S | = ˆ v ( · , θ ) → " # ( B π v )( s ) . X X p ( s 0 | s, a ) v ( s 0 ) = π ( s, a ) r ( s, a ) + γ a 2 A s 0 2 S v π ( Π B π v θ , Value Error The space of all VE( value functions ) E B ( r o r r θ e n a m l l e B Π v π = v θ ⇤ ⌘ min k VE k VE ( E B P � PBE = 0 = Π B π v θ , v θ min k BE k The subspace of all value functions representable as v θ ✓ 1 2 ✓ 2 1 an

  10. Can we do without bootstrapping? • Bootstrapping is critical to the computational efficiency of DP • Bootstrapping is critical to the data efficiency of TD methods • On the other hand, bootstrapping introduces bias, which harms the asymptotic performance of approximate methods • The degree of bootstrapping can be finely controlled via the λ parameter, from λ =0 (full bootstrapping) to λ =1 (no bootstrapping)

  11. 4 examples of the effect of bootstrapping 
 suggest that λ =1 (no bootstrapping) is a very poor choice Red points are the cases In all cases of no bootstrapping lower is better Pure No We need bootstrapping! bootstrapping bootstrapping

  12. Desiderata: We want a TD algorithm that • Bootstraps (genuine TD) • Works with linear function approximation 
 (stable, reliably convergent) • Is simple, like linear TD — O(n) • Learns fast, like linear TD • Can learn off-policy • Learns from online causal trajectories 
 (no repeat sampling from the same state)

  13. 4 easy steps to stochastic gradient descent 1. Pick an objective function , 
 J ( θ ) a parameterized function to be minimized 2. Use calculus to analytically compute the gradient � θ J ( θ ) 3. Find a “sample gradient” that you can sample on θ ⇥ θ � α ⇤ θ J t ( θ ) every time step and whose expected value equals the gradient 4. Take small steps in proportional to the sample gradient: θ θ ⇥ θ � α ⇤ θ J t ( θ )

  14. Conventional TD is not the gradient of anything ∆ θ = αδφ TD(0) algorithm: δ = r + γθ ⇥ φ � − θ ⇥ φ ∂ J Assume there is a J such that: = δφ i ∂θ i Then look at the second derivative: } ∂ 2 J = ∂ ( δφ i ) C = ( γφ � o j − φ j ) φ i n ∂ 2 J ∂ 2 J t ∂θ j ∂θ i ∂θ j r a d i � = c t ∂θ j ∂θ i ∂θ i ∂θ j i o ∂ 2 J = ∂ ( δφ j ) n ! = ( γφ � i − φ i ) φ j ∂θ i ∂θ j ∂θ i Real 2 nd derivatives must be symmetric Etienne Barnard 1993

  15. A-split example (Dayan 1992) Clearly, the true values are V ( A ) = 0 . 5 A V ( B ) = 1 50% 50% But if you minimize the naive B objective fn, , 100% J ( θ ) = E [ δ 2 ] then you get the solution 1 0 V ( A ) = 1 / 3 V ( B ) = 2 / 3 Even in the tabular case (no FA)

  16. Indistinguishable pairs of MDPs � � � � � These two have different Value Errors, 0 but the same Return Errors 0 ✓ 1 2 0 ✓ 1 ✓ 1 2 2 (both errors have the same minima) � 2 � J RE ( θ ) 2 = J VE ( θ ) 2 + E h� i v π ( S t ) − G t � A t : ∞ ∼ π � These two have different Bellman Errors, - 1 1 - 1 1 - 1 B � A B but the same Projected Bellman Errors A B 0 (the errors have different minima) 0 0

  17. Not all objectives can be estimated from data Not all minima can be found by learning Data Data distribution distribution d P µ ( ξ ) = d P µ ( ξ ) = MDP 1 MDP 2 MDP 1 MDP 2 TDE PBE RE BE 1 BE 2 VE 1 VE 2 ✓ ⇤ ✓ ⇤ 3 4 ✓ ⇤ ✓ ⇤ 1 2 ✓ ⇤ No learning algorithm can find the minimum of the Bellman Error � � � � � �

  18. The Gradient-TD Family of Algorithms • True gradient-descent algorithms in the Projected Bellman Error • GTD( λ ) and GQ( λ ), for learning V and Q • Solve two open problems: • convergent linear-complexity off-policy TD learning • convergent non-linear TD • Extended to control variate, proximal forms by Mahadevan et al.

  19. First relate the geometry to the iid statistics TV θ E B S M R T Π Π TV θ V θ RMSPBE MSPBE ( θ ) Φ , D matrix of the feature vectors for all states ⇥ V θ � Π TV θ ⇥ 2 = D Π = Φ ( Φ ⇧ D Φ ) � 1 Φ ⇧ D ⇥ Π ( V θ � TV θ ) ⇥ 2 = Φ T D ( TV θ − V θ ) = E [ δφ ] D ( Π ( V θ � TV θ )) ⇤ D ( Π ( V θ � TV θ )) = Φ T D Φ = E [ φφ T ] ( V θ � TV θ ) ⇤ Π ⇤ D Π ( V θ � TV θ ) = ( V θ � TV θ ) ⇤ D ⇤ Φ ( Φ ⇤ D Φ ) � 1 Φ ⇤ D ( V θ � TV θ ) = ( Φ ⇤ D ( TV θ � V θ )) ⇤ ( Φ ⇤ D Φ ) � 1 Φ ⇤ D ( TV θ � V θ ) = φφ ⇤ ⇥ � 1 E [ δφ ] . E [ δφ ] ⇤ E � =

  20. Derivation of the TDC algorithm r → s � s − ∆ θ = � 1 � 1 2 α r θ k V θ � Π TV θ k 2 2 α r θ J ( θ ) = D φ � φ � 1 φφ > ⇤ � 1 E [ δφ ] ⇣ ⌘ ⇥ = 2 α r θ E [ δφ ] E φφ > ⇤ � 1 E [ δφ ] ⇥ = � α ( r θ E [ δφ ]) E φφ > ⇤ � 1 E [ δφ ] r + γφ 0> θ � φ > θ ⇥ � � ⇤ ⇥ = � α E r θ [ φ ] E φ ( γφ 0 � φ ) > i > φφ > ⇤ � 1 E [ δφ ] h ⇥ = � α E E φφ > ⇤ � 1 E [ δφ ] φ 0 φ > ⇤ φφ > ⇤� � ⇥ ⇥ ⇥ = � α � E γ E E φφ > ⇤ � 1 E [ δφ ] φ 0 φ > ⇤ ⇥ ⇥ = α E [ δφ ] � αγ E E φ 0 φ > ⇤ ⇥ ⇡ α E [ δφ ] � αγ E w This is the trick! αδφ � αγφ 0 φ > w (sampling) ⇡ is a second w � ⇥ n set of weights

  21. TD with gradient correction (TDC) algorithm aka GTD(0) • on each transition r → s � s − φ � φ • update two parameters TD(0) with gradient correction θ ← θ + αδφ − αγφ � � φ ⇥ w ⇥ w ← w + β ( δ − φ � w ) φ estimate of the • where, as usual TD error ( ) for δ the current state φ δ = r + γθ ⇥ φ � − θ ⇥ φ

  22. Convergence theorems • All algorithms converge w.p.1 to the TD fix-point: E [ δφ ] − → 0 • GTD, GTD-2 converges at one time scale α = β − → 0 • TD-C converges in a two-time-scale sense α α , β − → 0 → 0 β −

  23. Off-policy result: Baird’s counter-example &! "! "! % & "! '()(*+,+)-. ! /01 ! 67 ! "! $ '()*+, & ! & ! "! 123 # "! ! "! ! "! ! "!!! #!!! $!!! %!!! &!!! 23++45 123 ! " " 234 ! ! "! #! $! %! &!! &"! &#! &$! &%! "!! )-../0 Gradient algorithms converge. TD diverges.

  24. Computer Go experiment • Learn a linear value � E [ ∆ θ T D ] � 0.8 function (probability of winning) for 9x9 Go from self play 0.6 GTD2 RNEU TDC • One million features, 0.4 each corresponding to a TD GTD2 template on a part of 0.2 GTD the Go board TDC 0 .000001 .000003 .00001 .00003 .0001 .0003 .001 • An established ! experimental testbed

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend