q learning algorithms for optimal stopping based on least
play

Q -learning Algorithms for Optimal Stopping Based on Least Squares - PowerPoint PPT Presentation

Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Q -learning Algorithms for Optimal Stopping Based on Least Squares H. Yu 1 . Bertsekas 2 D. P 1Department of Computer Science University of Helsinki 2Department


  1. Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Q -learning Algorithms for Optimal Stopping Based on Least Squares H. Yu 1 . Bertsekas 2 D. P 1Department of Computer Science University of Helsinki 2Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology European Control Conference, Kos, Greece, 2007

  2. Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Outline Introduction Optimal Stopping Problems Preliminaries Least Squares Q -Learning Algorithm Convergence Convergence Rate Variants with Reduced Computation Motivation First Variant Second Variant

  3. Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Basic Problem and Bellman Equation • An irreducible Markov chain with n states and transition matrix P Action: stop or continue Cost at state i : c ( i ) if stop; g ( i ) if continue Minimize the expected discounted total cost till stop • Bellman equations in vector notation 1 J ∗ = min { c , g + α PJ ∗ } , Q ∗ = g + α P min { c , Q ∗ } Optimal policy: stop as soon as the state hits the set D = { i | c ( i ) ≤ Q ∗ ( i ) } • Applications: search, sequential hypothesis testing, finance • Focus of this paper: Q -learning with linear function approximation 2 1 α : discount factor, J ∗ : optimal cost, Q ∗ : Q -factor for the continuation action (the cost of continuing for the first stage and using an optimal stopping policy in the remaining stages) 2 Q-learning aims to find the Q-factor for each action-state pair, i.e., the vector Q ∗ (the Q-factor vector for the stop action is c ).

  4. Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Q -Learning with Function Approximation (Tsitsiklis and Van Roy 1999) Subspace Approximation 3 2 3 · · · 4 φ ( i ) ′ 5 , or, Q ( i , r ) = φ ( i ) ′ r [Φ] n × s = Q = Φ r · · · Weighted Euclidean Projection Π Q = arg min � Q − Φ r � π , π = ( π ( 1 ) , . . . , π ( n )) : invariant distribution of P r ∈ℜ s Key Fact: DP mapping F is � · � π -contraction and so is Π F , where FQ def = g + α P min { c , Q } Temporal Difference (TD) Learning solves Projected Bellman Equation : Φ r ∗ = Π F (Φ r ∗ ) Suboptimal policy µ : stop as soon as the state hits the set { i | c ( i ) ≤ φ ( i ) ′ r ∗ } 4 n X ` ´ 2 1 − α 2 � Π Q ∗ − Q ∗ � π J µ ( i ) − J ∗ ( i ) π ( i ) ≤ p ( 1 − α ) i = 1 3 Assume that Φ has linearly independent columns. 4 Denote by J µ the cost of this policy.

  5. Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Basis of Least Squares Methods I Projected Value Iteration Simulation: ( x 0 , x 1 , . . . ) unstopped state process; implicitly approximate Π F with increasing accuracy Projected Value Iteration and LSPE (Bertsekas and Ioffe 1996): 5 Π t b Φ r t + 1 = b Φ r t + 1 = Π F (Φ r t ) , F t (Φ r t ) = Π F (Φ r t ) + ǫ t Value Iterate Value Iterate F( Φ r t ) F( Φ r t ) Projection Projection on S on S Φ r t+1 Φ r t+1 Φ r t Φ r t Simulation error 0 0 S: Subspace spanned by basis functions S: Subspace spanned by basis functions Projected Value Iteration Least Squares Policy Evaluation (LSPE) 5 Roughly speaking, b Π t b F t → Π F , ǫ t → 0 as t → ∞ .

  6. Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Basis of Least Squares Methods II Solving Approximate Projected Bellman Equation LSTD (Bradtke and Barto 1996, Boyan 1999): find r t + 1 solving an approximate projected Bellman equation Φ r t + 1 = b Π t b F t (Φ r t + 1 ) Not viable for optimal stopping because F is non-linear 6 Comparison with Temporal Difference Learning Algorithm (Tsitsiklis and Van Roy 1999): 7 ` ´ g ( x t , x t + 1 ) + α min { c ( x t + 1 ) , φ ( x t + 1 ) ′ r t } − φ ( x t ) ′ r t r t + 1 = r t + γ t φ ( x t ) • TD: use each sample state only once; averaging through long time interval, approximately perform the mapping Π F • Least squares (LS) methods: use effectively the past information; no need to store the past (in policy evaluation context) 6 In the case of policy evaluation, this is a linear equation and can be solved efficiently. 7 Abusing notation, we denote by g ( i , j ) the one-stage cost of transiting from state i to j under the continuation action.

  7. Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Least Squares Q -Learning The Algorithm 2 ( x 0 , x 1 , . . . ) unstopped state process, γ ∈ ( 0 , 1 + α ) constant stepsize r t + 1 = r t + γ (ˆ r t + 1 − r t ) (1) where ˆ r t + 1 is the LS solution: t “ ¯” 2 X ˘ φ ( x k ) ′ r − g ( x k , x k + 1 ) − α min c ( x k + 1 ) , φ ( x k + 1 ) ′ r t ˆ r t + 1 = arg min (2) r ∈ℜ s k = 0 Can compute ˆ r t + 1 almost recursively: ! − 1 “ ¯” X t X t ˘ φ ( x k ) φ ( x k ) ′ c ( x k + 1 ) , φ ( x k + 1 ) ′ r t ˆ r t + 1 = φ ( x k ) g ( x k , x k + 1 ) + α min k = 0 k = 0 ˘ ¯ c ( x k + 1 ) , φ ( x k + 1 ) ′ r t except the calculation of min , k ≤ t requires repartitioning past states into stopping or continuation sets (a remedy will be discussed later)

  8. Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Convergence Analysis Express LS solution in matrix notation as 8 “ ¯” ˘ r t + 1 = b Π t b F t (Φ r t ) = b g t + α ˜ Φˆ Π t ˆ P t min c , Φ r t (3) With probability 1 (w.p.1), for all t sufficiently large, • b Π t b F t is � · � π -contraction with modulus ˆ α ∈ ( α, 1 ) • ( 1 − γ ) I + γ b Π t b 2 F t is � · � π -contraction for γ ∈ ( 0 , 1 + α ) Proposition „ « 2 r t → r ∗ , as t → ∞ , w . p . 1 . For all γ ∈ 0 , , 1 + α Note: Unit stepsize is in the convergence range 8 Here b g t and ˜ Π t , ˆ P t are increasingly accurate simulation-based approximations of Π , g and P , respectively.

  9. Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Comparison to an LSTD Analogue Φ r t + 1 = ( 1 − γ )Φ r t + γ b Π t b LS Q -learning: F t (Φ r t ) (4) r t + 1 = b Π t b Φ˜ F t (Φ˜ LSTD analogue: r t + 1 ) (5) Eq. (4) is one single fixed point iteration for solving Eq. (5). Yet, the LS Q -learning algorithm and the idealized LSTD algorithm have the same convergence rate [two-time scale argument, similar to a comparison analysis of LSPE/LSTD (Yu and Bertsekas 2006)]: 9 Proposition „ « 2 For all γ ∈ 0 , , t (Φ r t − Φ˜ r t ) < ∞ , w . p . 1 . 1 + α Implications: for all stepsize γ in the convergence range • empirical phenomenon: r t “tracks” ˜ r t r t → r ∗ at the rate of • more precisely: r t − ˜ r t → 0 at the rate of O ( t ) , faster than r t , ˜ √ O ( t ) 9 A coarse explanation is as follows: ˜ r t + 1 changes slowly at the rate of O ( t ) and can be viewed as if “frozen” for iteration (4), which, being a contraction mapping, has geometric rate of convergence to the vicinity of the “fixed point” ˜ r t + 1 .

  10. Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Variants with Reduced Computation Motivation LS solution ! − 1 “ ¯” X t X t ˘ φ ( x k ) φ ( x k ) ′ c ( x k + 1 ) , φ ( x k + 1 ) ′ r t ˆ r t + 1 = φ ( x k ) g ( x k , x k + 1 ) + α min k = 0 k = 0 requires extra overhead/repartition per iteration: ˘ ¯ c ( x k + 1 ) , φ ( x k + 1 ) ′ r t min , k ≤ t Introduce algorithms with limited repartition at the expense of likely worse asymptotic convergence rate

  11. Introduction Least Squares Q -Learning Variants with Reduced Computation Summary First Variant: Forgo Repartition With an Optimistic Policy Iteration Flavor Set of past stopping decisions for state samples ˘ ¯ k | c ( x k + 1 ) ≤ φ ( x k + 1 ) ′ r k K = ˘ ¯ c ( x k + 1 ) , φ ( x k + 1 ) ′ r t Replace the terms min , k ≤ t by ( c ( x k + 1 ) if k ∈ K ˜ q ( x k + 1 , r t ) = φ ( x k + 1 ) ′ r t if k / ∈ K Algorithm ! − 1 X t X t φ ( x k ) φ ( x k ) ′ r t + 1 = φ ( x k ) g ( x k , x k + 1 ) k = 0 k = 0 ! X X φ ( x k ) φ ( x k + 1 ) ′ r t + α φ ( x k ) c ( x k + 1 ) + α k ≤ t , k ∈ K k ≤ t , k / ∈ K Can compute recursively; LSTD approach is also applicable 10 But we have no proof of convergence at present 11 10 This is because the r.h.s. above is linear in r t . 11 Note that if the algorithm converges, it converges to the correct solution r ∗ .

  12. Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Second Variant: Repartition within a Finite Window Repartition at most m times per state sample, m ≥ 1: window size ˘ ¯ c ( x k + 1 ) , φ ( x k + 1 ) ′ r t Replace the terms min , k ≤ t by ˘ ¯ c ( x k + 1 ) , φ ( x k + 1 ) ′ r l k , t min , l k , t = min { k + m − 1 , t } Algorithm “ ¯” 2 t X ˘ φ ( x k ) ′ r − g ( x k , x k + 1 ) − α min c ( x k + 1 ) , φ ( x k + 1 ) ′ r l k , t r t + 1 = arg min (6) r ∈ℜ s k = 0 Special cases • m → ∞ : LS Q -learning algorithm • m = 1: the fixed point Kalman filter (TD with scaling), (Choi and Van Roy 2006) ` ´ 1 t + 1 B − 1 g ( x t , x t + 1 ) + α min { c ( x t + 1 ) , φ ( x t + 1 ) ′ r t } − φ ( x t ) ′ r t r t + 1 = r t + φ ( x t ) t

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend