linear programming for large scale markov decision
play

Linear Programming for Large-Scale Markov Decision Problems Yasin - PowerPoint PPT Presentation

Linear Programming for Large-Scale Markov Decision Problems Yasin Abbasi-Yadkori 1 Peter Bartlett 12 Alan Malek 2 1 Queensland University of Technology Brisbane, QLD, Australia 2 University of California, Berkeley Berkeley, CA June 24th, 2014


  1. Linear Programming for Large-Scale Markov Decision Problems Yasin Abbasi-Yadkori 1 Peter Bartlett 12 Alan Malek 2 1 Queensland University of Technology Brisbane, QLD, Australia 2 University of California, Berkeley Berkeley, CA June 24th, 2014 Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 1 / 22

  2. Outline Introduce MDPs and the Linear Program formulation 1 Algorithm 2 Oracle inequality 3 Experiments 4 Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 2 / 22

  3. Markov Decision Processes A Markov Decision Process is specified by: State space X = { 1 , . . . , X } Action space A = { 1 . . . , A } Transition Kernel P : X × A → ∆ X Loss function ℓ : X × A → [ 0 , 1 ] Let P π be the state transition kernel under policy π : X → ∆ A . Our goal is to choose π to minimize the average loss when X and A are very large. Aim for optimality within a restricted family of policies. Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 3 / 22

  4. Linear Program Formulation LP formulation (Manne 1960): max λ, h λ , (1) B ⊤ ( λ 1 + h ) ≤ ℓ + P ⊤ h , s.t. where B ∈ { 0 , 1 } ( X × XA ) is the marginalization matrix. Primal variables: h is the cost-to-go, λ is the average cost Dual: µ ∈ R XA ℓ ⊤ µ , min (2) 1 ⊤ µ = 1 , µ ≥ 0 , ( P − B ) µ = 0 . v s.t. Define policy via π ( a | x ) ∝ µ ( x , a ) . Dual variables: µ is a stationary distribution over X × A Still a problem when X , A very large Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 4 / 22

  5. The Dual ALP Feature matrix Φ ∈ R XA × d ; constrain µ = Φ θ µ ∈ R XA ℓ ⊤ Φ θ , min (3) 1 ⊤ Φ θ = 1 , Φ θ ≥ 0 , ( P − B ) ⊤ Φ θ = 0 . s.t. [ · ] + is positive part Define policy via π θ ( a | x ) ∝ [(Φ θ )( x , a )] + , µ θ is the stationary distribution of P π θ µ θ ≈ Φ θ ℓ ⊤ µ θ is the average loss of policy π θ Want to compete with min θ ℓ ⊤ µ θ Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 5 / 22

  6. Reducing Constraints Still intractable: d -dimensional problem but O ( XA ) constraints Form the convex cost function: � � � � c ( θ ) = ℓ ⊤ Φ θ + � [Φ θ ] − � 1 + � ( P − B ) ⊤ Φ θ � 1 � � � � � � � � � [Φ ( x , a ) , : θ ] − � + = ℓ ⊤ Φ θ + � (Φ θ ) ⊤ ( P − B ) : , x ′ � ( x , a ) x ′ Sample ( x t , a t ) ∼ q 1 and y t ∼ q 2 Unbiased subgradient estimate: g t ( θ ) = ℓ ⊤ Φ − Φ ( x t , a t ) , : (4) q 1 ( x t , a t ) I { Φ ( xt , at ) , : θ< 0 } � � + (Φ ⊤ ( P − B ) : , y t ) ⊤ (Φ θ ) ⊤ ( P − B ) : , y t sgn q 2 ( y t ) Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 6 / 22

  7. The Stochastic Subgradient Method for MDPs Input: Constants S , H > 0, number of rounds T . Let Π Θ be the Euclidean projection onto S -radius 2-norm ball. Initialize θ 1 ∝ 1. for t := 1 , 2 , . . . , T do Sample ( x t , a t ) ∼ q 1 and x ′ t ∼ q 2 . Compute subgradient estimate g t Update θ t + 1 = Π Θ ( θ t − η t g t ) . end for � T � θ T = 1 t = 1 θ t . T Return policy π � θ T . Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 7 / 22

  8. Theorem Given some ǫ > 0 , the � θ T produced by the stochastic subgradient method after T = 1 /ǫ 4 steps satisfies � � ℓ ⊤ µ θ + V ( θ ) ℓ ⊤ µ � θ T ≤ min + O ( ǫ ) ǫ θ ∈ Θ with probability at least 1 − δ , where V = O ( V 1 + V 2 ) is a violation function defined by V 1 ( θ ) = � [Φ θ ] − � 1 � � � � � ( P − B ) ⊤ Φ θ V 2 ( θ ) = 1 . � The big-O notation hides polynomials in S, d, C 1 , C 2 , and log ( 1 /δ ) . Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 8 / 22

  9. Comparison with previous techniques We bound performance of found policy directly (not through J ) Previous bounds were of the form inf θ � J ∗ − Ψ θ � Our bounds: performance w.r.t. best in class w.o. near optimality of class No knowledge of optimal policy assumed First method to make approximations in the dual Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 9 / 22

  10. Discussion Can remove the awkward V ( θ ) /ǫ + O ( ǫ ) by taking a grid of ǫ Recall � � � � � Φ ( x , a ) , : � � ( P − B ) ⊤ � : , x Φ C 1 = max q 1 ( x , a ) , C 2 = max q 2 ( x ) ( x , a ) ∈X×A x ∈X We also pick Φ and q 1 , so we can make C 1 small Making C 2 may require knowledge of P (such as sparsity or some stability assumption) Natural selection: state aggregation Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 10 / 22

  11. Comparison with Constraint Sampling Use the constraint sampling of (de Farias and Van Roy, 2004) Must assume feasibility Need a vector v ( x ) ≥ | ( P − B ) ⊤ Φ θ | as envelope to constraint violations Bound includes || v ( x ) || 1 ; could be very large Requires specific knowledge about problem Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 11 / 22

  12. Analysis Assume fast mixing: for every policy π , ∃ τ ( π ) > 0 s.t. ∀ d , d ′ ∈ △ X , � � dP π − d ′ P π � 1 ≤ e − 1 /τ ( π ) � � d − d ′ � � � 1 Define � � � � � Φ ( x , a ) , : � � ( P − B ) ⊤ � : , x Φ C 1 = max q 1 ( x , a ) , C 2 = max . q 2 ( x ) ( x , a ) ∈X×A x ∈X The proof has three main parts V 1 ( θ ) ≤ ǫ 1 and V 2 ( θ ) ≤ ǫ 2 ⇒ � µ θ − Φ θ � 1 ≤ O ( ǫ 1 + ǫ 2 ) 1 Bounding gradient of c ( θ ) ; checking it is unbiased 2 Applying stochastic gradient descent theorem: 3 ℓ ⊤ Φ � θ ≤ min θ ∈ Θ c ( θ ) + O ( ǫ ) Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 12 / 22

  13. Proof part 1 Lemma Let u ∈ R XA be a vector with � � � � 1 ⊤ u = 1 , � u � ≤ 1 + ǫ 1 , � u ⊤ ( P − B ) � 1 ≤ ǫ 2 For the stationary distribution µ u of policy u + = [ u ] + / � [ u ] + � 1 , we have � µ u − u � 1 ≤ τ ( µ u ) log ( 1 /ǫ ′ )( 2 ǫ ′ + ǫ ′′ ) + 3 ǫ ′ . Proof: � � ( P − B ) ⊤ u + � � 1 ≤ 2 ǫ 1 + ǫ 2 := ǫ ′ Two bounds give Also, � u + − u � 1 ≤ 2 ǫ 1 Define M u + ∈ R X × XA as the matrix that encodes policy u + , e.g. M u + P = P u + Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 13 / 22

  14. Proof (continued): t − 1 PM u + , v t = µ ⊤ t ( P − B ) = v t − 1 M u + P Let µ 0 = u + , µ ⊤ t = µ ⊤ µ t is the state-action distribution after running the policy for t steps By previous bound, � v 0 � 1 ≤ ǫ ′ ⇒ � v t � 1 ≤ ǫ ′ t − 1 PM u + = ( µ ⊤ t − 1 B + v t − 1 ) M u + = µ ⊤ t − 1 + v t − 1 M u + µ ⊤ t = µ ⊤ 0 + � k t = 0 v t M u + Telescoping: µ ⊤ k = µ ⊤ Thus, � µ k − u + � 1 ≤ k ǫ By mixing assumption: � µ k − µ u � 1 ≤ e − 1 /τ ( u + ) Take k = τ ( u + ) log ( 1 /ǫ ′ ) and use triangle inequality Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 14 / 22

  15. Applying SGD theorem Theorem (Lemma 3.1 of (Flaxman et al., 2005)) Assume we have Convex set Z ⊆ B 2 ( Z , 0 ) and ( f t ) t = 1 , 2 ,..., T convex functions on Z . Gradient estimates f ′ t with E [[] f ′ t | z t ] = ∇ f ( z t ) and bound � f ′ t � 2 ≤ F Sample Path z 1 = 0 and z t + 1 = Π Z ( z t − η f ′ t ) ( Π Z Euclidean projection) √ Then, for η = Z / ( F T ) and any δ ∈ ( 0 , 1 ) , the following holds with probability at least 1 − δ : T T � � √ f t ( z t ) − min f t ( z ) ≤ ZF T (5) z ∈Z t = 1 t = 1 � � � �� 1 + Z 2 T 2 log 1 ( 1 + 4 Z 2 T ) + δ + d log . d Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 15 / 22

  16. checking conditions of theorem Recall gradient: for ( x t , a t ) ∼ q 1 and y t ∼ q 2 , g t ( θ ) = ℓ ⊤ Φ − H Φ ( x t , a t ) , : q 1 ( x t , a t ) I { Φ ( xt , at ) , : θ< 0 } � � + H ( P − B ) ⊤ : , y t Φ (Φ θ ) ⊤ ( P − B ) : , y t . sgn q 2 ( y t ) We can bound � � � � � � � Φ ( x t , a t ) , : � � ( P − B ) ⊤ � : , y t Φ � � � ℓ ⊤ Φ 2 2 � g t ( θ ) � 2 ≤ � 2 + H q 1 ( x t , a t ) + q 2 ( y t ) √ ≤ d + H ( C 1 + C 2 ) := F . and E [[] g t ( θ )] = ∇ c ( θ ) . Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 16 / 22

  17. proof conclusion The SGD theorem gives us: θ )) ≤ ℓ ⊤ Φ θ ∗ ≤ H ( V 1 ( θ ∗ ) + V 2 ( θ ∗ )) + b T ℓ ⊤ Φ � θ T + H ( V 1 ( � θ ) + V 2 ( � where b T is the regret bound from the theorem: � � � 1 + 4 S 2 T δ ) + d log ( d + S 2 T b T = SF 2 log ( 1 √ + ) . T 2 d T We take θ ) ≤ 1 V 1 ( � θ ) , V 2 ( � H ( 2 ( 1 + S ) + HV 1 ( θ ∗ ) + HV 2 ( θ ∗ ) + b T ) := ǫ ′ Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 17 / 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend