 
              Linear Programming for Large-Scale Markov Decision Problems Yasin Abbasi-Yadkori 1 Peter Bartlett 12 Alan Malek 2 1 Queensland University of Technology Brisbane, QLD, Australia 2 University of California, Berkeley Berkeley, CA June 24th, 2014 Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 1 / 22
Outline Introduce MDPs and the Linear Program formulation 1 Algorithm 2 Oracle inequality 3 Experiments 4 Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 2 / 22
Markov Decision Processes A Markov Decision Process is specified by: State space X = { 1 , . . . , X } Action space A = { 1 . . . , A } Transition Kernel P : X × A → ∆ X Loss function ℓ : X × A → [ 0 , 1 ] Let P π be the state transition kernel under policy π : X → ∆ A . Our goal is to choose π to minimize the average loss when X and A are very large. Aim for optimality within a restricted family of policies. Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 3 / 22
Linear Program Formulation LP formulation (Manne 1960): max λ, h λ , (1) B ⊤ ( λ 1 + h ) ≤ ℓ + P ⊤ h , s.t. where B ∈ { 0 , 1 } ( X × XA ) is the marginalization matrix. Primal variables: h is the cost-to-go, λ is the average cost Dual: µ ∈ R XA ℓ ⊤ µ , min (2) 1 ⊤ µ = 1 , µ ≥ 0 , ( P − B ) µ = 0 . v s.t. Define policy via π ( a | x ) ∝ µ ( x , a ) . Dual variables: µ is a stationary distribution over X × A Still a problem when X , A very large Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 4 / 22
The Dual ALP Feature matrix Φ ∈ R XA × d ; constrain µ = Φ θ µ ∈ R XA ℓ ⊤ Φ θ , min (3) 1 ⊤ Φ θ = 1 , Φ θ ≥ 0 , ( P − B ) ⊤ Φ θ = 0 . s.t. [ · ] + is positive part Define policy via π θ ( a | x ) ∝ [(Φ θ )( x , a )] + , µ θ is the stationary distribution of P π θ µ θ ≈ Φ θ ℓ ⊤ µ θ is the average loss of policy π θ Want to compete with min θ ℓ ⊤ µ θ Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 5 / 22
Reducing Constraints Still intractable: d -dimensional problem but O ( XA ) constraints Form the convex cost function: � � � � c ( θ ) = ℓ ⊤ Φ θ + � [Φ θ ] − � 1 + � ( P − B ) ⊤ Φ θ � 1 � � � � � � � � � [Φ ( x , a ) , : θ ] − � + = ℓ ⊤ Φ θ + � (Φ θ ) ⊤ ( P − B ) : , x ′ � ( x , a ) x ′ Sample ( x t , a t ) ∼ q 1 and y t ∼ q 2 Unbiased subgradient estimate: g t ( θ ) = ℓ ⊤ Φ − Φ ( x t , a t ) , : (4) q 1 ( x t , a t ) I { Φ ( xt , at ) , : θ< 0 } � � + (Φ ⊤ ( P − B ) : , y t ) ⊤ (Φ θ ) ⊤ ( P − B ) : , y t sgn q 2 ( y t ) Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 6 / 22
The Stochastic Subgradient Method for MDPs Input: Constants S , H > 0, number of rounds T . Let Π Θ be the Euclidean projection onto S -radius 2-norm ball. Initialize θ 1 ∝ 1. for t := 1 , 2 , . . . , T do Sample ( x t , a t ) ∼ q 1 and x ′ t ∼ q 2 . Compute subgradient estimate g t Update θ t + 1 = Π Θ ( θ t − η t g t ) . end for � T � θ T = 1 t = 1 θ t . T Return policy π � θ T . Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 7 / 22
Theorem Given some ǫ > 0 , the � θ T produced by the stochastic subgradient method after T = 1 /ǫ 4 steps satisfies � � ℓ ⊤ µ θ + V ( θ ) ℓ ⊤ µ � θ T ≤ min + O ( ǫ ) ǫ θ ∈ Θ with probability at least 1 − δ , where V = O ( V 1 + V 2 ) is a violation function defined by V 1 ( θ ) = � [Φ θ ] − � 1 � � � � � ( P − B ) ⊤ Φ θ V 2 ( θ ) = 1 . � The big-O notation hides polynomials in S, d, C 1 , C 2 , and log ( 1 /δ ) . Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 8 / 22
Comparison with previous techniques We bound performance of found policy directly (not through J ) Previous bounds were of the form inf θ � J ∗ − Ψ θ � Our bounds: performance w.r.t. best in class w.o. near optimality of class No knowledge of optimal policy assumed First method to make approximations in the dual Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 9 / 22
Discussion Can remove the awkward V ( θ ) /ǫ + O ( ǫ ) by taking a grid of ǫ Recall � � � � � Φ ( x , a ) , : � � ( P − B ) ⊤ � : , x Φ C 1 = max q 1 ( x , a ) , C 2 = max q 2 ( x ) ( x , a ) ∈X×A x ∈X We also pick Φ and q 1 , so we can make C 1 small Making C 2 may require knowledge of P (such as sparsity or some stability assumption) Natural selection: state aggregation Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 10 / 22
Comparison with Constraint Sampling Use the constraint sampling of (de Farias and Van Roy, 2004) Must assume feasibility Need a vector v ( x ) ≥ | ( P − B ) ⊤ Φ θ | as envelope to constraint violations Bound includes || v ( x ) || 1 ; could be very large Requires specific knowledge about problem Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 11 / 22
Analysis Assume fast mixing: for every policy π , ∃ τ ( π ) > 0 s.t. ∀ d , d ′ ∈ △ X , � � dP π − d ′ P π � 1 ≤ e − 1 /τ ( π ) � � d − d ′ � � � 1 Define � � � � � Φ ( x , a ) , : � � ( P − B ) ⊤ � : , x Φ C 1 = max q 1 ( x , a ) , C 2 = max . q 2 ( x ) ( x , a ) ∈X×A x ∈X The proof has three main parts V 1 ( θ ) ≤ ǫ 1 and V 2 ( θ ) ≤ ǫ 2 ⇒ � µ θ − Φ θ � 1 ≤ O ( ǫ 1 + ǫ 2 ) 1 Bounding gradient of c ( θ ) ; checking it is unbiased 2 Applying stochastic gradient descent theorem: 3 ℓ ⊤ Φ � θ ≤ min θ ∈ Θ c ( θ ) + O ( ǫ ) Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 12 / 22
Proof part 1 Lemma Let u ∈ R XA be a vector with � � � � 1 ⊤ u = 1 , � u � ≤ 1 + ǫ 1 , � u ⊤ ( P − B ) � 1 ≤ ǫ 2 For the stationary distribution µ u of policy u + = [ u ] + / � [ u ] + � 1 , we have � µ u − u � 1 ≤ τ ( µ u ) log ( 1 /ǫ ′ )( 2 ǫ ′ + ǫ ′′ ) + 3 ǫ ′ . Proof: � � ( P − B ) ⊤ u + � � 1 ≤ 2 ǫ 1 + ǫ 2 := ǫ ′ Two bounds give Also, � u + − u � 1 ≤ 2 ǫ 1 Define M u + ∈ R X × XA as the matrix that encodes policy u + , e.g. M u + P = P u + Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 13 / 22
Proof (continued): t − 1 PM u + , v t = µ ⊤ t ( P − B ) = v t − 1 M u + P Let µ 0 = u + , µ ⊤ t = µ ⊤ µ t is the state-action distribution after running the policy for t steps By previous bound, � v 0 � 1 ≤ ǫ ′ ⇒ � v t � 1 ≤ ǫ ′ t − 1 PM u + = ( µ ⊤ t − 1 B + v t − 1 ) M u + = µ ⊤ t − 1 + v t − 1 M u + µ ⊤ t = µ ⊤ 0 + � k t = 0 v t M u + Telescoping: µ ⊤ k = µ ⊤ Thus, � µ k − u + � 1 ≤ k ǫ By mixing assumption: � µ k − µ u � 1 ≤ e − 1 /τ ( u + ) Take k = τ ( u + ) log ( 1 /ǫ ′ ) and use triangle inequality Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 14 / 22
Applying SGD theorem Theorem (Lemma 3.1 of (Flaxman et al., 2005)) Assume we have Convex set Z ⊆ B 2 ( Z , 0 ) and ( f t ) t = 1 , 2 ,..., T convex functions on Z . Gradient estimates f ′ t with E [[] f ′ t | z t ] = ∇ f ( z t ) and bound � f ′ t � 2 ≤ F Sample Path z 1 = 0 and z t + 1 = Π Z ( z t − η f ′ t ) ( Π Z Euclidean projection) √ Then, for η = Z / ( F T ) and any δ ∈ ( 0 , 1 ) , the following holds with probability at least 1 − δ : T T � � √ f t ( z t ) − min f t ( z ) ≤ ZF T (5) z ∈Z t = 1 t = 1 � � � �� 1 + Z 2 T 2 log 1 ( 1 + 4 Z 2 T ) + δ + d log . d Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 15 / 22
checking conditions of theorem Recall gradient: for ( x t , a t ) ∼ q 1 and y t ∼ q 2 , g t ( θ ) = ℓ ⊤ Φ − H Φ ( x t , a t ) , : q 1 ( x t , a t ) I { Φ ( xt , at ) , : θ< 0 } � � + H ( P − B ) ⊤ : , y t Φ (Φ θ ) ⊤ ( P − B ) : , y t . sgn q 2 ( y t ) We can bound � � � � � � � Φ ( x t , a t ) , : � � ( P − B ) ⊤ � : , y t Φ � � � ℓ ⊤ Φ 2 2 � g t ( θ ) � 2 ≤ � 2 + H q 1 ( x t , a t ) + q 2 ( y t ) √ ≤ d + H ( C 1 + C 2 ) := F . and E [[] g t ( θ )] = ∇ c ( θ ) . Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 16 / 22
proof conclusion The SGD theorem gives us: θ )) ≤ ℓ ⊤ Φ θ ∗ ≤ H ( V 1 ( θ ∗ ) + V 2 ( θ ∗ )) + b T ℓ ⊤ Φ � θ T + H ( V 1 ( � θ ) + V 2 ( � where b T is the regret bound from the theorem: � � � 1 + 4 S 2 T δ ) + d log ( d + S 2 T b T = SF 2 log ( 1 √ + ) . T 2 d T We take θ ) ≤ 1 V 1 ( � θ ) , V 2 ( � H ( 2 ( 1 + S ) + HV 1 ( θ ∗ ) + HV 2 ( θ ∗ ) + b T ) := ǫ ′ Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 17 / 22
Recommend
More recommend