Linear Programming for Large-Scale Markov Decision Problems Yasin - PowerPoint PPT Presentation

Linear Programming for Large-Scale Markov Decision Problems Yasin Abbasi-Yadkori 1 Peter Bartlett 12 Alan Malek 2 1 Queensland University of Technology Brisbane, QLD, Australia 2 University of California, Berkeley Berkeley, CA June 24th, 2014 Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 1 / 22

Outline Introduce MDPs and the Linear Program formulation 1 Algorithm 2 Oracle inequality 3 Experiments 4 Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 2 / 22

Markov Decision Processes A Markov Decision Process is specified by: State space X = { 1 , . . . , X } Action space A = { 1 . . . , A } Transition Kernel P : X × A → ∆ X Loss function ℓ : X × A → [ 0 , 1 ] Let P π be the state transition kernel under policy π : X → ∆ A . Our goal is to choose π to minimize the average loss when X and A are very large. Aim for optimality within a restricted family of policies. Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 3 / 22

Linear Program Formulation LP formulation (Manne 1960): max λ, h λ , (1) B ⊤ ( λ 1 + h ) ≤ ℓ + P ⊤ h , s.t. where B ∈ { 0 , 1 } ( X × XA ) is the marginalization matrix. Primal variables: h is the cost-to-go, λ is the average cost Dual: µ ∈ R XA ℓ ⊤ µ , min (2) 1 ⊤ µ = 1 , µ ≥ 0 , ( P − B ) µ = 0 . v s.t. Define policy via π ( a | x ) ∝ µ ( x , a ) . Dual variables: µ is a stationary distribution over X × A Still a problem when X , A very large Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 4 / 22

The Dual ALP Feature matrix Φ ∈ R XA × d ; constrain µ = Φ θ µ ∈ R XA ℓ ⊤ Φ θ , min (3) 1 ⊤ Φ θ = 1 , Φ θ ≥ 0 , ( P − B ) ⊤ Φ θ = 0 . s.t. [ · ] + is positive part Define policy via π θ ( a | x ) ∝ [(Φ θ )( x , a )] + , µ θ is the stationary distribution of P π θ µ θ ≈ Φ θ ℓ ⊤ µ θ is the average loss of policy π θ Want to compete with min θ ℓ ⊤ µ θ Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 5 / 22

Reducing Constraints Still intractable: d -dimensional problem but O ( XA ) constraints Form the convex cost function: � � � � c ( θ ) = ℓ ⊤ Φ θ + � [Φ θ ] − � 1 + � ( P − B ) ⊤ Φ θ � 1 � � � � � � � � � [Φ ( x , a ) , : θ ] − � + = ℓ ⊤ Φ θ + � (Φ θ ) ⊤ ( P − B ) : , x ′ � ( x , a ) x ′ Sample ( x t , a t ) ∼ q 1 and y t ∼ q 2 Unbiased subgradient estimate: g t ( θ ) = ℓ ⊤ Φ − Φ ( x t , a t ) , : (4) q 1 ( x t , a t ) I { Φ ( xt , at ) , : θ< 0 } � � + (Φ ⊤ ( P − B ) : , y t ) ⊤ (Φ θ ) ⊤ ( P − B ) : , y t sgn q 2 ( y t ) Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 6 / 22

The Stochastic Subgradient Method for MDPs Input: Constants S , H > 0, number of rounds T . Let Π Θ be the Euclidean projection onto S -radius 2-norm ball. Initialize θ 1 ∝ 1. for t := 1 , 2 , . . . , T do Sample ( x t , a t ) ∼ q 1 and x ′ t ∼ q 2 . Compute subgradient estimate g t Update θ t + 1 = Π Θ ( θ t − η t g t ) . end for � T � θ T = 1 t = 1 θ t . T Return policy π � θ T . Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 7 / 22

Theorem Given some ǫ > 0 , the � θ T produced by the stochastic subgradient method after T = 1 /ǫ 4 steps satisfies � � ℓ ⊤ µ θ + V ( θ ) ℓ ⊤ µ � θ T ≤ min + O ( ǫ ) ǫ θ ∈ Θ with probability at least 1 − δ , where V = O ( V 1 + V 2 ) is a violation function defined by V 1 ( θ ) = � [Φ θ ] − � 1 � � � � � ( P − B ) ⊤ Φ θ V 2 ( θ ) = 1 . � The big-O notation hides polynomials in S, d, C 1 , C 2 , and log ( 1 /δ ) . Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 8 / 22

Comparison with previous techniques We bound performance of found policy directly (not through J ) Previous bounds were of the form inf θ � J ∗ − Ψ θ � Our bounds: performance w.r.t. best in class w.o. near optimality of class No knowledge of optimal policy assumed First method to make approximations in the dual Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 9 / 22

Discussion Can remove the awkward V ( θ ) /ǫ + O ( ǫ ) by taking a grid of ǫ Recall � � � � � Φ ( x , a ) , : � � ( P − B ) ⊤ � : , x Φ C 1 = max q 1 ( x , a ) , C 2 = max q 2 ( x ) ( x , a ) ∈X×A x ∈X We also pick Φ and q 1 , so we can make C 1 small Making C 2 may require knowledge of P (such as sparsity or some stability assumption) Natural selection: state aggregation Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 10 / 22

Comparison with Constraint Sampling Use the constraint sampling of (de Farias and Van Roy, 2004) Must assume feasibility Need a vector v ( x ) ≥ | ( P − B ) ⊤ Φ θ | as envelope to constraint violations Bound includes || v ( x ) || 1 ; could be very large Requires specific knowledge about problem Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 11 / 22

Analysis Assume fast mixing: for every policy π , ∃ τ ( π ) > 0 s.t. ∀ d , d ′ ∈ △ X , � � dP π − d ′ P π � 1 ≤ e − 1 /τ ( π ) � � d − d ′ � � � 1 Define � � � � � Φ ( x , a ) , : � � ( P − B ) ⊤ � : , x Φ C 1 = max q 1 ( x , a ) , C 2 = max . q 2 ( x ) ( x , a ) ∈X×A x ∈X The proof has three main parts V 1 ( θ ) ≤ ǫ 1 and V 2 ( θ ) ≤ ǫ 2 ⇒ � µ θ − Φ θ � 1 ≤ O ( ǫ 1 + ǫ 2 ) 1 Bounding gradient of c ( θ ) ; checking it is unbiased 2 Applying stochastic gradient descent theorem: 3 ℓ ⊤ Φ � θ ≤ min θ ∈ Θ c ( θ ) + O ( ǫ ) Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 12 / 22

Proof part 1 Lemma Let u ∈ R XA be a vector with � � � � 1 ⊤ u = 1 , � u � ≤ 1 + ǫ 1 , � u ⊤ ( P − B ) � 1 ≤ ǫ 2 For the stationary distribution µ u of policy u + = [ u ] + / � [ u ] + � 1 , we have � µ u − u � 1 ≤ τ ( µ u ) log ( 1 /ǫ ′ )( 2 ǫ ′ + ǫ ′′ ) + 3 ǫ ′ . Proof: � � ( P − B ) ⊤ u + � � 1 ≤ 2 ǫ 1 + ǫ 2 := ǫ ′ Two bounds give Also, � u + − u � 1 ≤ 2 ǫ 1 Define M u + ∈ R X × XA as the matrix that encodes policy u + , e.g. M u + P = P u + Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 13 / 22

Proof (continued): t − 1 PM u + , v t = µ ⊤ t ( P − B ) = v t − 1 M u + P Let µ 0 = u + , µ ⊤ t = µ ⊤ µ t is the state-action distribution after running the policy for t steps By previous bound, � v 0 � 1 ≤ ǫ ′ ⇒ � v t � 1 ≤ ǫ ′ t − 1 PM u + = ( µ ⊤ t − 1 B + v t − 1 ) M u + = µ ⊤ t − 1 + v t − 1 M u + µ ⊤ t = µ ⊤ 0 + � k t = 0 v t M u + Telescoping: µ ⊤ k = µ ⊤ Thus, � µ k − u + � 1 ≤ k ǫ By mixing assumption: � µ k − µ u � 1 ≤ e − 1 /τ ( u + ) Take k = τ ( u + ) log ( 1 /ǫ ′ ) and use triangle inequality Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 14 / 22

Applying SGD theorem Theorem (Lemma 3.1 of (Flaxman et al., 2005)) Assume we have Convex set Z ⊆ B 2 ( Z , 0 ) and ( f t ) t = 1 , 2 ,..., T convex functions on Z . Gradient estimates f ′ t with E [[] f ′ t | z t ] = ∇ f ( z t ) and bound � f ′ t � 2 ≤ F Sample Path z 1 = 0 and z t + 1 = Π Z ( z t − η f ′ t ) ( Π Z Euclidean projection) √ Then, for η = Z / ( F T ) and any δ ∈ ( 0 , 1 ) , the following holds with probability at least 1 − δ : T T � � √ f t ( z t ) − min f t ( z ) ≤ ZF T (5) z ∈Z t = 1 t = 1 � � � �� 1 + Z 2 T 2 log 1 ( 1 + 4 Z 2 T ) + δ + d log . d Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 15 / 22

checking conditions of theorem Recall gradient: for ( x t , a t ) ∼ q 1 and y t ∼ q 2 , g t ( θ ) = ℓ ⊤ Φ − H Φ ( x t , a t ) , : q 1 ( x t , a t ) I { Φ ( xt , at ) , : θ< 0 } � � + H ( P − B ) ⊤ : , y t Φ (Φ θ ) ⊤ ( P − B ) : , y t . sgn q 2 ( y t ) We can bound � � � � � � � Φ ( x t , a t ) , : � � ( P − B ) ⊤ � : , y t Φ � � � ℓ ⊤ Φ 2 2 � g t ( θ ) � 2 ≤ � 2 + H q 1 ( x t , a t ) + q 2 ( y t ) √ ≤ d + H ( C 1 + C 2 ) := F . and E [[] g t ( θ )] = ∇ c ( θ ) . Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 16 / 22

proof conclusion The SGD theorem gives us: θ )) ≤ ℓ ⊤ Φ θ ∗ ≤ H ( V 1 ( θ ∗ ) + V 2 ( θ ∗ )) + b T ℓ ⊤ Φ � θ T + H ( V 1 ( � θ ) + V 2 ( � where b T is the regret bound from the theorem: � � � 1 + 4 S 2 T δ ) + d log ( d + S 2 T b T = SF 2 log ( 1 √ + ) . T 2 d T We take θ ) ≤ 1 V 1 ( � θ ) , V 2 ( � H ( 2 ( 1 + S ) + HV 1 ( θ ∗ ) + HV 2 ( θ ∗ ) + b T ) := ǫ ′ Y. Abbasi-Yadkori, P . Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 17 / 22

Linear Programming for Large-Scale Markov Decision Problems Yasin - PowerPoint PPT Presentation

Linear Programming for Large-Scale Markov Decision Problems Yasin Abbasi-Yadkori 1 Peter Bartlett 12 Alan Malek 2 1 Queensland University of Technology Brisbane, QLD, Australia 2 University of California, Berkeley Berkeley, CA June 24th, 2014

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Model Repair for Markov Decision Model Repair for Markov Decision Model Repair for Markov

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Outline Md Md Markov Markov Decision Decision Processes Processes Grid World Example

Stochastic Processes Markov Processes Hamid R. Rabiee 1 Overview o Markov Property o Markov

Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

Linear Programming Linear Programming In a linear programming problem, there is a set of

Markov Decision Processes and Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS

Markov Systems, Markov Decision Processes, and Dynamic Programming Andrew W. Moore Note to

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Imprecise Markov chains From basic theory to applications II prof. Jasper De Bock Imprecise

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University Markov Chains and

Carolyn Penstein Ros 1 Theoretical framework Psychology-> Sociolinguistics ->

Lin inear programming Example Numpy: PageRank scipy.optimize.linprog Example linear

The Central Curve in Linear Programming Cynthia Vinzant, U. Michigan joint work with Jes

Introduction to Linear Programming Dominik Scheder Products Resources production production

Expressions, Statements, and Control Structures Announcements Assignment 2 out, due next

Extending Simple Drawings Alan Arroyo 1 , Martin Derka 2 , and Irene Parada 3 1 IST Austria 2

Extending the Reflexion Method for Consolidating Software Variants into Product Lines Pierre

Extending Tina With Secure On-line Accoun7ng

Linear Programming for Large-Scale Markov Decision Problems Yasin - PowerPoint PPT Presentation

Linear Programming for Large-Scale Markov Decision Problems Yasin Abbasi-Yadkori 1 Peter Bartlett 12 Alan Malek 2 1 Queensland University of Technology Brisbane, QLD, Australia 2 University of California, Berkeley Berkeley, CA June 24th, 2014

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Model Repair for Markov Decision Model Repair for Markov Decision Model Repair for Markov

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Outline Md Md Markov Markov Decision Decision Processes Processes Grid World Example

Stochastic Processes Markov Processes Hamid R. Rabiee 1 Overview o Markov Property o Markov

Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

Linear Programming Linear Programming In a linear programming problem, there is a set of

Markov Decision Processes and Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS

Markov Systems, Markov Decision Processes, and Dynamic Programming Andrew W. Moore Note to

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Imprecise Markov chains From basic theory to applications II prof. Jasper De Bock Imprecise

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University Markov Chains and

Carolyn Penstein Ros 1 Theoretical framework Psychology-&gt; Sociolinguistics -&gt;

Lin inear programming Example Numpy: PageRank scipy.optimize.linprog Example linear

The Central Curve in Linear Programming Cynthia Vinzant, U. Michigan joint work with Jes

Introduction to Linear Programming Dominik Scheder Products Resources production production

Expressions, Statements, and Control Structures Announcements Assignment 2 out, due next

Extending Simple Drawings Alan Arroyo 1 , Martin Derka 2 , and Irene Parada 3 1 IST Austria 2

Extending the Reflexion Method for Consolidating Software Variants into Product Lines Pierre

Extending Tina With Secure On-line Accoun7ng

Carolyn Penstein Ros 1 Theoretical framework Psychology-> Sociolinguistics ->