The simplex method is strongly polynomial for deterministic Markov - PowerPoint PPT Presentation

The simplex method is strongly polynomial for deterministic Markov decision processes Ian Post Yinyu Ye Fields Institute November 29, 2013 Post, Ye Simplex on MDPs Fields, Nov 29, 2013 1 / 18

Markov Decision Processes A Markov decision process is a method of modeling repeated decision making over time in stochastic, changing environments. p 1 p 2 r 1 s p 3 r 2 It consists of states s and actions a with rewards r a and probability distributions P a over states When action a is used it receive the reward r a and transitions to a new state according to the distribution P a Post, Ye Simplex on MDPs Fields, Nov 29, 2013 2 / 18

Markov Decision Processes We are also given a discount factor γ < 1 as part of the input Goal: pick actions so as to maximize ∞ � γ t E A [ r ( t )] t =0 where r ( t ) is the reward at time t r 2 r 1 r 4 r 5 r 3 Reward: Post, Ye Simplex on MDPs Fields, Nov 29, 2013 3 / 18

Markov Decision Processes We are also given a discount factor γ < 1 as part of the input Goal: pick actions so as to maximize ∞ � γ t E A [ r ( t )] t =0 where r ( t ) is the reward at time t r 2 r 1 r 4 r 5 r 3 Reward: r 1 Post, Ye Simplex on MDPs Fields, Nov 29, 2013 3 / 18

Markov Decision Processes We are also given a discount factor γ < 1 as part of the input Goal: pick actions so as to maximize ∞ � γ t E A [ r ( t )] t =0 where r ( t ) is the reward at time t r 2 r 1 r 5 r 4 r 3 Reward: r 1 + γ r 5 Post, Ye Simplex on MDPs Fields, Nov 29, 2013 3 / 18

Markov Decision Processes We are also given a discount factor γ < 1 as part of the input Goal: pick actions so as to maximize ∞ � γ t E A [ r ( t )] t =0 where r ( t ) is the reward at time t r 2 r 1 r 4 r 5 r 3 r 1 + γ r 5 + γ 2 r 4 Reward: Post, Ye Simplex on MDPs Fields, Nov 29, 2013 3 / 18

Motivation MDPs are widely used in machine learning, operations research, economics, robotics and control, etc. Post, Ye Simplex on MDPs Fields, Nov 29, 2013 4 / 18

Motivation MDPs are widely used in machine learning, operations research, economics, robotics and control, etc. MDPs are also an interesting problem theoretically in that they are essentially where our knowledge of how to solve LPs in strongly polynomial time stops ◮ Close to being strongly polynomial [Ye05] and possess a lot of structure that allows for powerful algorithms like policy iteration [How60]... ◮ ...but also appear hard for powerful algorithms [Fea10] [FHZ11] Post, Ye Simplex on MDPs Fields, Nov 29, 2013 4 / 18

Motivation MDPs are widely used in machine learning, operations research, economics, robotics and control, etc. MDPs are also an interesting problem theoretically in that they are essentially where our knowledge of how to solve LPs in strongly polynomial time stops ◮ Close to being strongly polynomial [Ye05] and possess a lot of structure that allows for powerful algorithms like policy iteration [How60]... ◮ ...but also appear hard for powerful algorithms [Fea10] [FHZ11] Performance of basis exchange algorithms like policy iteration and simplex remains poorly understood ◮ A number of open questions including their performance on special cases like deterministic MDPs [HZ10] ◮ Important for developing new algorithms with better performance Post, Ye Simplex on MDPs Fields, Nov 29, 2013 4 / 18

Previous Work Policy iteration [How60] ◮ Long conjectured to be strongly polynomial but only exponential bounds known [MS99] ◮ Recently shown to be exponential [Fea10] Simplex lower bounds using MDPs [FHZ11] [Fri11] [MC94] 1 Discounted MDPs (bounds depend on 1 − γ ) ◮ ǫ -approximation to the optimum [Bel57] ◮ True optimum [Ye11] [HMZ11] Specialized algorithms for deterministic MDPs and other special cases [PT87] [HN94] [MTZ10] [Mad02] Post, Ye Simplex on MDPs Fields, Nov 29, 2013 5 / 18

Results Theorem The simplex method with Dantzig’s most-negative reduced cost pivoting rule converges in O ( n 3 m 2 log 2 n ) iterations for deterministic MDPs regardless of the discount factor. Theorem If each action can have a distinct discount, then the simplex method converges in O ( n 5 m 3 log 2 n ) iterations. Post, Ye Simplex on MDPs Fields, Nov 29, 2013 6 / 18

Results Theorem The simplex method with Dantzig’s most-negative reduced cost pivoting rule converges in O ( n 3 m 2 log 2 n ) iterations for deterministic MDPs regardless of the discount factor. Theorem If each action can have a distinct discount, then the simplex method converges in O ( n 5 m 3 log 2 n ) iterations. Subsequent work [HKZ13] has improved these bounds by a factor of n Post, Ye Simplex on MDPs Fields, Nov 29, 2013 6 / 18

Value vector Let π be a policy (a choice of action for each state) ◮ This defines a Markov chain The value (dual variable) v π s of a state s is the expected reward for starting in the state and following π v π s = r a + γ ( P π a ) T v π v 1 p 1 r 1 v s p 2 v 2 ◮ Key property: increasing the value of one state only increases values of others Post, Ye Simplex on MDPs Fields, Nov 29, 2013 7 / 18

Flux vector The flux (primal variable) x π a through an action a is the discounted number of times an action is used when starting in all the states x π = � ( γ P π ) i 1 = ( I − γ P π ) − 1 1 , i ≥ 0 1 ◮ Flux through an action in π is always between 1 and n 1 − γ = n � ∞ i =0 γ i Post, Ye Simplex on MDPs Fields, Nov 29, 2013 8 / 18

Flux vector The flux (primal variable) x π a through an action a is the discounted number of times an action is used when starting in all the states x π = � ( γ P π ) i 1 = ( I − γ P π ) − 1 1 , i ≥ 0 1 γ γ 2 γ 3 1 − γ ◮ Flux through an action in π is always between 1 and n 1 − γ = n � ∞ i =0 γ i Post, Ye Simplex on MDPs Fields, Nov 29, 2013 8 / 18

Linear Program MDPs can be solved with the following primal/dual pair of LPs Primal: maximize � a r a x a � = 1 + γ � subject to ∀ s ∈ S , a ∈ A s x a a P a , s x a ≥ 0 x Dual: minimize � s v s v s ≥ r a + γ � subject to ∀ s ∈ S , a ∈ A s , s ′ P a , s ′ v s ′ Post, Ye Simplex on MDPs Fields, Nov 29, 2013 9 / 18

Gain The gain (reduced cost) r π a of an action is improvement for switching to that action for one step r π a = ( r a + γ P T a v π ) − v π s We will pivot on the action with the highest gain v 1 p 1 r 1 p 2 v 2 r π v s 1 = ( r 1 + γ ( p 1 v 1 + p 2 v 2 )) − v s r 2 p 3 v 3 v 4 p 4 Post, Ye Simplex on MDPs Fields, Nov 29, 2013 10 / 18

Discounted MDPs Basic idea: all variables lie in an interval of polynomial size. As a result the gap to the optimum shrinks by a polynomial factor each iteration. 1 Suppose 1 − γ is polynomial. Let π be the current policy and ∆ = max r π and a = argmax r π r T x ∗ − r T x π = ( r π ) T x ∗ ≤ ∆ n 1 − γ Using action a will increase objective by at least ∆, so distance to optimum shrinks by factor of 1 − 1 − γ n Post, Ye Simplex on MDPs Fields, Nov 29, 2013 11 / 18

Discounted MDPs Now consider optimal gains r ∗ Suppose ∆ = min a ′ ∈ π r ∗ and a = argmin a ′ ∈ π r ∗ ∆ > r T x π − r T x ∗ > ∆ n 1 − γ if a ∈ π . Therefore if r T x π − r T x ∗ shrinks by factor of n 1 − γ , a can never again appear in a policy, and this happens after � � 1 − γ n n log 1 − (1 − γ ) / n = O 1 − γ log n 1 − γ rounds [Ye10] Post, Ye Simplex on MDPs Fields, Nov 29, 2013 12 / 18

Deterministic MDPs An action is either on a path or a cycle If a is on a path then x a ∈ [1 , n ] � � 1 n If a is on a cycle then x a ∈ 1 − γ , 1 − γ So if x a � = 0, it must lie in one of two layers of polynomial size n n 0 1 1 1 − γ 1 − γ Post, Ye Simplex on MDPs Fields, Nov 29, 2013 13 / 18

Uniform discount Lemma If the algorithm updates a path action it reduces the gap to the last policy before creating a new cycle by a factor of 1 − 1 / n 2 . Post, Ye Simplex on MDPs Fields, Nov 29, 2013 14 / 18

Uniform discount Lemma If the algorithm updates a path action it reduces the gap to the last policy before creating a new cycle by a factor of 1 − 1 / n 2 . Lemma After O ( n 2 log n ) iterations, either the algorithm finishes, creates a new cycle, breaks a cycle, or some action never again appears in a policy before a new cycle is created. Post, Ye Simplex on MDPs Fields, Nov 29, 2013 14 / 18

Uniform discount Lemma If the algorithm updates a path action it reduces the gap to the last policy before creating a new cycle by a factor of 1 − 1 / n 2 . Lemma After O ( n 2 log n ) iterations, either the algorithm finishes, creates a new cycle, breaks a cycle, or some action never again appears in a policy before a new cycle is created. Lemma After O ( n 2 m log n ) iterations, either the algorithm finishes or creates a new cycle. Post, Ye Simplex on MDPs Fields, Nov 29, 2013 14 / 18

Uniform discount Lemma If the algorithm creates a new cycle it reduces the gap to the optimum by a factor of 1 − 1 / n. Lemma After O ( n log n ) cycles are created either the algorithm finishes, some action is eliminated from cycles for the remainder of the algorithm or entirely eliminated from future policies, or the algorithm converges. Theorem The simplex method converges in O ( n 3 m 2 log 2 n ) iterations on deterministic MDPs. Post, Ye Simplex on MDPs Fields, Nov 29, 2013 15 / 18

The simplex method is strongly polynomial for deterministic Markov - PowerPoint PPT Presentation

The simplex method is strongly polynomial for deterministic Markov decision processes Ian Post Yinyu Ye Fields Institute November 29, 2013 Post, Ye Simplex on MDPs Fields, Nov 29, 2013 1 / 18 Markov Decision Processes A Markov decision

Simplex Method and Reduced Costs, Duality and Marginal Costs Frdric Giroire FG Simplex

Linear Programming Chapter 6.14-7.3 Bjrn Morn 1 Simplex Method with Upper Bounds Optjmality

The Revised Simplex Method Combinatorial Problem Solving (CPS) Javier Larrosa Albert Oliveras

The Simplex Method Marco Chiarandini Department of Mathematics & Computer Science University

Revised Simplex Method Marco Chiarandini Department of Mathematics & Computer Science

A Friendly Smoothed Analysis of the Simplex Method Daniel Dadush (CWI) Sophie Huiberts (CWI)

The Simplex Method or Solving Linear Program Frdric Giroire FG Simplex 1/20 Motivation

Revised Simplex Method Marco Chiarandini Department of Mathematics & Computer Science

simplex simplex Just sign in and upload Features Web-based UI Publish videos from

Crazy Picture. Maximum matching and simplex. z y x Maximum matching and simplex. max x + y + z

Introduction Warping polynomial Span of warping polynomial Span and dealternating number Ayaka

Strongly Connected Components Detection Strongly Connected Components A directed graph is

Strongly connected components Finding strongly-connected components A strongly connected component

On the Shadow Simplex Method for Curved Polyhedra Daniel Dadush 1 ahnle 2 Nicolai H 1 Centrum

Fast quantum subroutines for the simplex method Giacomo Nannicini IBM T.J. Watson research

A Friendly Smoothed Analysis of the Simplex Method Daniel Dadush (CWI) Sophie Huiberts (CWI)

Hidden Markov Model (HMM) Sensor Markov assumption: P ( E t | X 0: t , E 1: t 1 ) = P ( E t | X

Prob obab abil ilit ity y an and d Tim Time: H Hid idde den Ma Marko kov v Mo Mode

Poverty and Inequality Dynamics. Ira N. Gang, Rutgers University Ksenia Gatskova, IOS-Regensburg

Randomized Algorithms Lecture 3: Occupancy, Moments and deviations, Randomized selection

Markov Chains and Hidden Markov Models COMP 571 - Spring 2015 Luay Nakhleh, Rice University

18.175: Lecture 32 More Markov chains Scott Sheffield MIT 1 18.175 Lecture 32 Outline General

Markov Chains Toolbox Search: uninformed/heuristic Adversarial search Probability

Graphs and Markov chains Graphs as matrices 0 1 2 3 4 If there is an edge (arrow) from node

The simplex method is strongly polynomial for deterministic Markov - PowerPoint PPT Presentation

The simplex method is strongly polynomial for deterministic Markov decision processes Ian Post Yinyu Ye Fields Institute November 29, 2013 Post, Ye Simplex on MDPs Fields, Nov 29, 2013 1 / 18 Markov Decision Processes A Markov decision

Simplex Method and Reduced Costs, Duality and Marginal Costs Frdric Giroire FG Simplex

Linear Programming Chapter 6.14-7.3 Bjrn Morn 1 Simplex Method with Upper Bounds Optjmality

The Revised Simplex Method Combinatorial Problem Solving (CPS) Javier Larrosa Albert Oliveras

The Simplex Method Marco Chiarandini Department of Mathematics &amp; Computer Science University

Revised Simplex Method Marco Chiarandini Department of Mathematics &amp; Computer Science

A Friendly Smoothed Analysis of the Simplex Method Daniel Dadush (CWI) Sophie Huiberts (CWI)

The Simplex Method or Solving Linear Program Frdric Giroire FG Simplex 1/20 Motivation

Revised Simplex Method Marco Chiarandini Department of Mathematics &amp; Computer Science

simplex simplex Just sign in and upload Features Web-based UI Publish videos from

Crazy Picture. Maximum matching and simplex. z y x Maximum matching and simplex. max x + y + z

Introduction Warping polynomial Span of warping polynomial Span and dealternating number Ayaka

Strongly Connected Components Detection Strongly Connected Components A directed graph is

Strongly connected components Finding strongly-connected components A strongly connected component

On the Shadow Simplex Method for Curved Polyhedra Daniel Dadush 1 ahnle 2 Nicolai H 1 Centrum

Fast quantum subroutines for the simplex method Giacomo Nannicini IBM T.J. Watson research

A Friendly Smoothed Analysis of the Simplex Method Daniel Dadush (CWI) Sophie Huiberts (CWI)

Hidden Markov Model (HMM) Sensor Markov assumption: P ( E t | X 0: t , E 1: t 1 ) = P ( E t | X

Prob obab abil ilit ity y an and d Tim Time: H Hid idde den Ma Marko kov v Mo Mode

Poverty and Inequality Dynamics. Ira N. Gang, Rutgers University Ksenia Gatskova, IOS-Regensburg

Randomized Algorithms Lecture 3: Occupancy, Moments and deviations, Randomized selection

Markov Chains and Hidden Markov Models COMP 571 - Spring 2015 Luay Nakhleh, Rice University

18.175: Lecture 32 More Markov chains Scott Sheffield MIT 1 18.175 Lecture 32 Outline General

Markov Chains Toolbox Search: uninformed/heuristic Adversarial search Probability

Graphs and Markov chains Graphs as matrices 0 1 2 3 4 If there is an edge (arrow) from node

The Simplex Method Marco Chiarandini Department of Mathematics & Computer Science University

Revised Simplex Method Marco Chiarandini Department of Mathematics & Computer Science

Revised Simplex Method Marco Chiarandini Department of Mathematics & Computer Science