Lecture 2: Infinite Horizon and Indefinite Horizon MDPs B9140 - PowerPoint PPT Presentation

Lecture 2: Infinite Horizon and Indefinite Horizon MDPs B9140 Dynamic Programming & Rienforcement Learning. – Prof. Daniel Russo Last time: • RL overview and motivation • Finite Horizon MDPs: formulation and the DP algorithm Today: • Infinite horizon discounted MDPs • Basic theory of Bellman operators; contraction mappings; existence of optimal policies; • Analogous theory for indefinite horizon (episodic) MDPs.

Lecture 2 / #2 Infinite Horizon and Indefinite Horizon MDPs Warmup: Finite Horizon Discounted MDPs A special case of last time • Finite state and control spaces. • Periods 0 , 1 , . . . N with controls u 0 , . . . , u N − 1 . • Stationary transition probabilities f k ( x, u, w ) = f ( x, u, w ) for all k ∈ { 0 , . . . , N − 1 } . • Stationary control spaces: U k ( x ) = U ( x ) for all k ∈ { 0 , . . . , N − 1 } . • Discounted costs: g k ( x, u, w ) = γ k g ( x, u, w ) for k ∈ { 0 , . . . , N − 1 } • Special terminal costs: g N ( x ) = γ N c ( x ) .

Lecture 2 / #3 Infinite Horizon and Indefinite Horizon MDPs Warmup: Finite Horizon Discounted MDPs A policy π = ( µ 0 , . . . , µ N − 1 ) is a sequence of mappings where µ k ( x ) ∈ U ( x ) for all x ∈ X .. The expected cumulative “cost-to-go” of a policy π from starting state x is � N − 1 � � γ k g ( x k , µ k ( x k ) , w k ) + γ N c ( x N ) J π ( x ) = E k =0 where the expectation is over the i.i.d disturbances w 0 , . . . , w N − 1 . The optimal expected cost to go is J ∗ ( x ) = min π ∈ Π J π ( x ) ∀ x ∈ X

Lecture 2 / #4 Infinite Horizon and Indefinite Horizon MDPs The Dynamic Programming Algorithm Set J ∗ N ( x ) = c ( x ) ∀ x ∈ X For k = N − 1 , N − 2 , . . . 0 , set J ∗ u ∈ U ( x ) E [ g ( x, u, w ) + γJ ∗ k ( x ) = min k +1 ( f ( x, u, w ))] ∀ x ∈ X . Main Proposition from last time For all initial states x ∈ X , the optimal cost to go is J ∗ ( x ) = J ∗ 0 ( x ) . This is attained by a policy π ∗ = ( µ ∗ 0 , ..., µ ∗ N − 1 ) where for all k ∈ { 0 , . . . , N − 1 } , x ∈ X µ ∗ u ∈ U ( x ) E [ g ( x, u, w ) + γJ ∗ k ( x ) ∈ arg min k +1 ( f ( x, u, w ))] .

Lecture 2 / #5 Infinite Horizon and Indefinite Horizon MDPs The DP Algorithm for policy evaluation How to find the cost-to-go for any policy π = ( µ 0 , . . . , µ N − 1 ) ? - J π ( x ) = J 0 ( x ) where J 0 is output by the following iterative algorithm. J N ( x ) = c ( x ) ∀ x ∈ X For k = N − 1 , N − 2 , . . . 0 , set J k ( x ) = E [ g ( x, µ k ( x ) , w ) + γJ k +1 ( f ( x, µ k ( x ) , w ))] ∀ x ∈ X .

Lecture 2 / #6 Infinite Horizon and Indefinite Horizon MDPs Bellman Operators For any stationary policy µ mapping x ∈ X to µ ( x ) ∈ U ( x ) , define T µ , which maps a cost to go function J ∈ R |X| to another cost to go function T µ J ∈ R |X| , by ( T µ J )( x ) = E [ g ( x, µ ( x ) , w ) + γJ ( f ( x, µ ( x ) , w ))] where (as usual) the expectation is take over the disturbance w . • We call T µ the Bellman operator corresponding to a policy µ . • It is a map from the space of cost-to-go functions to the space of cost-to-go functions.

Lecture 2 / #7 Infinite Horizon and Indefinite Horizon MDPs Bellman Operators Define T , which maps a cost-to-go function J ∈ R |X| to another cost-to-go function TJ ∈ R |X| by ( TJ )( x ) = min u ∈ U ( x ) E [ g ( x, u, w ) + γJ ( f ( x, u, w ))] where (as usual) the expection is take over the disturbance w . • We call T the Bellman operator. • It is a map from the space of cost-to-go functions to the space of cost-to-go functions.

Lecture 2 / #8 Infinite Horizon and Indefinite Horizon MDPs Alternate notation: transition probabilities Write the expected cost function as g ( x, u ) = E [ g ( x, u, w )] and transition probabilities as p ( x ′ | x, u ) = P ( f ( x, u, w ) = x ′ ) where both integrate over the distribution of the disturbance w . In this notation � p ( x ′ | x, µ ( x )) J ( x ′ ) T µ J ( x ) = g ( x, µ ( x )) + γ x ′ ∈X and � p ( x ′ | x, u ) J ( x ′ ) . TJ ( x ) = min u ∈ U ( X ) g ( x, u ) + γ x ′ ∈X

Lecture 2 / #9 Infinite Horizon and Indefinite Horizon MDPs The Dynamic Programming Algorithm Old notation : Set J ∗ N ( x ) = c ( x ) ∀ x ∈ X For k = N − 1 , N − 2 , . . . 0 , set J ∗ u ∈ U ( x ) E [ g ( x, u, w ) + γJ ∗ k ( x ) = min k +1 ( f ( x, u, w ))] ∀ x ∈ X . Operator notation J ∗ N = c ∈ R |X| For k = N − 1 , N − 2 , . . . , 0 , set J ∗ k = TJ ∗ k +1 .

Lecture 2 / #10 Infinite Horizon and Indefinite Horizon MDPs The Dynamic Programming Algorithm Main Proposition from last time: old notation For all initial states x ∈ X , the optimal cost to go is J ∗ ( x ) = J ∗ 0 ( x ) . This is attained by a policy π ∗ = ( µ ∗ 0 , ..., µ ∗ N − 1 ) where for all k ∈ { 0 , . . . , N − 1 } , x ∈ X µ ∗ u ∈ U ( x ) E [ g ( x, u, w ) + γJ ∗ k ( x ) ∈ arg min k +1 ( f ( x, u, w ))] . Main Proposition from last time: operator notation For all initial states x ∈ X , the optimal cost to go is J ∗ ( x ) = J ∗ 0 ( x ) . This is attained by a policy π ∗ = ( µ ∗ 0 , ..., µ ∗ N − 1 ) satisfying k J ∗ k +1 = TJ ∗ T µ ∗ ∀ k ∈ { 0 , 1 , . . . , N − 1 } . k +1

Lecture 2 / #11 Infinite Horizon and Indefinite Horizon MDPs The DP Algorithm for policy evaluation How to find the cost-to-go for any policy π = ( µ 0 , . . . , µ N − 1 ) ? - J π ( x ) = J 0 ( x ) where J 0 is output by the following iterative algorithm. Old notation J N ( x ) = c ( x ) ∀ x ∈ X For k = N − 1 , N − 2 , . . . 0 , set J k ( x ) = E [ g ( x, µ k ( x ) , w ) + γJ k +1 ( f ( x, µ k ( x ) , w ))] ∀ x ∈ X . Operator notation J N = c ∈ R |X| For k = N − 1 , N − 2 , . . . , 0 , set J k = T µ k J k +1 .

Lecture 2 / #12 Infinite Horizon and Indefinite Horizon MDPs Composition of Bellman Operators In the DP algorithm J ∗ = TJ ∗ 1 = T ( TJ ∗ 2 ) = · · · = T N c. Analogously, for any policy π = ( µ 0 , µ 1 , . . . µ N − 1 ) , J π = T µ 0 T µ 1 · · · T µ N − 1 c. • Applying the Bellman operator to c iteratively N times gives the optimal cost-to-go in an N period problem with terminal costs c . • Applying the Bellman operators associated with a policy to c iteratively N times gives its cost-to-go in an N period problem with terminal costs c .

Lecture 2 / #13 Infinite Horizon and Indefinite Horizon MDPs Infinite Horizon Discounted MDPs The same problem as before, but take N → ∞ . • Finite state and control spaces. • Periods 0 , 1 , . . . with controls u 0 , u 1 . . . , . • Stationary transition probabilities f k ( x, u, w ) = f ( x, u, w ) for all k ∈ N . • Stationary control spaces: U k ( x ) = U ( x ) for all k ∈ N . • Discounted costs: g k ( x, u, w ) = γ k g ( x, u, w ) for k ∈ N The objective is to minimize � N � � γ k g ( x k , u k , w k ) lim N →∞ E k =0

Lecture 2 / #14 Infinite Horizon and Indefinite Horizon MDPs Infinite Horizon Discounted MDPs • A policy π = ( µ 0 , µ 1 , µ 2 , . . . ) is a sequence of mappings where µ k : x �→ U ( x ) . • The expected cumulative “cost-to-go” of a policy π from starting state x is � N � � γ k g ( x k , µ k ( x k ) , w k ) J π ( x ) = lim N →∞ E k =0 where x k +1 = f ( x k , µ k ( x k ) , w k ) and the expectation is over the i.i.d disturbances w 0 , w 1 , w 2 . . . • The optimal expected cost-to-go is J ∗ ( x ) = inf π ∈ Π J π ( x ) ∀ x ∈ X . • We say a policy π is optimal if J π = J ∗ . • For a stationary policy π = ( µ, µ, µ, . . . ) we write J µ instead of J π .

Lecture 2 / #15 Infinite Horizon and Indefinite Horizon MDPs Infinite Horizon Discounted MDPs: Main Results Cost-to go functions J µ is the unique solution to the equation T µ J = J and iterates of the relation J k +1 = T µ J k converge to J µ at a geometric rate. Optimal cost-to go functions J ∗ is the unique solution to the Bellman equation TJ = J and iterates of the relation J k +1 = TJ k converge to J ∗ at a geometric rate. Optimal policies There exists an optimal stationary policy. A stationary policy ( µ, µ, . . . ) is optimal if and only if T µ J ∗ = TJ ∗ . By computing the optimal cost-to-go function we are solving a fixed point equation, and one way to solve this equation is by iterating the Bellman operator. Once we calculate the optimal cost-to-go function we can find the optimal policy by solving the one period problem u ∈ U ( x ) E [ g ( x, u, w ) + γJ ∗ ( f ( x, u, w ))] . min

Lecture 2 / #16 Infinite Horizon and Indefinite Horizon MDPs Example: selling an asset An instance of optimal stopping. • No deadline to sell . • Potential buyers make offers in sequence. • The agent chooses to accept or reject each offer – The asset is sold once an offer is accepted. – Offers are no longer available once declined. • Offers are iid. • Profits can be invested with interest rate r > 0 per period. – We discounting with rate γ = 1 / (1 + r ) .

Lecture 2 / #17 Infinite Horizon and Indefinite Horizon MDPs Example: selling an asset • Special terminal state t (costless and absorbing) • x k � = t is the offer considered at time k . • x 0 = 0 is fictitious null offer. • g ( x, sell ) = x . • x k = w k − 1 for independent w 0 , w 1 , . . . Bellman equation J ∗ = TJ ∗ becomes J ∗ ( x ) = max { x, γ E [ J ∗ ( w )] } The optimal policy is a threshold α = γ E [ J ∗ ( w )] . Sell ⇐ ⇒ x k ≥ α where This stationary policy is much simpler than what we saw last time.

Lecture 2: Infinite Horizon and Indefinite Horizon MDPs B9140 - PowerPoint PPT Presentation

Lecture 2: Infinite Horizon and Indefinite Horizon MDPs B9140 Dynamic Programming & Rienforcement Learning. Prof. Daniel Russo Last time: RL overview and motivation Finite Horizon MDPs: formulation and the DP algorithm Today:

Planning and Optimization December 4, 2019 G1. Factored MDPs G1.1 Factored MDPs Planning and

Continued fraction expansions and generalized indefinite strings Jonathan Eckhardt Loughborough

INDUSTRY DAY MARTA Indefinite Delivery Indefinite Quantity (IDIQ) Thursday, June 27, 2019

Bounds on the non-real spectrum of indefinite Sturm-Liouville operators Operator Theory in

CS 730/830: Intro AI Solving MDPs MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 1 / 23

$TITLE: M10-2.GMS: Infinite horizon dynamic model, MPS/GE formulation $ONTEXT Converts infinite

Infinite graphs P eter Komj ath LC12 P eter Komj ath Infinite graphs Infinite

Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning

Parity Objectives in Countable MDPs Stefan Kiefer Richard Mayr Mahsa Shirmohammadi Dominik

Computing with (indefinite) quadratic forms and quaternion algebras in PARI/GP James Rickards

Lecture 13 Reachability in MDPs Dr. Dave Parker Department of Computer Science University

JUST THE MATHS SLIDES NUMBER 12.1 INTEGRATION 1 (Elementary indefinite integrals) by

AESS - IFD Batch 20/21 1 Contents Introduction Zero Articles Indefinite Articles

Objectives Wrap up indefinite loops Text processing, manipulation String operations,

Indefinite Integrals Return to Table of Contents Slide 5 / 91 Indefinite Integrals do not have

18.175: Lecture 20 Infinite divisibility and L evy processes Scott Sheffield MIT 18.175 Lecture

Amenable actions of the infinite permutation group Lecture I Juris Stepr ans York

Coalgebras in Type Theory Equations CoInductive Types Bisimulations Venanzio Capretta

Lecture 6.2: Semi-infinite domains and the reflection method Matthew Macauley Department of

The Pascalian Notion of Infinity what does infinite distance mean? Joo F . N.

Time Travel Sec,on Owain Evans, MIT 24.118 1 t2:

Actual infinity, potential infinity, objectivity, and reverse mathematics Stephen G. Simpson

Objectives The Y Combinator Understand how to allow functions to call themselves even when

Nonlinear infinite-horizon control using generalized Lyapunov equations Tobias Breiten Karl