= + U ( s ) R ( s ) max T ( s , a , s ' ) U - PDF document

The Bellman Equation ∑ = + γ U ( s ) R ( s ) max T ( s , a , s ' ) U ( s ' ) a s ' • The Bellman equation gives the utility of a state Markov Decision Processes II • If there are n states, there are n Bellman Equations to , q solve • This is a system of simultaneous equations • But the equations are nonlinear because of the max operator 1 2 Iterative Solution The Bellman Update More generally, we have: Define U 1 (s) to be the utility if the agent is at state s and lives for 1 time step ∑ = + γ U ( s ) R ( s ) max T ( s , a , s ' ) U ( s ' ) i + 1 i = U ( s ) R ( s ) a Calculate this for all states s s ' 1 • This is the maximum possible expected sum p p Define U 2 (s) to be the utility if the agent is at state s of discounted rewards (ie. the utility) if the and lives for 2 time steps agent is at state s and lives for i+1 time steps ∑ = + γ U ( s ) R ( s ) max T ( s , a , s ' ) U ( s ' ) 2 1 • This equation is called the Bellman Update a s ' This has already been calculated above 3 4 The Bellman Update • As the number of iterations goes to infinity, U i+1 (s) converges to an equilibrium value U * (s). • The final utility values U*(s) are solutions to the Bellman equations. Even better, they are the unique solutions and the corresponding policy is q p g p y optimal • This algorithm is called Value ‐ Iteration • The optimal policy is given by: ∑ π = * * ( s ) arg max T ( s , a , s ' ) U ( s ' ) a s ' 5 6 1

Example Example 0.5 γ =0.9 0.5 • We will use the following convention when drawing MDPs graphically: B A A1 B1 -4 +12 State Action Action A2 R(s) 0.75 0.25 0.5 0.5 Transition probability 1.0 C C1 +2 7 8 Example Example U 1 (A) U 1 (B) U 1 (C) i=1 12 -4 2 U 1 (A) = R(A)=12 i=2 U 1 (B) = R(B)= ‐ 4 U 2 (A) = 12 + (0.9) * max{(0.5)(12)+(0.5)( ‐ 4), (1.0)(2)} = 12 + (0 9)*max{4 0 2 0} = 12 + 3 6 = 15 6 = 12 + (0.9) max{4.0,2.0} = 12 + 3.6 = 15.6 U 1 (C) = R(C)=2 U (C) R(C) 2 U 2 (B) = ‐ 4 + (0.9) * {(0.25)(12)+(0.75)( ‐ 4)} = ‐ 4 + (0.9)*0 = ‐ 4 U 2 (C) = 2 + (0.9) * {(0.5)(2)+(0.5)( ‐ 4)} = 2 + (0.9)*( ‐ 1) = 2 ‐ 0.9 = 1.1 9 10 Example The Bellman Update U 2 (A) U 2 (B) U 2 (C) • What exactly is going on? 15.6 -4 1.1 • Think of each Bellman update as an update i=3 of each local state U 3 (A) = 12 + (0.9) * max{(0.5)(15.6)+(0.5)( ‐ • If we do enough local updates, we end up • If we do enough local updates we end up 4) (1 0)(1 1)} = 12 + (0 9) * max{5 8 1 1} = 12 + 4),(1.0)(1.1)} = 12 + (0.9) max{5.8,1.1} = 12 + (0.9)(5.8) = 17.22 propagating information throughout the U 3 (B) = ‐ 4 + (0.9) * {(0.25)(15.6)+(0.75)( ‐ 4)} = ‐ 4 + state space (0.9)*(3.9 ‐ 3) = ‐ 4 + (0.9)(0.9) = ‐ 3.19 U 3 (C) = 2 + (0.9) * {(0.5)(1.1)+(0.5)( ‐ 4)} = 2 + (0.9)*{0.55 ‐ 2.0} = 2 + (0.9)( ‐ 1.45) = 0.695 11 12 2

Value Iteration on the Maze Value ‐ Iteration Termination When do you stop? In an iteration over all the states, keep track of the maximum change in utility of any state (call this δ ) (call this δ ) When δ is less than some pre ‐ defined threshold, stop This will give us an approximation to the true Notice that rewards are negative until a path to (4,3) utilities, we can act greedily based on the is found, resulting in an increase in U approximated state utilities 13 14 Comments Value iteration is designed around the idea of the utilities of the states The computational difficulty comes from the max operation in the bellman equation max operation in the bellman equation Instead of computing the general utility of a state (assuming acting optimally), a much γ easier quantity to compute is the utility of a state assuming a policy 15 16 Evaluating a Policy Policy Iteration • Start with a randomly chosen initial policy π 0 Once we compute the utilities, we can easily improve the current policy by one ‐ • Iterate until no change in utilities: step look ‐ ahead: 1. Policy evaluation: given a policy π i , calculate the utility U i ( s ) of every state s using policy π i 2. Policy improvement: calculate the new policy π i+1 using one ‐ step look ‐ ahead based on U i ( s ) This suggests a different approach for ie. finding optimal policy ∑ π = ( s ) arg max T ( s , a , s ' ) U ( s ' ) i + 1 i a s ' 17 18 3

Policy Evaluation Policy Evaluation • O(n 3 ) is still too expensive for large state spaces • Policy improvement is straightforward • Instead of calculating exact utilities, we could calculate • Policy evaluation requires a simpler approximate utilities version of the Bellman equation • The simplified Bellman update is: ∑ ∑ ← ← + + γ γ π π • • Compute U i (s) for every state s using π i: Compute U (s) for every state s using π U U ( ( s s ) ) R R ( ( s s ) ) T T ( ( s s , ( ( s s ), ) s s ' ' ) ) U U ( ( s s ' ' ) ) + i 1 i i ∑ = + γ π s ' U ( s ) R ( s ) T ( s , ( s ), s ' ) U ( s ' ) i i i • Repeat the above k times to get the next utility estimate s ' This is called modified policy iteration Notice that there is no max operator, so the above equations are linear! O(n 3 ) where n is the number of states 19 20 Comparison Limitations • Need to represent the utility (and policy) for • Which would you prefer, policy or value every state iteration? • In real problems, the number of states may be • Depends… very large – If you have lots of actions in each state: policy If you have lots of actions in each state: policy • Leads to intractably large tables • Leads to intractably large tables iteration • Need to find compact ways to represent the states eg – If you have a pretty good policy to start with: – Function approximation policy iteration – Hierarchical representations – If you have few actions in each state: value – Memory ‐ based representations iteration 21 22 What you should know • How value iteration works • How policy iteration works • Pros and cons of both • What is the big problem with both value Wh t i th bi bl ith b th l and policy iteration 23 4

= + U ( s ) R ( s ) max T ( s , a , s ' ) U - PDF document

The Bellman Equation = + U ( s ) R ( s ) max T ( s , a , s ' ) U ( s ' ) a s ' The Bellman equation gives the utility of a state Markov Decision Processes II If there are n states, there are n Bellman

Outline Outline Itos Equation Itos Equation Fokker Fokker- -Planck

Overview Chapter 7 Ideal Gas Equation of State P= RT/V Van der Waals Equation of State Cubic

The IPAT Equation The IPAT Equation The IPAT Equation The IPAT

Max India Limited Max India Limited I Investor Presentation t P t ti June, 2014

Positive definite max-QP Recall max-cut max vE(1-v) s.t. 0 v 1 max

Power Series Solutions to the Bessel Equation Department of Mathematics IIT Guwahati RA/RKS

Graph of f 1 . Since the equation y = f 1 ( x ) is the same as the equation x = f ( y ),

Universit at Augsburg Amplitude Equation for stoch. SH Equation Konrad Klepel Amplitude

Gas Laws: Real Gases Virial equation of state Virial equation of state for precise description of

7.4 Cauchy-Euler Equation The differential equation a n x n y ( n ) + a n 1 x n 1 y ( n

Jean-Max Roger Sancerre Menetou-Salon Pouilly-Fum Jean-Max Roger s.a.s. 11 place du Carrou

JBoss Tools & Developer Studio Max Rydahl Andersen http://in.relation.to/Bloggers/Max

Heapsort Build-Max-Heap Next we build a full heap from an unsorted sequence Build-Max-Heap(A)

SPECTRAL THEORY OF REDUCIBLE NONNEGATIVE MATRICES IN MAX ALGEBRA Hans Schneider Chemnitz

Class 14: Simple harmonic motion Class 14: Simple harmonic motion Origin of simple harmonic motion

Shortwave solar radiation 1 Calculating equation coefficients Construction Conservation

Mitigation Plan Proceeding 1. Listed tasks the independent evaluator should undertake, issues it

Markov Decision Processes Robert Platt Northeastern University Some images and slides are used

Rise of {Sensors + AI} People expect rich computational experiences to be available in every

Decision Making Case of Interval . . . under Interval Monetary Approach Is . . . The Notion of

The path to the future ????? Utilities provide an Essential Service When you flip the

Adopting LLVM Binary Utilities in Toolchains Jordan Rupprecht rupprecht@google.com Euro LLVM

Computational Social Choice: Autumn 2013 Ulle Endriss Institute for Logic, Language and

Welcome to the LIFE Webinar Series We will begin the webinar momentarily LIFE Webinar Series

= + U ( s ) R ( s ) max T ( s , a , s ' ) U - PDF document

The Bellman Equation = + U ( s ) R ( s ) max T ( s , a , s ' ) U ( s ' ) a s ' The Bellman equation gives the utility of a state Markov Decision Processes II If there are n states, there are n Bellman

Outline Outline Itos Equation Itos Equation Fokker Fokker- -Planck

Overview Chapter 7 Ideal Gas Equation of State P= RT/V Van der Waals Equation of State Cubic

The IPAT Equation The IPAT Equation The IPAT Equation The IPAT

Max India Limited Max India Limited I Investor Presentation t P t ti June, 2014

Positive definite max-QP Recall max-cut max vE(1-v) s.t. 0 v 1 max

Power Series Solutions to the Bessel Equation Department of Mathematics IIT Guwahati RA/RKS

Graph of f 1 . Since the equation y = f 1 ( x ) is the same as the equation x = f ( y ),

Universit at Augsburg Amplitude Equation for stoch. SH Equation Konrad Klepel Amplitude

Gas Laws: Real Gases Virial equation of state Virial equation of state for precise description of

7.4 Cauchy-Euler Equation The differential equation a n x n y ( n ) + a n 1 x n 1 y ( n

Jean-Max Roger Sancerre Menetou-Salon Pouilly-Fum Jean-Max Roger s.a.s. 11 place du Carrou

JBoss Tools &amp; Developer Studio Max Rydahl Andersen http://in.relation.to/Bloggers/Max

Heapsort Build-Max-Heap Next we build a full heap from an unsorted sequence Build-Max-Heap(A)

SPECTRAL THEORY OF REDUCIBLE NONNEGATIVE MATRICES IN MAX ALGEBRA Hans Schneider Chemnitz

Class 14: Simple harmonic motion Class 14: Simple harmonic motion Origin of simple harmonic motion

Shortwave solar radiation 1 Calculating equation coefficients Construction Conservation

Mitigation Plan Proceeding 1. Listed tasks the independent evaluator should undertake, issues it

Markov Decision Processes Robert Platt Northeastern University Some images and slides are used

Rise of {Sensors + AI} People expect rich computational experiences to be available in every

Decision Making Case of Interval . . . under Interval Monetary Approach Is . . . The Notion of

The path to the future ????? Utilities provide an Essential Service When you flip the

Adopting LLVM Binary Utilities in Toolchains Jordan Rupprecht rupprecht@google.com Euro LLVM

Computational Social Choice: Autumn 2013 Ulle Endriss Institute for Logic, Language and

Welcome to the LIFE Webinar Series We will begin the webinar momentarily LIFE Webinar Series

JBoss Tools & Developer Studio Max Rydahl Andersen http://in.relation.to/Bloggers/Max