CMU-Q 15-381 Lecture 16: Markov Decision Processes I Teacher: - PowerPoint PPT Presentation

CMU-Q 15-381 Lecture 16: Markov Decision Processes I Teacher: Gianni A. Di Caro

R ECAP : M ARKOV D ECISION P ROCESSES (MPD) § A set $ of world states § A set % of feasible actions § A stochastic transition matrix & , & ., . / , 0 = 2 . / ., 0) &: $×$×%× 0,1, … , ! ↦ 0,1 , § A reward function 4 : 4 . , 4 ., 0 , 4 ., 0, . / , 4: $×%×$× 0,1, … , ! ⟼ ℝ § A start state (or a distribution of initial states), optional § Terminal/Absorbing states, optional Goal: Define the action decision policy that maximizes a given (utility) function of the rewards, potentially for ! → ∞ § Deterministic Policy 7 . : a mapping from states to actions , 7: $ → % § Stochastic Policy, 7 ., 0 : a mapping from states to a probability distribution over the actions feasible in the state 2

R ECYCLING ROBOT § At each step, a recycling robot has to decide whether it should: search for a can; wait for someone to bring it a can; go to home base and recharge . § Searching is better but runs down the battery; if runs out of power while searching, it has to be rescued. § States are battery levels: high,low . § Reward = number of cans collected (expected) Note: the “state” (robot’s battery status) is a parameter of the agent itself, not a property of the physical environment Example from Sutton and Barto 3

U TILITY OF A P OLICY § Starting from ! " , applying the policy p , generates a sequence of states ! " , ! $ , ⋯ , ! & , and of rewards ' " , ' $ , ⋯ , ' & § For the (rational) decision-maker each sequence has a utility based on its preferences § Utility is a function of the sequence of rewards: ( ')*+', !)-.)/0) → „Additive function of the rewards” § The expected utility , or value of a policy p starting in state ! " is the expected utility over all the state sequences generated by the applying p and depending on state transition dynamics ( 2 ! " = G 2 ! ((!) 4 5 ∈ {899 5&8&: 5:;<:=>:5 5&8?&@=A B?CD 5 E } 4

O PTIMAL P OLICIES § An optimal policy p * yields the maximal utility = maximal expected utility function of the rewards from following the policy starting from the initial state ü Principle of maximum expected utility : a rational agent should choose the action(s) that maximize its expected utility § Note: Different optimal policies arise from different reward models, that, in turn, determine different utilities for the same action sequence à Let’s look at the grid world… 5

O PTIMAL P OLICIES R(s) = -0.01 R(s) = -0.04 Balance between risk and reward changes depending on the value of R(s) R(s) = -0.4 R(s) > 0 R(s) = -2.0 6

E XAMPLE : C AR R ACING § A robot car wants to travel far, quickly, gets higher rewards for moving fast § Three states: Cool, Warm, Overheated (Terminal state, end the process) § Two actions: Slow , Fast § Going faster gets double reward § Green numbers are rewards +1 0.5 1.0 Fast Slow -10 +1 0.5 Warm Slow Fast 0.5 +2 Cool 0.5 Overheated +1 1.0 +2 7

R ACING S EARCH T REE (~E XPECTIMAX ) Chance nodes slow fast 8

U TILITIES OF S EQUENCES § What preferences should an agent have over reward sequences? § More or less? [1, 2, 2] or [2, 3, 4] § Now or later? [1, 0, 0] [0, 0, 1] or 9

D ISCOUNTING § It’s reasonable to maximize the sum of rewards § It’s also reasonable to prefer rewards now to rewards later § One solution: values of rewards decay exponentially by a factor ! Worth Next Step Worth Now Worth In Two Steps 10

D ISCOUNTING § How to discount? state action § Each time we descend a level, we chance multiply in the discount ! once § Why discount? § Sooner rewards probably do have higher utility than later rewards § Also helps our algorithms converge § Example: discount of ! = 0.5 § & 1,2,3 = 1 ∗ 1 + 0.5 ∗ 2 + 0.25 ∗ 3 time 0 § & 1,2,3 < &(3,2,1) 11

S TATIONARY P REFERENCES § Theorem: if we assume stationary preferences between sequences: then there are only two ways to define utilities over sequences of rewards § Additive utility: § Discounted utility: 12

E FFECT OF D ISCOUNTING ON O PTIMAL P OLICY a b c d e 10 1 Exit Exit § MDP: § Actions: East, West § Terminal states: a and e (end when reach one or the other) § Transitions: deterministic § Reward for reaching a is 10 § Reward for reaching e is 1, reward for reaching all other states is 0 § For g = 1, what is the optimal policy? § For g = 0.1, what is the optimal policy for states b, c and d ? p § For which g are West and East equally good when in state d ? γ = (1 / 10) 13

I NFINITE U TILITIES ?! § Problem: What if the process can last forever ? § Do we get infinite rewards? § Possible solutions: 1. Finite horizon: (similar to depth-limited search) § Terminate episodes after a fixed number of steps (e.g., life) § Gives nonstationary policies ( p depends on time left) 2. Discounting: use 0 < g < 1 & + ) " ) = ∑ )*# ! " # , ⋯ , " & & + ) " ) = , ≤ 2 345 if " ) = ", ∑ )*# -./ ⇒ ! " # , ⋯ , " & -./ § Smaller g means shorter horizon, the far future will matter less 3. Absorbing states: guarantee that for every policy, a terminal state will eventually be reached (like “ overheated ” for racing) 14

U SE OF U TILITIES : ! AND " FUNCTIONS § The value (utility) of a state # : ! ∗ (#) = expected utility starting in # # is a and acting optimally (according to ' ∗ ) # state § The value (utility) of a ( -state (#, *) : * (#, *) is a " ∗ (#, *) = expected utility starting out q-state #, * having taken action * from state # and (thereafter) acting optimally. (#, *, # + ) is a #, *, # + transition s’ Action * is not necessarily the optimal one. " ∗ (#, *) says what is the best we can get after taking * in # § The optimal policy: ' ∗ # = optimal action from state # , the Functional relation between one that returns ! ∗ (#) ! ∗ (#) and " ∗ (#, *) ? 15

MDP S S UMMARY § Markov decision processes (MDPs): § Set of states ! § Start state " # (optional) § Set of actions $ § Transitions % " & ", () or %(" & , ", () § Rewards +(", (, " & ) (and discount g ) § Terminal states (optional) § Markov / memoryless property § Policy p = Choice of action for each state § Utility / Value = Sum of (discounted) rewards § Value of a state, ,(") , and value of a Q-state, -(", () § Optimal policy p * = Best choice, that maximize Utility 16

O PTIMAL V ALUES OF S TATES Sub-problem § Fundamental operation: compute the value ! ∗ ($) of a state ü Expected utility under optimal action ! ∗ ($) ü Average of sum of (discounted) rewards $ )&* + & § Recursive definition of value of a state: $, & , -( [ ] 0($, &, $ ( ) $, &, $ ( $ ( [ 2 3 0 current state + 2 = ! ∗ (next state)] ⋮ 17

G RIDWORLD V-V ALUES Forget about this for now … It “means” that the optimal policy has been found, which is the one shown with Probabilistic dynamics, 80% correct, 20% L/R ▲▼◄► Discount: ! = 1 Living reward: $ = 0 18

G RIDWORLD Q-V ALUES Probabilistic dynamics, 80% correct, 20% L/R Discount: ! = 1 Living reward: $ = 0 19

G RIDWORLD V-V ALUES ' 3,3 : max . = Right ' ∗ ( 3,3 ) 789:; = 0.8 0 + 0.9 1 + 0.1 0 + 0.9 0.57 + 0.1 0 + 0.9 0.85 ≅ 0.85 Probabilistic dynamics, 80% correct, 20% L/R Discount: ! = 0.9 Living reward: & = 0 20

G RIDWORLD Q-V ALUES Probabilistic dynamics, 80% correct, 20% L/R Discount: ! = 0.9 Living reward: & = 0 21

G RIDWORLD V-V ALUES Probabilistic dynamics, 80% correct, 20% L/R Discount: ! = 0.9 Living reward: & = −0.1 22

G RIDWORLD Q-V ALUES Probabilistic dynamics, 80% correct, 20% L/R Discount: ! = 0.9 Living reward: & = −0.1 23

V ALUE FUNCTION AND Q- FUNCTION § The value ! " ($) of a state $ under the policy p is the expected value of its return , the utility of all state sequences starting in $ and applying p " # ∞ State-Value X γ t R ( s t +1 ) | s 0 = s V π ( s ) = E π function t =0 The value & " ($, () of taking an action ( in state $ under policy p is the § expected return starting from $ , taking action ( , and thereafter following p : " # ∞ Action-Value X γ t R ( s t +1 ) | s 0 = s, a 0 = a Q π ( s, a ) = E π function t =0 24

B ELLMAN ( EXPECTATION ) EQUATION FOR V ALUE FUNCTION " # V π ( s ) = E π R ( s t +1 ) + γ V π ( s t +1 ) | s t = s ⇣ ⌘ X p ( s 0 | s, π ( s )) R ( s 0 , s, π ( s )) + γ V π ( s 0 ) = ∀ s ∈ S s 0 2 S Expected immediate reward (short-term utility) for taking action !(#) prescribed by p for state # + Expected future discounted reward (long-term utility) get after taking that action from that state and following p ü Under a given policy ! , an MDP is equivalent to an MRP, and the question of interest is the prediction about the expected cumulative reward that results from a state # , which is the same as computing % & (#) 25

CMU-Q 15-381 Lecture 16: Markov Decision Processes I Teacher: - PowerPoint PPT Presentation

CMU-Q 15-381 Lecture 16: Markov Decision Processes I Teacher: Gianni A. Di Caro R ECAP : M ARKOV D ECISION P ROCESSES (MPD) A set $ of world states A set % of feasible actions A stochastic transition matrix & , & ., . / , 0 = 2

Slides for 15-381/781 15-381/781 Fall 2016

FACT: A Diagnostic for Group Fairness Trade-offs Joon Kim, CMU (joonsikk@cs.cmu.edu ) Jiahao Chen,

The bluetides simulation Tiziana DiMatteo (CMU ) Yu Feng (Berkeley), Rupert Croft (CMU ), Aklant

A New Boosting Algorithm Using Input-Dependent Regularizer Rong Jin rong+@cs.cmu.edu Yan Liu

CMU-Q 15-381 Lecture 1: Introduction AI, basic definitions, problems, road map Teacher:

CMU-Q 15-381 Lecture 4: Path Planning Teacher: Gianni A. Di Caro A PPLICATION : M OTION P

15-381: Artificial Intelligence Introduction and Overview Course data All up-to-date info is

CMU-Q 15-381 Lecture 18: Reinforcement Learning I Teacher: Gianni A. Di Caro H OW REALISTIC ARE

CMU-Q 15-381 Lecture 20: Game Theory I Teacher: Gianni A. Di Caro I CE - CREAM W ARS

CMU-Q 15-381 Lecture 5: Classical Planning Factored Representations STRIPS Teacher: Gianni A.

CMU MDPs 15-381/781 Emma Brunskill (THIS TIME) Ariel Procaccia DeepMind 2 So long

CMU-Q 15-381 Lecture 8: Optimization I: Optimization for CSP Local Search Teacher: Gianni A.

CMU-Q 15-381 Lecture 15: Predictions in Markov Chains Markov Decision Processes Teacher:

CMU-Q 15-381 Lecture 23: Supervised Learning 1 Teacher: Gianni A. Di Caro M ACHINE L EARNING ?

EOR Enhanced Oil Recovery 3535 W. 16 th . St. Odessa, Texas 79763 Tel. (432) 381-6540 Fax

6. Parameter Passing Parameter Passing CS 381 Spring 2016 Example (Formal) Parameter void

INTRODUCTION TO PROBABILITY INTRODUCTION TO PROBABILITY MODELS MODELS Lecture 20 Qi Wang ,

Discrete-Event Systems and Generalized Semi-Markov Processes Discrete-Event Systems and

Simulation Discrete-Event System Simulation Dr. Mesut Gne Computer Science, Informatik

Basic Notions of Discrete-Event Simulation (DES) Computational Models for Complex Systems Paolo

Matrices with Application to Page Rank Markov Matrices Pagerank Anil Maheshwari

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

Algorithms for Parity Games Piotr Danilewski May 15, 2008 Piotr Danilewski Algorithms for

Computational Approaches to Analysis and Control of Hybrid Systems Antoine Girard Laboratoire

Sambuz

Useful Links

Newsletter

Mail Us