CMU MDPs 15-381/781 Emma Brunskill (THIS TIME) Ariel Procaccia

• DeepMind 2

So long certainty… Sensors Percepts Agent Environment ¡ Actuators Actions • Until now, result of taking an action in a state was deterministic Slide adapted from Klein and Abbeel

Reasoning Under Uncertainty Multi-armed Reinforcement Learn ¡model ¡ bandits Learning of ¡outcomes ¡ ¡ ¡ Markov Decision Given ¡model ¡ Decision theory Processes of ¡stochas#c ¡ outcomes ¡ Ac#ons ¡Don’t ¡ Ac#ons ¡Change ¡ Change ¡State ¡of ¡ State ¡of ¡the ¡ the ¡World ¡ World ¡

Expectation • The expected value of a function of a random variable is the average, weighted by the probability distribution over outcomes • Example: expected time if take the bus • Time: 5 min + 30 min 12.5 ¡min ¡ • Probability: 0.7 + 0.3 Slide adapted from Klein and Abbeel

Where Do Probabilities Come from? • Models max ¡ • Data chance ¡ • For now assume we are given the probabilities for any chance node

Reasoning Under Uncertainty Learn ¡model ¡ of ¡outcomes ¡ ¡ Given ¡model ¡ Markov of ¡stochas#c ¡ Decision theory Decision outcomes ¡ Processes Ac#ons ¡Don’t ¡ Ac#ons ¡Change ¡ Change ¡State ¡of ¡ State ¡of ¡the ¡ the ¡World ¡ World ¡

(Stochastically) Change the World Sensors Percepts Agent Environment ¡ Actuators Actions • Like planning/search, actions impact world • But exact impact is stochastic: probability distribution over next states Slide adapted from Klein and Abbeel

Example: Grid World A ¡maze-‑like ¡problem ¡ § The ¡agent ¡lives ¡in ¡a ¡grid ¡ § Walls ¡block ¡the ¡agent’s ¡path ¡ § The ¡agent ¡receives ¡rewards ¡each ¡#me ¡step ¡ § Small ¡“living” ¡reward ¡each ¡step ¡(can ¡be ¡ § nega#ve) ¡ Big ¡rewards ¡come ¡at ¡the ¡end ¡(good ¡or ¡bad) ¡ § Goal: ¡maximize ¡sum ¡of ¡rewards ¡ § Noisy ¡movement: ¡ac#ons ¡do ¡not ¡always ¡go ¡as ¡ § planned ¡ 80% ¡of ¡the ¡#me, ¡ac#on ¡North ¡takes ¡the ¡agent ¡ § North ¡(if ¡there ¡is ¡no ¡wall ¡there) ¡ 10% ¡of ¡the ¡#me, ¡North ¡takes ¡the ¡agent ¡West; ¡ § 10% ¡East ¡ If ¡there ¡is ¡a ¡wall ¡in ¡the ¡direc#on ¡the ¡agent ¡ § would ¡have ¡gone, ¡agent ¡stays ¡put ¡ Slide adapted from Klein and Abbeel

Grid World Actions Determinis#c ¡Grid ¡World ¡ Stochas#c ¡Grid ¡World ¡ Slide adapted from Klein and Abbeel

Markov Decision Processes • S et of states s ∈ S • Set of actions a ∈ A • Transition func. T(s, a, s’) Probability that a from s leads to s’, i.e., P(s’| s, a) • • Reward func. R(s, a, s’) / R(s) / R(s,a) • Start state or states (could be all S) • Maybe a terminal state Discount factor • • MDPs are non-deterministic search problems Slide adapted from Klein and Abbeel

Markov Decision Processes

Markov Property • Called Markov decision process because the outcome of an action depends only on the current state • p(s t+1 |s 1 ,a 1 ,s 2 ,a 2 , … s t ,a t )=p(s t+1 |s t ,a t )

Policies • In deterministic single-agent search problems, we wanted an optimal plan, or sequence of actions, from start to a goal • In MDPs instead of plans, we have a policies • A policy π *: S → A Specifies what action to take in each state o Slide adapted from Klein and Abbeel

How Many Policies? • How many non-terminal states? • How many actions? • How many deterministic policies over non-terminal states?

Optimal Policies • Optimal plan had minimal cost to reach goal • Utility or value of a policy π starting in state s is the expected sum of future rewards will receive by following π starting in state s • Optimal policy has maximal expected sum of rewards from following it

Optimal Policies R(s) ¡= ¡ R(s) ¡= ¡ -‑0.03 ¡ -‑0.01 ¡ R(s) ¡= ¡-‑0.4 ¡ R(s) ¡= ¡-‑2.0 ¡ Slide adapted from Klein and Abbeel

Example: ¡Racing ¡ A ¡robot ¡car ¡wants ¡to ¡travel ¡far, ¡quickly ¡ • Three ¡states: ¡Cool, ¡Warm, ¡Overheated ¡ • Two ¡ac#ons: ¡ Slow , ¡ Fast ¡ • Going ¡faster ¡gets ¡double ¡reward ¡ • +1 ¡ ¡ 0.5 ¡ ¡ 1.0 ¡ ¡ Fast ¡ Slow ¡ -‑10 ¡ +1 ¡ ¡ 0.5 ¡ ¡ Warm ¡ Slow ¡ Fast ¡ 0.5 ¡ ¡ +2 ¡ ¡ 0.5 ¡ ¡ Cool ¡ Overheated ¡ +1 ¡ ¡ 1.0 ¡ ¡ +2 ¡ ¡ Slide adapted from Klein and Abbeel

Racing Search Tree Slide adapted from Klein and Abbeel

Utilities of Sequences Slide adapted from Klein and Abbeel

Utilities of Sequences • What preferences should an agent have over reward sequences? • More or less? [1, ¡2, ¡2] ¡ ¡or ¡ [2, ¡3, ¡4] ¡ • Now or later? [0, ¡0, ¡1] ¡ ¡or ¡ [1, ¡0, ¡0] ¡ Slide adapted from Klein and Abbeel

Stationary Preferences • Theorem: if we assume stationary preferences: • Then: there are only two ways to define utilities over sequences of rewards Additive utility: o Discounted utility: o Slide adapted from Klein and Abbeel

What are Discounts? • It’s reasonable to prefer rewards now to rewards later • Decay rewards exponentially Worth ¡ Worth ¡Next ¡ Worth ¡In ¡Two ¡ Now ¡ Step ¡ Steps ¡ Slide adapted from Klein and Abbeel

Discounting • Given: Actions: East, West o Terminal states: a and e (end when reach one or the other) o Transitions: deterministic o Reward for reaching a is 10 (regardless of initial state & action, e.g. r(s,action,a) = 10), reward o for reaching e is 1, and the reward for reaching all other states is 0 • Quiz 1: For γ = 1, what is the optimal policy? • Quiz 2: For γ = 0.1, what is the optimal policy for states b, c and d? • Quiz 3: For which γ are West and East equally good when in state d? Slide adapted from Klein and Abbeel

Quiz: Discounting • Given: Actions: East, West o Terminal states: a and e (endwhen reach one or the other) o Transitions: deterministic o Reward for reaching a is 10 (regardless of initial state a& action, e.g. r(s,action,a) = 10), reward o for reaching e is 1, and the reward for reaching all other states is 0 • Quiz 1: For γ = 1, what is the optimal policy? In all states, Go West (towards a) o • Quiz 2: For γ = 0.1, what is the optimal policy? b=W, c=W, d=E o • Quiz 3: For which γ are West and East equally good when in state d? Gamma = sqrt (1/10) Slide adapted from Klein and Abbeel

Infinite Utilities?! § Problem: What if the game lasts forever? Do we get infinite rewards? § Solutions: Finite horizon: (similar to depth-limited search) § § Terminate episodes after a fixed T steps (e.g. life) § Gives nonstationary policies ( π depends on time left) Discounting: use 0 < γ < 1 § § Smaller γ means smaller “ horizon ” – shorter term focus Absorbing state: guarantee that for every policy, a terminal § state will eventually be reached (like “ overheated ” for racing) Slide adapted from Klein and Abbeel

Recap: Defining MDPs • Markov decision processes: Set of states S o Start state s 0 o Set of actions A o Transitions P(s’|s,a) (or T(s,a,s’)) o Rewards R(s,a,s’) (and discount γ ) o • MDP quantities so far: Policy = Choice* of action for each state o Utility/Value = sum of (discounted) rewards o Slide adapted from Klein and Abbeel

Value of a Policy in Each State • Expected immediate reward for taking action prescribed by policy π for that state • And expected future reward get after taking that action from that state and following π V π ( s ) = p ( s ' | s , π ( s )) R ( s , π ( s ), s ') + γ V π ( s ') ! # ∑ " $ s ' ∈ S • Future reward depends on horizon (how many more steps get to act). For now assume infinite 28

Q: State-Action Value • Expected immediate reward for taking action • And expected future reward get after taking that action from that state and following π Q π ( s , a ) = p ( s ' | s , a ) R ( s , a , s ') + γ V π ( s ') ! # ∑ " $ s ' ∈ S

Optimal Value V* and π * • Optimal value: Highest possible value for each s • Satisfies the Bellman Equation ( ) ! # ∑ V *( s i ) = max p ( s j | s i , a ) R ( s i , a , s ') + γ V *( s j ) " $ s j ∈ S a • Optimal policy π *( s i ) = argmax Q ( s i , a ) a ( ) ! # ∑ = argmax p ( s j | s i , a ) R ( s i , a , s ') + γ V *( s j ) " $ s j ∈ S a • Want to find these optimal values!

Value Iteration • Bellman equation inspires an update rule ( ) ∑ ! # V *( s i ) = max p ( s j | s i , a ) R ( s , a , s ') + γ V *( s j ) " $ s j ∈ S a ( ) ∑ " $ V k ( s i ) = max p ( s j | s i , a ) R ( s , a , s ') + γ V k − 1 ( s j ) # % s j ∈ S a • Form of dynamic programming 31

CMU MDPs 15-381/781 Emma Brunskill (THIS TIME) Ariel Procaccia - PowerPoint PPT Presentation

CMU MDPs 15-381/781 Emma Brunskill (THIS TIME) Ariel Procaccia DeepMind 2 So long certainty Sensors Percepts Agent Environment Actuators Actions Until now, result of taking an action in a state was deterministic Slide

Slides for 15-381/781 15-381/781 Fall 2016

15-381/781 Bayesian Nets & Probabilistic Inference Emma Brunskill (this time) Ariel

781 FIFTH AVEN AVENUE NEW EW YO YORK, K, NY Y 10022 781 FIFTH AVENUE LAN ANDMAR MARKS KS

Planning and Optimization December 4, 2019 G1. Factored MDPs G1.1 Factored MDPs Planning and

Machine Learning Machine Learning 10 10- -701/15 701/15- -781, Fall 2006 781, Fall 2006

Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning

Parity Objectives in Countable MDPs Stefan Kiefer Richard Mayr Mahsa Shirmohammadi Dominik

CS 730/830: Intro AI Solving MDPs MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 1 / 23

FACT: A Diagnostic for Group Fairness Trade-offs Joon Kim, CMU (joonsikk@cs.cmu.edu ) Jiahao Chen,

The bluetides simulation Tiziana DiMatteo (CMU ) Yu Feng (Berkeley), Rupert Croft (CMU ), Aklant

A New Boosting Algorithm Using Input-Dependent Regularizer Rong Jin rong+@cs.cmu.edu Yan Liu

CMU-Q 15-381 Lecture 1: Introduction AI, basic definitions, problems, road map Teacher:

CMU-Q 15-381 Lecture 4: Path Planning Teacher: Gianni A. Di Caro A PPLICATION : M OTION P

15-381: Artificial Intelligence Introduction and Overview Course data All up-to-date info is

CMU-Q 15-381 Lecture 18: Reinforcement Learning I Teacher: Gianni A. Di Caro H OW REALISTIC ARE

CMU-Q 15-381 Lecture 20: Game Theory I Teacher: Gianni A. Di Caro I CE - CREAM W ARS

Mathematical Foundations for Finance Exercise 2 Martin Stefanik ETH Zurich Notation S k S k

MDPs and Value Iteration 2/20/17 Recall: State Space Search Problems A set of discrete

1 The Typical Capital-Budgeting Process Phase I: The firms management identifies promising

Making It Work: Best Practices for Ensuring Success of AP Automation Implementation Helee Lev

Machine Learning: Reinforcement Learning Volker Sorge Intro to AI: Basics Lecture 12 Volker

Discounting Lecture slides Brd Harstad University of Oslo 2019 Brd Harstad (University of

CS885 Reinforcement Learning Lecture 2a: May 4, 2018 Intro to Markov decision processes [SutBar]

Markov Decision Processes (MDPs) and Reinforcement Learning (RL) Sven Koenig, USC Russell and