Outline Md Md Markov Markov Decision Decision Processes - PDF document

3/25/2017 Outline Md Md Markov Markov Decision Decision Processes Processes • Grid World Example Markov Decision Processes • MDP definition • Optimal Policies • Auto Racing Example CSE 415: Introduction to Artificial Intelligence University of Washington • Utilities of Sequences Spring 2017 • Bellman Updates • Value Iterations Presented by S. Tanimoto, University of Washington, based on material by Dan Klein and Pieter Abbeel - - University of California. 2 Md Md Non-Deterministic Search Example: Grid World Markov Markov Decision Decision Processes Processes  A maze-like problem  The agent lives in a grid  Walls block the agent’s path  Noisy movement: actions do not always go as planned  80% of the time, the action North takes the agent North (if there is no wall there)  10% of the time, North takes the agent West; 10% East  If there is a wall in the direction the agent would have been taken, the agent stays put  The agent receives rewards each time step  Small “living” reward each step (can be negative)  Big rewards come at the end (good or bad)  Goal: maximize sum of rewards 3 4 Md Md Grid World Actions Markov Decision Processes Markov Markov Decision Decision Processes Processes Deterministic Grid Stochastic Grid • An MDP is defined by: World World – A set of states s in S – A set of actions a in A – A transition function T(s, a, s’) • Probability that a from s leads to s’, i.e., P(s’| s, a) • Also called the model or the dynamics T(s 11 , E, … … T(s 31 , N, s 11 ) = 0 T is a Big Table! … T(s 31 , N, s 32 ) = 0.8 11 X 4 x 11 = 484 entries T(s 31 , N, s 21 ) = 0.1 T(s 31 , N, s 41 ) = 0.1 For now, we give this as input to the agent … 5 6 1

3/25/2017 Md Md Markov Decision Processes Markov Decision Processes Markov Markov Decision Decision Processes Processes • An MDP is defined by: • An MDP is defined by: – A set of states s in S – A set of states s in S – A set of actions a in A – A set of actions a in A – A transition function T(s, a, s’) – A transition function T(s, a, s’) • Probability that a from s leads to s’, • Probability that a from s leads to s’, i.e., P(s’| s, a) i.e., P(s’| s, a) • Also called the model or the dynamics • Also called the model or the dynamics – A reward function R(s, a, s’) – A reward function R(s, a, s’) • Sometimes just R(s) or R(s’) … … Cost of breathing R(s 32 , N, s 33 ) = -0.01 R(s 33 ) = -0.01 … R(s 32 , N, s 42 ) = -1.01 R(s 42 ) = -1.01 R is also a Big Table! … R(s 33 , E, s 43 ) = 0.99 R(s 43 ) = 0.99 For now, we also give this to the agent 7 8 Md Md What is Markov about MDPs? Markov Decision Processes Markov Markov Decision Decision Processes Processes • • “Markov” generally means that given the present An MDP is defined by: – A set of states s in S state, the future and the past are independent – A set of actions a in A – A transition function T(s, a, s’) • For Markov decision processes, “Markov” means • Probability that a from s leads to s’, i.e., action outcomes depend only on the current state P(s’| s, a) • Also called the model or the dynamics – A reward function R(s, a, s’) • Sometimes just R(s) or R(s’) – A start state – Maybe a terminal state • Andrey Markov This is just like search, where the successor function • MDPs are non-deterministic search (1856-1922) could only depend on the current state (not the problems history) – One way to solve them is with expectimax search – We’ll have a new tool soon 9 10 Md Md Policies Optimal Policies Markov Markov Decision Decision Processes Processes • In deterministic single-agent search problems, we wanted an optimal plan, or sequence of actions, from start to a goal • For MDPs, we want an optimal policy π*: S → A – A policy π gives an action for each state – An optimal policy is one that maximizes R(s) = -0.01 R(s) = -0.03 expected utility if followed – An explicit policy defines a reflex agent • Expectimax didn’t compute entire Optimal policy when R(s, a, s’) policies = -0.03 for all non-terminals s – It computed the action for a single state only R(s) = -0.4 R(s) = -2.0 11 12 2

3/25/2017 Md Md Example: Racing Example: Racing Markov Markov Decision Decision Processes Processes • A robot car wants to travel far, quickly • Three states: Cool, Warm, Overheated • Two actions: Slow , Fast +1 • Going faster gets double reward 1.0 Fast 0.5 Slow -10 +1 0.5 Warm Slow Fast 0.5 +2 0.5 Cool +1 Overheated 1.0 +2 13 14 Md Md Racing Search Tree MDP Search Trees Markov Markov Decision Decision Processes Processes • Each MDP state projects an expectimax-like search tree s is a s state a (s, a) is a s, a q-state (s,a,s ’ ) called a transition s,a,s ’ T(s,a,s ’ ) = P(s ’ |s,a) R(s,a,s ’ ) s ’ 15 16 Md Md Utilities of Sequences Utilities of Sequences Markov Markov Decision Decision Processes Processes • What preferences should an agent have over reward sequences? [1, 2, 2] or [2, 3, 4] • More or less? [0, 0, 1] or [1, 0, 0] • Now or later? 17 18 3

3/25/2017 Md Md Discounting Discounting Markov Markov Decision Decision Processes Processes • It’s reasonable to maximize the sum of rewards • How to discount? • It’s also reasonable to prefer rewards now to rewards – Each time we descend a level, we multiply in the later discount once • One solution: values of rewards decay exponentially • Why discount? – Sooner rewards probably do have higher utility than later rewards – Also helps our algorithms converge • Example: discount of 0.5 – U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3 Worth Worth Next Worth In Two – U([1,2,3]) < U([3,2,1]) Now Step Steps 19 20 Md Md Stationary Preferences Quiz: Discounting Markov Markov 10 * g 3 = 1* g Decision Decision Processes Processes • Theorem: if we assume stationary preferences: g 2 = 1 • Given: 10 – Actions: East, West, and Exit (only available in exit states a, e) – Transitions: deterministic • Quiz 1: For γ = 1, what is the optimal policy? • Then: there are only two ways to define utilities • Quiz 2: For γ = 0.1, what is the optimal policy? – Additive utility: • Quiz 3: For which γ are West and East equally good – Discounted utility: when in state d? 21 22 Md Md Infinite Utilities?! Recap: Defining MDPs Markov Markov Decision Decision Processes Processes  Problem: What if the game lasts forever? Do we get • Markov decision processes: s infinite rewards? – Set of states S  Solutions: – Start state s 0 a  Finite horizon: (similar to depth-limited search) – Set of actions A s, a  Terminate episodes after a fixed T steps (e.g. life) – Transitions P(s’|s,a) (or T(s,a,s’))  Gives nonstationary policies (γ depends on time left) s,a,s – Rewards R(s,a,s’) (and discount γ)  Discounting: use 0 < γ < 1 ’ s ’ • MDP quantities so far:  Smaller γ means smaller “ horizon ” – shorter term focus – Policy = Choice of action for each state  Absorbing state: guarantee that for every policy, a terminal – Utility = sum of (discounted) rewards state will eventually be reached (like “ overheated ” for racing) 23 24 4

3/25/2017 Md Md Solving MDPs Optimal Quantities Markov Markov Decision Decision Processes Processes • Value Iteration  The value (utility) of a state s: V * (s) = expected utility s is a • Policy Iteration s state starting in s and acting a optimally (s, a) is  The value (utility) of a q-state • Reinforcement Learning s, a q- (s,a): a state Q * (s,a) = expected utility s,a,s (s,a,s’) is a ’ starting out having taken transition s action a from state s and ’ (thereafter) acting optimally  The optimal policy: π * (s) = optimal action from state s 25 26 Snapshot of Demo – Gridworld V Snapshot of Demo – Gridworld Q Md Md Markov Markov Decision Values Decision Values Processes Processes Noise = 0.2 Noise = 0.2 Discount = 0.9 Discount = 0.9 Living reward = 0 Living reward = 0 27 28 Md Md Values of States Racing Search Tree Markov Markov Decision Decision Processes Processes • Fundamental operation: compute the (expectimax) value of a state s – Expected utility under optimal action – Average sum of (discounted) rewards a – This is just what expectimax computed! s, a • Recursive definition of value: s,a,s ’ s ’ 29 30 5

Outline Md Md Markov Markov Decision Decision Processes - PDF document

3/25/2017 Outline Md Md Markov Markov Decision Decision Processes Processes Grid World Example Markov Decision Processes MDP definition Optimal Policies Auto Racing Example CSE 415: Introduction to Artificial Intelligence

Ins Domingues Breast Cancer Workshop April 7th 2015 Outline Outline Outline Outline

Presentation Preparation Outline Speech Outline Template ***Use this outline to guide you in

Outline for St Outline for St Outline for

Beob Kyun Kim, S oonwook Hwang {kyun, hwang}@ kisti.re.kr KIS TI, Korea Outline Outline

Catherine Revels, World Bank November 2009 Presentation outline Presentation outline

Battlestar Galactica Battlestar Galactica Galactica Battlestar Outline Outline Outline

Outline 2 Outline 2 ZSim core simulation techniques Outline 2 ZSim core simulation

Appendix J: Capstone Presentation Outline Revised Spring 2016 CAPSTONE PRESENTATION OUTLINE This

PT1 TMP Presentation Outline 1 Group Members: ___________________________________ Use this outline

Broverview Outline 2 Outline Philosophy and Architecture A framework for network traffic

Xingqian Peng, Huaqiao University, China Presented by Zhen Wu Presented by Zhen Wu October 30,2011

1 Web Application Development 2 3 Web Application Development CSS Outline An outline is a

Lecture Outline Strengthening Induction Hypothesis. Lecture Outline Strengthening Induction

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

Outline Outline Deaf and Hearing Impaired Deaf and Hearing Impaired Physical Structures of

RACING TO PERFORMANCE CONNECTED AND AUTOMATED VEHICLE TECHNOLOGY TIMELINES MANAGEMENT BRIEFING

Schematic view of the MNS1 model Posterior cortex Frontal cortex Role of PFC in Neural network

Path integral control Minimization wrt u yields: 11 R 1 g J u = 1 2( J )

Character Animation: Dynamic Approaches Simulate articulated rigid body system Feedback

ARio Kart Sourav Panda David Yang Bujji Setty Problem Drones Not easily accessible

EECS 192: Mechatronics Design Lab Discussion 12: AGC & Mechanical Tuning GSI: Varun Tolani

REPL::Financial Statements and Related Announcement::Full Yearly Results Page 1 of 1

Potentially dangerous glacial lakes Cartographers in the northern Tien Shan Manfred

Outline Md Md Markov Markov Decision Decision Processes - PDF document

3/25/2017 Outline Md Md Markov Markov Decision Decision Processes Processes Grid World Example Markov Decision Processes MDP definition Optimal Policies Auto Racing Example CSE 415: Introduction to Artificial Intelligence

Ins Domingues Breast Cancer Workshop April 7th 2015 Outline Outline Outline Outline

Presentation Preparation Outline Speech Outline Template ***Use this outline to guide you in

Outline for St Outline for St Outline for

Beob Kyun Kim, S oonwook Hwang {kyun, hwang}@ kisti.re.kr KIS TI, Korea Outline Outline

Catherine Revels, World Bank November 2009 Presentation outline Presentation outline

Battlestar Galactica Battlestar Galactica Galactica Battlestar Outline Outline Outline

Outline 2 Outline 2 ZSim core simulation techniques Outline 2 ZSim core simulation

Appendix J: Capstone Presentation Outline Revised Spring 2016 CAPSTONE PRESENTATION OUTLINE This

PT1 TMP Presentation Outline 1 Group Members: ___________________________________ Use this outline

Broverview Outline 2 Outline Philosophy and Architecture A framework for network traffic

Xingqian Peng, Huaqiao University, China Presented by Zhen Wu Presented by Zhen Wu October 30,2011

1 Web Application Development 2 3 Web Application Development CSS Outline An outline is a

Lecture Outline Strengthening Induction Hypothesis. Lecture Outline Strengthening Induction

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

Outline Outline Deaf and Hearing Impaired Deaf and Hearing Impaired Physical Structures of

RACING TO PERFORMANCE CONNECTED AND AUTOMATED VEHICLE TECHNOLOGY TIMELINES MANAGEMENT BRIEFING

Schematic view of the MNS1 model Posterior cortex Frontal cortex Role of PFC in Neural network

Path integral control Minimization wrt u yields: 11 R 1 g J u = 1 2( J )

Character Animation: Dynamic Approaches Simulate articulated rigid body system Feedback

ARio Kart Sourav Panda David Yang Bujji Setty Problem Drones Not easily accessible

EECS 192: Mechatronics Design Lab Discussion 12: AGC &amp; Mechanical Tuning GSI: Varun Tolani

REPL::Financial Statements and Related Announcement::Full Yearly Results Page 1 of 1

Potentially dangerous glacial lakes Cartographers in the northern Tien Shan Manfred

EECS 192: Mechatronics Design Lab Discussion 12: AGC & Mechanical Tuning GSI: Varun Tolani