Making Complex Decisions Chapter 17 Ch. 17 p.1/29 Outline - PowerPoint PPT Presentation

Making Complex Decisions Chapter 17 Ch. 17 – p.1/29

Outline Sequential decision problems Value iteration algorithm Policy iteration algorithm Ch. 17 – p.2/29

A simple environment 3 +1 p=0.8 2 −1 p=0.1 p=0.1 1 S 1 2 3 4 Ch. 17 – p.3/29

A simple environment (cont’d) The agent has to make a series of decisions (or alternatively it has to know what to do in each of the possible 11 states) The move action can fail Each state has a “reward” Ch. 17 – p.4/29

What is different? uncertainty rewards for states (not just good/bad states) a series of decisions (not just one) Ch. 17 – p.5/29

Issues How to represent the environment? How to automate the decision making process? How to make useful simplifying assumptions? Ch. 17 – p.6/29

Markov decision process (MDP) It is a specification of a sequential decision problem for a fully observable environment. It has three components S 0 : the initial state T ( s , a , s ′ ) : the transition model R ( s ) : the reward function The rewards are additive . Ch. 17 – p.7/29

Transition model It is a specification of outcome probabilities for each state and action pair T ( s , a , s ′ ) denotes the probability of ending up in state s ′ if action a is applied in state s The transitions are Markovian : T ( s , a , s ′ ) depends only on s , not on the history of earlier states Ch. 17 – p.8/29

Utility function It is a specification of agent preferences The utility function will depend on a sequence of states (this is a sequential decision problem, but still the transitions are Markovian) There is a negative/positive finite reward for each state given by R(s) Ch. 17 – p.9/29

Policy It is a specification of a solution for an MDP It denotes what to do at any state the agent might reach π ( s ) denotes the action recommended by the policy π for state s The quality of a policy is measured by the expected utility of the possible environment histories generated by that policy π ∗ denotes the optimal policy An agent with a complete policy is a reflex agent Ch. 17 – p.10/29

Optimal policy when R(s) = -0.04 3 +1 2 −1 1 1 2 3 4 Ch. 17 – p.11/29

Optimal policy when R(s) < -1.6284 3 +1 2 −1 1 1 2 3 4 Ch. 17 – p.12/29

Optimal policy when -0.4278 < R(s) < -0.0850 3 +1 2 −1 1 1 2 3 4 Ch. 17 – p.13/29

Optimal policy when -0.0221 < R(s) < 0 3 +1 2 −1 1 1 2 3 4 Ch. 17 – p.14/29

Optimal policy when R(s) > 0 3 +1 * * 2 −1 * * * * 1 1 2 3 4 Ch. 17 – p.15/29

Finite vs. infinite horizon A finite horizon means that there is fixed time N after which nothing matters (the game is over) ∀ k ≥ 0 U h ([ s 0 , s 1 ,..., s N + k ]) = U h ([ s 0 , s 1 ,..., s N ]) The optimal policy for a finite horizon is nonstationary , i.e., it could change over time An infinite horizon means that there is no deadline There is no reason to behave differently in the same state at different times, i.e., the optimal policy is stationary It is easier than the nonstationary case Ch. 17 – p.16/29

Stationary preferences It means that the agent’s preferences between state sequences do not depend on time If two state sequences [ s 0 , s 1 , s 2 ,... ] and [ s ′ 0 , s ′ 1 , s ′ 2 ,... ] begin with the same state (i.e., s 0 = s ′ 0 ) then the two sequences should be preference-ordered the same way as the sequences [ s 1 , s 2 ,... ] and [ s ′ 1 , s ′ 2 ,... ] Ch. 17 – p.17/29

Algorithms to solve MDPs Value iteration Initialize the value of each state to its immediate reward Iterate to calculate values considering sequential rewards For each state, select the action with the maximum expected utility Policy iteration Get an initial policy Evaluate the policy to find the utility of each state Modify the policy by selecting actions that increase the utility of a state. If changes occurred, go to the previous step Ch. 17 – p.18/29

Value Iteration Algorithm function V ALUE -I TERATION ( mdp, ε ) returns a utility function inputs: mdp , an MDP with states S , transition model T , reward function R , discount γ ε , the maximum error allowed in the utility of a state local variables: U, U’ , vectors of utilities for states in S , initially zero δ , the maximum change in the utility of any state in an iteration repeat U ← U ′ ; δ ← 0 for each state s in S do U ′ [ s ] ← R [ s ]+ γ max a ∑ s ′ T ( s , a , s ′ ) U [ s ′ ] if | U ′ [ s ] − U [ s ] | > δ then δ ← | U ′ [ s ] − U [ s ] | until δ < ε ( 1 − γ ) / γ return U Ch. 17 – p.19/29

State utilities with γ = 1 and R(s) = -0.04 3 +1 0.812 0.868 0.918 2 −1 0.762 0.660 1 0.705 0.655 0.611 0.388 1 2 3 4 Ch. 17 – p.20/29

Optimal policy using value iteration To find the optimal policy choose the action that maximizes the expected utility of the subsequent state π ∗ ( s ) = argmax a ∑ T ( s , a , s ′ ) U ( s ′ ) s ′ Ch. 17 – p.21/29

Properties of value iteration The value iteration algorithm can be thought of as propogating information through the state space by means of local updates It converges to the correct utilities We can bound the error in the utility estimates if we stop after a finite number of iterations, and we can bound the policy loss that results from executing the corresponding MEU policy Ch. 17 – p.22/29

More on value iteration The value iteration algorithm we looked at is solving the standard Bellman equations using Bellman updates . Bellman equation U ( s ) = R ( s )+ γ max a ∑ T ( s , a , s ′ ) U ( s ′ ) s ′ Bellman update U i + 1 ( s ) = R ( s )+ γ max a ∑ T ( s , a , s ′ ) U i ( s ′ ) s ′ Ch. 17 – p.23/29

More on value iteration (cont’d) If we apply the Bellman update infinitely often, we are guaran- teed to reach an equilibrium, in which case the final utility values must be solutions to the Bellman equations. In fact, they are also the unique solutions, and the corresponding policy is optimal. Ch. 17 – p.24/29

Policy iteration With Bellman equations, we either need to solve a nonlinear set of equations or we need to use an iterative method Policy iteration starts with a initial policy and performs iterations of evaluation and impovement on it Ch. 17 – p.25/29

Policy Iteration Algorithm function P OLICY -I TERATION ( mdp ) returns a policy inputs: mdp , an MDP with states S , transition model T local variables: U , a vector of utilities for states in S , initially zero π , a policy vector indexed by state, initially random repeat U ← P OLICY -E VALUATION ( π , U , mdp ) unchanged? ← true for each state s in S do if max a ∑ s ′ T ( s , a , s ′ ) U [ s ′ ] > ∑ s ′ T ( s , π ( s ) , s ′ ) U [ s ′ ] then π ( s ) ← argmax a ∑ s ′ T ( s , a , s ′ ) U [ s ′ ] unchanged? ← false until unchanged? return π Ch. 17 – p.26/29

Properties of Policy Iteration Implementing the P OLICY -E VALUATION routine is simpler than solving the standard Bellman equations because the action in each state is fixed by the policy The simplified Bellman equation is U i ( s ) = R ( s )+ γ ∑ T ( s , π i ( s ) , s ′ ) U i ( s ′ ) s ′ Ch. 17 – p.27/29

Properties of Policy Iteration (cont’d) The simplified set of Bellman Equations is linear ( n equations with n unknowns can be solved in O ( n 3 ) time) If n 3 is prohibitive, we can use modified policy iteration which uses the simplified Bellman update k times U i + 1 ( s ) = R ( s )+ γ ∑ T ( s , π i ( s ) , s ′ ) U i ( s ′ ) s ′ Ch. 17 – p.28/29

Issues revisited (and summary) How to represent the environment? (transition model) How to automate the decision making process? (Policy iteration and value iteration) Can also use asynchronous policy iteration and work on a subset of states How to make useful simplifying assumptions? (Full observability, stationary policy, infinite horizon etc.) Ch. 17 – p.29/29

Making Complex Decisions Chapter 17 Ch. 17 p.1/29 Outline - PowerPoint PPT Presentation

Making Complex Decisions Chapter 17 Ch. 17 p.1/29 Outline Sequential decision problems Value iteration algorithm Policy iteration algorithm Ch. 17 p.2/29 A simple environment 3 +1 p=0.8 2 1 p=0.1 p=0.1 1 S 1 2 3 4

Today Making Simple Decisions Making Decisions Making Sequential Decisions Planning

Complex Numbers Complex Numbers 1 / 19 Complex Numbers Complex numbers ( C ) are an extension of

Intermembrane Space H + H + Cyt c Co Q Complex Complex III IV H + ATPase H + Complex

Making Decisions 10 AI Slides (6e) c Lin Zuoquan@PKU 1998-2020 10 1 10 Making Decisions

Making better decisions and improving Making better decisions and improving performance

An introduction to complex numbers The complex numbers Are the real numbers not sufficient? A

Making maps pretty Andrea Aime Jim Groffen Making Maps Pretty Making Maps Pretty 1 1 Making

GCSE or Equivalent Options Decisions! Decisions! Decisions! An important time for our Year 10

Doing Your Taxes Decisions Decisions Decisions How do I get ready? Should I

Dysphagia: decisions, decisions, decisions Sean White Home Enteral Feed Dietitian Sheffield

$ Lesson One Making Decisions 04/09 the decision-making process The decision-making process

Overview of Complex Networks Complex Networks Principles of Complex Systems | @pocsvox Basic

Complex Networks Principles of Complex Systems Basic definitions Examples of CSYS/MATH 300,

Why Complex-Valued When Are Integration . . . Relation to Complex . . . Fuzzy? Why Complex

Math 211 Math 211 Complex Numbers and Matrices October 29, 2001 2 Complex Numbers Complex

Complex Networks Basic definitions Principles of Complex Systems Books Course 300, Fall, 2008

Mining, and Intro to Categorization Thurs Nov 5 Kristen Grauman UT Austin Announcements

Global economys circularity : Current state and future options Willi Haas Fridolin Krausmann,

The Panda in the Room By Louis-Vincent Gave Whether the US$ goes up, or down, remains most

Lecture 3 Stock Valuation Contact: Natt Koowattanatianchai Email: fbusnwk@ku.ac.th

Basic Framework [This lecture adapted from Sutton & Barto and Russell & Norvig] The

SC2011 International Conference Scientific Computing on S. Margherita di Pula, Sardinia, Italy

CSCE 990 Lecture 7: SVMs for Classification Stephen D. Scott February 14, 2006 Most

Formal Verification of Train Control with Air Pressure Brakes Stefan Mitsch 1 Marco Gario 2

Making Complex Decisions Chapter 17 Ch. 17 p.1/29 Outline - PowerPoint PPT Presentation

Making Complex Decisions Chapter 17 Ch. 17 p.1/29 Outline Sequential decision problems Value iteration algorithm Policy iteration algorithm Ch. 17 p.2/29 A simple environment 3 +1 p=0.8 2 1 p=0.1 p=0.1 1 S 1 2 3 4

Today Making Simple Decisions Making Decisions Making Sequential Decisions Planning

Complex Numbers Complex Numbers 1 / 19 Complex Numbers Complex numbers ( C ) are an extension of

Intermembrane Space H + H + Cyt c Co Q Complex Complex III IV H + ATPase H + Complex

Making Decisions 10 AI Slides (6e) c Lin Zuoquan@PKU 1998-2020 10 1 10 Making Decisions

Making better decisions and improving Making better decisions and improving performance

An introduction to complex numbers The complex numbers Are the real numbers not sufficient? A

Making maps pretty Andrea Aime Jim Groffen Making Maps Pretty Making Maps Pretty 1 1 Making

GCSE or Equivalent Options Decisions! Decisions! Decisions! An important time for our Year 10

Doing Your Taxes Decisions Decisions Decisions How do I get ready? Should I

Dysphagia: decisions, decisions, decisions Sean White Home Enteral Feed Dietitian Sheffield

$ Lesson One Making Decisions 04/09 the decision-making process The decision-making process

Overview of Complex Networks Complex Networks Principles of Complex Systems | @pocsvox Basic

Complex Networks Principles of Complex Systems Basic definitions Examples of CSYS/MATH 300,

Why Complex-Valued When Are Integration . . . Relation to Complex . . . Fuzzy? Why Complex

Math 211 Math 211 Complex Numbers and Matrices October 29, 2001 2 Complex Numbers Complex

Complex Networks Basic definitions Principles of Complex Systems Books Course 300, Fall, 2008

Mining, and Intro to Categorization Thurs Nov 5 Kristen Grauman UT Austin Announcements

Global economys circularity : Current state and future options Willi Haas Fridolin Krausmann,

The Panda in the Room By Louis-Vincent Gave Whether the US$ goes up, or down, remains most

Lecture 3 Stock Valuation Contact: Natt Koowattanatianchai Email: fbusnwk@ku.ac.th

Basic Framework [This lecture adapted from Sutton &amp; Barto and Russell &amp; Norvig] The

SC2011 International Conference Scientific Computing on S. Margherita di Pula, Sardinia, Italy

CSCE 990 Lecture 7: SVMs for Classification Stephen D. Scott February 14, 2006 Most

Formal Verification of Train Control with Air Pressure Brakes Stefan Mitsch 1 Marco Gario 2

Basic Framework [This lecture adapted from Sutton & Barto and Russell & Norvig] The