Complex decisions Chapter 17, Sections 13 Chapter 17, Sections 13 - PowerPoint PPT Presentation

Complex decisions Chapter 17, Sections 1–3 Chapter 17, Sections 1–3 1

Outline ♦ Sequential decision problems ♦ Value iteration ♦ Policy iteration Chapter 17, Sections 1–3 2

Sequential decision problems Search explicit actions uncertainty and subgoals and utility Markov decision Planning problems (MDPs) explicit actions uncertainty uncertain (belief states) and subgoals and utility sensing Decision−theoretic Partially observable planning MDPs (POMDPs) Chapter 17, Sections 1–3 3

Example MDP 3 + 1 2 − 1 0.8 0.1 0.1 1 START 1 2 3 4 States s ∈ S , actions a ∈ A Model T ( s, a, s ′ ) ≡ P ( s ′ | s, a ) = probability that a in s leads to s ′ Reward function R ( s ) (or R ( s, a ) , R ( s, a, s ′ ) )  − 0 . 04 (small penalty) for nonterminal states   =  ± 1 for terminal states    Chapter 17, Sections 1–3 4

Solving MDPs In search problems, aim is to find an optimal sequence In MDPs, aim is to find an optimal policy π ( s ) i.e., best action for every possible state s (because can’t predict where one will end up) The optimal policy maximizes (say) the expected sum of rewards Optimal policy when state penalty R ( s ) is –0.04: 3 + 1 2 − 1 1 1 2 3 4 Chapter 17, Sections 1–3 5

Risk and reward + 1 + 1 − 1 − 1 r = [− : −1.6284] r = [−0.4278 : −0.0850] 8 + 1 + 1 − 1 − 1 r = [−0.0480 : −0.0274] r = [−0.0218 : 0.0000] Chapter 17, Sections 1–3 6

Utility of state sequences Need to understand preferences between sequences of states Typically consider stationary preferences on reward sequences: [ r, r 0 , r 1 , r 2 , . . . ] ≻ [ r, r ′ 0 , r ′ 1 , r ′ 2 , . . . ] ⇔ [ r 0 , r 1 , r 2 , . . . ] ≻ [ r ′ 0 , r ′ 1 , r ′ 2 , . . . ] Theorem: there are only two ways to combine rewards over time. 1) Additive utility function: U ([ s 0 , s 1 , s 2 , . . . ]) = R ( s 0 ) + R ( s 1 ) + R ( s 2 ) + · · · 2) Discounted utility function: U ([ s 0 , s 1 , s 2 , . . . ]) = R ( s 0 ) + γR ( s 1 ) + γ 2 R ( s 2 ) + · · · where γ is the discount factor Chapter 17, Sections 1–3 7

Utility of states Utility of a state (a.k.a. its value ) is defined to be U ( s ) = expected (discounted) sum of rewards (until termination) assuming optimal actions Given the utilities of the states, choosing the best action is just MEU: maximize the expected utility of the immediate successors 3 0.912 + 1 3 + 1 0.812 0.868 0.762 2 0.660 − 1 2 − 1 0.705 0.655 1 0.611 1 0.388 1 2 3 4 1 2 3 4 Chapter 17, Sections 1–3 8

Utilities contd. Problem: infinite lifetimes ⇒ additive utilities are infinite 1) Finite horizon: termination at a fixed time T ⇒ nonstationary policy: π ( s ) depends on time left 2) Absorbing state(s): w/ prob. 1, agent eventually “dies” for any π ⇒ expected utility of every state is finite 3) Discounting: assuming γ < 1 , R ( s ) ≤ R max , U ([ s 0 , . . . s ∞ ]) = Σ ∞ t =0 γ t R ( s t ) ≤ R max / (1 − γ ) Smaller γ ⇒ shorter horizon 4) Maximize system gain = average reward per time step Theorem: optimal policy has constant gain after initial transient E.g., taxi driver’s daily scheme cruising for passengers Chapter 17, Sections 1–3 9

Dynamic programming: the Bellman equation Definition of utility of states leads to a simple relationship among utilities of neighboring states: expected sum of rewards = current reward + γ × expected sum of rewards after taking best action Bellman equation (1957): a Σ s ′ U ( s ′ ) T ( s, a, s ′ ) U ( s ) = R ( s ) + γ max U (1 , 1) = − 0 . 04 + γ max { 0 . 8 U (1 , 2) + 0 . 1 U (2 , 1) + 0 . 1 U (1 , 1) , up 0 . 9 U (1 , 1) + 0 . 1 U (1 , 2) left 0 . 9 U (1 , 1) + 0 . 1 U (2 , 1) down 0 . 8 U (2 , 1) + 0 . 1 U (1 , 2) + 0 . 1 U (1 , 1) } right One equation per state = n nonlinear equations in n unknowns Chapter 17, Sections 1–3 10

Value iteration algorithm Idea: Start with arbitrary utility values Update to make them locally consistent with Bellman eqn. Everywhere locally consistent ⇒ global optimality Repeat for every s simultaneously until “no change” a Σ s ′ U ( s ′ ) T ( s, a, s ′ ) U ( s ) ← R ( s ) + γ max for all s (4,3) 1 (3,3) (2,3) (1,1) (3,1) 0.5 (4,1) Utility estimates 0 -0.5 -1 (4,2) 0 5 10 15 20 25 30 Number of iterations Chapter 17, Sections 1–3 11

Convergence Define the max-norm || U || = max s | U ( s ) | , so || U − V || = maximum difference between U and V Let U t and U t +1 be successive approximations to the true utility U Theorem: For any two approximations U t and V t || U t +1 − V t +1 || ≤ γ || U t − V t || I.e., any distinct approximations must get closer to each other so, in particular, any approximation must get closer to the true U and value iteration converges to a unique, stable, optimal solution Theorem: if || U t +1 − U t || < ǫ , then || U t +1 − U || < 2 ǫγ/ (1 − γ ) I.e., once the change in U t becomes small, we are almost done. MEU policy using U t may be optimal long before convergence of values Chapter 17, Sections 1–3 12

Policy iteration Howard, 1960: search for optimal policy and utility values simultaneously Algorithm: π ← an arbitrary initial policy repeat until no change in π compute utilities given π update π as if utilities were correct (i.e., local MEU) To compute utilities given a fixed π (value determination): U ( s ) = R ( s ) + γ Σ s ′ U ( s ′ ) T ( s, π ( s ) , s ′ ) for all s i.e., n simultaneous linear equations in n unknowns, solve in O ( n 3 ) Chapter 17, Sections 1–3 13

Modified policy iteration Policy iteration often converges in few iterations, but each is expensive Idea: use a few steps of value iteration (but with π fixed) starting from the value function produced the last time to produce an approximate value determination step. Often converges much faster than pure VI or PI Leads to much more general algorithms where Bellman value updates and Howard policy updates can be performed locally in any order Reinforcement learning algorithms operate by performing such updates based on the observed transitions made in an initially unknown environment Chapter 17, Sections 1–3 14

Partial observability POMDP has an observation model O ( s, e ) defining the probability that the agent obtains evidence e when in state s Agent does not know which state it is in ⇒ makes no sense to talk about policy π ( s ) !! Theorem (Astrom, 1965): the optimal policy in a POMDP is a function π ( b ) where b is the belief state (probability distribution over states) Can convert a POMDP into an MDP in belief-state space, where T ( b, a, b ′ ) is the probability that the new belief state is b ′ given that the current belief state is b and the agent does a . I.e., essentially a filtering update step Chapter 17, Sections 1–3 15

Partial observability contd. Solutions automatically include information-gathering behavior If there are n states, b is an n -dimensional real-valued vector ⇒ solving POMDPs is very (actually, PSPACE-) hard! The real world is a POMDP (with initially unknown T and O ) Chapter 17, Sections 1–3 16

Complex decisions Chapter 17, Sections 13 Chapter 17, Sections 13 - PowerPoint PPT Presentation

Complex decisions Chapter 17, Sections 13 Chapter 17, Sections 13 1 Outline Sequential decision problems Value iteration Policy iteration Chapter 17, Sections 13 2 Sequential decision problems Search explicit actions

Complex Numbers Complex Numbers 1 / 19 Complex Numbers Complex numbers ( C ) are an extension of

Intermembrane Space H + H + Cyt c Co Q Complex Complex III IV H + ATPase H + Complex

An introduction to complex numbers The complex numbers Are the real numbers not sufficient? A

GCSE or Equivalent Options Decisions! Decisions! Decisions! An important time for our Year 10

Doing Your Taxes Decisions Decisions Decisions How do I get ready? Should I

Dysphagia: decisions, decisions, decisions Sean White Home Enteral Feed Dietitian Sheffield

Today Making Simple Decisions Making Decisions Making Sequential Decisions Planning

Overview of Complex Networks Complex Networks Principles of Complex Systems | @pocsvox Basic

Complex Networks Principles of Complex Systems Basic definitions Examples of CSYS/MATH 300,

Why Complex-Valued When Are Integration . . . Relation to Complex . . . Fuzzy? Why Complex

Math 211 Math 211 Complex Numbers and Matrices October 29, 2001 2 Complex Numbers Complex

Complex Networks Basic definitions Principles of Complex Systems Books Course 300, Fall, 2008

New guidance for better decisions in complex times: Guidance for Strategic Decisions on Climate

Modelling the information that underlies complex behavioural decisions to provide decision

Hawaii Board of Education Meeting Kauai Complex Area Presentation September 2, 2014 1 Complex Area

Anaerobic digestion the method of choice for the treatment of organic waste and Complex

Discounted Duration Calculus Work in Progress H. Ody Joint work with M. Frnzle and M. R.

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 15: Language

Testable Implications of Models of Intertemporal Choice Exponential Discounting and Its

Aggregate shocks and house prices fluctuations Jos e-V ctor R os-Rull Virginia S

If market is efficient, does this mean expert advice is worthless? Does this mean there is no room

On multiple discount rates C. Chambers F. Echenique Georgetown Caltech Columbia Sept. 15 2017

The impact of The impact of Financial Crises on the Financial Crises on the CPI CPI R R

St. Alban s Episcopal Church 2017/2018 Financial Overview Annual Meeting January 28, 2018