SLIDE 1 Topics in Computational Sustainability
CS 325
Spring 2016
Making Choices: Sequential Decision Making
Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint
- riginals are available. If you make use
- f a significant portion of these slides in
your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.
SLIDE 2 Stochastic programming
?
b a c {a(pa),b(pb),c(pc)} maximizes expected utility minimizes expected cost Probabilistic model wet dry decision
SLIDE 3 Problem Setup
Given limited budget, what parcels should I conserve to maximize the expected number of occupied territories in 50 years?
! "# $%# "
Conserved parcels Available parcels Current territories Potential territories
SLIDE 4 Metapopulation = Cascade
i j k l m i j k l m i j k l m i j k l m i j k l m
- Metapopulation model can be viewed as a cascade in
the layered graph representing territories over time
Target nodes: territories at final time step Patches
SLIDE 5 Management Actions
- Conserving parcels adds nodes to the network to
create new pathways for the cascade
Parcel 1 Parcel 2 Initial network
SLIDE 6 Management Actions
Parcel 1 Parcel 2 Initial network
- Conserving parcels adds nodes to the network to
create new pathways for the cascade
SLIDE 7 Management Actions
Parcel 1 Parcel 2 Initial network
- Conserving parcels adds nodes to the network to
create new pathways for the cascade
SLIDE 8 Cascade Optimization Problem
Given:
– Initially occupied territories – Colonization and extinction probabilities
– Already-conserved parcels – List of available parcels and their costs
Find set of parcels with total cost at most B that maximizes the expected number of occupied territories at time T. Can we make our decision adaptively?
SLIDE 9 9
Sequential decision making
- We have a systems that changes state over time
- Can (partially) control the system state transitions
by taking actions
- Problem gives an objective that specifies which
states (or state sequences) are more/less preferred
- Problem: At each time step select an action to
- ptimize the overall (long-term) objective
– Produce most preferred sequences of “states”
SLIDE 10 Discounted Rewards/Costs
An assistant professor gets paid, say, 20K per year. How much, in total, will the A.P. earn in their life? 20 + 20 + 20 + 20 + 20 + … = Infinity What’s wrong with this argument?
$ $
SLIDE 11 Discounted Rewards
“A reward (payment) in the future is not worth quite as much as a reward now.”
– Because of chance of obliteration – Because of inflation
Example:
Being promised $10,000 next year is worth only 90% as much as receiving $10,000 right now.
Assuming payment n years in future is worth only (0.9)n of payment now, what is the AP’s Future Discounted Sum of Rewards ?
SLIDE 12
Infinite Sum
Assuming a discount rate of 0.9, how much does the assistant professor get in total? 20 + .9 20 + .92 20 + .93 20 + … = 20 + .9 (20 + .9 20 + .92 20 + …) x = 20 + .9 x x = 20/(.1) = 200
SLIDE 13
People in economics and probabilistic decision- making do this all the time. The “Discounted sum of future rewards” using discount factor g” is (reward now) + g (reward in 1 time step) + g 2 (reward in 2 time steps) + g 3 (reward in 3 time steps) + : : (infinite sum)
Discount Factors
SLIDE 14 Markov System: the Academic Life
Define: JA = Expected discounted future rewards starting in state A JB = Expected discounted future rewards starting in state B JT = “ “ “ “ “ “ “ T JS = “ “ “ “ “ “ “ S JD = “ “ “ “ “ “ “ D How do we compute JA, JB, JT, JS, JD ?
A. Assistant Prof 20 B. Assoc. Prof 60 S. On the Street 10 D. Dead T. Tenured Prof 400
0.7 0.7 0.6 0.3 0.2 0.2 0.2 0.3 0.6 0.2
SLIDE 15 Working Backwards
Prof.: 20
Prof.: 60
Prof.: 100
Street: 10
1.0 0.6 0.2 0.2 0.7 0.3 0.6 0.2 0.2 0.7 0.3
Discount factor 0.9
270 27 247 151
SLIDE 16 Reincarnation?
Prof.: 20
Prof.: 60
Prof.: 100
Street: 10
0.5 0.6 0.2 0.2 0.7 0.3 0.6 0.2 0.2 0.7 0.3
Discount factor 0.9 0.5
SLIDE 17
System of Equations
L(A) = 20 + .9(.6 L(A) + .2 L(B) + .2 L(S)) L(B) = 60 + .9(.6 L(B) + .2 L(S) + .2 L(T)) L(S) = 10 + .9(.7 L(S) + .3 L(D)) L(T) = 100 + .9(.7 L(T) + .3 L(D)) L(D) = 0 + .9 (.5 L(D) + .5 L(A))
SLIDE 18 Solving a Markov System with Matrix Inversion
- Upside: You get an exact answer
- Downside: If you have 100,000 states
you’re solving a 100,000 by 100,000 system
SLIDE 19 Define
J1(Si) = Expected discounted sum of rewards over the next 1 time step. J2(Si) = Expected discounted sum rewards during next 2 steps J3(Si) = Expected discounted sum rewards during next 3 steps : Jk(Si) = Expected discounted sum rewards during next k steps J1(Si) = (what?) J2(Si) = (what?) Jk+1(Si) = (what?)
Value Iteration: another way to solve a Markov System
SLIDE 20 Define
J1(Si) = Expected discounted sum of rewards over the next 1 time step. J2(Si) = Expected discounted sum rewards during next 2 steps J3(Si) = Expected discounted sum rewards during next 3 steps : Jk(Si) = Expected discounted sum rewards during next k steps J1(Si) = ri (what?) J2(Si) = (what?) : Jk+1(Si) = (what?)
Value Iteration: another way to solve a Markov System
N j j ij i
s J p r
1 1
) ( g
N j j k ij i
s J p r
1
) ( g
N = Number of states
SLIDE 21 Let’s do Value Iteration
k Jk(SUN) Jk(WIND) Jk(HAIL) 1 2 3 4 5
SUN
+4 WIND
HAIL .::.:.::
1/2 1/2 1/2 1/2 1/2 1/2
g = 0.5
SLIDE 22 Let’s do Value Iteration
k Jk(SUN) Jk(WIND) Jk(HAIL) 1 4
2 5
3 5
4 4.94
5 4.88
SUN
+4 WIND
HAIL .::.:.::
1/2 1/2 1/2 1/2 1/2 1/2
g = 0.5
SLIDE 23 Value Iteration for solving Markov Systems
- Compute J1(Si) for each i
- Compute J2(Si) for each i
:
- Compute Jk(Si) for each i
As k→∞ Jk(Si)→J*(Si) When to stop? When Max Jk+1(Si) – Jk(Si) < ξ i This is faster than matrix inversion (N3 style) if the transition matrix is sparse
What if we have a way to interact with the Markov system?
SLIDE 24 A Markov Decision Process
g = 0.9
Poor & Unknown +0 Rich & Unknown +10 Rich & Famous +10 Poor & Famous +0
You run a startup company. In every state you must choose between Saving money (S)
Advertising (A).
S S A A S
1 1 1 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2
SLIDE 25 Markov Decision Processes
An MDP has…
- A set of states {s1 ··· SN}
- A set of actions {a1 ··· aM}
- A set of rewards {r1 ··· rN} (one for each state)
- A transition probability function
k ij
P
k i j
k ij
action use I and This Next Prob P
On each step:
- 0. Call current state Si
- 1. Receive reward ri
- 2. Choose action {a1 ··· aM}
- 3. If you choose action ak you’ll move to state Sj with
probability
- 4. All future rewards are discounted by g
What’s a solution to an MDP? A sequence of actions?
SLIDE 26 A Policy
A policy is a mapping from states to actions. Examples
STATE → ACTION PU S PF A RU S RF A STATE → ACTION PU A PF A RU A RF A
Policy Number 1:
- How many possible policies in our example?
- Which of the above two policies is best?
- How do you compute the optimal policy?
PU PF RU
+10
RF
+10
RF
10
PF PU RU
10
S A A A A A
1 1 1 1 1 1/2 1/2 1/2 1/2 1/2 1/2
Policy Number 2:
SLIDE 27
Interesting Fact
For every M.D.P. there exists an optimal policy. It’s a policy such that for every possible start state there is no better option than to follow the policy.
SLIDE 28
Computing the Optimal Policy
Idea One: Run through all possible policies. Select the best. What’s the problem ??
SLIDE 29 Optimal Value Function
Define J*(Si) = Expected Discounted Future Rewards, starting from state Si, assuming we use the optimal policy
S1 +0 S3 +2 S2 +3 B
1/2 1/2 1/2 1/2 1 1 1 1/3 1/3 1/3
Question: What is an optimal policy for that MDP? (assume g = 0.9) What is J*(S1) ? What is J*(S2) ? What is J*(S3) ?
SLIDE 30
Computing the Optimal Value Function with Value Iteration
Define Jk(Si) = Maximum possible expected sum of discounted rewards I can get if I start at state Si and I live for k time steps. Note that J1(Si) = ri
SLIDE 31
Let’s compute Jk(Si) for our example
k Jk(PU) Jk(PF) Jk(RU) Jk(RF) 1 2 3 4 5 6
SLIDE 32
k Jk(PU) Jk(PF) Jk(RU) Jk(RF) 1 10 10 2 4.5 14.5 19 3 2.03 8.55 16.52 25.08
SLIDE 33 Bellman’s Equation
N j j n a ij i a i n
r
1 1
S J P S J
max
g
i n i n i
S J S J when converged
1
max
Value Iteration for solving MDPs
- Compute J1(Si) for all i
- Compute J2(Si) for all i
- :
- Compute Jn(Si) for all i
…..until converged
…Also known as
Dynamic Programming