Topics in Computational Sustainability CS 325 Spring 2016 Note to - - PowerPoint PPT Presentation

topics in computational sustainability
SMART_READER_LITE
LIVE PREVIEW

Topics in Computational Sustainability CS 325 Spring 2016 Note to - - PowerPoint PPT Presentation

Topics in Computational Sustainability CS 325 Spring 2016 Note to other teachers and users of these slides. Andrew would be delighted Making Choices: Sequential Decision Making if you found this source material useful in giving your own


slide-1
SLIDE 1

Topics in Computational Sustainability

CS 325

Spring 2016

Making Choices: Sequential Decision Making

Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint

  • riginals are available. If you make use
  • f a significant portion of these slides in

your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.

slide-2
SLIDE 2

Stochastic programming

?

b a c {a(pa),b(pb),c(pc)}  maximizes expected utility  minimizes expected cost Probabilistic model wet dry decision

slide-3
SLIDE 3

Problem Setup

Given limited budget, what parcels should I conserve to maximize the expected number of occupied territories in 50 years?

! "# $%# "

Conserved parcels Available parcels Current territories Potential territories

slide-4
SLIDE 4

Metapopulation = Cascade

i j k l m i j k l m i j k l m i j k l m i j k l m

  • Metapopulation model can be viewed as a cascade in

the layered graph representing territories over time

Target nodes: territories at final time step Patches

slide-5
SLIDE 5

Management Actions

  • Conserving parcels adds nodes to the network to

create new pathways for the cascade

Parcel 1 Parcel 2 Initial network

slide-6
SLIDE 6

Management Actions

Parcel 1 Parcel 2 Initial network

  • Conserving parcels adds nodes to the network to

create new pathways for the cascade

slide-7
SLIDE 7

Management Actions

Parcel 1 Parcel 2 Initial network

  • Conserving parcels adds nodes to the network to

create new pathways for the cascade

slide-8
SLIDE 8

Cascade Optimization Problem

Given:

  • Patch network

– Initially occupied territories – Colonization and extinction probabilities

  • Management actions

– Already-conserved parcels – List of available parcels and their costs

  • Time horizon T
  • Budget B

Find set of parcels with total cost at most B that maximizes the expected number of occupied territories at time T. Can we make our decision adaptively?

slide-9
SLIDE 9

9

Sequential decision making

  • We have a systems that changes state over time
  • Can (partially) control the system state transitions

by taking actions

  • Problem gives an objective that specifies which

states (or state sequences) are more/less preferred

  • Problem: At each time step select an action to
  • ptimize the overall (long-term) objective

– Produce most preferred sequences of “states”

slide-10
SLIDE 10

Discounted Rewards/Costs

An assistant professor gets paid, say, 20K per year. How much, in total, will the A.P. earn in their life? 20 + 20 + 20 + 20 + 20 + … = Infinity What’s wrong with this argument?

$ $

slide-11
SLIDE 11

Discounted Rewards

“A reward (payment) in the future is not worth quite as much as a reward now.”

– Because of chance of obliteration – Because of inflation

Example:

Being promised $10,000 next year is worth only 90% as much as receiving $10,000 right now.

Assuming payment n years in future is worth only (0.9)n of payment now, what is the AP’s Future Discounted Sum of Rewards ?

slide-12
SLIDE 12

Infinite Sum

Assuming a discount rate of 0.9, how much does the assistant professor get in total? 20 + .9 20 + .92 20 + .93 20 + … = 20 + .9 (20 + .9 20 + .92 20 + …) x = 20 + .9 x x = 20/(.1) = 200

slide-13
SLIDE 13

People in economics and probabilistic decision- making do this all the time. The “Discounted sum of future rewards” using discount factor g” is (reward now) + g (reward in 1 time step) + g 2 (reward in 2 time steps) + g 3 (reward in 3 time steps) + : : (infinite sum)

Discount Factors

slide-14
SLIDE 14

Markov System: the Academic Life

Define: JA = Expected discounted future rewards starting in state A JB = Expected discounted future rewards starting in state B JT = “ “ “ “ “ “ “ T JS = “ “ “ “ “ “ “ S JD = “ “ “ “ “ “ “ D How do we compute JA, JB, JT, JS, JD ?

A. Assistant Prof 20 B. Assoc. Prof 60 S. On the Street 10 D. Dead T. Tenured Prof 400

0.7 0.7 0.6 0.3 0.2 0.2 0.2 0.3 0.6 0.2

slide-15
SLIDE 15

Working Backwards

  • A. Assistant

Prof.: 20

  • B. Associate

Prof.: 60

  • T. Tenured

Prof.: 100

  • S. Out on the

Street: 10

  • D. Dead: 0

1.0 0.6 0.2 0.2 0.7 0.3 0.6 0.2 0.2 0.7 0.3

Discount factor 0.9

270 27 247 151

slide-16
SLIDE 16

Reincarnation?

  • A. Assistant

Prof.: 20

  • B. Associate

Prof.: 60

  • T. Tenured

Prof.: 100

  • S. Out on the

Street: 10

  • D. Dead: 0

0.5 0.6 0.2 0.2 0.7 0.3 0.6 0.2 0.2 0.7 0.3

Discount factor 0.9 0.5

slide-17
SLIDE 17

System of Equations

L(A) = 20 + .9(.6 L(A) + .2 L(B) + .2 L(S)) L(B) = 60 + .9(.6 L(B) + .2 L(S) + .2 L(T)) L(S) = 10 + .9(.7 L(S) + .3 L(D)) L(T) = 100 + .9(.7 L(T) + .3 L(D)) L(D) = 0 + .9 (.5 L(D) + .5 L(A))

slide-18
SLIDE 18

Solving a Markov System with Matrix Inversion

  • Upside: You get an exact answer
  • Downside: If you have 100,000 states

you’re solving a 100,000 by 100,000 system

  • f equations.
slide-19
SLIDE 19

Define

J1(Si) = Expected discounted sum of rewards over the next 1 time step. J2(Si) = Expected discounted sum rewards during next 2 steps J3(Si) = Expected discounted sum rewards during next 3 steps : Jk(Si) = Expected discounted sum rewards during next k steps J1(Si) = (what?) J2(Si) = (what?) Jk+1(Si) = (what?)

Value Iteration: another way to solve a Markov System

slide-20
SLIDE 20

Define

J1(Si) = Expected discounted sum of rewards over the next 1 time step. J2(Si) = Expected discounted sum rewards during next 2 steps J3(Si) = Expected discounted sum rewards during next 3 steps : Jk(Si) = Expected discounted sum rewards during next k steps J1(Si) = ri (what?) J2(Si) = (what?) : Jk+1(Si) = (what?)

Value Iteration: another way to solve a Markov System

N j j ij i

s J p r

1 1

) ( g

N j j k ij i

s J p r

1

) ( g

N = Number of states

slide-21
SLIDE 21

Let’s do Value Iteration

k Jk(SUN) Jk(WIND) Jk(HAIL) 1 2 3 4 5

SUN

+4 WIND

HAIL .::.:.::

  • 8

1/2 1/2 1/2 1/2 1/2 1/2

g = 0.5

slide-22
SLIDE 22

Let’s do Value Iteration

k Jk(SUN) Jk(WIND) Jk(HAIL) 1 4

  • 8

2 5

  • 1
  • 10

3 5

  • 1.25
  • 10.75

4 4.94

  • 1.44
  • 11

5 4.88

  • 1.52
  • 11.11

SUN

+4 WIND

HAIL .::.:.::

  • 8

1/2 1/2 1/2 1/2 1/2 1/2

g = 0.5

slide-23
SLIDE 23

Value Iteration for solving Markov Systems

  • Compute J1(Si) for each i
  • Compute J2(Si) for each i

:

  • Compute Jk(Si) for each i

As k→∞ Jk(Si)→J*(Si) When to stop? When Max Jk+1(Si) – Jk(Si) < ξ i This is faster than matrix inversion (N3 style) if the transition matrix is sparse

What if we have a way to interact with the Markov system?

slide-24
SLIDE 24

A Markov Decision Process

g = 0.9

Poor & Unknown +0 Rich & Unknown +10 Rich & Famous +10 Poor & Famous +0

You run a startup company. In every state you must choose between Saving money (S)

  • r

Advertising (A).

S S A A S

1 1 1 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2

slide-25
SLIDE 25

Markov Decision Processes

An MDP has…

  • A set of states {s1 ··· SN}
  • A set of actions {a1 ··· aM}
  • A set of rewards {r1 ··· rN} (one for each state)
  • A transition probability function

k ij

P

 

k i j

k ij

action use I and This Next Prob P   

On each step:

  • 0. Call current state Si
  • 1. Receive reward ri
  • 2. Choose action  {a1 ··· aM}
  • 3. If you choose action ak you’ll move to state Sj with

probability

  • 4. All future rewards are discounted by g

What’s a solution to an MDP? A sequence of actions?

slide-26
SLIDE 26

A Policy

A policy is a mapping from states to actions. Examples

STATE → ACTION PU S PF A RU S RF A STATE → ACTION PU A PF A RU A RF A

Policy Number 1:

  • How many possible policies in our example?
  • Which of the above two policies is best?
  • How do you compute the optimal policy?

PU PF RU

+10

RF

+10

RF

10

PF PU RU

10

S A A A A A

1 1 1 1 1 1/2 1/2 1/2 1/2 1/2 1/2

Policy Number 2:

slide-27
SLIDE 27

Interesting Fact

For every M.D.P. there exists an optimal policy. It’s a policy such that for every possible start state there is no better option than to follow the policy.

slide-28
SLIDE 28

Computing the Optimal Policy

Idea One: Run through all possible policies. Select the best. What’s the problem ??

slide-29
SLIDE 29

Optimal Value Function

Define J*(Si) = Expected Discounted Future Rewards, starting from state Si, assuming we use the optimal policy

S1 +0 S3 +2 S2 +3 B

1/2 1/2 1/2 1/2 1 1 1 1/3 1/3 1/3

Question: What is an optimal policy for that MDP? (assume g = 0.9) What is J*(S1) ? What is J*(S2) ? What is J*(S3) ?

slide-30
SLIDE 30

Computing the Optimal Value Function with Value Iteration

Define Jk(Si) = Maximum possible expected sum of discounted rewards I can get if I start at state Si and I live for k time steps. Note that J1(Si) = ri

slide-31
SLIDE 31

Let’s compute Jk(Si) for our example

k Jk(PU) Jk(PF) Jk(RU) Jk(RF) 1 2 3 4 5 6

slide-32
SLIDE 32

k Jk(PU) Jk(PF) Jk(RU) Jk(RF) 1 10 10 2 4.5 14.5 19 3 2.03 8.55 16.52 25.08

slide-33
SLIDE 33

Bellman’s Equation

 

 

      

  N j j n a ij i a i n

r

1 1

S J P S J

max

g    

       

i n i n i

S J S J when converged

1

max

Value Iteration for solving MDPs

  • Compute J1(Si) for all i
  • Compute J2(Si) for all i
  • :
  • Compute Jn(Si) for all i

…..until converged

…Also known as

Dynamic Programming