Optimal Control and Dynamic Programming 4SC000 Q2 2017-2018 Duarte - - PowerPoint PPT Presentation

optimal control and dynamic programming
SMART_READER_LITE
LIVE PREVIEW

Optimal Control and Dynamic Programming 4SC000 Q2 2017-2018 Duarte - - PowerPoint PPT Presentation

Optimal Control and Dynamic Programming 4SC000 Q2 2017-2018 Duarte Antunes Part I Discrete optimization problems Outline Dynamic programming formalism Stochastic dynamic programming Applications Recap c 1 c 0 n 1 n 1 1 n 0 2 c h


slide-1
SLIDE 1

4SC000 Q2 2017-2018

Optimal Control and Dynamic Programming

Duarte Antunes

slide-2
SLIDE 2

Part I

Discrete optimization problems

slide-3
SLIDE 3

Outline

  • Dynamic programming formalism
  • Stochastic dynamic programming
  • Applications
slide-4
SLIDE 4

Recap

1

  • Discrete optimization problem specified by a transition diagram.
  • Several applications.

1 n0 n1 2 2 1 1 1 2 c0

11

Stage 1 Stage 0 Stage h Stage h −1 c0

n01

c0

n02

c0

21

c0

22

c0

12

c1

11

c1

21

c1

22

c1

23

c1

n11

nh nh

− 1

ch−1

11

ch−1

21

ch−1

22

ch−1

nh−11 ch nh

ch

1

slide-5
SLIDE 5

Recap

2

1 2 3 4 5 1 2 3 4 5 2 3 4 1 2 4 1 1 1 4 1 2 3 5 3 1 1 2 3 1 3 4 5 2 3 4 1 1 1 1 4 2 3 1 4 5 4 4 3 1

9 7

The dynamic programming algorithm provides a policy from which an optimal path can be obtained. Policies are crucial to cope with disturbances.

slide-6
SLIDE 6

Equivalent formulation of discrete optimization problems

3

  • Dynamic model
  • Cost

1 n0 n1 2 2 1 1 1 2 c0

11

c0

n01

c0

n02

c0

21

c0

22

c0

12

c1

11

c1

21

c1

22

c1

23

c1

n11

nh nh

− 1

ch−1

11

ch−1

21

ch−1

22

ch−1

nh−11 ch nh

ch

1

xk+1 = fk(xk, uk), k ∈ {0, . . . , h − 1}.

h−1

X

k=0

gk(xk, uk) + gh(xh). x0 ∈ {1, . . . , n0} x1 ∈ {1, . . . , n1} xh ∈ {1, . . . , nh} u0 ∈ {1, . . . , m0,x0} u1 ∈ {1, . . . , m1,x1} State Action Cost g0(x0, u0) = c0

x0,u0

g1(x1, u1) = c1

x1,u1

gh(xh) = ch

xh

slide-7
SLIDE 7

Dynamic programming equations

4

Dynamic programming algorithm in the new formalist

Start with for every and for each decision stage, starting from the last and moving backwards, compute and and where is the minimizer in the dynamic programming (DP) equation, i.e., Jk(i) = min

j∈Uk(i) gk(i, j) + Jk+1(fk(i, j))

Jh(i) = gh(i) i ∈ Xk k ∈ {h − 1, h − 2, . . . , 0} Jk µk µk(i) = j, Jk(i) = gk(i, µk(i)) + Jk+1(fk(i, µk(i))) and . Then is an optimal policy. {µ0, . . . , µh−1} j from DP equation Uk(i) := {1, . . . , mk,i}

slide-8
SLIDE 8

Remarks

  • The DP equation expresses the balance each optimal decision must

meet between immediate and future cost

  • This is just a more formal way of writing what we have already seen.
  • We shall use the same notation for stage-decision problems.
  • There we shall formally prove that the dynamic programming

algorithm provides the optimal policy. The proof also applies to discrete optimization problems.

5

Jk(xk) = min

uk∈Uk(xk)

g(xk, uk) | {z }

immediate or stage cost

+ Jk+1(f(xk, uk)) | {z }

future cost

slide-9
SLIDE 9

Example

6

Move a robot from an initial stage to a final stage in minimum time

  • If the robot is not stuck in an obstacle or in a wall it can go up, straight or down.

Otherwise, there is only one option (see figs). It takes I time unit to move horizontally from stage to stage and to move diagonally. time units are paid every time an obstacle or a wall is hit. initial stage final stage i i + 1 √ 2 up straight down

  • bstacle

upper wall lower wall

h h c

√ 2 √ 2

1 √ 2 + c √ 2 + c

1 + c

slide-10
SLIDE 10

Modeling

7

√ 2 1 + c √ 2 + c 1 1 √ 2 + c √ 2 √ 2 √ 2 √ 2

This problem can be written in the DP framework for a transition diagram obtained from the rules of the problem

1

slide-11
SLIDE 11

DP equation

8

initial stage final stage h Jk(1) = √ 2 + c + Jk+1(2) Jk(n) = √ 2 + c + Jk+1(n − 1) Jk(i) = 1 + c + Jk+1(i) Jk(i) = min{1 + Jk+1(i), √ 2 + Jk+1(i + 1), √ 2 + Jk+1(i − 1)} 1 n not an obstacle node Jh(i) = 0 i ∈ {1, . . . , n} i ∈ {2, . . . , n − 1}

  • bstacle node

k ∈ {h − 1, h − 2, . . . , 1, 0} For :

slide-12
SLIDE 12

Numerical Example

9

Jk(1) = √ 2 + c + Jk+1(2) Jk(n) = √ 2 + c + Jk+1(n − 1) Jk(i) = 1 + c + Jk+1(i) Jk(i) = min{1 + Jk+1(i), √ 2 + Jk+1(i + 1), √ 2 + Jk+1(i − 1)} not an obstacle node Jh(i) = 0

  • bstacle node

:

17.4116.4115.4114.4113.4112.4111.4110.419.41 8.41 7.41 6.41 5.41 0.00 13.0012.0011.0010.009.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00 13.0012.0011.0010.009.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00 13.0012.0011.0010.009.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00 13.0012.0011.0010.009.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00 13.4112.4111.4110.419.41 8.41 11.416.41 5.41 8.00 3.00 2.00 1.00 0.00 13.8312.8311.8310.839.83 12.8311.836.83 9.41 8.41 3.41 6.00 1.00 0.00 14.2413.2412.2411.2410.249.24 12.247.24 9.41 8.41 3.41 6.00 1.00 0.00 13.8312.8311.8310.839.83 8.83 7.83 6.83 5.83 8.00 3.00 2.00 1.00 0.00 13.8312.8311.8310.839.83 9.24 11.416.41 5.41 4.41 3.41 6.00 1.00 0.00 13.4112.4111.4110.419.41 8.41 11.416.41 5.41 8.41 3.41 6.00 1.00 0.00 13.0012.0011.0010.009.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00 17.4116.4115.4114.4113.4112.4111.4110.419.41 8.41 7.41 6.41 5.41 0.00

c = 4

slide-13
SLIDE 13

Outline

  • Dynamic programming formalism
  • Stochastic dynamic programming
  • Applications
slide-14
SLIDE 14

Discussion

  • We can use the policy provided by the dynamic programming algorithm

assuming no disturbances to cope with disturbances.

  • Is this procedure optimal in any sense? In general, no.
  • In fact, as we show next, in the presence of disturbances it may not even

be possible to define optimal decisions, since these would depend on future realizations of disturbances.

10

slide-15
SLIDE 15

Example

11

Consider that at position A there might be a disturbance making the robot move down one extra position

17.4116.4115.4114.4113.4112.4111.4110.419.41 8.41 7.41 6.41 5.41 0.00 13.0012.0011.0010.009.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00 13.0012.0011.0010.009.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00 13.0012.0011.0010.009.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00 13.0012.0011.0010.009.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00 13.4112.4111.4110.419.41 8.41 11.416.41 5.41 8.00 3.00 2.00 1.00 0.00 13.8312.8311.8310.839.83 12.8311.836.83 9.41 8.41 3.41 6.00 1.00 0.00 14.2413.2412.2411.2410.249.24 12.247.24 9.41 8.41 3.41 6.00 1.00 0.00 13.8312.8311.8310.839.83 8.83 7.83 6.83 5.83 8.00 3.00 2.00 1.00 0.00 13.8312.8311.8310.839.83 9.24 11.416.41 5.41 4.41 3.41 6.00 1.00 0.00 13.4112.4111.4110.419.41 8.41 11.416.41 5.41 8.41 3.41 6.00 1.00 0.00 13.0012.0011.0010.009.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00 17.4116.4115.4114.4113.4112.4111.4110.419.41 8.41 7.41 6.41 5.41 0.00

√ 2 up straight down √ 2

1

Decisions Possible outcomes (no disturbance/ disturbance) A

1

√ 2

1

√ 5

slide-16
SLIDE 16

Example

12

If we knew the future disturbance value, we would pick ‘up’ if ‘disturbance’, ‘straight’ if ‘no disturbance’. Thus, if we assume nothing about the disturbances there is no optimal decision at position A.

no disturbance disturbance decision: ‘up’ decision:‘straight’ decision:‘down’ cost-to-go at position A 11.4 + √ 2 7.83 + 1 12.2 + √ 2 11.8 + √ 5 7.83 + 1 12.2 + √ 2

slide-17
SLIDE 17

Assumptions on disturbances

There are two assumptions that make optimal decisions well-defined

  • Stochastic disturbances. If we have a stochastic characterization of disturbances we can

define optimal policies as the ones that minimize the expected cost. The dynamic programming framework can be extended to provide this policy.

  • Worst-case disturbances. Optimal control problems with worst-case disturbances can be

tackled in the framework of game theory and will not be addressed in the course.

13

slide-18
SLIDE 18

Example: stochastic disturbances

14

For the toy robot problem consider the stochastic characterization: cost-to-go at position A Optimal decision is now well-defined: pick ‘straight’ .

(12.2 + √ 2)0.8 + (11.8 + √ 5)0.2 = 13.69

no disturbance disturbance Expected cost decision: ‘up’ decision:‘straight’ decision:‘down’ 11.4 + √ 2 7.83 + 1 12.2 + √ 2 11.8 + √ 5 7.83 + 1 12.2 + √ 2 Prob[disturbance] = 0.2 Prob[no disturbance] = 0.8 (11.4 + √ 2)0.8 + (7.83 + 1)0.2 = 12.0174 (7.83 + 1)0.8 + (12.2 + √ 2)0.2 = 9.79

slide-19
SLIDE 19

Example: worst-case disturbances

15

cost-to-go at position A

no disturbance disturbance Worst-case cost decision: ‘up’ decision:‘straight’ decision:‘down’ 11.4 + √ 2 7.83 + 1 12.2 + √ 2 11.8 + √ 5 7.83 + 1 12.2 + √ 2 11.4 + √ 2 12.2 + √ 2 11.8 + √ 5

Optimal decision is now well-defined: pick ‘up’. Safe policy (at least we get ).

11.4 + √ 2

slide-20
SLIDE 20

Stochastic formulation: Markov decision processes

The dynamic model and the cost take now the form

16

xk+1 = fk(xk, uk, wk)

h−1

X

k=0

gk(xk, uk, wk) + gh(xh), Note that both the state and cost are now random variables. pk,i,j

`

:= Prob[wk = `|xk = i, uk = j], ` ∈ Wk(i, j). where the state and input live in the same finite spaces defined before (slide 3), and disturbances belong to a finite set when and are characterised by wk ∈ Wk(i, j) := {1, . . . , ωi,j,k} xk = i, uk = j

slide-21
SLIDE 21

17

xk+1 = fk(xk, µk(xk), wk). π = {µ0, . . . , µh−1} uk = µk(xk) Given a policy where is a real number for each initial condition. The goal is to find a policy that minimizes the cost . Jπ Jπ(x0) Jπ(x0) = E[Ph−1

k=0 gk(xk, µk(xk), wk) + gh(xh)]

Stochastic formulation: Markov decision processes

slide-22
SLIDE 22

(Stochastic) Dynamic programming algorithm

18

Start with for every and for each decision stage, starting from the last and moving backwards, compute and and where is the minimizer in the DP equation. Jh(i) = gh(i) i ∈ Xk k ∈ {h − 1, h − 2, . . . , 0} Jk µk µk(i) = j, Then is an optimal policy. {µ0, . . . , µh−1} (DP eq.) j as Jk(i) = minj∈Uk(i) P

`∈Wk(i,j) pk,i,j `

(gk(i, j, `) + Jk+1(fk(i, j, `))) Each function is now the expected cost-to-go when J`(i) = E[Ph−1

k=` gk(xk, µk(xk), wk) + gh(xh)]

J`(i) x` = i J0(i) x0 = i In particular is the expected cost for a given initial condition . (the notation is typically also used) J`(i) = E[Ph−1

k=` gk(xk, µk(xk), wk) + gh(xh)|x` = i]

slide-23
SLIDE 23

19

Discussion

  • The nomenclature ‘Markov decision processes’ comes from the fact that

the dynamic model has the so-called Markov property.

  • Intuitively the Markov property states that the probability that the

process moves to a new state depends only on the current state and action; given the current state and action the next state is conditionally independent of all previous states and actions.

  • Formally, the Markov property states that for any function ,
  • This property follows from the definition of the dynamic model.
  • For convenience, throughout the course we will sometimes refer to a

Markov decision process as a discrete optimisation process with stochastic disturbances. E[f(xk+1)|xk, uk, xk−1, uk−1, . . . , x0, u0] = E[f(xk+1)|xk, uk] f

slide-24
SLIDE 24

Example

20

Consider the deterministic problem introduced before but now assume that for regular nodes (no obstacles) the evolution of the state is stochastic for every stage and state initial state final stage move up move straight move down

h

probabilities

1 − 2p 2p p 1 − 2p p 1 − 2p 2p

slide-25
SLIDE 25

Dynamic programming equation

21

initial stage final stage h Jk(1) = √ 2 + c + Jk+1(2) Jk(n) = √ 2 + c + Jk+1(n − 1) Jk(i) = 1 + c + Jk+1(i) 1 n Jh(i) = 0 i ∈ {1, . . . , n} i ∈ {2, . . . , n − 1} Stochastic

  • bstacle node

i ∈ {2, . . . , n − 1} Jk(i) = min{1 + Jk+1(i), √ 2 + Jk+1(i + 1), √ 2 + Jk+1(i − 1)} Compare with deterministic (computed before) Jk(i) = min{(1 − 2p)(1 + Jk+1(i)) + p( √ 2 + Jk+1(i + 1)) + p( √ 2 + Jk+1(i − 1)), 2p(1 + Jk+1(i)) + (1 − 2p)( √ 2 + Jk+1(i + 1)), 2p(1 + Jk+1(i)) + (1 − 2p)( √ 2 + Jk+1(i − 1))}

slide-26
SLIDE 26

Numerical example

22

Jk(i) = min{(1 − 2p)(1 + Jk+1(i)) + p( √ 2 + Jk+1(i + 1)) + p( √ 2 + Jk+1(i − 1)), 2p(1 + Jk+1(i)) + (1 − 2p)( √ 2 + Jk+1(i + 1)), 2p(1 + Jk+1(i)) + (1 − 2p)( √ 2 + Jk+1(i − 1))} p = 0.1 c = 4

18.9317.8216.7115.6114.5013.3912.2911.1810.078.96 7.83 6.50 5.41 0.00 14.6213.5112.4111.3010.199.09 7.98 6.87 5.77 4.66 3.55 2.41 1.08 0.00 14.3413.2312.1311.029.91 8.80 7.70 6.59 5.49 4.38 3.27 2.17 1.08 0.00 14.3213.2112.1010.999.88 8.76 7.65 6.54 5.42 4.33 3.25 2.17 1.08 0.00 14.5213.4312.3411.2510.159.03 7.90 6.80 5.67 4.36 3.25 2.17 1.08 0.00 14.8113.7312.6411.5710.5710.0812.177.17 6.52 8.50 3.50 2.17 1.08 0.00 15.1214.0513.0312.2212.1513.7813.548.54 9.99 9.28 4.28 6.08 1.08 0.00 15.4814.5213.8413.6613.0311.9814.349.34 10.349.28 4.28 6.08 1.08 0.00 16.0315.3914.9513.8912.7711.639.73 8.59 7.42 9.03 4.03 2.17 1.08 0.00 16.8816.1515.0613.9612.8711.7813.358.35 7.22 5.36 4.28 6.08 1.08 0.00 17.4316.3315.2414.1513.0612.0213.578.57 7.51 9.48 4.48 6.08 1.08 0.00 17.7216.6315.5414.4413.2911.849.96 8.86 7.58 5.73 4.27 2.41 1.08 0.00 22.0420.9519.8518.7117.2515.3814.2713.0011.159.69 7.83 6.50 5.41 0.00

slide-27
SLIDE 27

Comparison with deterministic

23

Stochastic DP Deterministic

The optimal policy coping with uncertainty is intuitive: go around the obstacles!

slide-28
SLIDE 28

Options to cope with disturbances

  • Open loop - apply control inputs of optimal path
  • May not make sense, leads to poor performance or instability.
  • Feedback policy obtained from the DP algorithm neglecting disturbances
  • Often followed in practice (specially in eng. contexts) due to the

difficulty in obtaining a stochastic characterizing of disturbances.

  • Later we will call this policy certainty equivalent control.
  • Feedback policy obtained from the DP algorithm for stochastic models
  • important in operational research problems, games, etc., but

limited use in practical control applications.

24

slide-29
SLIDE 29

Open loop

25

  • 1. Compute optimal decisions for a given initial

state assuming no disturbances

  • predicted cost 13.8
  • 2. Blindly apply these decisions
  • cost of perturbed path in the picture: 26.66
  • average cost of 1000 perturbed paths: 22.04

(open-loop decisions applied only at regular nodes, at obstacle and wall nodes the options are unique)

slide-30
SLIDE 30

Closed loop with deterministic DP

26

  • 1. Compute optimal policy assuming no disturbances
  • 2. Re-decide online options based on the policy
  • cost of perturbed path in the picture: 19.07
  • average cost of 1000 perturbed paths: 17.51
slide-31
SLIDE 31

Closed loop with stochastic DP

27

  • 1. Compute optimal policy coping with disturbances
  • 2. Re-decide online options based on the policy
  • cost of perturbed path in the picture: 15.48
  • average cost of 1000 perturbed paths: 15.96
slide-32
SLIDE 32

Outline

  • Dynamic programming formalism
  • Stochastic dynamic programming
  • Applications
slide-33
SLIDE 33

Discussion

28

  • Stochastic dynamic programming has a large number of applications in

management, finance, games, operational research and other areas requiring decisions in the presence of uncertainty.

  • One interesting class of problems are optimal stopping problems where

at each stage one must decide either to stop or continue a given process (possibly at a given cost).

  • Another interesting class are games, where a good (optimal) policy in

games can sometimes bias the game in favor of one of the players.

  • Here we discuss a game and a standard problem - car parking.
  • Bertsekas’ book discusses stopping time problems in Chapter 4.

Continuous-time optimal stopping time problem are mathematically much more involved (see, e.g., M. H. Davis, Markov models and optimization, Chapman and Hall/CRC, 1993)

slide-34
SLIDE 34

A simple game

29

Suppose and both players have zero points at the initial stage. Two players ( and ), repetitively play two fair games:

  • Player tosses a coin of money unit: if the outcome is ‘heads’ he/she gains point

(Player looses point); otherwise he/she looses point (Player gains point).

  • Player tosses a coin of money units: if the outcome is ‘heads’ he/she gains points

(Player looses points), otherwise he/she looses points (Player gains points).

  • The game ends if one of the players achieves or points in which case that

player receives money units from the other player.

  • If none of the players reaches points during stages the game stops and the

players do not win or loose anything. Can player gain money in expectation by deciding at each stage which coin to toss? A 1 B A 1 B 1 1 1 B A B 2 2 2 2 B 2 L L L h h = 2, L = 3 L + 1 A

slide-35
SLIDE 35

DP formulation

30

Dynamic model xk+1 = f(xk, uk, wk) xk - #points of player

  • decision

wk ∈ {−1, 1} uk ∈ {1, 2} tails heads f(xk, uk, wk) =                xk + 1 if uk = 1 and wk = 1 xk + 2 if uk = 2 and wk = 1 xk − 1 if uk = 1 and wk = −1 xk − 2 if uk = 2 and wk = −1 xk if xk ∈ {L, L + 1, −L, −(L + 1)} k ∈ {0, . . . , h − 1} (game had ended but wait for the last stage)

A

Cost g(xk, uk) = 0

xk, uk

for every min

h−1

X

k=0

g(xk, uk) + gh(xh) (note that profit = -cost) gh(xh) =      − L if xh = L of xh = L + 1 L if xh = −L of xh = −(L + 1) 0 otherwise

slide-36
SLIDE 36

DP algorithm

31

DP algorithm

k ∈ {h − 1, h − 2, . . . , 0} = 0 f(xk, uk, wk)

Using the expression for we obtain

f(xk, uk, wk) Jk(xk) = min

uk E[g(xk, uk) + Jk+1(xk+1)|xk]

Jh(xh) = gh(xh) Jh(xh) =      − L if xh = L or xh = L + 1 L if xh = −L or xh = −(L + 1) 0 otherwise Jk(xk) =    min{Jk+1(xk + 1) + Jk+1(xk − 1) 2 , Jk+1(xk + 2) + Jk+1(xk − 2) 2 }, otherwise Jk+1(xk), if xk ∈ {−(L + 1), −L, L, (L + 1)}

slide-37
SLIDE 37

Optimal policy and cost

32

Iterating this equation we get the costs-to-go and optimal decision indicated in the diagram For the zero initial condition the average cost is and so the profit is

−3 4 3 4 3 3 4 4 3 3 2 2 2 −2 −2 −2 1 1 1 −1 −1 −1 −3 −3 −4 −4 3 3 −3 −3 −3 −3 3/2 −3/2 −3/2 −3/2 3/2 3/4 −3/4 1 1 1 2 1 or 2 1 or 2 1 or 2 1 or 2 2 1 u0 u1 k = 0 k = 1 k = 2

(circles correspond to states and are labeled with the number of points at a given stage, cost- to-go on top and decision next to them).

−9/4

slide-38
SLIDE 38

The car parking problem

Juliet is taking her boyfriend Romeo to watch a movie in a crowded local cinema. They are late and need to find a parking space along a long avenue.

  • If they park places away from the cinema they loose minutes of the movie.
  • Each parking place is free with probability
  • If they reach the cinema without having parked they have to wait in the parking queue of the

cinema, which typically takes about minutes When should they park (stop searching) to minimize expected delay? 0 < α < 1 ck

33

pk = 1 − αk C k ∈ {1, . . . , h}

slide-39
SLIDE 39

Dynamic programming formulation

Stage 1 Stage 0 Stage h − 1 Stage h O F O F O F O F O F E S S S S DP 0 DP P h DP 0 DP P DP 0 DP P DP 0 DP P DP DP P C C h − 1

  • parking places, stage
  • State parking spot free - , occupied - , car is already parked -
  • Terminal state
  • When , park ( )-cost , don’t park ( ) -cost ; when ,
  • If , with probability and with probability
  • If the decision is not to park at the final stage, the cost is

h − # ` ∈ {0, 1, . . . , h − 1}, k = h − ` x` ∈ {F, S, O} ` xh = E x` = F, u ∈ {P, DP} P F O S DP u` = DP x` = O C

34

u` = DP x`+1 = F x`+1 = O α(h−(`+1)) 1 − α(h−(`+1)) c 2c 3c ck

slide-40
SLIDE 40

Dynamic programming algorithm

35

Stage h: Jh(E) = 0. pk = 1 − αk qk = αk Stage h − 2: Jh−2(O) = p1Jh−1(F) + q1Jh−1(O) = p1 min{C, c} + q1C, Jh−2(F) = min{p1Jh−1(F) + q1Jh−1(O), 2c} = min{p1 min{C, c} + q1C, 2c} Stage h − 1: Jh−1(O) = C, Jh−1(F) = min{C, α} . . . Stage h − (k + 1): Jh−(k+1)(O) = pkJh−k(F) + qkJh−k(O) = pk min{pk−1Jh−(k−1)(F) + qk−1Jh−(k−1)(O), ck} + qk(pk−1Jh−(k−1)(F) + qk−1Jh−(k−1)(O)), Jh−(k+1)(F) = min{pkJh−k(F) + qkJh−k(O), c(k + 1)} Jh−1(F) = min{C, c}

slide-41
SLIDE 41

Numerical example

36

O F O F O F E S DP 0 DP DP P DP P 1 2 Stage 1 Stage 0 O F DP 0 DP Stage 2 Stage 3 Stage 4 1 3 2.6 2 Simple optimal rule: park if there are two or one parking places away from the cinema,

  • therwise do not park

h = 4 C = 3 c = 1 α = 0.8 2.384 2.384 2.384 2.384

slide-42
SLIDE 42

Concluding remarks

37

Summary

  • We introduced a different but equivalent formalism with respect to transition diagrams.
  • Stochastic DP: if one has a stochastic characterization of uncertainty one can use it to

improve (in terms of expected cost) open-loop policies and even the closed-loop policy

  • btained with dynamic programming assuming deterministic behavior.
  • We gave two examples: games and stopping times.

After this lecture, you should be able to:

  • Model problems with uncertainties in the frame of stochastic dynamic programming.
  • Apply the stochastic dynamic programming algorithm.
slide-43
SLIDE 43

Appendix A

Stochastic dynamic programming for the inventory control example

slide-44
SLIDE 44

Inventory control example

A1

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Stage 4 Stage 4 Stage 3 Stage 2 Stage 1 Stage 0 1 −4.5 −9.8 −14.4 −19 −23.6 −14.4 −19.2 −9.6 −4.8 1 1 3 2 2 1 −25.5 −35.9 −45.45 −40.65 −36.05 −30.75 −25.95 −31.35 −26.25 −21.25 −16.45 −21.1 −16.7 −11.95 −6.65 u0 u1 u2 u3 Applying the dynamic programming algorithm to the inventory control example with uncertain demand at stage 2, we obtain (see Lecture 1) Prob[d2 = 0] = 0.5 Prob[d2 = 1] = 0.5

slide-45
SLIDE 45

Computing the optimal policy

A2

  • Notice that since there is only one disturbance at stage 2 the costs-to-go at stage 3

match the ones obtained before in the deterministic case.

  • Decisions at stage 2 are the ones that minimize the expected cost-to-go. We have seen

that for state 0 is better than and actually it is the optimal decision.

  • Another example:
  • For stage 0 and stages 1 there is no uncertainty. To obtain the respective costs-to-go one can

apply the dynamic programming algorithm considering the expected costs-to-go at stage 2. Decision Cost-to-go 1 2 u2 = 2

0.5(0.4 − 14.4) + 0.5(−9.6 − 9.8) = −16.7 0.5(0.9 − 19) + 0.5(10.9 − 23.6) = −15.4 0.5(5.9 − 19) + 0.5(4.1 − 14.4) = −15.8

u2 = 1

slide-46
SLIDE 46

Appendix B

Another stopping time problem: asset selling

slide-47
SLIDE 47

B1

Asset selling

When to accept an offer for an asset, e.g., a house, in other to maximize terminal expected revenue?*

  • Independent offers, denoted by , occur in periods of time.
  • The offers take values in a finite set ,
  • models the case where there is no offer in period .
  • If an offer is accepted, the money is invested at a rate until the th period.
  • Offers rejected are not renewed, and the last offer must be accepted.

r w0, w1, . . . , wN−1 N wN−1 Assumptions: wk = 0 N N Prob[wk = ai] = pi wk ∈ {a1, a2, . . . , aM}

*see Bertsekas’ book, sec.4.4

slide-48
SLIDE 48

Asset selling

B2

Let for equal the most recent offer and pick a terminal state xk k ≥ 1 to denote that the house has already been sold. Let also for wk−1 xk = T xk+1 = ( T if xk = T, or if xk 6= T and uk = 1(sell), wk otherwise. uk = 1 if one decides to accept offer and (zero initial order). Then: wk−1 k ≥ 1 x0 := 0 and the expected selling price is gN(xN) = ( xN if xN 6= T, 0 if xN = T gk(xk, uk, wk) = ( (1 + r)N−kxk if xk 6= T and uk = 1(sell) 0 otherwise. where Ewk[gN(xN) + PN−1

k=0 gk(xk, uk, wk)]

slide-49
SLIDE 49

Asset selling

B3

Applying the dynamic programming algorithm we conclude that JN(xN) = ( xN if xN 6= T, 0 if xN = T Policy accept the offer if reject the offer if xk > αk xk ≤ αk αk = E[Jk+1(wk)]

(1+r)N−k

=

PM

i=1 piJk+1(ai)

(1+r)N−k

xk xk Jk(xk) = ( max[(1 + r)N−kxk, E[Jk+1(wk)]], if xk 6= T 0, if xk = T