4SC000 Q2 2017-2018
Optimal Control and Dynamic Programming
Duarte Antunes
Optimal Control and Dynamic Programming 4SC000 Q2 2017-2018 Duarte - - PowerPoint PPT Presentation
Optimal Control and Dynamic Programming 4SC000 Q2 2017-2018 Duarte Antunes Part I Discrete optimization problems Outline Dynamic programming formalism Stochastic dynamic programming Applications Recap c 1 c 0 n 1 n 1 1 n 0 2 c h
4SC000 Q2 2017-2018
Duarte Antunes
1
1 n0 n1 2 2 1 1 1 2 c0
11
Stage 1 Stage 0 Stage h Stage h −1 c0
n01
c0
n02
c0
21
c0
22
c0
12
c1
11
c1
21
c1
22
c1
23
c1
n11
nh nh
− 1
ch−1
11
ch−1
21
ch−1
22
ch−1
nh−11 ch nh
ch
1
2
1 2 3 4 5 1 2 3 4 5 2 3 4 1 2 4 1 1 1 4 1 2 3 5 3 1 1 2 3 1 3 4 5 2 3 4 1 1 1 1 4 2 3 1 4 5 4 4 3 1
9 7
The dynamic programming algorithm provides a policy from which an optimal path can be obtained. Policies are crucial to cope with disturbances.
3
1 n0 n1 2 2 1 1 1 2 c0
11
c0
n01
c0
n02
c0
21
c0
22
c0
12
c1
11
c1
21
c1
22
c1
23
c1
n11
nh nh
− 1
ch−1
11
ch−1
21
ch−1
22
ch−1
nh−11 ch nh
ch
1
xk+1 = fk(xk, uk), k ∈ {0, . . . , h − 1}.
h−1
X
k=0
gk(xk, uk) + gh(xh). x0 ∈ {1, . . . , n0} x1 ∈ {1, . . . , n1} xh ∈ {1, . . . , nh} u0 ∈ {1, . . . , m0,x0} u1 ∈ {1, . . . , m1,x1} State Action Cost g0(x0, u0) = c0
x0,u0
g1(x1, u1) = c1
x1,u1
gh(xh) = ch
xh
4
Dynamic programming algorithm in the new formalist
Start with for every and for each decision stage, starting from the last and moving backwards, compute and and where is the minimizer in the dynamic programming (DP) equation, i.e., Jk(i) = min
j∈Uk(i) gk(i, j) + Jk+1(fk(i, j))
Jh(i) = gh(i) i ∈ Xk k ∈ {h − 1, h − 2, . . . , 0} Jk µk µk(i) = j, Jk(i) = gk(i, µk(i)) + Jk+1(fk(i, µk(i))) and . Then is an optimal policy. {µ0, . . . , µh−1} j from DP equation Uk(i) := {1, . . . , mk,i}
meet between immediate and future cost
algorithm provides the optimal policy. The proof also applies to discrete optimization problems.
5
Jk(xk) = min
uk∈Uk(xk)
g(xk, uk) | {z }
immediate or stage cost
+ Jk+1(f(xk, uk)) | {z }
future cost
6
Move a robot from an initial stage to a final stage in minimum time
Otherwise, there is only one option (see figs). It takes I time unit to move horizontally from stage to stage and to move diagonally. time units are paid every time an obstacle or a wall is hit. initial stage final stage i i + 1 √ 2 up straight down
upper wall lower wall
h h c
√ 2 √ 2
1 √ 2 + c √ 2 + c
1 + c
7
√ 2 1 + c √ 2 + c 1 1 √ 2 + c √ 2 √ 2 √ 2 √ 2
This problem can be written in the DP framework for a transition diagram obtained from the rules of the problem
1
8
initial stage final stage h Jk(1) = √ 2 + c + Jk+1(2) Jk(n) = √ 2 + c + Jk+1(n − 1) Jk(i) = 1 + c + Jk+1(i) Jk(i) = min{1 + Jk+1(i), √ 2 + Jk+1(i + 1), √ 2 + Jk+1(i − 1)} 1 n not an obstacle node Jh(i) = 0 i ∈ {1, . . . , n} i ∈ {2, . . . , n − 1}
k ∈ {h − 1, h − 2, . . . , 1, 0} For :
9
Jk(1) = √ 2 + c + Jk+1(2) Jk(n) = √ 2 + c + Jk+1(n − 1) Jk(i) = 1 + c + Jk+1(i) Jk(i) = min{1 + Jk+1(i), √ 2 + Jk+1(i + 1), √ 2 + Jk+1(i − 1)} not an obstacle node Jh(i) = 0
:
17.4116.4115.4114.4113.4112.4111.4110.419.41 8.41 7.41 6.41 5.41 0.00 13.0012.0011.0010.009.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00 13.0012.0011.0010.009.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00 13.0012.0011.0010.009.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00 13.0012.0011.0010.009.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00 13.4112.4111.4110.419.41 8.41 11.416.41 5.41 8.00 3.00 2.00 1.00 0.00 13.8312.8311.8310.839.83 12.8311.836.83 9.41 8.41 3.41 6.00 1.00 0.00 14.2413.2412.2411.2410.249.24 12.247.24 9.41 8.41 3.41 6.00 1.00 0.00 13.8312.8311.8310.839.83 8.83 7.83 6.83 5.83 8.00 3.00 2.00 1.00 0.00 13.8312.8311.8310.839.83 9.24 11.416.41 5.41 4.41 3.41 6.00 1.00 0.00 13.4112.4111.4110.419.41 8.41 11.416.41 5.41 8.41 3.41 6.00 1.00 0.00 13.0012.0011.0010.009.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00 17.4116.4115.4114.4113.4112.4111.4110.419.41 8.41 7.41 6.41 5.41 0.00
c = 4
assuming no disturbances to cope with disturbances.
be possible to define optimal decisions, since these would depend on future realizations of disturbances.
10
11
Consider that at position A there might be a disturbance making the robot move down one extra position
17.4116.4115.4114.4113.4112.4111.4110.419.41 8.41 7.41 6.41 5.41 0.00 13.0012.0011.0010.009.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00 13.0012.0011.0010.009.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00 13.0012.0011.0010.009.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00 13.0012.0011.0010.009.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00 13.4112.4111.4110.419.41 8.41 11.416.41 5.41 8.00 3.00 2.00 1.00 0.00 13.8312.8311.8310.839.83 12.8311.836.83 9.41 8.41 3.41 6.00 1.00 0.00 14.2413.2412.2411.2410.249.24 12.247.24 9.41 8.41 3.41 6.00 1.00 0.00 13.8312.8311.8310.839.83 8.83 7.83 6.83 5.83 8.00 3.00 2.00 1.00 0.00 13.8312.8311.8310.839.83 9.24 11.416.41 5.41 4.41 3.41 6.00 1.00 0.00 13.4112.4111.4110.419.41 8.41 11.416.41 5.41 8.41 3.41 6.00 1.00 0.00 13.0012.0011.0010.009.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00 17.4116.4115.4114.4113.4112.4111.4110.419.41 8.41 7.41 6.41 5.41 0.00
√ 2 up straight down √ 2
1
Decisions Possible outcomes (no disturbance/ disturbance) A
1
√ 2
1
√ 5
12
If we knew the future disturbance value, we would pick ‘up’ if ‘disturbance’, ‘straight’ if ‘no disturbance’. Thus, if we assume nothing about the disturbances there is no optimal decision at position A.
no disturbance disturbance decision: ‘up’ decision:‘straight’ decision:‘down’ cost-to-go at position A 11.4 + √ 2 7.83 + 1 12.2 + √ 2 11.8 + √ 5 7.83 + 1 12.2 + √ 2
There are two assumptions that make optimal decisions well-defined
define optimal policies as the ones that minimize the expected cost. The dynamic programming framework can be extended to provide this policy.
tackled in the framework of game theory and will not be addressed in the course.
13
14
For the toy robot problem consider the stochastic characterization: cost-to-go at position A Optimal decision is now well-defined: pick ‘straight’ .
(12.2 + √ 2)0.8 + (11.8 + √ 5)0.2 = 13.69
no disturbance disturbance Expected cost decision: ‘up’ decision:‘straight’ decision:‘down’ 11.4 + √ 2 7.83 + 1 12.2 + √ 2 11.8 + √ 5 7.83 + 1 12.2 + √ 2 Prob[disturbance] = 0.2 Prob[no disturbance] = 0.8 (11.4 + √ 2)0.8 + (7.83 + 1)0.2 = 12.0174 (7.83 + 1)0.8 + (12.2 + √ 2)0.2 = 9.79
15
cost-to-go at position A
no disturbance disturbance Worst-case cost decision: ‘up’ decision:‘straight’ decision:‘down’ 11.4 + √ 2 7.83 + 1 12.2 + √ 2 11.8 + √ 5 7.83 + 1 12.2 + √ 2 11.4 + √ 2 12.2 + √ 2 11.8 + √ 5
Optimal decision is now well-defined: pick ‘up’. Safe policy (at least we get ).
11.4 + √ 2
The dynamic model and the cost take now the form
16
xk+1 = fk(xk, uk, wk)
h−1
X
k=0
gk(xk, uk, wk) + gh(xh), Note that both the state and cost are now random variables. pk,i,j
`
:= Prob[wk = `|xk = i, uk = j], ` ∈ Wk(i, j). where the state and input live in the same finite spaces defined before (slide 3), and disturbances belong to a finite set when and are characterised by wk ∈ Wk(i, j) := {1, . . . , ωi,j,k} xk = i, uk = j
17
xk+1 = fk(xk, µk(xk), wk). π = {µ0, . . . , µh−1} uk = µk(xk) Given a policy where is a real number for each initial condition. The goal is to find a policy that minimizes the cost . Jπ Jπ(x0) Jπ(x0) = E[Ph−1
k=0 gk(xk, µk(xk), wk) + gh(xh)]
18
Start with for every and for each decision stage, starting from the last and moving backwards, compute and and where is the minimizer in the DP equation. Jh(i) = gh(i) i ∈ Xk k ∈ {h − 1, h − 2, . . . , 0} Jk µk µk(i) = j, Then is an optimal policy. {µ0, . . . , µh−1} (DP eq.) j as Jk(i) = minj∈Uk(i) P
`∈Wk(i,j) pk,i,j `
(gk(i, j, `) + Jk+1(fk(i, j, `))) Each function is now the expected cost-to-go when J`(i) = E[Ph−1
k=` gk(xk, µk(xk), wk) + gh(xh)]
J`(i) x` = i J0(i) x0 = i In particular is the expected cost for a given initial condition . (the notation is typically also used) J`(i) = E[Ph−1
k=` gk(xk, µk(xk), wk) + gh(xh)|x` = i]
19
the dynamic model has the so-called Markov property.
process moves to a new state depends only on the current state and action; given the current state and action the next state is conditionally independent of all previous states and actions.
Markov decision process as a discrete optimisation process with stochastic disturbances. E[f(xk+1)|xk, uk, xk−1, uk−1, . . . , x0, u0] = E[f(xk+1)|xk, uk] f
20
Consider the deterministic problem introduced before but now assume that for regular nodes (no obstacles) the evolution of the state is stochastic for every stage and state initial state final stage move up move straight move down
h
probabilities
1 − 2p 2p p 1 − 2p p 1 − 2p 2p
21
initial stage final stage h Jk(1) = √ 2 + c + Jk+1(2) Jk(n) = √ 2 + c + Jk+1(n − 1) Jk(i) = 1 + c + Jk+1(i) 1 n Jh(i) = 0 i ∈ {1, . . . , n} i ∈ {2, . . . , n − 1} Stochastic
i ∈ {2, . . . , n − 1} Jk(i) = min{1 + Jk+1(i), √ 2 + Jk+1(i + 1), √ 2 + Jk+1(i − 1)} Compare with deterministic (computed before) Jk(i) = min{(1 − 2p)(1 + Jk+1(i)) + p( √ 2 + Jk+1(i + 1)) + p( √ 2 + Jk+1(i − 1)), 2p(1 + Jk+1(i)) + (1 − 2p)( √ 2 + Jk+1(i + 1)), 2p(1 + Jk+1(i)) + (1 − 2p)( √ 2 + Jk+1(i − 1))}
22
Jk(i) = min{(1 − 2p)(1 + Jk+1(i)) + p( √ 2 + Jk+1(i + 1)) + p( √ 2 + Jk+1(i − 1)), 2p(1 + Jk+1(i)) + (1 − 2p)( √ 2 + Jk+1(i + 1)), 2p(1 + Jk+1(i)) + (1 − 2p)( √ 2 + Jk+1(i − 1))} p = 0.1 c = 4
18.9317.8216.7115.6114.5013.3912.2911.1810.078.96 7.83 6.50 5.41 0.00 14.6213.5112.4111.3010.199.09 7.98 6.87 5.77 4.66 3.55 2.41 1.08 0.00 14.3413.2312.1311.029.91 8.80 7.70 6.59 5.49 4.38 3.27 2.17 1.08 0.00 14.3213.2112.1010.999.88 8.76 7.65 6.54 5.42 4.33 3.25 2.17 1.08 0.00 14.5213.4312.3411.2510.159.03 7.90 6.80 5.67 4.36 3.25 2.17 1.08 0.00 14.8113.7312.6411.5710.5710.0812.177.17 6.52 8.50 3.50 2.17 1.08 0.00 15.1214.0513.0312.2212.1513.7813.548.54 9.99 9.28 4.28 6.08 1.08 0.00 15.4814.5213.8413.6613.0311.9814.349.34 10.349.28 4.28 6.08 1.08 0.00 16.0315.3914.9513.8912.7711.639.73 8.59 7.42 9.03 4.03 2.17 1.08 0.00 16.8816.1515.0613.9612.8711.7813.358.35 7.22 5.36 4.28 6.08 1.08 0.00 17.4316.3315.2414.1513.0612.0213.578.57 7.51 9.48 4.48 6.08 1.08 0.00 17.7216.6315.5414.4413.2911.849.96 8.86 7.58 5.73 4.27 2.41 1.08 0.00 22.0420.9519.8518.7117.2515.3814.2713.0011.159.69 7.83 6.50 5.41 0.00
23
Stochastic DP Deterministic
The optimal policy coping with uncertainty is intuitive: go around the obstacles!
difficulty in obtaining a stochastic characterizing of disturbances.
limited use in practical control applications.
24
25
state assuming no disturbances
(open-loop decisions applied only at regular nodes, at obstacle and wall nodes the options are unique)
26
27
28
management, finance, games, operational research and other areas requiring decisions in the presence of uncertainty.
at each stage one must decide either to stop or continue a given process (possibly at a given cost).
games can sometimes bias the game in favor of one of the players.
Continuous-time optimal stopping time problem are mathematically much more involved (see, e.g., M. H. Davis, Markov models and optimization, Chapman and Hall/CRC, 1993)
29
Suppose and both players have zero points at the initial stage. Two players ( and ), repetitively play two fair games:
(Player looses point); otherwise he/she looses point (Player gains point).
(Player looses points), otherwise he/she looses points (Player gains points).
player receives money units from the other player.
players do not win or loose anything. Can player gain money in expectation by deciding at each stage which coin to toss? A 1 B A 1 B 1 1 1 B A B 2 2 2 2 B 2 L L L h h = 2, L = 3 L + 1 A
30
Dynamic model xk+1 = f(xk, uk, wk) xk - #points of player
wk ∈ {−1, 1} uk ∈ {1, 2} tails heads f(xk, uk, wk) = xk + 1 if uk = 1 and wk = 1 xk + 2 if uk = 2 and wk = 1 xk − 1 if uk = 1 and wk = −1 xk − 2 if uk = 2 and wk = −1 xk if xk ∈ {L, L + 1, −L, −(L + 1)} k ∈ {0, . . . , h − 1} (game had ended but wait for the last stage)
A
Cost g(xk, uk) = 0
xk, uk
for every min
h−1
X
k=0
g(xk, uk) + gh(xh) (note that profit = -cost) gh(xh) = − L if xh = L of xh = L + 1 L if xh = −L of xh = −(L + 1) 0 otherwise
31
DP algorithm
k ∈ {h − 1, h − 2, . . . , 0} = 0 f(xk, uk, wk)
Using the expression for we obtain
f(xk, uk, wk) Jk(xk) = min
uk E[g(xk, uk) + Jk+1(xk+1)|xk]
Jh(xh) = gh(xh) Jh(xh) = − L if xh = L or xh = L + 1 L if xh = −L or xh = −(L + 1) 0 otherwise Jk(xk) = min{Jk+1(xk + 1) + Jk+1(xk − 1) 2 , Jk+1(xk + 2) + Jk+1(xk − 2) 2 }, otherwise Jk+1(xk), if xk ∈ {−(L + 1), −L, L, (L + 1)}
32
Iterating this equation we get the costs-to-go and optimal decision indicated in the diagram For the zero initial condition the average cost is and so the profit is
−3 4 3 4 3 3 4 4 3 3 2 2 2 −2 −2 −2 1 1 1 −1 −1 −1 −3 −3 −4 −4 3 3 −3 −3 −3 −3 3/2 −3/2 −3/2 −3/2 3/2 3/4 −3/4 1 1 1 2 1 or 2 1 or 2 1 or 2 1 or 2 2 1 u0 u1 k = 0 k = 1 k = 2
(circles correspond to states and are labeled with the number of points at a given stage, cost- to-go on top and decision next to them).
−9/4
Juliet is taking her boyfriend Romeo to watch a movie in a crowded local cinema. They are late and need to find a parking space along a long avenue.
cinema, which typically takes about minutes When should they park (stop searching) to minimize expected delay? 0 < α < 1 ck
33
pk = 1 − αk C k ∈ {1, . . . , h}
Stage 1 Stage 0 Stage h − 1 Stage h O F O F O F O F O F E S S S S DP 0 DP P h DP 0 DP P DP 0 DP P DP 0 DP P DP DP P C C h − 1
h − # ` ∈ {0, 1, . . . , h − 1}, k = h − ` x` ∈ {F, S, O} ` xh = E x` = F, u ∈ {P, DP} P F O S DP u` = DP x` = O C
34
u` = DP x`+1 = F x`+1 = O α(h−(`+1)) 1 − α(h−(`+1)) c 2c 3c ck
35
Stage h: Jh(E) = 0. pk = 1 − αk qk = αk Stage h − 2: Jh−2(O) = p1Jh−1(F) + q1Jh−1(O) = p1 min{C, c} + q1C, Jh−2(F) = min{p1Jh−1(F) + q1Jh−1(O), 2c} = min{p1 min{C, c} + q1C, 2c} Stage h − 1: Jh−1(O) = C, Jh−1(F) = min{C, α} . . . Stage h − (k + 1): Jh−(k+1)(O) = pkJh−k(F) + qkJh−k(O) = pk min{pk−1Jh−(k−1)(F) + qk−1Jh−(k−1)(O), ck} + qk(pk−1Jh−(k−1)(F) + qk−1Jh−(k−1)(O)), Jh−(k+1)(F) = min{pkJh−k(F) + qkJh−k(O), c(k + 1)} Jh−1(F) = min{C, c}
36
O F O F O F E S DP 0 DP DP P DP P 1 2 Stage 1 Stage 0 O F DP 0 DP Stage 2 Stage 3 Stage 4 1 3 2.6 2 Simple optimal rule: park if there are two or one parking places away from the cinema,
h = 4 C = 3 c = 1 α = 0.8 2.384 2.384 2.384 2.384
37
Summary
improve (in terms of expected cost) open-loop policies and even the closed-loop policy
After this lecture, you should be able to:
Stochastic dynamic programming for the inventory control example
A1
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Stage 4 Stage 4 Stage 3 Stage 2 Stage 1 Stage 0 1 −4.5 −9.8 −14.4 −19 −23.6 −14.4 −19.2 −9.6 −4.8 1 1 3 2 2 1 −25.5 −35.9 −45.45 −40.65 −36.05 −30.75 −25.95 −31.35 −26.25 −21.25 −16.45 −21.1 −16.7 −11.95 −6.65 u0 u1 u2 u3 Applying the dynamic programming algorithm to the inventory control example with uncertain demand at stage 2, we obtain (see Lecture 1) Prob[d2 = 0] = 0.5 Prob[d2 = 1] = 0.5
A2
match the ones obtained before in the deterministic case.
that for state 0 is better than and actually it is the optimal decision.
apply the dynamic programming algorithm considering the expected costs-to-go at stage 2. Decision Cost-to-go 1 2 u2 = 2
0.5(0.4 − 14.4) + 0.5(−9.6 − 9.8) = −16.7 0.5(0.9 − 19) + 0.5(10.9 − 23.6) = −15.4 0.5(5.9 − 19) + 0.5(4.1 − 14.4) = −15.8
u2 = 1
Another stopping time problem: asset selling
B1
When to accept an offer for an asset, e.g., a house, in other to maximize terminal expected revenue?*
r w0, w1, . . . , wN−1 N wN−1 Assumptions: wk = 0 N N Prob[wk = ai] = pi wk ∈ {a1, a2, . . . , aM}
*see Bertsekas’ book, sec.4.4
B2
Let for equal the most recent offer and pick a terminal state xk k ≥ 1 to denote that the house has already been sold. Let also for wk−1 xk = T xk+1 = ( T if xk = T, or if xk 6= T and uk = 1(sell), wk otherwise. uk = 1 if one decides to accept offer and (zero initial order). Then: wk−1 k ≥ 1 x0 := 0 and the expected selling price is gN(xN) = ( xN if xN 6= T, 0 if xN = T gk(xk, uk, wk) = ( (1 + r)N−kxk if xk 6= T and uk = 1(sell) 0 otherwise. where Ewk[gN(xN) + PN−1
k=0 gk(xk, uk, wk)]
B3
Applying the dynamic programming algorithm we conclude that JN(xN) = ( xN if xN 6= T, 0 if xN = T Policy accept the offer if reject the offer if xk > αk xk ≤ αk αk = E[Jk+1(wk)]
(1+r)N−k
=
PM
i=1 piJk+1(ai)
(1+r)N−k
xk xk Jk(xk) = ( max[(1 + r)N−kxk, E[Jk+1(wk)]], if xk 6= T 0, if xk = T