Optimal Control and Dynamic Programming 4SC000 Q2 2017-2018 Duarte - - PowerPoint PPT Presentation

optimal control and dynamic programming
SMART_READER_LITE
LIVE PREVIEW

Optimal Control and Dynamic Programming 4SC000 Q2 2017-2018 Duarte - - PowerPoint PPT Presentation

Optimal Control and Dynamic Programming 4SC000 Q2 2017-2018 Duarte Antunes Introduction In several control problems only an output (subset of states) is available for feedback. For example when controlling the position of a mass on a


slide-1
SLIDE 1

4SC000 Q2 2017-2018

Optimal Control and Dynamic Programming

Duarte Antunes

slide-2
SLIDE 2

Introduction

  • In several control problems only an output (subset of states) is

available for feedback.

  • For example when controlling the position of a mass on a

surface using only position sensors, the state (which includes velocity) is not fully known.

  • More generally the output might not provide full information

about the state but only partial information

  • In this lecture we discuss Partially Observable Markov Decision

Problems (POMDP).

1

slide-3
SLIDE 3

Outline

  • Formulation of POMDP
  • Bayes filter
  • Solving POMDP
slide-4
SLIDE 4

Problem formulation

Dynamic model xk+1 = fk(xk, uk, wk)

h−1

X

k=0

gk(xk, uk, wk) + gh(xh), π = {µ0, . . . , µh−1} Problem: Find policy , that minimizes Stochastic disturbances Cost Output Information set noise , ,

2

initial state stochastic with known

I0 = (y0) Ik = (y0, y1, . . . , yk, u0, u1, . . . , uk−1) k ≥ 1 uk = µk(Ik)

Prob[xk = i] = p0,i

yk = hk(xk, nk) Jπ = E[

h−1

X

k=0

gk(xk, µk(Ik), wk) + gh(xk)] The state, input and disturbances live in the sets defined in Lec 2 (slides 3, 16), the output and noise live in finite spaces in general dependent on state Yk := {1, . . . , qk,i} Nk :={1, . . . , νk,i} xk = i yk ∈ Yk, nk ∈ Nk

slide-5
SLIDE 5

First approach

We can reformulate a partial information optimal control problem as a standard full information optimal control problem by considering the state to be the information set. The problem is that the state space dimension increases exponentially. Stage 0 Stage 1 Stage 2 Information set Measurement space y0 y1 y0

3

y1 y0 y2 I0 = (y0) I1 = (y0, y1, u0) I2 = (y0, y1, y2, u0, u1)

slide-6
SLIDE 6

Second approach

It is possible to show that the knowledge of the probability distribution of the state given the information obtained so far denoted by is sufficient to determine optimal decisions (Bertsekas’ book, Section 5.4) Stage 0 Stage 1 Stage 2 Probability space u0

4

Initial state distribution Pxk|Ik uk = µk(Pxk|Ik) Px0|I0 Px1|I1 Px1|I1 Decision and

  • bservation 1

Decision and

  • bservation2

Decision and

  • bservation 1

Decision and

  • bservation 2

Decision and

  • bservation 1

Decision and

  • bservation 2

Px2|I2 Px2|I2

slide-7
SLIDE 7

Decomposition of the optimal policy

5

Then the structure of the optimal decision-maker is (see Bertsekas’ book, Section 5.4): Dynamics Output State estimator Actuator xk+1 = fk(xk, uk, wk) Delay µk

  • ptimal

Policy uk xk uk−1 yk yk = hk(xk, nk) Pxk|Ik That is, the optimal policy can be decomposed into: an estimator of and a map from to actions. Pxk|Ik Pxk|Ik

slide-8
SLIDE 8

Discussion

  • The first approach is typically impractical for applications.
  • The second approach is often used in robotics.
  • To compute a crucial step is the Bayes’ rule and the state

estimator is the Bayes filter.

  • The Bayes filter is important per se and we will start by studying it.

6

Pxk|Ik

slide-9
SLIDE 9

Outline

  • Formulation of POMDP
  • Bayes filter
  • Solving POMDP
slide-10
SLIDE 10

Bayes’ rule

7

Prob[x = i|y = j] = Prob[y=j|x=i]Prob[x=i]

Prob[y=j]

=

Prob[y=j|x=i]Prob[x=i] Pn

`=1 Prob[y=j|x=`]Prob[x=`]

The Bayes filter relies on the basic Bayes’ rule. Suppose then

  • The Bayes’ rule allow to infer something about the state a posteriori of the sensor

measurement from a priori information , . Prob[x = i|y = j] Prob[y = j|x = i] Prob[x = i] x ∈ {1, . . . , n} y ∈ {1, . . . , m} x = 1 x = 2 x = 3

  • Think of as a sensor measurement and think of as the state (see figure). If

what can you tell about the state? y x y = 1 y = 2 Probability space Example: y = 1

slide-11
SLIDE 11

Example

8

An international student is deciding if it is worthwhile to design a hobby alarm (using an ultrasound sensor close to the door) like his dorm mates to detect burglars when she goes to her home country to spend Christmas. The alarm is faulty and characterised by Her total belongings in the room amount to 2000 euros and asking a security agency to check her room whenever she calls (after the alarm indicates a break in) costs 300 euros and therefore she will only call if or equivalently Shall she buy the alarm? Prob[A|¬B] = 0.1 Prob[¬A|¬B] = 0.9 Prob[A|B] = 0.99 Prob[¬A|B] = 0.01 Burglar breaks in Burglar does not break in Alarm goes off Alarm does not go off and historical data reveals that 2 out of 100 rooms are robbed each Christmas, i.e., Prob[B|A]2000 > 300 Prob[B|A] > 0.15 Prob[B] = 0.02, Prob[¬B] = 0.98 A ¬A ¬B B

slide-12
SLIDE 12

Computing

9

Prob[B|A]

Prob[B|A] = Prob[A|B]Prob[B]

Prob[A]

= 0.99×0.02

Prob[A]

Prob[¬B|A] = Prob[A|¬B]Prob[¬B]

Prob[A]

= 0.1×0.98

Prob[A]

We can simply use the a priori probabilities And compute from the fact that where Prob[A] Prob[B|A] + Prob[¬B|A] = 1 Thus Prob[A] = 0.1 × 0.98 + 0.99 × 0.02 = 0.1178 Prob[B|A] = 0.1681 Equivalently we can directly apply the formula Prob[B|A] = Prob[A|B]Prob[B] Prob[A|B]Prob[B] + Prob[A|¬B]Prob[¬B] (yes, buy!)

slide-13
SLIDE 13

10

Thomas Bayes

Historical note

Thomas Bayes (1701 – 1761) was an English statistician, philosopher and Presbyterian minister who is known for having formulated a specific case of the theorem that bears his name: Bayes'

  • theorem. Bayes never published what would eventually become his most famous

accomplishment; his notes were edited and published after his death (source wikipedia).

slide-14
SLIDE 14

Problem formulation

11

State estimator uk−1 yk

How to find a state estimator to compute ?

yk = hk(xk, nk) xk+1 = fk(xk, uk, wk) Pxk|Ik

Pxk|Ik Note that can be used to compute any quantity of interest about the state (e.g. mean, median, variance, etc). In particular a typical state estimate is the mean, obtained by making equal to identity in Pxk|Ik a E[a(xk)|Ik] = Pn

i=1 a(i)Pxk=i|Ik

slide-15
SLIDE 15

Preliminaries

12

Let us start by defining a different representation of the dynamic and

  • utput maps in terms of the following matrices for each

state transition matrices

  • utput

matrices dimensions Components For simplicity we will assume that the input, disturbance, output, and noise spaces do not change with the state (but can still depend on time). uk ∈ {1, . . . , mk} yk ∈ {1, . . . , qk} wk ∈ {1, . . . , ωk} j ∈ {1, . . . , mk} k ∈ {0, . . . , h − 1} Rk Pk(j) [Rk(j)]`I = Prob[yk = `|xk = i] [Pk(j)]ri = Prob[xk+1 = r|xk = i, uk = j] nk ∈ {1, . . . , νk} xk ∈ {1, . . . , ¯ nk} ¯ nk × ¯ nk qk × ¯ nk

slide-16
SLIDE 16

Finding state transition matrices

13

[Pk(j)]ri = Prob[xk+1 = r|xk = i ∧ uk = j] Use: = X

{ι|fk(i,j,ι)=r}

Prob[wk = ι|xk = i ∧ uk = j] Example ( , fixed ) xk = 1 xk = 2 uk = 2 uk = 1 wk = 1 wk = 2 uk = 2 uk = 1 1 2 1 1 1 1 1 2 Prob[wk = 1|xk = i ∧ uk = j] = 0.8, ∀i, j (Prob[wk = 2|xk = i ∧ uk = j] = 0.2) Pk(1) =  0.8 0.2 0.2 0.8

  • [Pk(1)]11 = Prob[wk = 1|uk = 1 ∧ xk = 1] = 0.8

[Pk(2)]11 = Prob[wk = 1|xk = 1 ∧ uk = 2] +Prob[wk = 2|xk = 1 ∧ uk = 2] = 1 . . . . . . k

fk(xk, uk, wk) ¯ nk = mk = ωk = 2 Pk(2) =  1 1

slide-17
SLIDE 17

Finding output matrices

14

Use: [Rk]`i = Prob[yk = `|xk = i] = X

{⌧|gk(i,⌧)=`}

Prob[vk = τ|xk = i] Example ( , fixed ) xk = 1 xk = 2 1 2 1 1

. . . k

vk = 1 vk = 2 Prob[vk = 1|xk = i] = 0.6, ∀i Prob[vk = 2|xk = i] = 0.4, ∀i

[Rk]11 = Prob[vk = 1|xk = 1] = 0.6

Rk =  0.6 1 0.4

  • .

. .

hk(xk, nk)

[Rk]12 = Prob[vk = 1|xk = 2] + Prob[vk = 2|xk = 2] = 1

¯ nk = qk = νk = 2

slide-18
SLIDE 18

Bayes’ filter

15

State estimator uk−1 yk Prediction Update Correction Prediction yk = hk(xk, nk) xk+1 = fk(xk, uk, wk) is represented by the vector ¯ pk+1 = Pk(uk)pk pk+1 = qk+1 1|qk+1 D(yk) =      [Rk]yk1 . . . [Rk]yk2 . . . . . . . . . ... ... . . . . . . [Rk]ykn      qk+1 = D(yk+1)¯ pk+1 Pxk|Ik Pxk|Ik pk,i = Prob[xk = i|Ik]

pk = ⇥pk,1 . . . pk,nk ⇤|

Initial condition p0 =

q0 1|q0

q0 = D(y0)˜ p0 ˜ p0,i = Prob[x0 = i]

slide-19
SLIDE 19

Derivation of the Bayes’ filter (I)

16

Define Prediction step Then (condition on ) xk ( independent of ) uk (Markov property) ¯ pk,i = Prob[xk = i|Ik−1, uk−1] ¯ pk+1,i = Prob[xk+1 = i|Ik, uk] xk

n

X

τ=1

Prob[xk+1 = i|xk = τ, Ik, uk]Prob[xk = τ|Ik, uk]

n

X

τ=1

Prob[xk+1 = i|xk = τ, Ik, uk]Prob[xk = τ|Ik]

n

X

τ=1

Prob[xk+1 = i|xk = τ, uk]Prob[xk = τ|Ik] =

n

X

τ=1

[Pk(uk)]iτpk,τ = = = In matrix form ¯ pk+1 = Pk(uk)pk

¯ pk = ⇥¯ pk,1 . . . ¯ pk,nk ⇤|

slide-20
SLIDE 20

17

Correction step

Derivation of the Bayes’ filter (II)

(Bayes’ rule*!) (Markov property) Define Suppose there is a measurement yk+1 = θ qk+1,i = Prob[yk+1 = θ|xk+1 = i, Ik, uk]Prob[xk+1 = i|Ik, uk] α = 1 Prob[yk+1 = θ|Ik, uk] pk+1,i = Prob[xk+1 = i|Ik, yk+1 = θ, uk] = αProb[yk+1 = θ|Ik, xk+1 = i, uk]Prob[xk+1 = i|Ik, uk] = αProb[yk+1 = θ|xk+1 = i]Prob[xk+1 = i|Ik, uk]

*Bayes’ rule holds when conditioning on a third variable Prob[x = i|y = j, z = r] = Prob[y = j|x = i, z = r]Prob[x = i|z = r]

Prob[y = j|z = r]

In matrix form: pk+1 = αqk+1 α = 1 1|qk+1 Since the probability vector must add up to one and pk+1 = qk+1 1|qk+1 qk+1 = D(yk+1)¯ pk+1

slide-21
SLIDE 21

Example

18

We wish to estimate the position of a robot in a given environment. We assign a label to each cell in the environment and indicates that the robot’s position coincides with the centre of cell i ∈ {1, . . . , n} xk = i i Cell Cell i i + 1 . . .

The derivations of the Bayes filter can be easily specialised to this case (no control input).

xk+1 = f(xk, wk)

† †

The robot moves according to a given control policy (autonomous system ) Disturbances

slide-22
SLIDE 22

Vector field

19

Vector field with no disturbances

slide-23
SLIDE 23

Disturbance model

20

At each step , for a given state , the new state is uncertain and can take values in a neighbourhood of the state obtained with a deterministic model according to the probability values given in the grid below Given position corresponding to xk xk k xk+1 = f(xk, wk) xk+1 = f(xk, 0) xk+1 = f(xk, 0) p1 p2 p2 p2 p2 p2 p2 p2 p2 p3 p3 p3 p3 p3 p3 p3 p3 p3 p3 p3 p3 p3 p3 p3 p3 p4 p4 p4 p4 p4 p4 p4 p4 p4 p4 p4 p4 p4 p4 p4 p4 p4 p4 p4 p4 p4 p4 p4 p4 (no disturbances) xk+1 = f(xk, wk) (possible next state) Deterministic Stochastic p1 = 0.65 p2 = 0.2

8

p3 = 0.1

16

p4 = 0.05

24

slide-24
SLIDE 24

Sensor model I

21

At each step , for a given state , the measurement is uncertain and can take values in a neighbourhood of the state according to the probability values given in the grid xk k (possible measurements) xk State yk = h(xk, nk) yk = h(xk, nk) xk q1 = 0.5 q2 = 0.2

8

q3 = 0.2

16

q4 = 0.1

24

q4 q4 q4 q4 q4 q4 q4 q4 q4 q4 q4 q4 q4 q4 q4 q4 q4 q4 q4 q4 q4 q4 q4 q4 q3 q3 q3 q3 q3 q3 q3 q3 q3 q3 q3 q3 q3 q3 q3 q3 q1 q2 q2 q2 q2 q2 q2 q2 q2

slide-25
SLIDE 25

Sensor model II

22

At each step , for a given state , the measurement is deterministic and indicates if there are objects/walls (represented in yellow) within an euclidian distance, a multiple M = 5 of the length of each cell, of the robot. If so it indicates if the closest object/wall is to the left (L), right (R), up (U) or down (D), otherwise it indicates no object close (NO) xk k yk = h(xk) measurements measurements Position/state Position/state R D L U NO

slide-26
SLIDE 26

Results

23

Video: LEC4nodisturbances.mp4

slide-27
SLIDE 27

Results

24

Video: LEC4sensor1.mp4

slide-28
SLIDE 28

Results

25

Video: LEC4sensor2.mp4

slide-29
SLIDE 29

Results

26

Video: LEC4sensor2_2.mp4

slide-30
SLIDE 30

Outline

  • Formulation of POMDP
  • Bayes filter
  • Solving POMDP
slide-31
SLIDE 31

Solution of a POMDP

27

  • The POMPD problem we introduced can always be solved exactly.
  • We will study first two examples (show host, and machine repair) and

then establish this general fact

  • In fact, as we shall see, the costs-to-go take the general form of

piecewise affine functions of the probability distribution of the state.

  • However, the complexity of these functions typically grows exponentially

with the number of stages, and therefore this results has mostly theoretical

  • interest. In practice, heuristics are used to solve POMPDP

.

slide-32
SLIDE 32

Monty Hall problem

  • The first approach is typically impractical for applications.
  • The second approach is often used in robotics (google POMDP) and

is will also be used to derive LQG control.

  • A nice example of these two approaches is given in Bertsekas’ book

Chapter 5, Machine repair Sec 5.4 and Example 5.4.2.

  • For illustration purposes we solve the Monty Hall problem using the

second approach (the first approach would actually be similar).

slide-33
SLIDE 33

Example: Monty hall problem

28 Suppose you're on a game show, and you're given the choice

  • f three doors: Behind one door is a car; behind the others,

goats. You pick a door, say No 1, and the host, who knows what's behind the doors, opens another door, say No. 3, which has a goat. He then says to you, "Do you want to pick door No. 2?" Is it to your advantage to switch your choice? Whitaker, Craig F. (9 September 1990, Parade Magazine: 16, source wikipedia.)

stage 0 stage 1 stage 2

?

decision 1 decision 2

Monty Hall problem

slide-34
SLIDE 34

Formulation

29

Dynamic model decisions u0 ∈ {1, 2, 3} u1 ∈ {1, 2, 3} k ∈ {0, 1, 2} two decision stages doors selected control input xk ∈ {1, 2, . . . , 9} We can think of the overall state as , although formally xk = (¯ xk, ˜ xk) corresponds the nine possible combinations of ˜ xk ∈ {1, 2, 3} ¯ xk ∈ {1, 2, 3} Note that from we can extract and vice-versa. xk ˜ xk, ¯ xk unknown door where the car is no disturbances ¯ xk+1 = ¯ xk ¯ x0 Prob[¯ x0 = 1] = Prob[¯ x0 = 2] = Prob[¯ x0 = 3] = 1 3 State ˜ xk+1 = uk keeps track of previous decision ˜ x0 = 1 (not relevant) Two “components”

slide-35
SLIDE 35

Formulation

30

g0(x0, u0) = 0 g2(x2) = 0 Cost g1(x1, u1) = ( − 1, if ¯ x1 = u1 0, otherwise Output and information sets where is a random variable taking one of two values in the set with equal probability ( ) n1 1/2 y0 = ∅ I0 = (∅) y1 = h1(x1, n1) = ( {1, 2, 3} \ {¯ x1, ˜ x1} if ¯ x1 6= ˜ x1 {1, 2, 3} \ {¯ x1, n1} if ¯ x1 = ˜ x1 I1 = (y1, u0) {1, 2, 3} \ {¯ x1}

slide-36
SLIDE 36

Conditional probability

31

  • To apply DP

, start at stage 1 and compute for every possible value of Bayes’ rule! Px1|I1 I1 = (y1, u0) P¯

x1|I1

  • Given it is trivial to compute and therefore it suffices to compute .

I1 ˜ x1 = u0

  • Let us do this for ( ).

(y1, u0) = (3, 1) It is obvious that How to compute. , ? Prob[¯ x1 = 3|(y1, u0) = (3, 1)] = 0 Prob[¯ x1 = 2|(y1, u0) = (3, 1)] = 0 Prob[¯ x1 = 1|(y1, u0) = (3, 1)] = 0

slide-37
SLIDE 37

Computing

32

= α 1 ×

1 3

= 2

3

= α

1 2

×

1 3

= 1

3

Prob[¯ x1 = 1|y1 = 3, u0 = 1] = αProb[y1 = 3|¯ x1 = 1, u0 = 1]Prob[¯ x1 = 1|u0 = 1] Prob[¯ x1 = 2|y1 = 3, u0 = 1] = αProb[y1 = 3|¯ x1 = 2, u0 = 1]Prob[¯ x1 = 2|u0 = 1]

x1|I1

2 1 3

2 3 1 3

α = 1 Prob[y1 = 3|u0 = 1] = 1

1 2 × 1 2 + 1 × 1 3

= 2 ¯ x1 P¯

x1|I1=(3,1)

slide-38
SLIDE 38

Optimal decision

33 2 1 3 x1

2 3 1 3

(y1, u0) = (3, 1) The optimal decision is the one that minimizes E[g1(x1, u1)] g1(x1, u1) = ( − 1, if ¯ x1 = u1 0, otherwise E[g1(¯ x1, u1)] = −1 × Prob[¯ x1 = u1] + 0 × Prob[¯ x1 = u1] = −Prob[¯ x1 = u1] This is achieved for and is given by −Prob[¯ x1 = 2] = −2 3 ¯ x1 = 2 P¯

x1|I1=(3,1)

slide-39
SLIDE 39

Optimal policy

stage 0 stage 1 decision 1 decision 2

34 2 1 3 x1

2 3 1 3

2 1 3 x1

2 3 1 3

2 1 3 x1

2 3 1 3

2 1 3 x1

2 3 1 3

2 1 3 x1

2 3 1 3

2 1 3 x1

2 3 1 3

always switch at stage 1 pick any at stage 0 cost-to-go − 2

3

− 2

3

− 2

3

− 2

3

− 2

3

− 2

3

− 2

3

The optimal policy is to pick any door at state 1 and switch at stage 2, and the probability of winning is 2

3

x1|I1=(3,1)

x1|I1=(2,1)

x1|I1=(1,2)

x1|I1=(3,2)

x1|I1=(1,3)

x1|I1=(2,1)

slide-40
SLIDE 40

Example: Machine repair*

35

  • A machine can be in one of two states (proper condition) or (improper condition)

P ¯ P

  • In each time period, if it starts at is stays in , if it starts at is stays at with

probability 2/3 and it moves to with probability 1/3 ¯ P ¯ P P ¯ P P

  • It operates over 3 periods and initially it is in state

P

  • At the end of the first and second periods there is an inspection, with possible outcomes

positive and negative with B G Prob[G|P] = Prob[B| ¯ P] = 3/4 Prob[B|P] = Prob[G| ¯ P] = 1/4 P ¯ P P ¯ P P ¯ P P ¯ P

Decision 1: repair? Decision 2: repair? 1st inspection 2nd inspection Repair Do not repair

*[Bertsekas’s book Section. 5.1]

S

  • Action has a cost and action has a cost . Each period that the state is incurs in

a cost . C 1 ¯ P 2

  • There are two possible actions after each inspection: continue operation or stop and repare

continue will not change the state and repare will change the state to if in C P ¯ P S

slide-41
SLIDE 41

Dynamic programming formulation

36

xk ∈ {P, ¯ P} uk ∈ {C, S} Dynamics xk+1 = wk k ∈ {0, 1} Prob[wk = P|xk = P, uk = C] = 2

3 Prob[wk = P|xk = ¯

P, uk = C] = 0 Prob[wk = P|xk = ¯ P, uk = S] = 2

3

Prob[wk = P|xk = P, uk = S] = 2

3

Prob[x0 = P] = 2

3

Prob[x0 = ¯ P] = 1

3

Prob[wk = ¯ P| . . . ] = 1 − Prob[wk = P| . . . ] Measurements Prob[vk = G|xk = P] = 3

4

Prob[vk = G|xk = ¯ P] = 1

4

Prob[vk = B| . . . ] = 1 − Prob[vk = B| . . . ] yk = vk Cost g(x0, u0) + g(x1, u1) g(P, C) = 0 g(P, S) = 1 g( ¯ P, S) = 1 g( ¯ P, C) = 2 Information set I0 = (y0) I1 = (y0, y1, u0) Goal Find to minimize µ0(I0) µ1(I1) E[g(x0, µ0(I0)) + g(x1, µ1(I1))]

slide-42
SLIDE 42

Dynamic programming algorithm

37

At the last decision stage, assume that we know For a given “state” the cost-to-go is k = 1 p1 p1 = Prob[x1 = ¯ P|I1] Prob[x1 = P|I1] = 1 − p1 J1 = min

u1 E[g(x1, u1)|I1]

g(P, S) = 1 g( ¯ P, S) = 1

  • cost-to-go is since

u1 = S u1 = C - cost-to-g is since 1 2p1

E[g(x1, u1)|I1] = g( ¯ P, C) | {z }

=2

Prob[x1 = ¯ P|I1] | {z }

p1

+ g(P, C) | {z }

=0

Prob[x1 = P|I1] | {z }

1−p1

Therefore J1(p1) = min(2p1, 1) µ1(p1) =      S if p1 > 1 2 C if p1 ≤ 1 2 (then )

slide-43
SLIDE 43

Dynamic programming algorithm

36

At the first decision stage assume that we know P ¯ P P ¯ P P ¯ P k = 0 J1(p1) = min(2p1, 1) u0 = S u0 = C

  • r

1st inspection p0 = Prob[x0 = ¯ P|I0] To determine how decision can influence the cost to go we need to understand how the state probability distribution at , parameterised by , depends on taking into account that there will be a measurement p1 k = 1 u0 u0 y1

slide-44
SLIDE 44

Dynamic programming algorithm

37

There are then four options, each leading to a probability distribution characterised by

p1 = Φ0(p0, u0, y1) =                        1 7 if u0 = S, y1 = G 3 5 if u0 = S, y1 = B 1 + 2p0 7 − 4p0 if u0 = C, y1 = G 3 + 6p0 5 + 4p0 if u0 = C, y1 = B

For example for , is given by u0 = S

1/4 3/4 1/3 2/3 p1 = Prob[x1 = ¯ P|u0 = S, y1 = G] Prob[y1 = G|u0 = S, x1 = ¯ P]Prob[x1 = ¯ P|u0 = S] Prob[y1 = G|u0 = S, x1 = ¯ P]Prob[x1 = ¯ P|u0 = S] + Prob[y1 = G|u0 = S, x1 = P]Prob[x1 = P|u0 = S]

y1 = G =

= 1/7 p1

|{z} |{z} |{z} |{z}

Other example

1/4 = Prob[y1 = B|u0 = C, x1 = ¯ P]Prob[x1 = ¯ P|u0 = C] Prob[y1 = B|u0 = C, x1 = ¯ P]Prob[x1 = ¯ P|u0 = C] + Prob[y1 = B|u0 = C, x1 = P]Prob[x1 = P|u0 = C] 3/4 p0 + (1 − p0)/3 (1 − p0)2/3 = 3 + 6p0 5 + 4p0 p1

|{z} |{z} |{z} |{z}

u0 = C y1 = B

slide-45
SLIDE 45

Dynamic programming algorithm

38

Therefore

Prob[y1 = B|p0, u0 = C] = 5 + 4p0 12 Prob[y1 = G|p0, u0 = C] = 7 − 4p0 12 Prob[y1 = G|p0, u0 = S] = 7 12 Prob[y1 = B|p0, u0 = S] = 5 12 J0(p0) = min[2p0 + Prob[y1 = G|p0, u0 = C]J1(Φ0(p0, C, G) + Prob[y1 = B|p0, u0 = C]J1(Φ0(p0, C, B) 1 + Prob[y1 = G|p0, u0 = S]J1(Φ0(p0, S, G) + Prob[y1 = B|p0, u0 = S]J1(Φ0(p0, S, B)]

where one can compute

J0(p0) = min[2p0+7 − 4p0 12 J1(1 + 2p0 7 − 4p0 )+5 + 4p0 12 J1(3 + 6p0 5 + 4p0 ), 1+ 7 12J1(1 7)+ 5 12J1(3 5)] µ0(p0) =      C, p0 ≤ 3 8 S if p0 > 3 8 J0(p0) =      19 12, 3 8 ≤ p0 ≤ 1 7 + 32p0 12 if 0 ≤ p0 ≤ 3 8

Then Which can be simplified to

  • ptimal policy
slide-46
SLIDE 46

General case

Dynamic model xk+1 = fk(xk, uk, wk) Cost

Then there exist vectors dimensional vectors for each k such that

Output Information set k ≥ 1 , ,

39

n Jh(ph) = α1

hph

I0 = (y0) yk = hk(xk, nk) Jπ = E[

h−1

X

k=0

gk(xk, µk(Ik), wk) + gh(xk)] Ik = (y0, y1, . . . , yk, u0, u1, . . . , uk−1)

α1

k, α2 k, . . . , αak k

Jk(pk) = min{α1

kpk, α2 kpk, . . . , αak k pk}

µk(pk) = arg min

uk E[gk(xk, uk) + Jk+1(pk+1)]

slide-47
SLIDE 47

Justification

40

The statement is true for the last stage since defining α1

h =

⇥gh(1) gh(2) . . . gh(n)⇤ And assuming it is true for note that k + 1 (condition on ) yk+1 (where ) From the expression on slide 15, where ¯ α(uk) = ⇥g(1, uk) . . . g(n, uk)⇤ For simplicity assume the running cost is and it does not depend on time and disturbances and the state, input and output live in fixed sets over time xk ∈ {1, . . . , n}, uk ∈ {1, . . . , m}, yk ∈ {1, . . . , q} g(xk, uk) Jh(ph) = E[gh(xh)|Ih] =

n

X

i=1

gh(i) Prob[xh = i|Ih] | {z }

ph,i

= α1

hph

Jk(pk) = min

uk E[g(xk, uk) + Jk+1(pk+1)|Ik]

= min

uk ¯

α(uk)pk + E[Jk+1(pk+1)|Ik] = min

uk ¯

↵(uk)pk +

q

X

`=1

E[Jk+1(pk+1)|Ik, yk+1 = `]Prob[y`+1|Ik] pk+1 = qk+1 Prob[yk+1 = `|Ik] qk+1 = D(yk+1)Pk(uk)pk

slide-48
SLIDE 48

Justification

41

we obtain which can be written as Replacing this expression and noticing that for some Jk(pk) = min{α1

kpk, α2 kpk, . . . , αak k pk}

Jk+1(pk+1) = min{α1

k+1pk+1, α2 k+1pk+1, . . . , αak+1 k+1 pk+1}

Jk(pk) = min

uk∈{1,...,m} ¯

↵(uk)pk+

q

X

`=1

{↵1

k+1D(`)Pk(uk)pk, ↵2 k+1D(`)Pk(uk)pk, . . . , ↵ak+1 k+1 D(`)Pk(uk)pk}

α1

k, . . . , αak k

slide-49
SLIDE 49

42

Concluding remarks

Summary

  • The optimal control structure for a POMDP can be divided into a state estimator,

computing the probability distribution of the state and a decision maker.

  • The state estimator relies on the Bayes’ filter which is interesting per se in several

contexts.

  • POMDP can always be exactly solved although the complexity of the value functions

typically increase exponentially with the time horizon.

After this lecture, you should be able to:

  • Apply the Bayes’ filter.
  • Explicitly solve POMDP with small horizon.