Module 8 Linear Programming CS 886 Sequential Decision Making and - - PowerPoint PPT Presentation

β–Ά
module 8
SMART_READER_LITE
LIVE PREVIEW

Module 8 Linear Programming CS 886 Sequential Decision Making and - - PowerPoint PPT Presentation

Module 8 Linear Programming CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Policy Optimization Value and policy iteration Iterative algorithms that implicitly solve an optimization problem Can


slide-1
SLIDE 1

Module 8 Linear Programming

CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

slide-2
SLIDE 2

CS886 (c) 2013 Pascal Poupart

2

Policy Optimization

  • Value and policy iteration

– Iterative algorithms that implicitly solve an optimization problem

  • Can we explicitly write down this optimization

problem?

– Yes, it can be formulated as a linear program

slide-3
SLIDE 3

CS886 (c) 2013 Pascal Poupart

3

Primal Linear Program

  • Variables: π‘Š 𝑑 βˆ€π‘‘
  • Objective: min π‘₯(𝑑)π‘Š(𝑑)

𝑑

where π‘₯(𝑑) is a weight assigned to state 𝑑

  • Constraints:

π‘Š 𝑑 β‰₯ 𝑆 𝑑, 𝑏 + 𝛿 Pr 𝑑′ 𝑑, 𝑏 π‘Š 𝑑′ βˆ€π‘‘, 𝑏

𝑑′

primalLP(MDP)

min

π‘Š π‘₯(𝑑)π‘Š(𝑑) 𝑑

subject to π‘Š 𝑑 β‰₯ 𝑆 𝑑, 𝑏 + 𝛿 Pr 𝑑′ 𝑑, 𝑏 π‘Š 𝑑′ βˆ€π‘‘, 𝑏

𝑑′

return π‘Š

slide-4
SLIDE 4

CS886 (c) 2013 Pascal Poupart

4

Objective

  • Why do we minimize a weighted combination of

the values? Shouldn’t we maximize value?

  • Value functions π‘Š that satisfy the constraints are

upper bounds on the optimal value function π‘Šβˆ— π‘Š 𝑑 β‰₯ π‘Šβˆ— 𝑑 βˆ€π‘‘

  • Minimizing value ensures that we choose the

lowest upper bound

min

V π‘Š(𝑑) = π‘Šβˆ— 𝑑 βˆ€π‘‘

slide-5
SLIDE 5

CS886 (c) 2013 Pascal Poupart

5

Upper bound

  • Theorem: Value functions π‘Š that satisfy

π‘Š 𝑑 β‰₯ 𝑆 𝑑, 𝑏 + 𝛿 Pr 𝑑′ 𝑑, 𝑏 π‘Š 𝑑′

𝑑′

βˆ€π‘‘, 𝑏 are upper bounds on the optimal value function π‘Šβˆ— π‘Š 𝑑 β‰₯ π‘Šβˆ— 𝑑 βˆ€π‘‘

  • Proof:

– Since π‘Š 𝑑 β‰₯ 𝑆 𝑑, 𝑏 + 𝛿 Pr 𝑑′ 𝑑, 𝑏 π‘Š 𝑑′

𝑑′

βˆ€π‘‘, 𝑏 – Then π‘Š 𝑑 β‰₯ max

𝑏

𝑆 𝑑, 𝑏 + 𝛿 Pr 𝑑′ 𝑑, 𝑏 π‘Š 𝑑′

𝑑′

βˆ€π‘‘ = πΌβˆ—(π‘Š)(𝑑) βˆ€π‘‘ – Furthermore π‘Š β‰₯ πΌβˆ— π‘Š β‰₯ πΌβˆ—(πΌβˆ— π‘Š β‰₯ β‹― β‰₯ πΌβˆ— ∞ π‘Š = π‘Šβˆ—

slide-6
SLIDE 6

CS886 (c) 2013 Pascal Poupart

6

Weight function (initial state)

  • How do we choose the weight function?
  • If the policy always starts in the same initial state

𝑑0, then set π‘₯ 𝑑 = 1 𝑑 = 𝑑0

  • therwise
  • This ensures that π‘₯ 𝑑 π‘Š 𝑑 = π‘Šβˆ—(𝑑0)

𝑑

slide-7
SLIDE 7

CS886 (c) 2013 Pascal Poupart

7

Weight function (any state)

  • If the policy may start in any state, then assign a

positive weight to each state, i.e. π‘₯ 𝑑 > 0 βˆ€π‘‘

  • This ensures that π‘Š is minimized at each 𝑑 and

therefore π‘Š 𝑑 = π‘Šβˆ— 𝑑 βˆ€π‘‘

  • The magnitude of the weight doesn’t matter

when the LP is solved exactly. We will revisit the choice of π‘₯(𝑑) when we discuss approximate linear programming.

slide-8
SLIDE 8

CS886 (c) 2013 Pascal Poupart

8

Optimal Policy

  • Linear program finds π‘Šβˆ—
  • We can extract πœŒβˆ— from π‘Šβˆ— as usual:

πœŒβˆ— 𝑑 ← 𝑏𝑠𝑕𝑛𝑏𝑦𝑏 𝑆 𝑑, 𝑏 + 𝛿 Pr 𝑑′ 𝑑, 𝑏 π‘Šβˆ—(𝑑′)

𝑑′

  • Or check the active constraints

– For each 𝑑, check which π‘βˆ— leads to equality π‘Š 𝑑 = 𝑆 𝑑, π‘βˆ— + 𝛿 Pr 𝑑′ 𝑑, π‘βˆ— π‘Š(𝑑′)

𝑑′

π‘Š 𝑑 β‰₯ 𝑆 𝑑, 𝑏 + 𝛿 Pr 𝑑′ 𝑑, 𝑏

𝑑′

π‘Š 𝑑′ βˆ€π‘ – Set πœŒβˆ— 𝑑 ← π‘βˆ—

slide-9
SLIDE 9

CS886 (c) 2013 Pascal Poupart

9

Direct Policy Optimization

  • The optimal solution to the primal linear program

is π‘Šβˆ—, but we still have to extract πœŒβˆ—

  • Could we directly optimize 𝜌?

– Yes, by considering the dual linear program

slide-10
SLIDE 10

CS886 (c) 2013 Pascal Poupart

10

Dual Linear Program

  • Variables: y 𝑑, 𝑏 βˆ€π‘‘, 𝑏

– frequency of each 𝑑, 𝑏 -pair (proportional to 𝜌)

  • Objective: max

𝑧

𝑧 𝑑, 𝑏 𝑆(𝑑, 𝑏)

𝑑,𝑏

  • Constraints:

𝑧 𝑑′, 𝑏′ = 𝑐 𝑑′ + 𝛿 Pr (𝑑′|𝑑, 𝑏)𝑧 𝑑, 𝑏

𝑑,𝑏 𝑏′

dualLP(MDP)

max

𝑧

𝑧 𝑑, 𝑏 𝑆(𝑑, 𝑏)

𝑑,𝑏

subject to 𝑧 𝑑′, 𝑏′ = 𝑐 𝑑′ + 𝛿 Pr (𝑑′|𝑑, 𝑏)𝑧 𝑑, 𝑏

𝑑,𝑏

βˆ€π‘‘

𝑏′

𝑧 𝑑, 𝑏 β‰₯ 0 βˆ€π‘‘, 𝑏 Let 𝜌 𝑏|𝑑 = Pr 𝑏 𝑑 = 𝑧(𝑑, 𝑏)/ 𝑧(𝑑, 𝑏)

𝑏

return 𝜌

slide-11
SLIDE 11

CS886 (c) 2013 Pascal Poupart

11

Duality

  • For every primal linear

program in the form min

𝑦 π‘‘π‘ˆπ‘¦

  • s. t. 𝐡𝑦 β‰₯ 𝑐
  • There is an equivalent dual

linear program in the form max

𝑧

π‘π‘ˆπ‘§

  • s. t. π΅π‘ˆπ‘§ = 𝑑 and 𝑧 β‰₯ 0
  • Where min

𝑦

π‘‘π‘ˆπ‘¦ = max

𝑧

π‘π‘ˆπ‘§

Interpretation: 𝑑 = π‘₯ 𝑦 = π‘Š 𝑧 ∝ 𝜌 𝐡 = 𝐽 βˆ’ π›Ώπ‘ˆπ‘ βˆ€π‘ 𝑐 = [𝑆𝑏]βˆ€π‘

slide-12
SLIDE 12

CS886 (c) 2013 Pascal Poupart

12

State Frequency

  • Let 𝑔(𝑑) be the frequency of 𝑑 under policy 𝜌.

0 step: 𝑔

0 𝑑 = π‘₯(𝑑)

1 step: 𝑔

1 𝑑′ = π‘₯ 𝑑′ + 𝛿 Pr

(𝑑′|𝑑, 𝜌 𝑑 )π‘₯ 𝑑

𝑑

2 steps: 𝑔

2 𝑑′′ = π‘₯ 𝑑′′ + 𝛿 Pr

(𝑑′′|𝑑′, 𝜌 𝑑′ )π‘₯ 𝑑′

𝑑′

+𝛿2 Pr 𝑑′′ 𝑑′, 𝜌 𝑑′ Pr 𝑑′ 𝑑, 𝜌 𝑑 π‘₯(𝑑)

𝑑,𝑑′

… n steps:

𝑔

π‘œ 𝑑 π‘œ

= π‘₯ 𝑑 π‘œ + 𝛿 Pr 𝑑 π‘œ 𝑑 π‘œβˆ’1 , 𝜌 𝑑 π‘œβˆ’1

𝑑 π‘œβˆ’1

𝑔

π‘œβˆ’1(𝑑 π‘œβˆ’1 )

∞ steps: 𝑔 𝑑′ = π‘₯ 𝑑′ + 𝛿 Pr 𝑑′ 𝑑, 𝜌(𝑑)

𝑑

𝑔(𝑑)

slide-13
SLIDE 13

CS886 (c) 2013 Pascal Poupart

13

State-Action Frequency

  • Let 𝑧 𝑑, 𝑏 be the state-action frequency

𝑧 𝑑, 𝑏 = 𝜌 𝑏|𝑑 𝑔 𝑑 where 𝜌 𝑏 𝑑 = Pr 𝑏 𝑑 is a stochastic policy

  • Then the following equations are equivalent

𝑔 𝑑′ = π‘₯ 𝑑′ + 𝛿 Pr 𝑑′ 𝑑, 𝜌(𝑑)

𝑑

𝑔(𝑑) ⇔ 𝜌(𝑏′|𝑑′)

𝑏′

π‘”πœŒ 𝑑′ = π‘₯ 𝑑′ + Pr 𝑑′ 𝑑, 𝑏 𝜌 𝑏|𝑑 π‘”πœŒ(𝑑)

𝑑

⇔ 𝑧(𝑑′, 𝑏′)

𝑏′

= π‘₯ 𝑑′ + Pr 𝑑′ 𝑑, 𝑏 𝑧(𝑑, 𝑏)

𝑑

Constraint of dual LP

slide-14
SLIDE 14

CS886 (c) 2013 Pascal Poupart

14

Policy

  • We can recover 𝜌 from 𝑧.

𝑧 𝑑, 𝑏 = 𝜌 𝑏 𝑑 𝑔 𝑑 (by definition) 𝜌 𝑏 𝑑 =

𝑧 𝑑,𝑏 𝑔 𝑑 (isolate 𝜌)

𝜌 𝑏 𝑑 =

𝑧 𝑑,𝑏 𝑧 𝑑,𝑏

𝑏

(by definition)

  • 𝜌 may be stochastic
  • Actions with non-zero probability are necessarily
  • ptimal
slide-15
SLIDE 15

CS886 (c) 2013 Pascal Poupart

15

Objective

  • Duality theory guarantees that the objectives of

the primal and dual LPs are equal max

y

𝑧 𝑑, 𝑏 𝑆 𝑑, 𝑏 = min

π‘Š π‘₯(𝑑) 𝑑

π‘Š(𝑑)

𝑑,𝑏

  • This means that

𝑧 𝑑, 𝑏 𝑆 𝑑, 𝑏

𝑑,𝑏

implicitly measures the value of the optimal policy.

slide-16
SLIDE 16

CS886 (c) 2013 Pascal Poupart

16

Solution Algorithms

  • Two broad classes of algorithms:

– Simplex (corner search) – Interior point methods (interior iterative methods)

  • Polynomial complexity (MDP is in P, not NP)
  • Many packages for linear programming

– CPLEX (robust, efficient and free for academia)