Module 8 Linear Programming CS 886 Sequential Decision Making and - - PowerPoint PPT Presentation
Module 8 Linear Programming CS 886 Sequential Decision Making and - - PowerPoint PPT Presentation
Module 8 Linear Programming CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Policy Optimization Value and policy iteration Iterative algorithms that implicitly solve an optimization problem Can
CS886 (c) 2013 Pascal Poupart
2
Policy Optimization
- Value and policy iteration
β Iterative algorithms that implicitly solve an optimization problem
- Can we explicitly write down this optimization
problem?
β Yes, it can be formulated as a linear program
CS886 (c) 2013 Pascal Poupart
3
Primal Linear Program
- Variables: π π‘ βπ‘
- Objective: min π₯(π‘)π(π‘)
π‘
where π₯(π‘) is a weight assigned to state π‘
- Constraints:
π π‘ β₯ π π‘, π + πΏ Pr π‘β² π‘, π π π‘β² βπ‘, π
π‘β²
primalLP(MDP)
min
π π₯(π‘)π(π‘) π‘
subject to π π‘ β₯ π π‘, π + πΏ Pr π‘β² π‘, π π π‘β² βπ‘, π
π‘β²
return π
CS886 (c) 2013 Pascal Poupart
4
Objective
- Why do we minimize a weighted combination of
the values? Shouldnβt we maximize value?
- Value functions π that satisfy the constraints are
upper bounds on the optimal value function πβ π π‘ β₯ πβ π‘ βπ‘
- Minimizing value ensures that we choose the
lowest upper bound
min
V π(π‘) = πβ π‘ βπ‘
CS886 (c) 2013 Pascal Poupart
5
Upper bound
- Theorem: Value functions π that satisfy
π π‘ β₯ π π‘, π + πΏ Pr π‘β² π‘, π π π‘β²
π‘β²
βπ‘, π are upper bounds on the optimal value function πβ π π‘ β₯ πβ π‘ βπ‘
- Proof:
β Since π π‘ β₯ π π‘, π + πΏ Pr π‘β² π‘, π π π‘β²
π‘β²
βπ‘, π β Then π π‘ β₯ max
π
π π‘, π + πΏ Pr π‘β² π‘, π π π‘β²
π‘β²
βπ‘ = πΌβ(π)(π‘) βπ‘ β Furthermore π β₯ πΌβ π β₯ πΌβ(πΌβ π β₯ β― β₯ πΌβ β π = πβ
CS886 (c) 2013 Pascal Poupart
6
Weight function (initial state)
- How do we choose the weight function?
- If the policy always starts in the same initial state
π‘0, then set π₯ π‘ = 1 π‘ = π‘0
- therwise
- This ensures that π₯ π‘ π π‘ = πβ(π‘0)
π‘
CS886 (c) 2013 Pascal Poupart
7
Weight function (any state)
- If the policy may start in any state, then assign a
positive weight to each state, i.e. π₯ π‘ > 0 βπ‘
- This ensures that π is minimized at each π‘ and
therefore π π‘ = πβ π‘ βπ‘
- The magnitude of the weight doesnβt matter
when the LP is solved exactly. We will revisit the choice of π₯(π‘) when we discuss approximate linear programming.
CS886 (c) 2013 Pascal Poupart
8
Optimal Policy
- Linear program finds πβ
- We can extract πβ from πβ as usual:
πβ π‘ β ππ ππππ¦π π π‘, π + πΏ Pr π‘β² π‘, π πβ(π‘β²)
π‘β²
- Or check the active constraints
β For each π‘, check which πβ leads to equality π π‘ = π π‘, πβ + πΏ Pr π‘β² π‘, πβ π(π‘β²)
π‘β²
π π‘ β₯ π π‘, π + πΏ Pr π‘β² π‘, π
π‘β²
π π‘β² βπ β Set πβ π‘ β πβ
CS886 (c) 2013 Pascal Poupart
9
Direct Policy Optimization
- The optimal solution to the primal linear program
is πβ, but we still have to extract πβ
- Could we directly optimize π?
β Yes, by considering the dual linear program
CS886 (c) 2013 Pascal Poupart
10
Dual Linear Program
- Variables: y π‘, π βπ‘, π
β frequency of each π‘, π -pair (proportional to π)
- Objective: max
π§
π§ π‘, π π(π‘, π)
π‘,π
- Constraints:
π§ π‘β², πβ² = π π‘β² + πΏ Pr (π‘β²|π‘, π)π§ π‘, π
π‘,π πβ²
dualLP(MDP)
max
π§
π§ π‘, π π(π‘, π)
π‘,π
subject to π§ π‘β², πβ² = π π‘β² + πΏ Pr (π‘β²|π‘, π)π§ π‘, π
π‘,π
βπ‘
πβ²
π§ π‘, π β₯ 0 βπ‘, π Let π π|π‘ = Pr π π‘ = π§(π‘, π)/ π§(π‘, π)
π
return π
CS886 (c) 2013 Pascal Poupart
11
Duality
- For every primal linear
program in the form min
π¦ πππ¦
- s. t. π΅π¦ β₯ π
- There is an equivalent dual
linear program in the form max
π§
πππ§
- s. t. π΅ππ§ = π and π§ β₯ 0
- Where min
π¦
πππ¦ = max
π§
πππ§
Interpretation: π = π₯ π¦ = π π§ β π π΅ = π½ β πΏππ βπ π = [ππ]βπ
CS886 (c) 2013 Pascal Poupart
12
State Frequency
- Let π(π‘) be the frequency of π‘ under policy π.
0 step: π
0 π‘ = π₯(π‘)
1 step: π
1 π‘β² = π₯ π‘β² + πΏ Pr
(π‘β²|π‘, π π‘ )π₯ π‘
π‘
2 steps: π
2 π‘β²β² = π₯ π‘β²β² + πΏ Pr
(π‘β²β²|π‘β², π π‘β² )π₯ π‘β²
π‘β²
+πΏ2 Pr π‘β²β² π‘β², π π‘β² Pr π‘β² π‘, π π‘ π₯(π‘)
π‘,π‘β²
β¦ n steps:
π
π π‘ π
= π₯ π‘ π + πΏ Pr π‘ π π‘ πβ1 , π π‘ πβ1
π‘ πβ1
π
πβ1(π‘ πβ1 )
β steps: π π‘β² = π₯ π‘β² + πΏ Pr π‘β² π‘, π(π‘)
π‘
π(π‘)
CS886 (c) 2013 Pascal Poupart
13
State-Action Frequency
- Let π§ π‘, π be the state-action frequency
π§ π‘, π = π π|π‘ π π‘ where π π π‘ = Pr π π‘ is a stochastic policy
- Then the following equations are equivalent
π π‘β² = π₯ π‘β² + πΏ Pr π‘β² π‘, π(π‘)
π‘
π(π‘) β π(πβ²|π‘β²)
πβ²
ππ π‘β² = π₯ π‘β² + Pr π‘β² π‘, π π π|π‘ ππ(π‘)
π‘
β π§(π‘β², πβ²)
πβ²
= π₯ π‘β² + Pr π‘β² π‘, π π§(π‘, π)
π‘
Constraint of dual LP
CS886 (c) 2013 Pascal Poupart
14
Policy
- We can recover π from π§.
π§ π‘, π = π π π‘ π π‘ (by definition) π π π‘ =
π§ π‘,π π π‘ (isolate π)
π π π‘ =
π§ π‘,π π§ π‘,π
π
(by definition)
- π may be stochastic
- Actions with non-zero probability are necessarily
- ptimal
CS886 (c) 2013 Pascal Poupart
15
Objective
- Duality theory guarantees that the objectives of
the primal and dual LPs are equal max
y
π§ π‘, π π π‘, π = min
π π₯(π‘) π‘
π(π‘)
π‘,π
- This means that
π§ π‘, π π π‘, π
π‘,π
implicitly measures the value of the optimal policy.
CS886 (c) 2013 Pascal Poupart
16
Solution Algorithms
- Two broad classes of algorithms:
β Simplex (corner search) β Interior point methods (interior iterative methods)
- Polynomial complexity (MDP is in P, not NP)
- Many packages for linear programming