Linear Programming for Large-Scale Markov Decision Problems Yasin - - PowerPoint PPT Presentation

linear programming for large scale markov decision
SMART_READER_LITE
LIVE PREVIEW

Linear Programming for Large-Scale Markov Decision Problems Yasin - - PowerPoint PPT Presentation

Linear Programming for Large-Scale Markov Decision Problems Yasin Abbasi-Yadkori 1 Peter Bartlett 12 Alan Malek 2 1 Queensland University of Technology Brisbane, QLD, Australia 2 University of California, Berkeley Berkeley, CA June 24th, 2014


slide-1
SLIDE 1

Linear Programming for Large-Scale Markov Decision Problems

Yasin Abbasi-Yadkori1 Peter Bartlett12 Alan Malek2

1Queensland University of Technology

Brisbane, QLD, Australia

2University of California, Berkeley

Berkeley, CA

June 24th, 2014

  • Y. Abbasi-Yadkori, P

. Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 1 / 22

slide-2
SLIDE 2

Outline

1

Introduce MDPs and the Linear Program formulation

2

Algorithm

3

Oracle inequality

4

Experiments

  • Y. Abbasi-Yadkori, P

. Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 2 / 22

slide-3
SLIDE 3

Markov Decision Processes

A Markov Decision Process is specified by: State space X = {1, . . . , X} Action space A = {1 . . . , A} Transition Kernel P : X × A → ∆X Loss function ℓ : X × A → [0, 1] Let Pπ be the state transition kernel under policy π : X → ∆A. Our goal is to choose π to minimize the average loss when X and A are very large. Aim for optimality within a restricted family of policies.

  • Y. Abbasi-Yadkori, P

. Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 3 / 22

slide-4
SLIDE 4

Linear Program Formulation

LP formulation (Manne 1960): max

λ,h λ ,

(1) s.t. B⊤(λ1 + h) ≤ ℓ + P⊤h , where B ∈ {0, 1}(X×XA) is the marginalization matrix. Primal variables: h is the cost-to-go, λ is the average cost Dual: min

µ∈RXA ℓ⊤µ ,

(2) vs.t. 1⊤µ = 1, µ ≥ 0, (P − B)µ = 0 . Define policy via π(a|x) ∝ µ(x,a). Dual variables: µ is a stationary distribution over X × A Still a problem when X, A very large

  • Y. Abbasi-Yadkori, P

. Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 4 / 22

slide-5
SLIDE 5

The Dual ALP

Feature matrix Φ ∈ RXA×d; constrain µ = Φθ min

µ∈RXA ℓ⊤Φθ ,

(3) s.t. 1⊤Φθ = 1, Φθ ≥ 0, (P − B)⊤Φθ = 0 . [·]+ is positive part Define policy via πθ(a|x) ∝ [(Φθ)(x, a)]+, µθ is the stationary distribution of Pπθ µθ ≈ Φθ ℓ⊤µθ is the average loss of policy πθ Want to compete with minθ ℓ⊤µθ

  • Y. Abbasi-Yadkori, P

. Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 5 / 22

slide-6
SLIDE 6

Reducing Constraints

Still intractable: d-dimensional problem but O(XA) constraints Form the convex cost function: c(θ) = ℓ⊤Φθ + [Φθ]−1 +

  • (P − B)⊤Φθ
  • 1

= ℓ⊤Φθ +

  • (x,a)
  • [Φ(x,a),:θ]−
  • +
  • x′
  • (Φθ)⊤(P − B):,x′
  • Sample (xt, at) ∼ q1 and yt ∼ q2

Unbiased subgradient estimate: gt(θ) =ℓ⊤Φ − Φ(xt,at),: q1(xt, at)I{Φ(xt ,at ),:θ<0} (4) + (Φ⊤(P − B):,yt)⊤ q2(yt) sgn

  • (Φθ)⊤(P − B):,yt
  • Y. Abbasi-Yadkori, P

. Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 6 / 22

slide-7
SLIDE 7

The Stochastic Subgradient Method for MDPs

Input: Constants S, H > 0, number of rounds T. Let ΠΘ be the Euclidean projection onto S-radius 2-norm ball. Initialize θ1 ∝ 1. for t := 1, 2, . . . , T do Sample (xt, at) ∼ q1 and x′

t ∼ q2.

Compute subgradient estimate gt Update θt+1 = ΠΘ(θt − ηtgt). end for

  • θT = 1

T

T

t=1 θt.

Return policy π

θT .

  • Y. Abbasi-Yadkori, P

. Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 7 / 22

slide-8
SLIDE 8

Theorem

Given some ǫ > 0, the θT produced by the stochastic subgradient method after T = 1/ǫ4 steps satisfies ℓ⊤µ

θT ≤ min θ∈Θ

  • ℓ⊤µθ + V(θ)

ǫ + O(ǫ)

  • with probability at least 1 − δ, where V = O(V1 + V2) is a violation

function defined by V1(θ) = [Φθ]−1 V2(θ) =

  • (P − B)⊤Φθ
  • 1 .

The big-O notation hides polynomials in S, d, C1, C2, and log(1/δ).

  • Y. Abbasi-Yadkori, P

. Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 8 / 22

slide-9
SLIDE 9

Comparison with previous techniques

We bound performance of found policy directly (not through J) Previous bounds were of the form infθ J∗ − Ψθ Our bounds: performance w.r.t. best in class w.o. near optimality

  • f class

No knowledge of optimal policy assumed First method to make approximations in the dual

  • Y. Abbasi-Yadkori, P

. Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 9 / 22

slide-10
SLIDE 10

Discussion

Can remove the awkward V(θ)/ǫ + O(ǫ) by taking a grid of ǫ Recall C1 = max

(x,a)∈X×A

  • Φ(x,a),:
  • q1(x, a) ,

C2 = max

x∈X

  • (P − B)⊤

:,xΦ

  • q2(x)

We also pick Φ and q1, so we can make C1 small Making C2 may require knowledge of P (such as sparsity or some stability assumption) Natural selection: state aggregation

  • Y. Abbasi-Yadkori, P

. Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 10 / 22

slide-11
SLIDE 11

Comparison with Constraint Sampling

Use the constraint sampling of (de Farias and Van Roy, 2004) Must assume feasibility Need a vector v(x) ≥ |(P − B)⊤Φθ| as envelope to constraint violations Bound includes ||v(x)||1; could be very large Requires specific knowledge about problem

  • Y. Abbasi-Yadkori, P

. Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 11 / 22

slide-12
SLIDE 12

Analysis

Assume fast mixing: for every policy π, ∃τ(π) > 0 s.t. ∀d, d′ ∈ △X ,

  • dPπ − d′Pπ
  • 1 ≤ e−1/τ(π)

d − d′

  • 1

Define C1 = max

(x,a)∈X×A

  • Φ(x,a),:
  • q1(x, a) ,

C2 = max

x∈X

  • (P − B)⊤

:,xΦ

  • q2(x)

. The proof has three main parts

1

V1(θ) ≤ ǫ1 and V2(θ) ≤ ǫ2 ⇒ µθ − Φθ1 ≤ O(ǫ1 + ǫ2)

2

Bounding gradient of c(θ); checking it is unbiased

3

Applying stochastic gradient descent theorem: ℓ⊤Φ θ ≤ minθ∈Θ c(θ) + O(ǫ)

  • Y. Abbasi-Yadkori, P

. Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 12 / 22

slide-13
SLIDE 13

Proof part 1

Lemma

Let u ∈ RXA be a vector with 1⊤u = 1, u ≤ 1 + ǫ1,

  • u⊤(P − B)
  • 1 ≤ ǫ2

For the stationary distribution µu of policy u+ = [u]+/ [u]+1, we have µu − u1 ≤ τ(µu) log(1/ǫ′)(2ǫ′ + ǫ′′) + 3ǫ′ . Proof: Two bounds give

  • (P − B)⊤u+
  • 1 ≤ 2ǫ1 + ǫ2 := ǫ′

Also, u+ − u1 ≤ 2ǫ1 Define Mu+ ∈ RX×XA as the matrix that encodes policy u+, e.g. Mu+P = Pu+

  • Y. Abbasi-Yadkori, P

. Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 13 / 22

slide-14
SLIDE 14

Proof (continued): Let µ0 = u+, µ⊤

t = µ⊤ t−1PMu+, vt = µ⊤ t (P − B) = vt−1Mu+P

µt is the state-action distribution after running the policy for t steps By previous bound, v01 ≤ ǫ′ ⇒ vt1 ≤ ǫ′ µ⊤

t = µ⊤ t−1PMu+ = (µ⊤ t−1B + vt−1)Mu+ = µ⊤ t−1 + vt−1Mu+

Telescoping: µ⊤

k = µ⊤ 0 + k t=0 vtMu+

Thus, µk − u+1 ≤ kǫ By mixing assumption: µk − µu1 ≤ e−1/τ(u+) Take k = τ(u+) log(1/ǫ′) and use triangle inequality

  • Y. Abbasi-Yadkori, P

. Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 14 / 22

slide-15
SLIDE 15

Applying SGD theorem

Theorem (Lemma 3.1 of (Flaxman et al., 2005))

Assume we have Convex set Z ⊆ B2(Z, 0) and (ft)t=1,2,...,T convex functions on Z. Gradient estimates f ′

t with E [[] f ′ t |zt] = ∇f(zt) and bound f ′ t 2 ≤ F

Sample Path z1 = 0 and zt+1 = ΠZ(zt − ηf ′

t ) (ΠZ Euclidean

projection) Then, for η = Z/(F √ T) and any δ ∈ (0, 1), the following holds with probability at least 1 − δ:

T

  • t=1

ft(zt) − min

z∈Z T

  • t=1

ft(z) ≤ ZF √ T (5) +

  • (1 + 4Z 2T)
  • 2 log 1

δ + d log

  • 1 + Z 2T

d

  • .
  • Y. Abbasi-Yadkori, P

. Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 15 / 22

slide-16
SLIDE 16

checking conditions of theorem

Recall gradient: for (xt, at) ∼ q1 and yt ∼ q2, gt(θ) =ℓ⊤Φ − H Φ(xt,at),: q1(xt, at)I{Φ(xt ,at ),:θ<0} + H (P − B)⊤

:,ytΦ

q2(yt) sgn

  • (Φθ)⊤(P − B):,yt
  • .

We can bound gt(θ)2 ≤

  • ℓ⊤Φ
  • 2 + H
  • Φ(xt,at),:
  • 2

q1(xt, at) +

  • (P − B)⊤

:,ytΦ

  • 2

q2(yt) ≤ √ d + H(C1 + C2) := F. and E [[] gt(θ)] = ∇c(θ).

  • Y. Abbasi-Yadkori, P

. Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 16 / 22

slide-17
SLIDE 17

proof conclusion

The SGD theorem gives us: ℓ⊤Φ θT + H(V1( θ) + V2( θ)) ≤ ℓ⊤Φθ∗ ≤H(V1(θ∗) + V2(θ∗)) + bT where bT is the regret bound from the theorem: bT = SF √ T +

  • 1 + 4S2T

T 2

  • 2 log(1

δ ) + d log(d + S2T d )

  • .

We take V1( θ), V2( θ) ≤ 1 H (2(1 + S) + HV1(θ∗) + HV2(θ∗) + bT) := ǫ′

  • Y. Abbasi-Yadkori, P

. Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 17 / 22

slide-18
SLIDE 18

Applying the lemma twice: ℓ⊤µ

θT − ℓ⊤µθ∗ ≤HV1(θ∗) + HV2(θ∗) + bT + τ(µ θ) log(1/ǫ′)3ǫ′ + 3ǫ′

+ τ(µθ∗) log(1/V(θ∗))(2V1(θ∗) + V2(θ∗)) + 3V1(θ) Since bT = O(H/ √ T, taking H = 1/ǫ and T = 1/ǫ4 yields: ℓ⊤µ

θT − ℓ⊤µθ∗ ≤ 1

ǫ (V1(θ∗) + V1(θ∗)) + O(ǫ).

  • Y. Abbasi-Yadkori, P

. Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 18 / 22

slide-19
SLIDE 19

Queueing network example (Rybko-Stolyar)

µ1 µ2 µ3 µ4 d1 a1 d3 a3 d2 d4 server1 server2

Customers arrive at µ1/µ3 then move to µ2/µ4 Server 1 processes µ1 or µ4, server 2 processes µ2 or µ3 Features: indicators of sub-blocks in state-action space, stationary distribution of LONGER and LBSF heuristics Loss is the total queue size a1 = a3 = .08, d1 = d2 = .12, and d3 = d4 = .28, X = 902500

  • Y. Abbasi-Yadkori, P

. Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 19 / 22

slide-20
SLIDE 20

Results

2000 4000 6000 8000 36 37 38 39 40 41 42 loss of running average 2000 4000 6000 8000 10

−2

10

−1

10 total constraint violation of running average 2000 4000 6000 8000 36 38 40 42 44 46 48 50 52 average loss of the running average policy

The left plot: linear objective of the running average, i.e. ℓ⊤Φ θt. The center plot: sum of the two constraint violations of θt The right plot: ℓ⊤µ

θt. The two horizontal lines correspond to the

loss of two heuristics, LONGER and LBFS.

  • Y. Abbasi-Yadkori, P

. Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 20 / 22

slide-21
SLIDE 21

Conclusion

Presented an algorithm to solve average-cost large-scale MDPs

◮ Restricted the dual LP to a subspace to reduce dimension ◮ Used Stochastic Gradient Descent to sample constraints

Presented oracle inequality guaranteeing we perform well w.r.t. best policy in the subspace. Demonstrated algorithm on a queueing network Visit us at poster T75

  • Y. Abbasi-Yadkori, P

. Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 21 / 22

slide-22
SLIDE 22

Bibliography

  • D. P

. de Farias and B. Van Roy. On constraint sampling in the linear programming approach to approximate dynamic programming. Mathematics of Operations Research, 29, 2004.

  • A. D. Flaxman, A. T. Kalai, and H. B. McMahan. Online convex
  • ptimization in the bandit setting: gradient descent without a
  • gradient. In SODA, 2005.
  • Y. Abbasi-Yadkori, P

. Bartlett, A. Malek Linear Programming for Large-Scale Markov Decision Problems 22 / 22