Stochastic approximation for speeding up LSTD/LSPI (and least - - PowerPoint PPT Presentation

stochastic approximation for speeding up lstd lspi and
SMART_READER_LITE
LIVE PREVIEW

Stochastic approximation for speeding up LSTD/LSPI (and least - - PowerPoint PPT Presentation

Stochastic approximation for speeding up LSTD/LSPI (and least squares regression/LinUCB) Prashanth L A Joint work with Nathaniel Korda and Rmi Munos INRIA Lille - Team SequeL MLRG - Oxford University November 24, 2014


slide-1
SLIDE 1

Stochastic approximation for speeding up LSTD/LSPI (and least squares regression/LinUCB)

Prashanth L A†

Joint work with Nathaniel Korda♯ and Rémi Munos†

†INRIA Lille - Team SequeL ♯MLRG - Oxford University

November 24, 2014

Prashanth L A Fast LSTD using SA November 24, 2014 1 / 39

slide-2
SLIDE 2

Fast LSTD using SA

Outline

1

Fast LSTD using SA

2

Fast LSPI using SA

3

Experiments - Traffic Signal Control

4

Extension to Least Squares Regression

5

Experiments - News Recommendation

6

Proof outline

Prashanth L A Fast LSTD using SA November 24, 2014 2 / 39

slide-3
SLIDE 3

Fast LSTD using SA

Background

MDP Set of States X, Set of Actions A, Rewards r(x, a) Value function Vπ(s) := E ∞

  • t=0

βtr(st, π(st)) | s0 = s

  • Bellman Operator

T π(V)(s) := r(s, π(s)) + β

  • s′

p(s, π(s), s′)V(s′)

Prashanth L A Fast LSTD using SA November 24, 2014 3 / 39

slide-4
SLIDE 4

Fast LSTD using SA

TD with Function Approximation

Linear Function Approximation. Vπ(s) ≈ θ

T φ(s)

Parameter θ ∈ Rd Feature φ(s) ∈ Rd TD Fixed Point Φ θ = Π T π(Φθ) Feature Matrix Orthogonal Projection with rows φ(s)

T, ∀s ∈ S

to B = {Φθ | θ ∈ Rd}

Prashanth L A Fast LSTD using SA November 24, 2014 4 / 39

slide-5
SLIDE 5

Fast LSTD using SA

TD with Function Approximation

Linear Function Approximation. Vπ(s) ≈ θ

T φ(s)

Parameter θ ∈ Rd Feature φ(s) ∈ Rd TD Fixed Point Φ θ = Π T π(Φθ) Feature Matrix Orthogonal Projection with rows φ(s)

T, ∀s ∈ S

to B = {Φθ | θ ∈ Rd}

Prashanth L A Fast LSTD using SA November 24, 2014 4 / 39

slide-6
SLIDE 6

Fast LSTD using SA

LSTD - A Batch Algorithm

Given dataset D := {(si, ri, s′

i), i = 1, . . . , T)}

LSTD approximates the TD fixed point by ˆ θT = ¯ A−1

T ¯

bT , O(d2T) Complexity where ¯ AT = 1 T

T

  • i=1

φ(si)(φ(si) − βφ(s′

i))T

¯ bT = 1 T

T

  • i=1

riφ(si).

Prashanth L A Fast LSTD using SA November 24, 2014 5 / 39

slide-7
SLIDE 7

Fast LSTD using SA

LSTD - A Batch Algorithm

Given dataset D := {(si, ri, s′

i), i = 1, . . . , T)}

LSTD approximates the TD fixed point by ˆ θT = ¯ A−1

T ¯

bT , O(d2T) Complexity where ¯ AT = 1 T

T

  • i=1

φ(si)(φ(si) − βφ(s′

i))T

¯ bT = 1 T

T

  • i=1

riφ(si).

Prashanth L A Fast LSTD using SA November 24, 2014 5 / 39

slide-8
SLIDE 8

Fast LSTD using SA

Complexity of LSTD [1]

Policy Evaluation Policy Improvement Q-value Qπ Policy π

Figure : LSPI - a batch-mode RL algorithm for control

LSTD Complexity

O(d2T) using the Sherman-Morrison lemma or O(d2.807) using the Strassen algorithm or O(d2.375) the

Coppersmith-Winograd algorithm

Prashanth L A Fast LSTD using SA November 24, 2014 6 / 39

slide-9
SLIDE 9

Fast LSTD using SA

Complexity of LSTD [1]

Policy Evaluation Policy Improvement Q-value Qπ Policy π

Figure : LSPI - a batch-mode RL algorithm for control

LSTD Complexity

O(d2T) using the Sherman-Morrison lemma or O(d2.807) using the Strassen algorithm or O(d2.375) the

Coppersmith-Winograd algorithm

Prashanth L A Fast LSTD using SA November 24, 2014 6 / 39

slide-10
SLIDE 10

Fast LSTD using SA

Complexity of LSTD [2]

Problem Practical applications involve high-dimensional features (e.g. Computer-Go: d ∼ 106) ⇒ solving LSTD is computationally intensive Related works: GTD 1, GTD2 2, iLSTD 3 Solution Use stochastic approximation (SA) Complexity O(dT) ⇒ O(d) reduction in complexity Theory SA variant of LSTD does not impact overall rate of convergence Experiments On traffic control application, performance of SA-based LSTD is comparable to LSTD, while gaining in runtime!

1Sutton et al. (2009) A convergent O(n) algorithm for off-policy temporal difference learning. In: NIPS 2Sutton et al. (2009) Fast gradient-descent methods for temporal-difference learning with linear func- tion approximation. In: ICML 3Geramifard A et al. (2007) iLSTD: Eligibility traces and convergence analysis. In: NIPS Prashanth L A Fast LSTD using SA November 24, 2014 7 / 39

slide-11
SLIDE 11

Fast LSTD using SA

Complexity of LSTD [2]

Problem Practical applications involve high-dimensional features (e.g. Computer-Go: d ∼ 106) ⇒ solving LSTD is computationally intensive Related works: GTD 1, GTD2 2, iLSTD 3 Solution Use stochastic approximation (SA) Complexity O(dT) ⇒ O(d) reduction in complexity Theory SA variant of LSTD does not impact overall rate of convergence Experiments On traffic control application, performance of SA-based LSTD is comparable to LSTD, while gaining in runtime!

1Sutton et al. (2009) A convergent O(n) algorithm for off-policy temporal difference learning. In: NIPS 2Sutton et al. (2009) Fast gradient-descent methods for temporal-difference learning with linear func- tion approximation. In: ICML 3Geramifard A et al. (2007) iLSTD: Eligibility traces and convergence analysis. In: NIPS Prashanth L A Fast LSTD using SA November 24, 2014 7 / 39

slide-12
SLIDE 12

Fast LSTD using SA

Fast LSTD using Stochastic Approximation

θn Pick in uniformly in {1, . . . , T} Random Sampling Update θn using (sin, rin, s′

in)

SA Update θn+1

Update rule: θn = θn−1 + γn

  • rin + βθT

n−1φ(s′ in) − θT n−1φ(sin)

  • φ(sin)

Step-sizes Fixed-point iteration Complexity: O(d) per iteration

Prashanth L A Fast LSTD using SA November 24, 2014 8 / 39

slide-13
SLIDE 13

Fast LSTD using SA

Fast LSTD using Stochastic Approximation

θn Pick in uniformly in {1, . . . , T} Random Sampling Update θn using (sin, rin, s′

in)

SA Update θn+1

Update rule: θn = θn−1 + γn

  • rin + βθT

n−1φ(s′ in) − θT n−1φ(sin)

  • φ(sin)

Step-sizes Fixed-point iteration Complexity: O(d) per iteration

Prashanth L A Fast LSTD using SA November 24, 2014 8 / 39

slide-14
SLIDE 14

Fast LSTD using SA

Assumptions

Setting: Given dataset D := {(si, ri, s′

i), i = 1, . . . , T)}

(A1) φ(si)2 ≤ 1 (A2) |ri| ≤ Rmax < ∞ (A3) λmin

  • 1

T

T

  • i=1

φ(si)φ(si)

T

  • ≥ µ.

Bounded features Bounded rewards Co-variance matrix has a min-eigenvalue

Prashanth L A Fast LSTD using SA November 24, 2014 9 / 39

slide-15
SLIDE 15

Fast LSTD using SA

Assumptions

Setting: Given dataset D := {(si, ri, s′

i), i = 1, . . . , T)}

(A1) φ(si)2 ≤ 1 (A2) |ri| ≤ Rmax < ∞ (A3) λmin

  • 1

T

T

  • i=1

φ(si)φ(si)

T

  • ≥ µ.

Bounded features Bounded rewards Co-variance matrix has a min-eigenvalue

Prashanth L A Fast LSTD using SA November 24, 2014 9 / 39

slide-16
SLIDE 16

Fast LSTD using SA

Assumptions

Setting: Given dataset D := {(si, ri, s′

i), i = 1, . . . , T)}

(A1) φ(si)2 ≤ 1 (A2) |ri| ≤ Rmax < ∞ (A3) λmin

  • 1

T

T

  • i=1

φ(si)φ(si)

T

  • ≥ µ.

Bounded features Bounded rewards Co-variance matrix has a min-eigenvalue

Prashanth L A Fast LSTD using SA November 24, 2014 9 / 39

slide-17
SLIDE 17

Fast LSTD using SA

Assumptions

Setting: Given dataset D := {(si, ri, s′

i), i = 1, . . . , T)}

(A1) φ(si)2 ≤ 1 (A2) |ri| ≤ Rmax < ∞ (A3) λmin

  • 1

T

T

  • i=1

φ(si)φ(si)

T

  • ≥ µ.

Bounded features Bounded rewards Co-variance matrix has a min-eigenvalue

Prashanth L A Fast LSTD using SA November 24, 2014 9 / 39

slide-18
SLIDE 18

Fast LSTD using SA

Convergence Rate

Step-size choice γn = (1 − β)c 2(c + n) , with (1 − β)2µc ∈ (1.33, 2) Bound in expectation E

  • θn − ˆ

θT

  • 2 ≤

K1 √n + c High-probability bound P

  • θn − ˆ

θT

  • 2 ≤

K2 √n + c

  • ≥ 1 − δ,

By iterate-averaging, the dependency of c on µ can be removed

Prashanth L A Fast LSTD using SA November 24, 2014 10 / 39

slide-19
SLIDE 19

Fast LSTD using SA

Convergence Rate

Step-size choice γn = (1 − β)c 2(c + n) , with (1 − β)2µc ∈ (1.33, 2) Bound in expectation E

  • θn − ˆ

θT

  • 2 ≤

K1 √n + c High-probability bound P

  • θn − ˆ

θT

  • 2 ≤

K2 √n + c

  • ≥ 1 − δ,

By iterate-averaging, the dependency of c on µ can be removed

Prashanth L A Fast LSTD using SA November 24, 2014 10 / 39

slide-20
SLIDE 20

Fast LSTD using SA

Convergence Rate

Step-size choice γn = (1 − β)c 2(c + n) , with (1 − β)2µc ∈ (1.33, 2) Bound in expectation E

  • θn − ˆ

θT

  • 2 ≤

K1 √n + c High-probability bound P

  • θn − ˆ

θT

  • 2 ≤

K2 √n + c

  • ≥ 1 − δ,

By iterate-averaging, the dependency of c on µ can be removed

Prashanth L A Fast LSTD using SA November 24, 2014 10 / 39

slide-21
SLIDE 21

Fast LSTD using SA

Convergence Rate

Step-size choice γn = (1 − β)c 2(c + n) , with (1 − β)2µc ∈ (1.33, 2) Bound in expectation E

  • θn − ˆ

θT

  • 2 ≤

K1 √n + c High-probability bound P

  • θn − ˆ

θT

  • 2 ≤

K2 √n + c

  • ≥ 1 − δ,

By iterate-averaging, the dependency of c on µ can be removed

Prashanth L A Fast LSTD using SA November 24, 2014 10 / 39

slide-22
SLIDE 22

Fast LSTD using SA

The constants

K1(n) = √c

  • θ0 − ˆ

θT

  • 2

n((1−β)2µc−1)/2 + (1 − β)ch2(n) 2 , K2(n) = (1 − β)c

  • log δ−1

2

  • 4

3(1 − β)2µc − 1

+ K1(n), where h(k) :=(1 + Rmax + β)2 max

  • θ0 − ˆ

θT

  • 2 + ln n +
  • ˆ

θT

  • 2

4 , 1

  • Both K1(n) and K2(n) are O(1)

Prashanth L A Fast LSTD using SA November 24, 2014 11 / 39

slide-23
SLIDE 23

Fast LSTD using SA

Iterate Averaging

Bigger step-size + Averaging γn := (1−β)

2

  • c

c+n

α ¯ θn+1 := (θ1 + . . . + θn)/n Bound in expectation E

  • ¯

θn − ˆ θT

  • 2 ≤

KIA

1 (n)

(n + c)α/2 High-probability bound P

  • ¯

θn − ˆ θT

  • 2 ≤

KIA

2 (n)

(n + c)α/2

  • ≥ 1 − δ,

Dependency of c on µ is removed dependency at the cost of (1 − α)/2 in the rate.

Prashanth L A Fast LSTD using SA November 24, 2014 12 / 39

slide-24
SLIDE 24

Fast LSTD using SA

Iterate Averaging

Bigger step-size + Averaging γn := (1−β)

2

  • c

c+n

α ¯ θn+1 := (θ1 + . . . + θn)/n Bound in expectation E

  • ¯

θn − ˆ θT

  • 2 ≤

KIA

1 (n)

(n + c)α/2 High-probability bound P

  • ¯

θn − ˆ θT

  • 2 ≤

KIA

2 (n)

(n + c)α/2

  • ≥ 1 − δ,

Dependency of c on µ is removed dependency at the cost of (1 − α)/2 in the rate.

Prashanth L A Fast LSTD using SA November 24, 2014 12 / 39

slide-25
SLIDE 25

Fast LSTD using SA

Iterate Averaging

Bigger step-size + Averaging γn := (1−β)

2

  • c

c+n

α ¯ θn+1 := (θ1 + . . . + θn)/n Bound in expectation E

  • ¯

θn − ˆ θT

  • 2 ≤

KIA

1 (n)

(n + c)α/2 High-probability bound P

  • ¯

θn − ˆ θT

  • 2 ≤

KIA

2 (n)

(n + c)α/2

  • ≥ 1 − δ,

Dependency of c on µ is removed dependency at the cost of (1 − α)/2 in the rate.

Prashanth L A Fast LSTD using SA November 24, 2014 12 / 39

slide-26
SLIDE 26

Fast LSTD using SA

Iterate Averaging

Bigger step-size + Averaging γn := (1−β)

2

  • c

c+n

α ¯ θn+1 := (θ1 + . . . + θn)/n Bound in expectation E

  • ¯

θn − ˆ θT

  • 2 ≤

KIA

1 (n)

(n + c)α/2 High-probability bound P

  • ¯

θn − ˆ θT

  • 2 ≤

KIA

2 (n)

(n + c)α/2

  • ≥ 1 − δ,

Dependency of c on µ is removed dependency at the cost of (1 − α)/2 in the rate.

Prashanth L A Fast LSTD using SA November 24, 2014 12 / 39

slide-27
SLIDE 27

Fast LSTD using SA

The constants

KIA

1 (n) :=

C

  • θ0 − ˆ

θT

  • 2

(n + c)(1−α)/2 + h(n)cα(1 − β) (µcα(1 − β)2)α

1+2α 2(1−α)

, and KIA

2 (n) :=

  • log δ−1

µ(1 − β)

  • 3α +

µcα(1 − β)2 + 2α α 2 1 (n + c)(1−α)/2 + KIA

1 (n).

As before, both KIA

1 (n) and KIA 2 (n) are O(1)

Prashanth L A Fast LSTD using SA November 24, 2014 13 / 39

slide-28
SLIDE 28

Fast LSTD using SA

Performance bounds

True value function v Approximate value function ˜ vn := Φθn v− ˜ vn T ≤ v − ΠvT

  • 1 − β2
  • approximation

error

+ O

  • d

(1 − β)2µT

  • estimation

error

+ O

  • 1

(1 − β)2µ2n ln 1 δ

  • computational

error

1f2

T := T−1 T

  • i=1

f(si)2, for any function f. 2Lazaric, A., Ghavamzadeh, M., Munos, R. (2012) Finite-sample analysis of least-squares policy iteration. In: JMLR Prashanth L A Fast LSTD using SA November 24, 2014 14 / 39

slide-29
SLIDE 29

Fast LSTD using SA

Performance bounds

v− ˜ vn T ≤ v − ΠvT

  • 1 − β2
  • approximation

error

+ O

  • d

(1 − β)2µT

  • estimation

error

+ O

  • 1

(1 − β)2µ2n ln 1 δ

  • computational

error 1

Artifacts of function approximation and least squares methods Consequence of using SA for LSTD Setting n = ln(1/δ)T/(dµ), the convergence rate is unaffected!

Prashanth L A Fast LSTD using SA November 24, 2014 15 / 39

slide-30
SLIDE 30

Fast LSTD using SA

Performance bounds

v− ˜ vn T ≤ v − ΠvT

  • 1 − β2
  • approximation

error

+ O

  • d

(1 − β)2µT

  • estimation

error

+ O

  • 1

(1 − β)2µ2n ln 1 δ

  • computational

error 1

Artifacts of function approximation and least squares methods Consequence of using SA for LSTD Setting n = ln(1/δ)T/(dµ), the convergence rate is unaffected!

Prashanth L A Fast LSTD using SA November 24, 2014 15 / 39

slide-31
SLIDE 31

Fast LSTD using SA

Performance bounds

v− ˜ vn T ≤ v − ΠvT

  • 1 − β2
  • approximation

error

+ O

  • d

(1 − β)2µT

  • estimation

error

+ O

  • 1

(1 − β)2µ2n ln 1 δ

  • computational

error 1

Artifacts of function approximation and least squares methods Consequence of using SA for LSTD Setting n = ln(1/δ)T/(dµ), the convergence rate is unaffected!

Prashanth L A Fast LSTD using SA November 24, 2014 15 / 39

slide-32
SLIDE 32

Fast LSPI using SA

Outline

1

Fast LSTD using SA

2

Fast LSPI using SA

3

Experiments - Traffic Signal Control

4

Extension to Least Squares Regression

5

Experiments - News Recommendation

6

Proof outline

Prashanth L A Fast LSTD using SA November 24, 2014 16 / 39

slide-33
SLIDE 33

Fast LSPI using SA

LSPI - A Quick Recap

Policy Evaluation Policy Improvement Q-value Qπ Policy π

Qπ(s, a) = E ∞

t=0 βtr(st, π(st)) | s0 = s, a0 = a

  • π′(s) = arg maxa∈A θTφ(s, a)

Prashanth L A Fast LSTD using SA November 24, 2014 17 / 39

slide-34
SLIDE 34

Fast LSPI using SA

LSPI - A Quick Recap

Policy Evaluation Policy Improvement Q-value Qπ Policy π

Qπ(s, a) = E ∞

t=0 βtr(st, π(st)) | s0 = s, a0 = a

  • π′(s) = arg maxa∈A θTφ(s, a)

Prashanth L A Fast LSTD using SA November 24, 2014 17 / 39

slide-35
SLIDE 35

Fast LSPI using SA

Policy Evaluation: LSTDQ and its SA variant

Given a set of samples D := {(si, ai, ri, s′

i), i = 1, . . . , T)}

LSTDQ approximates Qπ by ˆ θT = ¯ A−1

T ¯

bT where ¯ AT = 1 T

T

  • i=1

φ(si, ai)(φ(si, ai) − βφ(s′

i, π(s′ i)))T, and ¯

bT = T−1

T

  • i=1

riφ(si, ai). Fast LSTDQ using SA: θk = θk−1 + γk

  • rik + βθT

k−1φ(s′ ik, π(s′ ik)) − θT k−1φ(sik, aik)

  • φ(sik, aik)

Prashanth L A Fast LSTD using SA November 24, 2014 18 / 39

slide-36
SLIDE 36

Fast LSPI using SA

Policy Evaluation: LSTDQ and its SA variant

Given a set of samples D := {(si, ai, ri, s′

i), i = 1, . . . , T)}

LSTDQ approximates Qπ by ˆ θT = ¯ A−1

T ¯

bT where ¯ AT = 1 T

T

  • i=1

φ(si, ai)(φ(si, ai) − βφ(s′

i, π(s′ i)))T, and ¯

bT = T−1

T

  • i=1

riφ(si, ai). Fast LSTDQ using SA: θk = θk−1 + γk

  • rik + βθT

k−1φ(s′ ik, π(s′ ik)) − θT k−1φ(sik, aik)

  • φ(sik, aik)

Prashanth L A Fast LSTD using SA November 24, 2014 18 / 39

slide-37
SLIDE 37

Fast LSPI using SA

Fast LSPI using SA (fLSPI-SA)

Input: Sample set D := {si, ai, ri, s′

i}T i=1

repeat Policy Evaluation For k = 1 to τ

  • Get random sample index: ik ∼ U({1, . . . , T})
  • Update fLSTD-SA iterate θk

θ′ ← θτ, ∆ = θ − θ′2 Policy Improvement Obtain a greedy policy π′(s) = arg max

a∈A

θ′Tφ(s, a) θ ← θ′, π ← π′ until ∆ < ǫ

Prashanth L A Fast LSTD using SA November 24, 2014 19 / 39

slide-38
SLIDE 38

Fast LSPI using SA

Fast LSPI using SA (fLSPI-SA)

Input: Sample set D := {si, ai, ri, s′

i}T i=1

repeat Policy Evaluation For k = 1 to τ

  • Get random sample index: ik ∼ U({1, . . . , T})
  • Update fLSTD-SA iterate θk

θ′ ← θτ, ∆ = θ − θ′2 Policy Improvement Obtain a greedy policy π′(s) = arg max

a∈A

θ′Tφ(s, a) θ ← θ′, π ← π′ until ∆ < ǫ

Prashanth L A Fast LSTD using SA November 24, 2014 19 / 39

slide-39
SLIDE 39

Experiments - Traffic Signal Control

Outline

1

Fast LSTD using SA

2

Fast LSPI using SA

3

Experiments - Traffic Signal Control

4

Extension to Least Squares Regression

5

Experiments - News Recommendation

6

Proof outline

Prashanth L A Fast LSTD using SA November 24, 2014 20 / 39

slide-40
SLIDE 40

Experiments - Traffic Signal Control

The traffic control problem

Prashanth L A Fast LSTD using SA November 24, 2014 21 / 39

slide-41
SLIDE 41

Experiments - Traffic Signal Control

Simulation Results on 7x9-grid network

Tracking error

100 200 300 400 500 0.1 0.2 0.3 0.4 0.5 0.6 step k of fLSTD-SA

  • θk − ˆ

θT

  • 2
  • θk − ˆ

θT

  • 2

Throughput (TAR)

1,000 2,000 3,000 4,000 5,000 0.5 1 1.5 ·104 time steps TAR LSPI fLSPI-SA

Prashanth L A Fast LSTD using SA November 24, 2014 22 / 39

slide-42
SLIDE 42

Experiments - Traffic Signal Control

Runtime Performance on three road networks

7x9-Grid (d = 504) 14x9-Grid (d = 1008) 14x18-Grid (d = 2016) 0.5 1 1.5 2 ·105 4,917 30,144 1.91 · 105 66 159 287 runtime (ms) LSPI fLSPI-SA

Prashanth L A Fast LSTD using SA November 24, 2014 23 / 39

slide-43
SLIDE 43

Extension to Least Squares Regression

Outline

1

Fast LSTD using SA

2

Fast LSPI using SA

3

Experiments - Traffic Signal Control

4

Extension to Least Squares Regression

5

Experiments - News Recommendation

6

Proof outline

Prashanth L A Fast LSTD using SA November 24, 2014 24 / 39

slide-44
SLIDE 44

Extension to Least Squares Regression

Complexity of Ordinary Least Squares (OLS)

Choose xn Observe yn Estimate ˆ θn

Figure : Typical ML algorithm using Regression

OLS Complexity

O(d2) using the Sherman-Morrison lemma or O(d2.807) using the Strassen algorithm or O(d2.375) the

Coppersmith-Winograd algorithm Problem: News feed platform has high-dimensional features (d ∼ 105) ⇒ solving OLS is computationally costly

Prashanth L A Fast LSTD using SA November 24, 2014 25 / 39

slide-45
SLIDE 45

Extension to Least Squares Regression

Fast GD for OLS

θn Pick in uniformly in {1, . . . , n} Random Sampling Update θn using (xin, yin) GD Update θn+1

Solution: Use fast (online) gradient descent (GD) Efficient with complexity of only O(d) (Well-known) High probability bounds with explicit constants can be derived (not fully

known)

Prashanth L A Fast LSTD using SA November 24, 2014 26 / 39

slide-46
SLIDE 46

Extension to Least Squares Regression

A linear bandit algorithm

Choose xn Observe yn Estimate UCBs xn := arg max

x∈D

UCB(x) Rewards yn s.t. E[yn | xn] = xT

nθ∗

OLS used to compute UCB(x) := xTˆ θn + α

  • xTA−1

n x

Prashanth L A Fast LSTD using SA November 24, 2014 27 / 39

slide-47
SLIDE 47

Extension to Least Squares Regression

A linear bandit algorithm

Choose xn Observe yn Estimate UCBs xn := arg max

x∈D

UCB(x) Rewards yn s.t. E[yn | xn] = xT

nθ∗

OLS used to compute UCB(x) := xTˆ θn + α

  • xTA−1

n x

Prashanth L A Fast LSTD using SA November 24, 2014 27 / 39

slide-48
SLIDE 48

Extension to Least Squares Regression

A linear bandit algorithm

Choose xn Observe yn Estimate UCBs xn := arg max

x∈D

UCB(x) Rewards yn s.t. E[yn | xn] = xT

nθ∗

OLS used to compute UCB(x) := xTˆ θn + α

  • xTA−1

n x

Prashanth L A Fast LSTD using SA November 24, 2014 27 / 39

slide-49
SLIDE 49

Extension to Least Squares Regression

A linear bandit algorithm

Choose xn Observe yn Estimate UCBs xn := arg max

x∈D

UCB(x) Rewards yn s.t. E[yn | xn] = xT

nθ∗

OLS used to compute UCB(x) := xTˆ θn + α

  • xTA−1

n x

Prashanth L A Fast LSTD using SA November 24, 2014 27 / 39

slide-50
SLIDE 50

Extension to Least Squares Regression

fast GD

θn Pick in uniformly in {1, . . . , n} Random Sampling Update θn using (xin, yin) GD Update θn+1

Step-sizes θn = θn−1 + γn

  • yin − θT

n−1xin

  • xin

Sample gradient

Prashanth L A Fast LSTD using SA November 24, 2014 28 / 39

slide-51
SLIDE 51

Extension to Least Squares Regression

fast GD

θn Pick in uniformly in {1, . . . , n} Random Sampling Update θn using (xin, yin) GD Update θn+1

Step-sizes θn = θn−1 + γn

  • yin − θT

n−1xin

  • xin

Sample gradient

Prashanth L A Fast LSTD using SA November 24, 2014 28 / 39

slide-52
SLIDE 52

Extension to Least Squares Regression

fast GD

θn Pick in uniformly in {1, . . . , n} Random Sampling Update θn using (xin, yin) GD Update θn+1

Step-sizes θn = θn−1 + γn

  • yin − θT

n−1xin

  • xin

Sample gradient

Prashanth L A Fast LSTD using SA November 24, 2014 28 / 39

slide-53
SLIDE 53

Extension to Least Squares Regression

Assumptions

Setting: yn = x

T

nθ∗ + ξn, where ξn is i.i.d. zero-mean

(A1) sup

n

xn2 ≤ 1. (A2) |ξn| ≤ 1, ∀n. (A3) λmin

  • 1

T

T

  • i=1

xix

T

i

  • ≥ µ.

Bounded features Bounded noise Strongly convex co-variance matrix

Prashanth L A Fast LSTD using SA November 24, 2014 29 / 39

slide-54
SLIDE 54

Extension to Least Squares Regression

Assumptions

Setting: yn = x

T

nθ∗ + ξn, where ξn is i.i.d. zero-mean

(A1) sup

n

xn2 ≤ 1. (A2) |ξn| ≤ 1, ∀n. (A3) λmin

  • 1

T

T

  • i=1

xix

T

i

  • ≥ µ.

Bounded features Bounded noise Strongly convex co-variance matrix

Prashanth L A Fast LSTD using SA November 24, 2014 29 / 39

slide-55
SLIDE 55

Extension to Least Squares Regression

Assumptions

Setting: yn = x

T

nθ∗ + ξn, where ξn is i.i.d. zero-mean

(A1) sup

n

xn2 ≤ 1. (A2) |ξn| ≤ 1, ∀n. (A3) λmin

  • 1

T

T

  • i=1

xix

T

i

  • ≥ µ.

Bounded features Bounded noise Strongly convex co-variance matrix

Prashanth L A Fast LSTD using SA November 24, 2014 29 / 39

slide-56
SLIDE 56

Extension to Least Squares Regression

Assumptions

Setting: yn = x

T

nθ∗ + ξn, where ξn is i.i.d. zero-mean

(A1) sup

n

xn2 ≤ 1. (A2) |ξn| ≤ 1, ∀n. (A3) λmin

  • 1

T

T

  • i=1

xix

T

i

  • ≥ µ.

Bounded features Bounded noise Strongly convex co-variance matrix

Prashanth L A Fast LSTD using SA November 24, 2014 29 / 39

slide-57
SLIDE 57

Extension to Least Squares Regression

Error bound

With γn = c 2(c + n) with µc ∈ (1.33, 2) we have: High prob. bound For any δ > 0, P  

  • θn − ˆ

θn

  • 2 ≤

KLS

2

√n + c   ≥ 1 − δ. Optimal rate O

  • n−1/2

Bound in expectation E

  • θn − ˆ

θn

  • 2 ≤

KLS

1

√n + c

1By iterate-averaging, the dependency of c on µ can be removed. Prashanth L A Fast LSTD using SA November 24, 2014 30 / 39

slide-58
SLIDE 58

Experiments - News Recommendation

Outline

1

Fast LSTD using SA

2

Fast LSPI using SA

3

Experiments - Traffic Signal Control

4

Extension to Least Squares Regression

5

Experiments - News Recommendation

6

Proof outline

Prashanth L A Fast LSTD using SA November 24, 2014 31 / 39

slide-59
SLIDE 59

Experiments - News Recommendation

Dilbert’s boss on news recommendation (and ML)

Prashanth L A Fast LSTD using SA November 24, 2014 32 / 39

slide-60
SLIDE 60

Experiments - News Recommendation

Application to Bandits1

Fast linUCB linUCB: a well-known contextual bandit algorithm that employs OLS in each iteration. Fast GD: provides good approximation to OLS (with low computational

cost) in each iteration of linUCB

Experiments:

linUCB+fast GD on Yahoo news recommendation dataset 2

1Thanks to Jérémie Mary and Olivier Nicol for help with the framework (ICML 2012

challenge)

2Yahoo! Webscope dataset (2011) Prashanth L A Fast LSTD using SA November 24, 2014 33 / 39

slide-61
SLIDE 61

Experiments - News Recommendation

Application to Bandits1

Fast linUCB linUCB: a well-known contextual bandit algorithm that employs OLS in each iteration. Fast GD: provides good approximation to OLS (with low computational

cost) in each iteration of linUCB

Experiments:

linUCB+fast GD on Yahoo news recommendation dataset 2

1Thanks to Jérémie Mary and Olivier Nicol for help with the framework (ICML 2012

challenge)

2Yahoo! Webscope dataset (2011) Prashanth L A Fast LSTD using SA November 24, 2014 33 / 39

slide-62
SLIDE 62

Experiments - News Recommendation

Application to Bandits1

Fast linUCB linUCB: a well-known contextual bandit algorithm that employs OLS in each iteration. Fast GD: provides good approximation to OLS (with low computational

cost) in each iteration of linUCB

Experiments:

linUCB+fast GD on Yahoo news recommendation dataset 2

1Thanks to Jérémie Mary and Olivier Nicol for help with the framework (ICML 2012

challenge)

2Yahoo! Webscope dataset (2011) Prashanth L A Fast LSTD using SA November 24, 2014 33 / 39

slide-63
SLIDE 63

Experiments - News Recommendation

Application to Bandits1

Fast linUCB linUCB: a well-known contextual bandit algorithm that employs OLS in each iteration. Fast GD: provides good approximation to OLS (with low computational

cost) in each iteration of linUCB

Experiments:

linUCB+fast GD on Yahoo news recommendation dataset 2

1Thanks to Jérémie Mary and Olivier Nicol for help with the framework (ICML 2012

challenge)

2Yahoo! Webscope dataset (2011) Prashanth L A Fast LSTD using SA November 24, 2014 33 / 39

slide-64
SLIDE 64

Experiments - News Recommendation

Simulation Results

Tracking error

20 40 60 80 100 0.2 0.4 0.6 0.8 1 1.2 1.4 step k of fLS-SA

  • θk − ˆ

θT

  • 2
  • θk − ˆ

θT

  • 2

Runtimes

2 3 4 5 0.5 1 1.5 ·106 1.32 · 106 1.49 · 106 1.11 · 106 6.03 · 105 32,444 35,325 26,335 14,264 days runtime (ms) LinUCB fLinUCB-SA Prashanth L A Fast LSTD using SA November 24, 2014 34 / 39

slide-65
SLIDE 65

Proof outline

Outline

1

Fast LSTD using SA

2

Fast LSPI using SA

3

Experiments - Traffic Signal Control

4

Extension to Least Squares Regression

5

Experiments - News Recommendation

6

Proof outline

Prashanth L A Fast LSTD using SA November 24, 2014 35 / 39

slide-66
SLIDE 66

Proof outline

Proof Outline

Let zn = θn − ˆ θT. Then, first bound the deviation of this error from its mean: P(zn2 −E zn2 ≥ ǫ) ≤ exp    − ǫ2 2

n

  • i=1

L2

i

    , ∀ǫ > 0 and bound the size of the mean itself: E zn2 ≤ exp(−(1 − β)µΓn) z02

  • initial error

+ n−1

  • k=1

h(k)γ2

k+1 exp

  • − 2(1 − β)µ(Γn − Γk+1)
  • 1

2

  • sampling error

,

Prashanth L A Fast LSTD using SA November 24, 2014 36 / 39

slide-67
SLIDE 67

Proof outline

Proof Outline

Let zn = θn − ˆ θT. Then, first bound the deviation of this error from its mean: P(zn2 −E zn2 ≥ ǫ) ≤ exp    − ǫ2 2

n

  • i=1

L2

i

    , ∀ǫ > 0 and bound the size of the mean itself: E zn2 ≤ exp(−(1 − β)µΓn) z02

  • initial error

+ n−1

  • k=1

h(k)γ2

k+1 exp

  • − 2(1 − β)µ(Γn − Γk+1)
  • 1

2

  • sampling error

,

Prashanth L A Fast LSTD using SA November 24, 2014 36 / 39

slide-68
SLIDE 68

Proof outline

Proof Outline: High Probability Bound

Step 1: (Error decomposition)

zn2 − E zn2 =

n

  • i=1

gi − E[gi |Fi−1 ] =

n

  • i=1

Di, where Di := gi − E[gi |Fi−1 ], gi := E[zn2 |θi ], and Fi = {θ1, . . . , θn}.

Step 2: (Lipschitz continuity)

Functions gi are Lipschitz continuous with Lipschitz constants Li.

Step 3: (Concentration inequality)

P( zn2 − E zn2 ≥ ǫ) = P n

  • i=1

Di ≥ ǫ

  • ≤ exp(−λǫ) exp

αλ2 2

n

  • i=1

L2

i

  • .

Prashanth L A Fast LSTD using SA November 24, 2014 37 / 39

slide-69
SLIDE 69

Proof outline

Proof Outline: High Probability Bound

Step 1: (Error decomposition)

zn2 − E zn2 =

n

  • i=1

gi − E[gi |Fi−1 ] =

n

  • i=1

Di, where Di := gi − E[gi |Fi−1 ], gi := E[zn2 |θi ], and Fi = {θ1, . . . , θn}.

Step 2: (Lipschitz continuity)

Functions gi are Lipschitz continuous with Lipschitz constants Li.

Step 3: (Concentration inequality)

P( zn2 − E zn2 ≥ ǫ) = P n

  • i=1

Di ≥ ǫ

  • ≤ exp(−λǫ) exp

αλ2 2

n

  • i=1

L2

i

  • .

Prashanth L A Fast LSTD using SA November 24, 2014 37 / 39

slide-70
SLIDE 70

Proof outline

Proof Outline: High Probability Bound

Step 1: (Error decomposition)

zn2 − E zn2 =

n

  • i=1

gi − E[gi |Fi−1 ] =

n

  • i=1

Di, where Di := gi − E[gi |Fi−1 ], gi := E[zn2 |θi ], and Fi = {θ1, . . . , θn}.

Step 2: (Lipschitz continuity)

Functions gi are Lipschitz continuous with Lipschitz constants Li.

Step 3: (Concentration inequality)

P( zn2 − E zn2 ≥ ǫ) = P n

  • i=1

Di ≥ ǫ

  • ≤ exp(−λǫ) exp

αλ2 2

n

  • i=1

L2

i

  • .

Prashanth L A Fast LSTD using SA November 24, 2014 37 / 39

slide-71
SLIDE 71

Proof outline

Proof Outline: Bound in Expectation

Let fn(θ) := (θ

Tφ(sin) − (rin + βθ Tφ(s′

in)))φ(sin) and F(θ) := Ein(fn(θ)). Then

zn = θn − ˆ θT = θn−1 − ˆ θT − γn (F(θn−1) − ∆Mn) ,

Unrolling the above, noting F(ˆ θT)) = 0 and taking expectations, we obtain:

E(zn2) ≤ (E(zn, zn))

1 2 =

  • E Πnz02

2 + n

  • k=1

γ2

k E

  • ΠnΠ−1

k

∆Mk

  • 2

2

1

2

where ¯ An = 1 n

n

  • i=1

φ(si)(φ(si) − βφ(s′

i))T and Πn := n

  • k=1
  • I − γk¯

Ak

  • .

Rest of the proof amounts to bounding each of the terms on RHS above.

Prashanth L A Fast LSTD using SA November 24, 2014 38 / 39

slide-72
SLIDE 72

Proof outline

Proof Outline: Bound in Expectation

Let fn(θ) := (θ

Tφ(sin) − (rin + βθ Tφ(s′

in)))φ(sin) and F(θ) := Ein(fn(θ)). Then

zn = θn − ˆ θT = θn−1 − ˆ θT − γn (F(θn−1) − ∆Mn) ,

Unrolling the above, noting F(ˆ θT)) = 0 and taking expectations, we obtain:

E(zn2) ≤ (E(zn, zn))

1 2 =

  • E Πnz02

2 + n

  • k=1

γ2

k E

  • ΠnΠ−1

k

∆Mk

  • 2

2

1

2

where ¯ An = 1 n

n

  • i=1

φ(si)(φ(si) − βφ(s′

i))T and Πn := n

  • k=1
  • I − γk¯

Ak

  • .

Rest of the proof amounts to bounding each of the terms on RHS above.

Prashanth L A Fast LSTD using SA November 24, 2014 38 / 39

slide-73
SLIDE 73

For Further Reading

References I

Prashanth L.A., Nathaniel Korda and Rémi Munos, Fast LSTD using stochastic approximation: Finite time analysis and application to traffic control. ECML, 2014. Nathaniel Korda, Prashanth L.A. and Rémi Munos, Fast gradient descent for drifting least squares regression, with application to bandits. AAAI, 2015.

Prashanth L A Fast LSTD using SA November 24, 2014 39 / 39