It is hard to predict, especially about the future. Niels Bohr You - - PowerPoint PPT Presentation

it is hard to predict especially about the future niels
SMART_READER_LITE
LIVE PREVIEW

It is hard to predict, especially about the future. Niels Bohr You - - PowerPoint PPT Presentation

It is hard to predict, especially about the future. Niels Bohr You are what you pretend to be, so be careful what you pretend to be. Kurt Vonnegut Prashanth L A Convergence rate of TD(0) March 27, 2015 1 / 84 Convergence rate of TD(0) with


slide-1
SLIDE 1

It is hard to predict, especially about the future. Niels Bohr You are what you pretend to be, so be careful what you pretend to be. Kurt Vonnegut

Prashanth L A Convergence rate of TD(0) March 27, 2015 1 / 84

slide-2
SLIDE 2

Convergence rate of TD(0) with function approximation

Prashanth L A†

Joint work with Nathaniel Korda♯ and Rémi Munos∗

†Indian Institute of Science ♯MLRG - Oxford University ∗Google DeepMind

March 27, 2015

Prashanth L A Convergence rate of TD(0) March 27, 2015 2 / 84

slide-3
SLIDE 3

Background

Background

Prashanth L A Convergence rate of TD(0) March 27, 2015 3 / 84

slide-4
SLIDE 4

Background

Markov Decision Processes (MDPs)

MDP: Set of States X, Set of Actions A, Rewards r(x, a) Transition probability: p(s, a, s′) = Pr {st+1 = s′|st = s, at = a}

t t+1 st st+1 rt+1 at

Prashanth L A Convergence rate of TD(0) March 27, 2015 4 / 84

slide-5
SLIDE 5

Background

The Controlled Markov Property

Controlled Markov Property: ∀i0, i1, . . . , s, s′, b0, b1 . . .,a, P(st+1 = s′ | st = s, at = a, . . . , s0 = i0, a0 = b0) = p(s, a, s′)

t t+1 t+2 t−1 t−2 st at st+1 Figure: The Controlled Markov Behaviour

Prashanth L A Convergence rate of TD(0) March 27, 2015 5 / 84

slide-6
SLIDE 6

Background

Value function

Vπ(s) = E ∞

  • t=0

βt r(st, π(st)) | s0 = s, π

  • Value function

Reward Policy Vπ is the fixed point of the Bellman Operator T π: T π(V)(s) := r(s, π(s)) + β

  • s′

p(s, π(s), s′)V(s′)

Prashanth L A Convergence rate of TD(0) March 27, 2015 6 / 84

slide-7
SLIDE 7

Background

Value function

Vπ(s) = E ∞

  • t=0

βt r(st, π(st)) | s0 = s, π

  • Value function

Reward Policy Vπ is the fixed point of the Bellman Operator T π: T π(V)(s) := r(s, π(s)) + β

  • s′

p(s, π(s), s′)V(s′)

Prashanth L A Convergence rate of TD(0) March 27, 2015 6 / 84

slide-8
SLIDE 8

Background

Value function

Vπ(s) = E ∞

  • t=0

βt r(st, π(st)) | s0 = s, π

  • Value function

Reward Policy Vπ is the fixed point of the Bellman Operator T π: T π(V)(s) := r(s, π(s)) + β

  • s′

p(s, π(s), s′)V(s′)

Prashanth L A Convergence rate of TD(0) March 27, 2015 6 / 84

slide-9
SLIDE 9

Background

Value function

Vπ(s) = E ∞

  • t=0

βt r(st, π(st)) | s0 = s, π

  • Value function

Reward Policy Vπ is the fixed point of the Bellman Operator T π: T π(V)(s) := r(s, π(s)) + β

  • s′

p(s, π(s), s′)V(s′)

Prashanth L A Convergence rate of TD(0) March 27, 2015 6 / 84

slide-10
SLIDE 10

Background

Value function

Vπ(s) = E ∞

  • t=0

βt r(st, π(st)) | s0 = s, π

  • Value function

Reward Policy Vπ is the fixed point of the Bellman Operator T π: T π(V)(s) := r(s, π(s)) + β

  • s′

p(s, π(s), s′)V(s′)

Prashanth L A Convergence rate of TD(0) March 27, 2015 6 / 84

slide-11
SLIDE 11

Background

Policy evaluation using TD

Temporal difference learning Problem: estimate the value function for a given policy π Solution: Use TD(0) Vt+1(st) = Vt(st) + αt (rt+1 + γVt(st+1) − Vt(st)) . Why TD(0)? Simulation based algorithms like Monte-Carlo (no model necessary!) Update a guess based on another guess (like DP) Guaranteed convergence to value function Vπ(s) under standard assumptions

Prashanth L A Convergence rate of TD(0) March 27, 2015 7 / 84

slide-12
SLIDE 12

Background

Policy evaluation using TD

Temporal difference learning Problem: estimate the value function for a given policy π Solution: Use TD(0) Vt+1(st) = Vt(st) + αt (rt+1 + γVt(st+1) − Vt(st)) . Why TD(0)? Simulation based algorithms like Monte-Carlo (no model necessary!) Update a guess based on another guess (like DP) Guaranteed convergence to value function Vπ(s) under standard assumptions

Prashanth L A Convergence rate of TD(0) March 27, 2015 7 / 84

slide-13
SLIDE 13

Background

TD with Function Approximation

Linear Function Approximation. Vπ(s) ≈ θ

T φ(s)

Parameter θ ∈ Rd Feature φ(s) ∈ Rd Note: d << |S| TD Fixed Point Φ θ∗ = Π T π(Φθ∗) Feature Matrix Orthogonal Projection with rows φ(s)

T, ∀s ∈ S

to B = {Φθ | θ ∈ Rd}

Prashanth L A Convergence rate of TD(0) March 27, 2015 8 / 84

slide-14
SLIDE 14

Background

TD with Function Approximation

Linear Function Approximation. Vπ(s) ≈ θ

T φ(s)

Parameter θ ∈ Rd Feature φ(s) ∈ Rd Note: d << |S| TD Fixed Point Φ θ∗ = Π T π(Φθ∗) Feature Matrix Orthogonal Projection with rows φ(s)

T, ∀s ∈ S

to B = {Φθ | θ ∈ Rd}

Prashanth L A Convergence rate of TD(0) March 27, 2015 8 / 84

slide-15
SLIDE 15

Background

TD(0) with function approximation

θn+1 = θn + γn (r(sn, π(sn)) + βθT

nφ(sn+1) − θT nφ(sn)) φ(sn)

Step-sizes Fixed-point iteration

  • J. N. Tsitsiklis and B.V. Roy. (1997) show that θn → θ∗a.s., where

Aθ∗ = b, where A = ΦTΨ(I − βP)Φ and b = ΦTΨr.

  • 1J. N. Tsitsiklis and B.V. Roy. (1997) An analysis of temporal-difference learning with function approximation." In: IEEE Transactions
  • n Automatic Control

Prashanth L A Convergence rate of TD(0) March 27, 2015 9 / 84

slide-16
SLIDE 16

Background

Assumptions

Ergodicity Markov chain induced by the policy π is irreducible and aperiodic. Moreover, there exists a stationary distribution Ψ(= Ψπ) for this Markov chain. Linear independence Feature matrix Φ has full column rank ⇒ λmin(Φ

TΨΦ) ≥ µ > 0

Bounded rewards |r(s, π(s))| ≤ 1, for all s ∈ S. Bounded features φ(s)2 ≤ 1, for all s ∈ S.

Prashanth L A Convergence rate of TD(0) March 27, 2015 10 / 84

slide-17
SLIDE 17

Background

Assumptions (contd)

Step sizes satisfy

  • n

γn = ∞, and

  • n

γ2

n < ∞.

Bounded mixing time ∃ a non-negative function B(·) such that: ∀s0 ∈ S and m ≥ 0,

  • τ=0

E(φ(sτ) | s0) − EΨ(φ(sτ)) ≤ B(s0),

  • τ=0

E[φ(sτ)φ(sτ+m)

T | s0] − EΨ[φ(sτ)φ(sτ+m) T] ≤ B(s0),

where B(·) satisfies: for any q > 1, there exists a Kq < ∞ such that E[Bq(s) | s0] ≤ KqBq(s0).

Prashanth L A Convergence rate of TD(0) March 27, 2015 11 / 84

slide-18
SLIDE 18

Concentration bounds: Non-averaged case

In the long run we are all dead. John Maynard Keynes Question: What happens in a short run of TD(0) with function approximation?

Prashanth L A Convergence rate of TD(0) March 27, 2015 12 / 84

slide-19
SLIDE 19

Concentration bounds: Non-averaged case

Concentration Bounds: Non-averaged TD(0)

Prashanth L A Convergence rate of TD(0) March 27, 2015 13 / 84

slide-20
SLIDE 20

Concentration bounds: Non-averaged case

Non-averaged case: Bound in expectation

Step-size choice γn = c 2(c + n), with (1 − β)2µc > 1/2 Bound in expectation E θn − θ∗2 ≤ K1(n) √n + c , where

K1(n) = 2√c θ0 − θ∗2 (n + c)2(1−β)2µc−1/2 + c(1 − β)(3 + 6H)B(s0)

  • 2(1 − β)2µc − 1

H is an upper bound on θn2, for all n.

Prashanth L A Convergence rate of TD(0) March 27, 2015 14 / 84

slide-21
SLIDE 21

Concentration bounds: Non-averaged case

Non-averaged case: Bound in expectation

Step-size choice γn = c 2(c + n), with (1 − β)2µc > 1/2 Bound in expectation E θn − θ∗2 ≤ K1(n) √n + c , where

K1(n) = 2√c θ0 − θ∗2 (n + c)2(1−β)2µc−1/2 + c(1 − β)(3 + 6H)B(s0)

  • 2(1 − β)2µc − 1

H is an upper bound on θn2, for all n.

Prashanth L A Convergence rate of TD(0) March 27, 2015 14 / 84

slide-22
SLIDE 22

Concentration bounds: Non-averaged case

Non-averaged case: High probability bound

Step-size choice γn = c 2(c + n), with (µ(1 − β)/2 + 3B(s0)) c > 1 High-probability bound P

  • θn − θ∗2 ≤

K2(n) √n + c

  • ≥ 1 − δ, where

K2(n) := (1 − β)c

  • ln(1/δ)(1 + 9B(s0)2)

(µ(1 − β)/2 + 3B(s0)2)c − 1 + K1(n)

K1(n) and K2(n) above are O(1)

Prashanth L A Convergence rate of TD(0) March 27, 2015 15 / 84

slide-23
SLIDE 23

Concentration bounds: Non-averaged case

Non-averaged case: High probability bound

Step-size choice γn = c 2(c + n), with (µ(1 − β)/2 + 3B(s0)) c > 1 High-probability bound P

  • θn − θ∗2 ≤

K2(n) √n + c

  • ≥ 1 − δ, where

K2(n) := (1 − β)c

  • ln(1/δ)(1 + 9B(s0)2)

(µ(1 − β)/2 + 3B(s0)2)c − 1 + K1(n)

K1(n) and K2(n) above are O(1)

Prashanth L A Convergence rate of TD(0) March 27, 2015 15 / 84

slide-24
SLIDE 24

Concentration bounds: Non-averaged case

Non-averaged case: High probability bound

Step-size choice γn = c 2(c + n), with (µ(1 − β)/2 + 3B(s0)) c > 1 High-probability bound P

  • θn − θ∗2 ≤

K2(n) √n + c

  • ≥ 1 − δ, where

K2(n) := (1 − β)c

  • ln(1/δ)(1 + 9B(s0)2)

(µ(1 − β)/2 + 3B(s0)2)c − 1 + K1(n)

K1(n) and K2(n) above are O(1)

Prashanth L A Convergence rate of TD(0) March 27, 2015 15 / 84

slide-25
SLIDE 25

Concentration bounds: Non-averaged case

Why are these bounds problematic?

Obtaining optimal rate O

  • 1/√n
  • with a step-size γn = c/(c + n)

In expectation: Require c to be chosen such that (1 − β)2µc ∈ (1/2, ∞) In high-probability: c should satisfy (µ(1 − β)/2 + 3B(s0)) c > 1.

Optimal rate requires knowledge of the mixing bound B(s0) Even for finite state space settings, B(s0) is a constant, albeit one that depends on the transition dynamics!

Solution Iterate averaging

Prashanth L A Convergence rate of TD(0) March 27, 2015 16 / 84

slide-26
SLIDE 26

Concentration bounds: Non-averaged case

Why are these bounds problematic?

Obtaining optimal rate O

  • 1/√n
  • with a step-size γn = c/(c + n)

In expectation: Require c to be chosen such that (1 − β)2µc ∈ (1/2, ∞) In high-probability: c should satisfy (µ(1 − β)/2 + 3B(s0)) c > 1.

Optimal rate requires knowledge of the mixing bound B(s0) Even for finite state space settings, B(s0) is a constant, albeit one that depends on the transition dynamics!

Solution Iterate averaging

Prashanth L A Convergence rate of TD(0) March 27, 2015 16 / 84

slide-27
SLIDE 27

Concentration bounds: Non-averaged case

Proof Outline

Let zn = θn − θ∗. We first bound the deviation of this error from its mean: P(zn2 − E zn2 ≥ ǫ) ≤ exp    − ǫ2 2

n

  • i=1

L2

i

    , ∀ǫ > 0 , and then bound the size of the mean itself: E zn2 ≤

  • 2 exp(−(1 − β)µΓn) z02
  • initial error

+ n−1

  • k=1

(3 + 6H)2B(s0)2γ2

k+1 exp(−2(1 − β)µ(Γn − Γk+1)

  • sampling and mixing error

1

2

,

Note that Li := γi

  • n
  • j=i+1
  • 1 − 2γj
  • µ
  • 1 − β −

γj 2

  • + [1 + β(3 − β)] B(s0)

1/2 Prashanth L A Convergence rate of TD(0) March 27, 2015 17 / 84

slide-28
SLIDE 28

Concentration bounds: Non-averaged case

Proof Outline

Let zn = θn − θ∗. We first bound the deviation of this error from its mean: P(zn2 − E zn2 ≥ ǫ) ≤ exp    − ǫ2 2

n

  • i=1

L2

i

    , ∀ǫ > 0 , and then bound the size of the mean itself: E zn2 ≤

  • 2 exp(−(1 − β)µΓn) z02
  • initial error

+ n−1

  • k=1

(3 + 6H)2B(s0)2γ2

k+1 exp(−2(1 − β)µ(Γn − Γk+1)

  • sampling and mixing error

1

2

,

Note that Li := γi

  • n
  • j=i+1
  • 1 − 2γj
  • µ
  • 1 − β −

γj 2

  • + [1 + β(3 − β)] B(s0)

1/2 Prashanth L A Convergence rate of TD(0) March 27, 2015 17 / 84

slide-29
SLIDE 29

Concentration bounds: Non-averaged case

Proof Outline: Bound in Expectation

Let fXn(θ) := [r(sn, π(sn)) + βθT

n−1φ(sn+1) − θT n−1φ(sn)]φ(sn). Then, TD update is equivalent to

θn+1 = θn + γn [EΨ(fXn(θn)) + ǫn + ∆Mn] (1) Mixing error ǫn := E(fXn(θn) | s0) − EΨ(fXn(θn)) Martingale sequence ∆Mn := fXn(θn) − E(fXn(θn) | s0) Unrolling (1), we obtain: zn+1 = (I − γnA)zn + γn (ǫn + ∆Mn) = Πnz0 +

n

  • k=1

γkΠnΠ−1

k

(ǫk + ∆Mk) Here A := ΦTΨ(I − βP)Φ and Πn :=

n

  • k=1

(I − γkA).

Prashanth L A Convergence rate of TD(0) March 27, 2015 18 / 84

slide-30
SLIDE 30

Concentration bounds: Non-averaged case

Proof Outline: Bound in Expectation

Let fXn(θ) := [r(sn, π(sn)) + βθT

n−1φ(sn+1) − θT n−1φ(sn)]φ(sn). Then, TD update is equivalent to

θn+1 = θn + γn [EΨ(fXn(θn)) + ǫn + ∆Mn] (1) Mixing error ǫn := E(fXn(θn) | s0) − EΨ(fXn(θn)) Martingale sequence ∆Mn := fXn(θn) − E(fXn(θn) | s0) Unrolling (1), we obtain: zn+1 = (I − γnA)zn + γn (ǫn + ∆Mn) = Πnz0 +

n

  • k=1

γkΠnΠ−1

k

(ǫk + ∆Mk) Here A := ΦTΨ(I − βP)Φ and Πn :=

n

  • k=1

(I − γkA).

Prashanth L A Convergence rate of TD(0) March 27, 2015 18 / 84

slide-31
SLIDE 31

Concentration bounds: Non-averaged case

Proof Outline: Bound in Expectation

zn+1 = (I − γnA)zn + γn (ǫn + ∆Mn) = Πnz0 +

n

  • k=1

γkΠnΠ−1

k

(ǫk + ∆Mk) By Jensen’s inequality, we obtain E(zn2 | s0) ≤ (E(zn, zn) | s0)

1 2

  • 2 Πnz02

2 + 3 n

  • k=1

γ2

k

  • ΠnΠ−1

k

  • 2

2 E

  • ǫk2

2 | s0

  • + 2

n

  • k=1

γ2

k

  • ΠnΠ−1

k

  • 2

2 E

  • ∆Mk2

2 | s0

1

2

Rest of the proof amounts to bounding each of the terms on RHS above.

Prashanth L A Convergence rate of TD(0) March 27, 2015 19 / 84

slide-32
SLIDE 32

Concentration bounds: Non-averaged case

Proof Outline: High Probability Bound

Recall zn = θn − θ∗. Step 1: (Error decomposition)

zn2 − E zn2 =

n

  • i=1

gi − E[gi |Fi−1 ] =

n

  • i=1

Di, where Di := gi − E[gi |Fi−1 ], gi := E[zn2 |θi ], and Fi = σ(θ1, . . . , θn).

Step 2: (Lipschitz continuity)

Functions gi are Lipschitz continuous with Lipschitz constants Li.

Step 3: (Concentration inequality)

P(zn2 − E zn2 ≥ ǫ) = P n

  • i=1

Di ≥ ǫ

  • ≤ exp(−λǫ) exp

αλ2 2

n

  • i=1

L2

i

  • .

Prashanth L A Convergence rate of TD(0) March 27, 2015 20 / 84

slide-33
SLIDE 33

Concentration bounds: Non-averaged case

Proof Outline: High Probability Bound

Recall zn = θn − θ∗. Step 1: (Error decomposition)

zn2 − E zn2 =

n

  • i=1

gi − E[gi |Fi−1 ] =

n

  • i=1

Di, where Di := gi − E[gi |Fi−1 ], gi := E[zn2 |θi ], and Fi = σ(θ1, . . . , θn).

Step 2: (Lipschitz continuity)

Functions gi are Lipschitz continuous with Lipschitz constants Li.

Step 3: (Concentration inequality)

P(zn2 − E zn2 ≥ ǫ) = P n

  • i=1

Di ≥ ǫ

  • ≤ exp(−λǫ) exp

αλ2 2

n

  • i=1

L2

i

  • .

Prashanth L A Convergence rate of TD(0) March 27, 2015 20 / 84

slide-34
SLIDE 34

Concentration bounds: Non-averaged case

Proof Outline: High Probability Bound

Recall zn = θn − θ∗. Step 1: (Error decomposition)

zn2 − E zn2 =

n

  • i=1

gi − E[gi |Fi−1 ] =

n

  • i=1

Di, where Di := gi − E[gi |Fi−1 ], gi := E[zn2 |θi ], and Fi = σ(θ1, . . . , θn).

Step 2: (Lipschitz continuity)

Functions gi are Lipschitz continuous with Lipschitz constants Li.

Step 3: (Concentration inequality)

P(zn2 − E zn2 ≥ ǫ) = P n

  • i=1

Di ≥ ǫ

  • ≤ exp(−λǫ) exp

αλ2 2

n

  • i=1

L2

i

  • .

Prashanth L A Convergence rate of TD(0) March 27, 2015 20 / 84

slide-35
SLIDE 35

Concentration bounds: Iterate Averaging

Concentration Bounds: Iterate Averaged TD(0)

Prashanth L A Convergence rate of TD(0) March 27, 2015 21 / 84

slide-36
SLIDE 36

Concentration bounds: Iterate Averaging

Polyak-Ruppert averaging: Bound in expectation

Bigger step-size + Averaging γn := (1 − β) 2

  • c

c + n α ¯ θn+1 := (θ1 + . . . + θn)/n

with α ∈ (1/2, 1) and c > 0

Bound in expectation E

  • ¯

θn − ˆ θT

  • 2 ≤

KIA

1 (n)

(n + c)α/2 , where

KA

1 (n) :=

  • 1 + 9B(s0)2
  • θ0 − θ∗2

(n + c)(1−α)/2 + 2β(1 − β)cαHB(s0) (µcα(1 − β)2)α

1+2α 2(1−α)

  • Prashanth L A

Convergence rate of TD(0) March 27, 2015 22 / 84

slide-37
SLIDE 37

Concentration bounds: Iterate Averaging

Polyak-Ruppert averaging: Bound in expectation

Bigger step-size + Averaging γn := (1 − β) 2

  • c

c + n α ¯ θn+1 := (θ1 + . . . + θn)/n

with α ∈ (1/2, 1) and c > 0

Bound in expectation E

  • ¯

θn − ˆ θT

  • 2 ≤

KIA

1 (n)

(n + c)α/2 , where

KA

1 (n) :=

  • 1 + 9B(s0)2
  • θ0 − θ∗2

(n + c)(1−α)/2 + 2β(1 − β)cαHB(s0) (µcα(1 − β)2)α

1+2α 2(1−α)

  • Prashanth L A

Convergence rate of TD(0) March 27, 2015 22 / 84

slide-38
SLIDE 38

Concentration bounds: Iterate Averaging

Iterate averaging: High probability bound

Bigger step-size + Averaging γn := (1 − β) 2

  • c

c + n α ¯ θn+1 := (θ1 + . . . + θn)/n High-probability bound P

  • ¯

θn − ˆ θT

  • 2 ≤

KIA

2 (n)

(n + c)α/2

  • ≥ 1 − δ, where

KA

2 (n) :=

  • (1 + 9B(s0)2)

µ 1−β

2

+B(s0)

  • cα + 2(3α)

α

  • µ
  • 1

2 + B(s0) 1−β

  • n(1−α)/2

+ K1(n)

Prashanth L A Convergence rate of TD(0) March 27, 2015 23 / 84

slide-39
SLIDE 39

Concentration bounds: Iterate Averaging

Iterate averaging: High probability bound

Bigger step-size + Averaging γn := (1 − β) 2

  • c

c + n α ¯ θn+1 := (θ1 + . . . + θn)/n High-probability bound P

  • ¯

θn − ˆ θT

  • 2 ≤

KIA

2 (n)

(n + c)α/2

  • ≥ 1 − δ, where

KA

2 (n) :=

  • (1 + 9B(s0)2)

µ 1−β

2

+B(s0)

  • cα + 2(3α)

α

  • µ
  • 1

2 + B(s0) 1−β

  • n(1−α)/2

+ K1(n)

Prashanth L A Convergence rate of TD(0) March 27, 2015 23 / 84

slide-40
SLIDE 40

Concentration bounds: Iterate Averaging

Iterate averaging: High probability bound

Bigger step-size + Averaging γn := (1 − β) 2

  • c

c + n α ¯ θn+1 := (θ1 + . . . + θn)/n High-probability bound P

  • ¯

θn − ˆ θT

  • 2 ≤

KIA

2 (n)

(n + c)α/2

  • ≥ 1 − δ, where

α can be chosen arbitrarily close to 1, resulting in a rate O

  • 1/√n
  • .

Prashanth L A Convergence rate of TD(0) March 27, 2015 24 / 84

slide-41
SLIDE 41

Concentration bounds: Iterate Averaging

Proof Outline

Let ¯ θn+1 := (θ1 + . . . + θn)/n and zn = ¯ θn+1 − θ∗. Then, P(zn2 − E zn2 ≥ ǫ) ≤ exp    − ǫ2 2

n

  • i=1

L2

i

    , ∀ǫ > 0 ,

where Li := γi n

  • 1 +

n−1

  • l=i+1

l

  • j=i
  • 1 − 2γj
  • µ
  • 1 − β −

γj 2

  • + [1 + β(3 − β)] B(s0)

.

With γn = (1 − β)(c/(c + n))α, we obtain

n

  • i=1

L2

i ≤

    2α µ 1 − β 2 + B(s0)

+ 5α

α

   

2

µ2 1 2 + B(s0) 1 − β 2 × 1 n

Prashanth L A Convergence rate of TD(0) March 27, 2015 25 / 84

slide-42
SLIDE 42

Concentration bounds: Iterate Averaging

Proof Outline

Let ¯ θn+1 := (θ1 + . . . + θn)/n and zn = ¯ θn+1 − θ∗. Then, P(zn2 − E zn2 ≥ ǫ) ≤ exp    − ǫ2 2

n

  • i=1

L2

i

    , ∀ǫ > 0 ,

where Li := γi n

  • 1 +

n−1

  • l=i+1

l

  • j=i
  • 1 − 2γj
  • µ
  • 1 − β −

γj 2

  • + [1 + β(3 − β)] B(s0)

.

With γn = (1 − β)(c/(c + n))α, we obtain

n

  • i=1

L2

i ≤

    2α µ 1 − β 2 + B(s0)

+ 5α

α

   

2

µ2 1 2 + B(s0) 1 − β 2 × 1 n

Prashanth L A Convergence rate of TD(0) March 27, 2015 25 / 84

slide-43
SLIDE 43

Concentration bounds: Iterate Averaging

Proof outline: Bound in expectation

To bound the expected error we directly average the errors of the non-averaged iterates: E

  • ¯

θn+1 − θ∗

  • 2 ≤ 1

n

n

  • k=1

E θk − θ∗2 , and then specialise to the choice of step-size: γn = (1 − β)(c/(c + n))α E

  • ¯

θn+1 − θ∗

  • 2 ≤
  • 1 + 9B(s0)

n ∞

  • n=1

exp(−µc(n + c)1−α) θ0 − θ∗2 +2βHcα(1 − β)

  • µcα(1 − β)2−α 1+2α

2(1−α) (n + c)− α 2

  • Prashanth L A

Convergence rate of TD(0) March 27, 2015 26 / 84

slide-44
SLIDE 44

Centered TD(0)

Centered TD (CTD)

Prashanth L A Convergence rate of TD(0) March 27, 2015 27 / 84

slide-45
SLIDE 45

Centered TD(0)

The Variance Problem

Why does iterate averaging work? in TD(0), each iterate introduces a high variance, which must be controlled by the step-size choice

averaging the iterates reduces the variance of the final estimator

reduced variance allows for more exploration within the iterates through larger step sizes

Prashanth L A Convergence rate of TD(0) March 27, 2015 28 / 84

slide-46
SLIDE 46

Centered TD(0)

A Control Variate Solution

Centering: another approach to variance reduction instead of averaging iterates one can use an average to guide the iterates now all iterates are informed by their history constructing this average in epochs allows a constant step-size choice

Prashanth L A Convergence rate of TD(0) March 27, 2015 29 / 84

slide-47
SLIDE 47

Centered TD(0)

Centering: The Idea

Recall that for TD(0), θn+1 = θn + γn (r(sn, π(sn)) + βθ

T

nφ(sn+1) − θ

T

nφ(sn)) φ(sn)

  • =fn(θn)

and that θn → θ∗, the solution of F(θ) := ΠTπ(Φθ) − Φθ = 0. Centering each iterate: θn+1 = θn + γ   fn(θn) − fn(¯ θn) + F(¯ θn)

  • (*)

  

Prashanth L A Convergence rate of TD(0) March 27, 2015 30 / 84

slide-48
SLIDE 48

Centered TD(0)

Centering: The Idea

θn+1 = θn + γ   fn(θn) − fn(¯ θn) + F(¯ θn)

  • (*)

   Why Centering helps? No updates after hitting θ∗ An average guides the updates, resulting in low variance of term (*) Allows using a (large) constant step-size O(d) update - same as TD(0) Working with epochs ⇒ need to store only the averaged iterate ¯ θn and an estimate of ˆ F(¯ θn)

Prashanth L A Convergence rate of TD(0) March 27, 2015 31 / 84

slide-49
SLIDE 49

Centered TD(0)

Centering: The Idea

Centered update: θn+1 = θn + γ

  • fn(θn) − fn(¯

θn) + F(¯ θn)

  • Challenges compared to gradient descent with a accessible cost function

F is unknown and inaccessible in our setting To prove convergence bounds one has to cope with the error due to incomplete mixing

Prashanth L A Convergence rate of TD(0) March 27, 2015 32 / 84

slide-50
SLIDE 50

Centered TD(0)

Centering: The Idea

Centered update: θn+1 = θn + γ

  • fn(θn) − fn(¯

θn) + F(¯ θn)

  • Challenges compared to gradient descent with a accessible cost function

F is unknown and inaccessible in our setting To prove convergence bounds one has to cope with the error due to incomplete mixing

Prashanth L A Convergence rate of TD(0) March 27, 2015 32 / 84

slide-51
SLIDE 51

Centered TD(0) θn Take action π(sn) Simulation Update θn using (2) Fixed point iteration θn+1 Epoch Run ¯ θ(m), ˆ F(m)(¯ θ(m)) Centering ¯ θ(m+1), ˆ F(m+1)(¯ θ(m+1)) Centering

Beginning of each epoch, an iterate ¯ θ(m) is chosen uniformly at random from the previous epoch Epoch run Set θmM := ¯ θ(m), and, for n = mM, . . . , (m + 1)M − 1 θn+1 = θn + γ

  • fXin (θn) − fXin (¯

θ(m)) + ˆ F(m)(¯ θ(m))

  • ,

where ˆ F(m)(θ) := 1 M

mM

  • i=(m−1)M

fXi(θ) (2)

Prashanth L A Convergence rate of TD(0) March 27, 2015 33 / 84

slide-52
SLIDE 52

Centered TD(0) θn Take action π(sn) Simulation Update θn using (2) Fixed point iteration θn+1 Epoch Run ¯ θ(m), ˆ F(m)(¯ θ(m)) Centering ¯ θ(m+1), ˆ F(m+1)(¯ θ(m+1)) Centering

Beginning of each epoch, an iterate ¯ θ(m) is chosen uniformly at random from the previous epoch Epoch run Set θmM := ¯ θ(m), and, for n = mM, . . . , (m + 1)M − 1 θn+1 = θn + γ

  • fXin (θn) − fXin (¯

θ(m)) + ˆ F(m)(¯ θ(m))

  • ,

where ˆ F(m)(θ) := 1 M

mM

  • i=(m−1)M

fXi(θ) (2)

Prashanth L A Convergence rate of TD(0) March 27, 2015 33 / 84

slide-53
SLIDE 53

Centered TD(0)

Centering: Results

Epoch length and step size choice

Choose M and γ such that C1 < 1, where C1 :=

  • 1

2µγM((1 − β) − d2γ) + γd2 2((1 − β) − d2γ)

  • Error bound

Φ(¯ θ(m) − θ∗)2

Ψ ≤ Cm 1

  • Φ(¯

θ(0) − θ∗)2

Ψ

  • +C2H(5γ + 4)

m−1

  • k=1

C(m−2)−k

1

BkM

(k−1)M(s0), where C2 = γ/(2M((1 − β) − d2γ)) and BkM

(k−1)M is an upper bound on the partial sums kM

  • i=(k−1)M

(E(φ(si) | s0) − EΨ(φ(si))) and

kM

  • i=(k−1)M

(E(φ(si)φ(si+l) | s0) − EΨ(φ(si)φ(si+l)T)), for l = 0, 1. Prashanth L A Convergence rate of TD(0) March 27, 2015 34 / 84

slide-54
SLIDE 54

Centered TD(0)

Centering: Results

Epoch length and step size choice

Choose M and γ such that C1 < 1, where C1 :=

  • 1

2µγM((1 − β) − d2γ) + γd2 2((1 − β) − d2γ)

  • Error bound

Φ(¯ θ(m) − θ∗)2

Ψ ≤ Cm 1

  • Φ(¯

θ(0) − θ∗)2

Ψ

  • +C2H(5γ + 4)

m−1

  • k=1

C(m−2)−k

1

BkM

(k−1)M(s0), where C2 = γ/(2M((1 − β) − d2γ)) and BkM

(k−1)M is an upper bound on the partial sums kM

  • i=(k−1)M

(E(φ(si) | s0) − EΨ(φ(si))) and

kM

  • i=(k−1)M

(E(φ(si)φ(si+l) | s0) − EΨ(φ(si)φ(si+l)T)), for l = 0, 1. Prashanth L A Convergence rate of TD(0) March 27, 2015 34 / 84

slide-55
SLIDE 55

Centered TD(0)

Centering: Results cont.

The effect of mixing error If the Markov chain underlying policy π satisfies the following property:

|P(st = s | s0) − ψ(s)| ≤ Cρt/M,

then

Φ(¯ θ(m) − θ∗)2

Ψ ≤ Cm 1

  • Φ(¯

θ(0) − θ∗)2

Ψ

  • + CMC2H(5γ + 4) max{C1, ρM}(m−1)

When the MDP mixes exponentially fast (e.g. finite state-space MDPs) we get the exponential convergence rate

(* only in the first term)

Otherwise the decay of the error is dominated by the mixing rate

Prashanth L A Convergence rate of TD(0) March 27, 2015 35 / 84

slide-56
SLIDE 56

Centered TD(0)

Centering: Results cont.

The effect of mixing error If the Markov chain underlying policy π satisfies the following property:

|P(st = s | s0) − ψ(s)| ≤ Cρt/M,

then

Φ(¯ θ(m) − θ∗)2

Ψ ≤ Cm 1

  • Φ(¯

θ(0) − θ∗)2

Ψ

  • + CMC2H(5γ + 4) max{C1, ρM}(m−1)

When the MDP mixes exponentially fast (e.g. finite state-space MDPs) we get the exponential convergence rate

(* only in the first term)

Otherwise the decay of the error is dominated by the mixing rate

Prashanth L A Convergence rate of TD(0) March 27, 2015 35 / 84

slide-57
SLIDE 57

Centered TD(0)

Centering: Results cont.

The effect of mixing error If the Markov chain underlying policy π satisfies the following property:

|P(st = s | s0) − ψ(s)| ≤ Cρt/M,

then

Φ(¯ θ(m) − θ∗)2

Ψ ≤ Cm 1

  • Φ(¯

θ(0) − θ∗)2

Ψ

  • + CMC2H(5γ + 4) max{C1, ρM}(m−1)

When the MDP mixes exponentially fast (e.g. finite state-space MDPs) we get the exponential convergence rate

(* only in the first term)

Otherwise the decay of the error is dominated by the mixing rate

Prashanth L A Convergence rate of TD(0) March 27, 2015 35 / 84

slide-58
SLIDE 58

Centered TD(0)

Proof Outline

Let ¯ fXin (θn) := fXin (θn) − fXin (¯ θ(m)) + EΨ(fXin (¯ θ(m))).

Step 1: (Rewriting CTD update)

θn+1 = θn + γ

  • ¯

fXin (θn) + ǫn

  • where ǫn := E(fXin (¯

θ(m)) | FmM) − EΨ(fXin (¯ θ(m)))

Step 2: (Bounding the variance of centered updates)

  • ¯

fXin (θn)

  • 2

2

  • ≤ d2

Φ(θn − θ∗)2

Ψ + Φ(¯

θ(m) − θ∗)2

Ψ

  • Prashanth L A

Convergence rate of TD(0) March 27, 2015 36 / 84

slide-59
SLIDE 59

Centered TD(0)

Proof Outline

Let ¯ fXin (θn) := fXin (θn) − fXin (¯ θ(m)) + EΨ(fXin (¯ θ(m))).

Step 1: (Rewriting CTD update)

θn+1 = θn + γ

  • ¯

fXin (θn) + ǫn

  • where ǫn := E(fXin (¯

θ(m)) | FmM) − EΨ(fXin (¯ θ(m)))

Step 2: (Bounding the variance of centered updates)

  • ¯

fXin (θn)

  • 2

2

  • ≤ d2

Φ(θn − θ∗)2

Ψ + Φ(¯

θ(m) − θ∗)2

Ψ

  • Prashanth L A

Convergence rate of TD(0) March 27, 2015 36 / 84

slide-60
SLIDE 60

Centered TD(0)

Proof Outline

Step 3: (Analysis for a particular epoch)

Eθnθn+1 − θ∗2

2 ≤ θn − θ∗2 2 + γ2Eθn ǫn2 2 + 2γ(θn − θ∗)TEθn

¯ fXin (θn)

  • + γ2Eθn
  • ¯

fXin (θn)

  • 2

2

  • ≤ θn − θ∗2

2 − 2γ((1 − β) − d2γ)Φ(θn − θ∗)2 Ψ + γ2d2

Φ(¯ θ(m) − θ∗)2

Ψ

  • + γ2Eθn ǫn2

2

Summing the above inequality over an epoch and noting that EΨ,θnθn+1 − θ∗2

2 ≥ 0 and (¯

θ(m) − θ∗)TI(¯ θ(m) − θ∗) ≤ 1 µ (¯ θ(m) − θ∗)TΦTΨΦ(¯ θ(m) − θ∗) , we obtain the following by setting θ0 = ¯ θ(m): 2γM((1 − β) − d2γ)Φ(¯ θ(m+1) − θ∗)2

Ψ ≤

1 µ + γ2Md2 Φ(¯ θ(m) − θ∗)2

Ψ

  • +γ2

mM

  • i=(m−1)M

Eθi ǫi2

2

The final step is to unroll (across epochs) the final recursion above to obtain the rate for CTD.

Prashanth L A Convergence rate of TD(0) March 27, 2015 37 / 84

slide-61
SLIDE 61

Centered TD(0)

TD(0) on a batch

Prashanth L A Convergence rate of TD(0) March 27, 2015 38 / 84

slide-62
SLIDE 62

Centered TD(0)

Dilbert’s boss on big data!

Prashanth L A Convergence rate of TD(0) March 27, 2015 39 / 84

slide-63
SLIDE 63

fast LSTD

LSTD - A Batch Algorithm

Given dataset D := {(si, ri, s′

i), i = 1, . . . , T)}

LSTD approximates the TD fixed point by ˆ θT = ¯ A−1

T ¯

bT , O(d2T) Complexity where ¯ AT = 1 T

T

  • i=1

φ(si)(φ(si) − βφ(s′

i))T

¯ bT = 1 T

T

  • i=1

riφ(si).

Prashanth L A Convergence rate of TD(0) March 27, 2015 40 / 84

slide-64
SLIDE 64

fast LSTD

LSTD - A Batch Algorithm

Given dataset D := {(si, ri, s′

i), i = 1, . . . , T)}

LSTD approximates the TD fixed point by ˆ θT = ¯ A−1

T ¯

bT , O(d2T) Complexity where ¯ AT = 1 T

T

  • i=1

φ(si)(φ(si) − βφ(s′

i))T

¯ bT = 1 T

T

  • i=1

riφ(si).

Prashanth L A Convergence rate of TD(0) March 27, 2015 40 / 84

slide-65
SLIDE 65

fast LSTD

Complexity of LSTD [1]

Policy Evaluation Policy Improvement Q-value Qπ Policy π

Figure: LSPI - a batch-mode RL algorithm for control

LSTD Complexity

O(d2T) using the Sherman-Morrison lemma or O(d2.807) using the Strassen algorithm or O(d2.375) the

Coppersmith-Winograd algorithm

Prashanth L A Convergence rate of TD(0) March 27, 2015 41 / 84

slide-66
SLIDE 66

fast LSTD

Complexity of LSTD [1]

Policy Evaluation Policy Improvement Q-value Qπ Policy π

Figure: LSPI - a batch-mode RL algorithm for control

LSTD Complexity

O(d2T) using the Sherman-Morrison lemma or O(d2.807) using the Strassen algorithm or O(d2.375) the

Coppersmith-Winograd algorithm

Prashanth L A Convergence rate of TD(0) March 27, 2015 41 / 84

slide-67
SLIDE 67

fast LSTD

Complexity of LSTD [2]

Problem Practical applications involve high-dimensional features (e.g. Computer-Go: d ∼ 106) ⇒ solving LSTD is computationally intensive Related works: GTD 1, GTD2 2, iLSTD 3 Solution Use stochastic approximation (SA) Complexity O(dT) ⇒ O(d) reduction in complexity Theory SA variant of LSTD does not impact overall rate of convergence Experiments On traffic control application, performance of SA-based LSTD is comparable to LSTD, while gaining in runtime!

1Sutton et al. (2009) A convergent O(n) algorithm for off-policy temporal difference learning. In: NIPS 2Sutton et al. (2009) Fast gradient-descent methods for temporal-difference learning with linear func- tion approximation. In: ICML 3Geramifard A et al. (2007) iLSTD: Eligibility traces and convergence analysis. In: NIPS Prashanth L A Convergence rate of TD(0) March 27, 2015 42 / 84

slide-68
SLIDE 68

fast LSTD

Complexity of LSTD [2]

Problem Practical applications involve high-dimensional features (e.g. Computer-Go: d ∼ 106) ⇒ solving LSTD is computationally intensive Related works: GTD 1, GTD2 2, iLSTD 3 Solution Use stochastic approximation (SA) Complexity O(dT) ⇒ O(d) reduction in complexity Theory SA variant of LSTD does not impact overall rate of convergence Experiments On traffic control application, performance of SA-based LSTD is comparable to LSTD, while gaining in runtime!

1Sutton et al. (2009) A convergent O(n) algorithm for off-policy temporal difference learning. In: NIPS 2Sutton et al. (2009) Fast gradient-descent methods for temporal-difference learning with linear func- tion approximation. In: ICML 3Geramifard A et al. (2007) iLSTD: Eligibility traces and convergence analysis. In: NIPS Prashanth L A Convergence rate of TD(0) March 27, 2015 42 / 84

slide-69
SLIDE 69

fast LSTD

Fast LSTD using Stochastic Approximation

θn Pick in uniformly in {1, . . . , T} Random Sampling Update θn using (sin, rin, s′

in)

SA Update θn+1

Update rule: θn = θn−1 + γn

  • rin + βθT

n−1φ(s′ in) − θT n−1φ(sin)

  • φ(sin)

Step-sizes Fixed-point iteration Complexity: O(d) per iteration

Prashanth L A Convergence rate of TD(0) March 27, 2015 43 / 84

slide-70
SLIDE 70

fast LSTD

Fast LSTD using Stochastic Approximation

θn Pick in uniformly in {1, . . . , T} Random Sampling Update θn using (sin, rin, s′

in)

SA Update θn+1

Update rule: θn = θn−1 + γn

  • rin + βθT

n−1φ(s′ in) − θT n−1φ(sin)

  • φ(sin)

Step-sizes Fixed-point iteration Complexity: O(d) per iteration

Prashanth L A Convergence rate of TD(0) March 27, 2015 43 / 84

slide-71
SLIDE 71

fast LSTD

Assumptions

Setting: Given dataset D := {(si, ri, s′

i), i = 1, . . . , T)}

(A1) φ(si)2 ≤ 1 (A2) |ri| ≤ Rmax < ∞ (A3) λmin

  • 1

T

T

  • i=1

φ(si)φ(si)

T

  • ≥ µ.

Bounded features Bounded rewards Co-variance matrix has a min-eigenvalue

Prashanth L A Convergence rate of TD(0) March 27, 2015 44 / 84

slide-72
SLIDE 72

fast LSTD

Assumptions

Setting: Given dataset D := {(si, ri, s′

i), i = 1, . . . , T)}

(A1) φ(si)2 ≤ 1 (A2) |ri| ≤ Rmax < ∞ (A3) λmin

  • 1

T

T

  • i=1

φ(si)φ(si)

T

  • ≥ µ.

Bounded features Bounded rewards Co-variance matrix has a min-eigenvalue

Prashanth L A Convergence rate of TD(0) March 27, 2015 44 / 84

slide-73
SLIDE 73

fast LSTD

Assumptions

Setting: Given dataset D := {(si, ri, s′

i), i = 1, . . . , T)}

(A1) φ(si)2 ≤ 1 (A2) |ri| ≤ Rmax < ∞ (A3) λmin

  • 1

T

T

  • i=1

φ(si)φ(si)

T

  • ≥ µ.

Bounded features Bounded rewards Co-variance matrix has a min-eigenvalue

Prashanth L A Convergence rate of TD(0) March 27, 2015 44 / 84

slide-74
SLIDE 74

fast LSTD

Assumptions

Setting: Given dataset D := {(si, ri, s′

i), i = 1, . . . , T)}

(A1) φ(si)2 ≤ 1 (A2) |ri| ≤ Rmax < ∞ (A3) λmin

  • 1

T

T

  • i=1

φ(si)φ(si)

T

  • ≥ µ.

Bounded features Bounded rewards Co-variance matrix has a min-eigenvalue

Prashanth L A Convergence rate of TD(0) March 27, 2015 44 / 84

slide-75
SLIDE 75

fast LSTD

Convergence Rate

Step-size choice γn = (1 − β)c 2(c + n) , with (1 − β)2µc ∈ (1.33, 2) Bound in expectation E

  • θn − ˆ

θT

  • 2 ≤

K1 √n + c High-probability bound P

  • θn − ˆ

θT

  • 2 ≤

K2 √n + c

  • ≥ 1 − δ,

By iterate-averaging, the dependency of c on µ can be removed

Prashanth L A Convergence rate of TD(0) March 27, 2015 45 / 84

slide-76
SLIDE 76

fast LSTD

Convergence Rate

Step-size choice γn = (1 − β)c 2(c + n) , with (1 − β)2µc ∈ (1.33, 2) Bound in expectation E

  • θn − ˆ

θT

  • 2 ≤

K1 √n + c High-probability bound P

  • θn − ˆ

θT

  • 2 ≤

K2 √n + c

  • ≥ 1 − δ,

By iterate-averaging, the dependency of c on µ can be removed

Prashanth L A Convergence rate of TD(0) March 27, 2015 45 / 84

slide-77
SLIDE 77

fast LSTD

Convergence Rate

Step-size choice γn = (1 − β)c 2(c + n) , with (1 − β)2µc ∈ (1.33, 2) Bound in expectation E

  • θn − ˆ

θT

  • 2 ≤

K1 √n + c High-probability bound P

  • θn − ˆ

θT

  • 2 ≤

K2 √n + c

  • ≥ 1 − δ,

By iterate-averaging, the dependency of c on µ can be removed

Prashanth L A Convergence rate of TD(0) March 27, 2015 45 / 84

slide-78
SLIDE 78

fast LSTD

Convergence Rate

Step-size choice γn = (1 − β)c 2(c + n) , with (1 − β)2µc ∈ (1.33, 2) Bound in expectation E

  • θn − ˆ

θT

  • 2 ≤

K1 √n + c High-probability bound P

  • θn − ˆ

θT

  • 2 ≤

K2 √n + c

  • ≥ 1 − δ,

By iterate-averaging, the dependency of c on µ can be removed

Prashanth L A Convergence rate of TD(0) March 27, 2015 45 / 84

slide-79
SLIDE 79

fast LSTD

The constants

K1(n) = √c

  • θ0 − ˆ

θT

  • 2

n((1−β)2µc−1)/2 + (1 − β)ch2(n) 2 , K2(n) = (1 − β)c

  • log δ−1

2

  • 4

3(1 − β)2µc − 1

+ K1(n), where h(k) :=(1 + Rmax + β)2 max

  • θ0 − ˆ

θT

  • 2 + ln n +
  • ˆ

θT

  • 2

4 , 1

  • Both K1(n) and K2(n) are O(1)

Prashanth L A Convergence rate of TD(0) March 27, 2015 46 / 84

slide-80
SLIDE 80

fast LSTD

Iterate Averaging

Bigger step-size + Averaging γn := (1 − β) 2

  • c

c + n α ¯ θn+1 := (θ1 + . . . + θn)/n Bound in expectation E

  • ¯

θn − ˆ θT

  • 2 ≤

KIA

1 (n)

(n + c)α/2 High-probability bound P

  • ¯

θn − ˆ θT

  • 2 ≤

KIA

2 (n)

(n + c)α/2

  • ≥ 1 − δ,

Dependency of c on µ is removed dependency at the cost of (1 − α)/2 in the rate.

Prashanth L A Convergence rate of TD(0) March 27, 2015 47 / 84

slide-81
SLIDE 81

fast LSTD

Iterate Averaging

Bigger step-size + Averaging γn := (1 − β) 2

  • c

c + n α ¯ θn+1 := (θ1 + . . . + θn)/n Bound in expectation E

  • ¯

θn − ˆ θT

  • 2 ≤

KIA

1 (n)

(n + c)α/2 High-probability bound P

  • ¯

θn − ˆ θT

  • 2 ≤

KIA

2 (n)

(n + c)α/2

  • ≥ 1 − δ,

Dependency of c on µ is removed dependency at the cost of (1 − α)/2 in the rate.

Prashanth L A Convergence rate of TD(0) March 27, 2015 47 / 84

slide-82
SLIDE 82

fast LSTD

Iterate Averaging

Bigger step-size + Averaging γn := (1 − β) 2

  • c

c + n α ¯ θn+1 := (θ1 + . . . + θn)/n Bound in expectation E

  • ¯

θn − ˆ θT

  • 2 ≤

KIA

1 (n)

(n + c)α/2 High-probability bound P

  • ¯

θn − ˆ θT

  • 2 ≤

KIA

2 (n)

(n + c)α/2

  • ≥ 1 − δ,

Dependency of c on µ is removed dependency at the cost of (1 − α)/2 in the rate.

Prashanth L A Convergence rate of TD(0) March 27, 2015 47 / 84

slide-83
SLIDE 83

fast LSTD

Iterate Averaging

Bigger step-size + Averaging γn := (1 − β) 2

  • c

c + n α ¯ θn+1 := (θ1 + . . . + θn)/n Bound in expectation E

  • ¯

θn − ˆ θT

  • 2 ≤

KIA

1 (n)

(n + c)α/2 High-probability bound P

  • ¯

θn − ˆ θT

  • 2 ≤

KIA

2 (n)

(n + c)α/2

  • ≥ 1 − δ,

Dependency of c on µ is removed dependency at the cost of (1 − α)/2 in the rate.

Prashanth L A Convergence rate of TD(0) March 27, 2015 47 / 84

slide-84
SLIDE 84

fast LSTD

The constants

KIA

1 (n) :=

C

  • θ0 − ˆ

θT

  • 2

(n + c)(1−α)/2 + h(n)cα(1 − β) (µcα(1 − β)2)α

1+2α 2(1−α)

, and KIA

2 (n) :=

  • log δ−1

µ(1 − β)

  • 3α +

µcα(1 − β)2 + 2α α 2 1 (n + c)(1−α)/2 + KIA

1 (n).

As before, both KIA

1 (n) and KIA 2 (n) are O(1)

Prashanth L A Convergence rate of TD(0) March 27, 2015 48 / 84

slide-85
SLIDE 85

fast LSTD

Performance bounds

True value function v Approximate value function ˜ vn := Φθn v− ˜ vn T ≤ v − ΠvT

  • 1 − β2
  • approximation

error

+ O

  • d

(1 − β)2µT

  • estimation

error

+ O

  • 1

(1 − β)2µ2n ln 1 δ

  • computational

error

1f2

T := T−1 T

  • i=1

f(si)2, for any function f. 2Lazaric, A., Ghavamzadeh, M., Munos, R. (2012) Finite-sample analysis of least-squares policy iteration. In: JMLR Prashanth L A Convergence rate of TD(0) March 27, 2015 49 / 84

slide-86
SLIDE 86

fast LSTD

Performance bounds

v− ˜ vn T ≤ v − ΠvT

  • 1 − β2
  • approximation

error

+ O

  • d

(1 − β)2µT

  • estimation

error

+ O

  • 1

(1 − β)2µ2n ln 1 δ

  • computational

error 1

Artifacts of function approximation and least squares methods Consequence of using SA for LSTD Setting n = ln(1/δ)T/(dµ), the convergence rate is unaffected!

Prashanth L A Convergence rate of TD(0) March 27, 2015 50 / 84

slide-87
SLIDE 87

fast LSTD

Performance bounds

v− ˜ vn T ≤ v − ΠvT

  • 1 − β2
  • approximation

error

+ O

  • d

(1 − β)2µT

  • estimation

error

+ O

  • 1

(1 − β)2µ2n ln 1 δ

  • computational

error 1

Artifacts of function approximation and least squares methods Consequence of using SA for LSTD Setting n = ln(1/δ)T/(dµ), the convergence rate is unaffected!

Prashanth L A Convergence rate of TD(0) March 27, 2015 50 / 84

slide-88
SLIDE 88

fast LSTD

Performance bounds

v− ˜ vn T ≤ v − ΠvT

  • 1 − β2
  • approximation

error

+ O

  • d

(1 − β)2µT

  • estimation

error

+ O

  • 1

(1 − β)2µ2n ln 1 δ

  • computational

error 1

Artifacts of function approximation and least squares methods Consequence of using SA for LSTD Setting n = ln(1/δ)T/(dµ), the convergence rate is unaffected!

Prashanth L A Convergence rate of TD(0) March 27, 2015 50 / 84

slide-89
SLIDE 89

Fast LSPI using SA

LSPI - A Quick Recap

Policy Evaluation Policy Improvement Q-value Qπ Policy π

Qπ(s, a) = E ∞

  • t=0

βtr(st, π(st)) | s0 = s, a0 = a

  • π′(s) = arg max

a∈A

θ

Tφ(s, a)

Prashanth L A Convergence rate of TD(0) March 27, 2015 51 / 84

slide-90
SLIDE 90

Fast LSPI using SA

LSPI - A Quick Recap

Policy Evaluation Policy Improvement Q-value Qπ Policy π

Qπ(s, a) = E ∞

  • t=0

βtr(st, π(st)) | s0 = s, a0 = a

  • π′(s) = arg max

a∈A

θ

Tφ(s, a)

Prashanth L A Convergence rate of TD(0) March 27, 2015 51 / 84

slide-91
SLIDE 91

Fast LSPI using SA

Policy Evaluation: LSTDQ and its SA variant

Given a set of samples D := {(si, ai, ri, s′

i), i = 1, . . . , T)}

LSTDQ approximates Qπ by ˆ θT = ¯ A−1

T ¯

bT where ¯ AT = 1 T

T

  • i=1

φ(si, ai)(φ(si, ai) − βφ(s′

i, π(s′ i)))T, and ¯

bT = T−1

T

  • i=1

riφ(si, ai). Fast LSTDQ using SA: θk = θk−1 + γk

  • rik + βθT

k−1φ(s′ ik, π(s′ ik)) − θT k−1φ(sik, aik)

  • φ(sik, aik)

Prashanth L A Convergence rate of TD(0) March 27, 2015 52 / 84

slide-92
SLIDE 92

Fast LSPI using SA

Policy Evaluation: LSTDQ and its SA variant

Given a set of samples D := {(si, ai, ri, s′

i), i = 1, . . . , T)}

LSTDQ approximates Qπ by ˆ θT = ¯ A−1

T ¯

bT where ¯ AT = 1 T

T

  • i=1

φ(si, ai)(φ(si, ai) − βφ(s′

i, π(s′ i)))T, and ¯

bT = T−1

T

  • i=1

riφ(si, ai). Fast LSTDQ using SA: θk = θk−1 + γk

  • rik + βθT

k−1φ(s′ ik, π(s′ ik)) − θT k−1φ(sik, aik)

  • φ(sik, aik)

Prashanth L A Convergence rate of TD(0) March 27, 2015 52 / 84

slide-93
SLIDE 93

Fast LSPI using SA

Fast LSPI using SA (fLSPI-SA)

Input: Sample set D := {si, ai, ri, s′

i}T i=1

repeat Policy Evaluation For k = 1 to τ

  • Get random sample index: ik ∼ U({1, . . . , T})
  • Update fLSTD-SA iterate θk

θ′ ← θτ, ∆ = θ − θ′2 Policy Improvement Obtain a greedy policy π′(s) = arg max

a∈A

θ′Tφ(s, a) θ ← θ′, π ← π′ until ∆ < ǫ

Prashanth L A Convergence rate of TD(0) March 27, 2015 53 / 84

slide-94
SLIDE 94

Fast LSPI using SA

Fast LSPI using SA (fLSPI-SA)

Input: Sample set D := {si, ai, ri, s′

i}T i=1

repeat Policy Evaluation For k = 1 to τ

  • Get random sample index: ik ∼ U({1, . . . , T})
  • Update fLSTD-SA iterate θk

θ′ ← θτ, ∆ = θ − θ′2 Policy Improvement Obtain a greedy policy π′(s) = arg max

a∈A

θ′Tφ(s, a) θ ← θ′, π ← π′ until ∆ < ǫ

Prashanth L A Convergence rate of TD(0) March 27, 2015 53 / 84

slide-95
SLIDE 95

Experiments - Traffic Signal Control

The traffic control problem

Prashanth L A Convergence rate of TD(0) March 27, 2015 54 / 84

slide-96
SLIDE 96

Experiments - Traffic Signal Control

Simulation Results on 7x9-grid network

Tracking error

100 200 300 400 500 0.1 0.2 0.3 0.4 0.5 0.6 step k of fLSTD-SA

  • θk − ˆ

θT

  • 2
  • θk − ˆ

θT

  • 2

Throughput (TAR)

1,000 2,000 3,000 4,000 5,000 0.5 1 1.5 ·104 time steps TAR LSPI fLSPI-SA

Prashanth L A Convergence rate of TD(0) March 27, 2015 55 / 84

slide-97
SLIDE 97

Experiments - Traffic Signal Control

Runtime Performance on three road networks

7x9-Grid (d = 504) 14x9-Grid (d = 1008) 14x18-Grid (d = 2016) 0.5 1 1.5 2 ·105 4,917 30,144 1.91 · 105 66 159 287 runtime (ms) LSPI fLSPI-SA

Prashanth L A Convergence rate of TD(0) March 27, 2015 56 / 84

slide-98
SLIDE 98

Experiments - Traffic Signal Control

SGD in Linear Bandits

Prashanth L A Convergence rate of TD(0) March 27, 2015 57 / 84

slide-99
SLIDE 99

Experiments - Traffic Signal Control

Complacs News Recommendation Platform

NOAM database: 17 million articles from 2010 Task: Find the best among 2000 news feeds Reward: Relevancy score of the article Feature dimension: 80000 (approx)

1In collaboration with Nello Cristianini and Tom Welfare at University of Bristol Prashanth L A Convergence rate of TD(0) March 27, 2015 58 / 84

slide-100
SLIDE 100

Experiments - Traffic Signal Control

Complacs News Recommendation Platform

NOAM database: 17 million articles from 2010 Task: Find the best among 2000 news feeds Reward: Relevancy score of the article Feature dimension: 80000 (approx)

1In collaboration with Nello Cristianini and Tom Welfare at University of Bristol Prashanth L A Convergence rate of TD(0) March 27, 2015 58 / 84

slide-101
SLIDE 101

Experiments - Traffic Signal Control

Complacs News Recommendation Platform

NOAM database: 17 million articles from 2010 Task: Find the best among 2000 news feeds Reward: Relevancy score of the article Feature dimension: 80000 (approx)

1In collaboration with Nello Cristianini and Tom Welfare at University of Bristol Prashanth L A Convergence rate of TD(0) March 27, 2015 58 / 84

slide-102
SLIDE 102

Experiments - Traffic Signal Control

Complacs News Recommendation Platform

NOAM database: 17 million articles from 2010 Task: Find the best among 2000 news feeds Reward: Relevancy score of the article Feature dimension: 80000 (approx)

1In collaboration with Nello Cristianini and Tom Welfare at University of Bristol Prashanth L A Convergence rate of TD(0) March 27, 2015 58 / 84

slide-103
SLIDE 103

Experiments - Traffic Signal Control

More on relevancy score

Problem: Find the best news feed for Crime stories Sample scores:

Five dead in Finnish mall shooting Score: 1.93 Holidays provide more opportunities to drink Score: −0.48 Russia raises price of vodka Score: 2.67 Why Obama Care Must Be Defeated Score: 0.43 University closure due to weather Score: −1.06

Prashanth L A Convergence rate of TD(0) March 27, 2015 59 / 84

slide-104
SLIDE 104

Experiments - Traffic Signal Control

More on relevancy score

Problem: Find the best news feed for Crime stories Sample scores:

Five dead in Finnish mall shooting Score: 1.93 Holidays provide more opportunities to drink Score: −0.48 Russia raises price of vodka Score: 2.67 Why Obama Care Must Be Defeated Score: 0.43 University closure due to weather Score: −1.06

Prashanth L A Convergence rate of TD(0) March 27, 2015 59 / 84

slide-105
SLIDE 105

Experiments - Traffic Signal Control

More on relevancy score

Problem: Find the best news feed for Crime stories Sample scores:

Five dead in Finnish mall shooting Score: 1.93 Holidays provide more opportunities to drink Score: −0.48 Russia raises price of vodka Score: 2.67 Why Obama Care Must Be Defeated Score: 0.43 University closure due to weather Score: −1.06

Prashanth L A Convergence rate of TD(0) March 27, 2015 59 / 84

slide-106
SLIDE 106

Experiments - Traffic Signal Control

More on relevancy score

Problem: Find the best news feed for Crime stories Sample scores:

Five dead in Finnish mall shooting Score: 1.93 Holidays provide more opportunities to drink Score: −0.48 Russia raises price of vodka Score: 2.67 Why Obama Care Must Be Defeated Score: 0.43 University closure due to weather Score: −1.06

Prashanth L A Convergence rate of TD(0) March 27, 2015 59 / 84

slide-107
SLIDE 107

Experiments - Traffic Signal Control

More on relevancy score

Problem: Find the best news feed for Crime stories Sample scores:

Five dead in Finnish mall shooting Score: 1.93 Holidays provide more opportunities to drink Score: −0.48 Russia raises price of vodka Score: 2.67 Why Obama Care Must Be Defeated Score: 0.43 University closure due to weather Score: −1.06

Prashanth L A Convergence rate of TD(0) March 27, 2015 59 / 84

slide-108
SLIDE 108

Experiments - Traffic Signal Control

A linear bandit algorithm

Choose xn Observe yn Estimate UCBs xn := arg max

x∈D

UCB(x) Rewards yn s.t. E[yn | xn] = xT

nθ∗

Regression used to compute UCB(x) := xTˆ θn + α

  • xTA−1

n x

Prashanth L A Convergence rate of TD(0) March 27, 2015 60 / 84

slide-109
SLIDE 109

Experiments - Traffic Signal Control

A linear bandit algorithm

Choose xn Observe yn Estimate UCBs xn := arg max

x∈D

UCB(x) Rewards yn s.t. E[yn | xn] = xT

nθ∗

Regression used to compute UCB(x) := xTˆ θn + α

  • xTA−1

n x

Prashanth L A Convergence rate of TD(0) March 27, 2015 60 / 84

slide-110
SLIDE 110

Experiments - Traffic Signal Control

A linear bandit algorithm

Choose xn Observe yn Estimate UCBs xn := arg max

x∈D

UCB(x) Rewards yn s.t. E[yn | xn] = xT

nθ∗

Regression used to compute UCB(x) := xTˆ θn + α

  • xTA−1

n x

Prashanth L A Convergence rate of TD(0) March 27, 2015 60 / 84

slide-111
SLIDE 111

Experiments - Traffic Signal Control

A linear bandit algorithm

Choose xn Observe yn Estimate UCBs xn := arg max

x∈D

UCB(x) Rewards yn s.t. E[yn | xn] = xT

nθ∗

Regression used to compute UCB(x) := xTˆ θn + α

  • xTA−1

n x

Prashanth L A Convergence rate of TD(0) March 27, 2015 60 / 84

slide-112
SLIDE 112

Experiments - Traffic Signal Control

UCB values

Mean-reward estimate UCB(x) = ˆ µ(x) + α ˆ σ(x) Confidence width At each round t, select a tap. Optimize the quality of n selected beers

Prashanth L A Convergence rate of TD(0) March 27, 2015 61 / 84

slide-113
SLIDE 113

Experiments - Traffic Signal Control

UCB values

Mean-reward estimate UCB(x) = ˆ µ(x) + α ˆ σ(x) Confidence width At each round t, select a tap. Optimize the quality of n selected beers

Prashanth L A Convergence rate of TD(0) March 27, 2015 61 / 84

slide-114
SLIDE 114

Experiments - Traffic Signal Control

UCB values

Mean-reward estimate UCB(x) = ˆ µ(x) + α ˆ σ(x) Confidence width At each round t, select a tap. Optimize the quality of n selected beers

Prashanth L A Convergence rate of TD(0) March 27, 2015 61 / 84

slide-115
SLIDE 115

Experiments - Traffic Signal Control

UCB values

Linearity ⇒ No need to estimate mean-reward of all arms, estimating θ∗ is enough Regression ˆ θn = A−1

n bn

UCB(x) = ˆ µ(x) + α ˆ σ(x) Mahalanobis distance of x from An:

  • xTA−1

n x

Optimize the beer you drink, before you get drunk

Prashanth L A Convergence rate of TD(0) March 27, 2015 62 / 84

slide-116
SLIDE 116

Experiments - Traffic Signal Control

UCB values

Linearity ⇒ No need to estimate mean-reward of all arms, estimating θ∗ is enough Regression ˆ θn = A−1

n bn

UCB(x) = ˆ µ(x) + α ˆ σ(x) Mahalanobis distance of x from An:

  • xTA−1

n x

Optimize the beer you drink, before you get drunk

Prashanth L A Convergence rate of TD(0) March 27, 2015 62 / 84

slide-117
SLIDE 117

Experiments - Traffic Signal Control

UCB values

Linearity ⇒ No need to estimate mean-reward of all arms, estimating θ∗ is enough Regression ˆ θn = A−1

n bn

UCB(x) = ˆ µ(x) + α ˆ σ(x) Mahalanobis distance of x from An:

  • xTA−1

n x

Optimize the beer you drink, before you get drunk

Prashanth L A Convergence rate of TD(0) March 27, 2015 62 / 84

slide-118
SLIDE 118

Experiments - Traffic Signal Control

Performance measure

Best arm: x∗ = arg min

x

{xTθ∗}. Regret: RT =

T

  • i=1

(x∗ − xi)Tθ∗ Goal: ensure RT grows sub-linearly with T Linear bandit algorithms ensure sub-linear regret!

Prashanth L A Convergence rate of TD(0) March 27, 2015 63 / 84

slide-119
SLIDE 119

Experiments - Traffic Signal Control

Performance measure

Best arm: x∗ = arg min

x

{xTθ∗}. Regret: RT =

T

  • i=1

(x∗ − xi)Tθ∗ Goal: ensure RT grows sub-linearly with T Linear bandit algorithms ensure sub-linear regret!

Prashanth L A Convergence rate of TD(0) March 27, 2015 63 / 84

slide-120
SLIDE 120

Experiments - Traffic Signal Control

Complexity of Least Squares Regression

Choose xn Observe yn Estimate ˆ θn

Figure: Typical ML algorithm using Regression

Regression Complexity

O(d2) using the Sherman-Morrison lemma or O(d2.807) using the Strassen algorithm or O(d2.375) the

Coppersmith-Winograd algorithm Problem: Complacs News feed platform has high-dimensional features (d ∼ 105) ⇒ solving OLS is computationally costly

Prashanth L A Convergence rate of TD(0) March 27, 2015 64 / 84

slide-121
SLIDE 121

Experiments - Traffic Signal Control

Complexity of Least Squares Regression

Choose xn Observe yn Estimate ˆ θn

Figure: Typical ML algorithm using Regression

Regression Complexity

O(d2) using the Sherman-Morrison lemma or O(d2.807) using the Strassen algorithm or O(d2.375) the

Coppersmith-Winograd algorithm Problem: Complacs News feed platform has high-dimensional features (d ∼ 105) ⇒ solving OLS is computationally costly

Prashanth L A Convergence rate of TD(0) March 27, 2015 64 / 84

slide-122
SLIDE 122

Experiments - Traffic Signal Control

Fast GD for Regression

θn Pick in uniformly in {1, . . . , n} Random Sampling Update θn using (xin, yin) GD Update θn+1

Solution: Use fast (online) gradient descent (GD) Efficient with complexity of only O(d) (Well-known) High probability bounds with explicit constants can be derived (not fully

known)

Prashanth L A Convergence rate of TD(0) March 27, 2015 65 / 84

slide-123
SLIDE 123

Experiments - Traffic Signal Control

Bandits+GD for News Recommendation

LinUCB: a well-known contextual bandit algorithm that employs regression in each iteration Fast GD: provides good approximation to regression (with low

computational cost)

Strongly-Convex Bandits: no loss in regret except log-factors Proved! Non Strongly-Convex Bandits: Encouraging empirical results for linUCB+fast

GD] on two news feed platforms

Prashanth L A Convergence rate of TD(0) March 27, 2015 66 / 84

slide-124
SLIDE 124

Experiments - Traffic Signal Control

Bandits+GD for News Recommendation

LinUCB: a well-known contextual bandit algorithm that employs regression in each iteration Fast GD: provides good approximation to regression (with low

computational cost)

Strongly-Convex Bandits: no loss in regret except log-factors Proved! Non Strongly-Convex Bandits: Encouraging empirical results for linUCB+fast

GD] on two news feed platforms

Prashanth L A Convergence rate of TD(0) March 27, 2015 66 / 84

slide-125
SLIDE 125

Strongly convex bandits

fast GD

θn Pick in uniformly in {1, . . . , n} Random Sampling Update θn using (xin, yin) GD Update θn+1

Step-sizes θn = θn−1 + γn

  • yin − θT

n−1xin

  • xin

Sample gradient

Prashanth L A Convergence rate of TD(0) March 27, 2015 67 / 84

slide-126
SLIDE 126

Strongly convex bandits

fast GD

θn Pick in uniformly in {1, . . . , n} Random Sampling Update θn using (xin, yin) GD Update θn+1

Step-sizes θn = θn−1 + γn

  • yin − θT

n−1xin

  • xin

Sample gradient

Prashanth L A Convergence rate of TD(0) March 27, 2015 67 / 84

slide-127
SLIDE 127

Strongly convex bandits

fast GD

θn Pick in uniformly in {1, . . . , n} Random Sampling Update θn using (xin, yin) GD Update θn+1

Step-sizes θn = θn−1 + γn

  • yin − θT

n−1xin

  • xin

Sample gradient

Prashanth L A Convergence rate of TD(0) March 27, 2015 67 / 84

slide-128
SLIDE 128

Strongly convex bandits

Assumptions

Setting: yn = x

T

nθ∗ + ξn, where ξn is i.i.d. zero-mean

(A1) sup

n

xn2 ≤ 1. (A2) |ξn| ≤ 1, ∀n. (A3) λmin

  • 1

n

n−1

  • i=1

xix

T

i

  • ≥ µ.

Bounded features Bounded noise Strongly convex co-variance matrix (for each n)!

Prashanth L A Convergence rate of TD(0) March 27, 2015 68 / 84

slide-129
SLIDE 129

Strongly convex bandits

Assumptions

Setting: yn = x

T

nθ∗ + ξn, where ξn is i.i.d. zero-mean

(A1) sup

n

xn2 ≤ 1. (A2) |ξn| ≤ 1, ∀n. (A3) λmin

  • 1

n

n−1

  • i=1

xix

T

i

  • ≥ µ.

Bounded features Bounded noise Strongly convex co-variance matrix (for each n)!

Prashanth L A Convergence rate of TD(0) March 27, 2015 68 / 84

slide-130
SLIDE 130

Strongly convex bandits

Assumptions

Setting: yn = x

T

nθ∗ + ξn, where ξn is i.i.d. zero-mean

(A1) sup

n

xn2 ≤ 1. (A2) |ξn| ≤ 1, ∀n. (A3) λmin

  • 1

n

n−1

  • i=1

xix

T

i

  • ≥ µ.

Bounded features Bounded noise Strongly convex co-variance matrix (for each n)!

Prashanth L A Convergence rate of TD(0) March 27, 2015 68 / 84

slide-131
SLIDE 131

Strongly convex bandits

Assumptions

Setting: yn = x

T

nθ∗ + ξn, where ξn is i.i.d. zero-mean

(A1) sup

n

xn2 ≤ 1. (A2) |ξn| ≤ 1, ∀n. (A3) λmin

  • 1

n

n−1

  • i=1

xix

T

i

  • ≥ µ.

Bounded features Bounded noise Strongly convex co-variance matrix (for each n)!

Prashanth L A Convergence rate of TD(0) March 27, 2015 68 / 84

slide-132
SLIDE 132

Strongly convex bandits

Why deriving error bounds is difficult?

θn − ˆ θn =θn − ˆ θn−1 + ˆ θn−1 − ˆ θn =θn−1 − ˆ θn−1 + ˆ θn−1 − ˆ θn + γn(yin − θ

T

n−1xin)xin

= Πn(θ0 − θ∗)

  • Initial Error

+

n

  • k=1

γkΠnΠ−1

k ∆ ˜

Mk

  • Sampling Error

n

  • k=1

ΠnΠ−1

k (ˆ

θk − ˆ θk−1)

  • Drift Error

, Present in earlier SGD works and can be handled easily Consequence of changing target Hard to control!

Note: ¯ An = 1 n

n

  • i=1

xixT

i , Πn := n

  • k=1
  • I − γk¯

Ak

  • , and ∆˜

Mk is a martingale difference. Prashanth L A Convergence rate of TD(0) March 27, 2015 69 / 84

slide-133
SLIDE 133

Strongly convex bandits

Why deriving error bounds is difficult?

θn − ˆ θn =θn − ˆ θn−1 + ˆ θn−1 − ˆ θn =θn−1 − ˆ θn−1 + ˆ θn−1 − ˆ θn + γn(yin − θ

T

n−1xin)xin

= Πn(θ0 − θ∗)

  • Initial Error

+

n

  • k=1

γkΠnΠ−1

k ∆ ˜

Mk

  • Sampling Error

n

  • k=1

ΠnΠ−1

k (ˆ

θk − ˆ θk−1)

  • Drift Error

, Present in earlier SGD works and can be handled easily Consequence of changing target Hard to control!

Note: ¯ An = 1 n

n

  • i=1

xixT

i , Πn := n

  • k=1
  • I − γk¯

Ak

  • , and ∆˜

Mk is a martingale difference. Prashanth L A Convergence rate of TD(0) March 27, 2015 69 / 84

slide-134
SLIDE 134

Strongly convex bandits

Why deriving error bounds is difficult?

θn − ˆ θn =θn − ˆ θn−1 + ˆ θn−1 − ˆ θn =θn−1 − ˆ θn−1 + ˆ θn−1 − ˆ θn + γn(yin − θ

T

n−1xin)xin

= Πn(θ0 − θ∗)

  • Initial Error

+

n

  • k=1

γkΠnΠ−1

k ∆ ˜

Mk

  • Sampling Error

n

  • k=1

ΠnΠ−1

k (ˆ

θk − ˆ θk−1)

  • Drift Error

, Present in earlier SGD works and can be handled easily Consequence of changing target Hard to control!

Note: ¯ An = 1 n

n

  • i=1

xixT

i , Πn := n

  • k=1
  • I − γk¯

Ak

  • , and ∆˜

Mk is a martingale difference. Prashanth L A Convergence rate of TD(0) March 27, 2015 69 / 84

slide-135
SLIDE 135

Strongly convex bandits

Handling Drift Error

Note Fn(θ) := 1 2

n

  • i=1

(yi − θ

Txi)2 and ¯

An = 1 n

n

  • i=1

xix

T

i . Also, E[yn | xn] = x

T

nθ∗.

To control the drift error, we observe that

  • ∇Fn(ˆ

θn) = 0 = ∇Fn−1(ˆ θn−1)

  • =

  • ˆ

θn−1 − ˆ θn = ξnA−1

n−1xn − (x

T

n(ˆ

θn − θ∗))A−1

n−1xn

  • .

Thus, drift is controlled by the convergence of ˆ θn to θ∗ Key: confidence ball result1

1Dani, Varsha, Thomas P. Hayes, and Sham M. Kakade, (2008) "Stochastic Linear Optimization under Bandit Feedback." In: COLT Prashanth L A Convergence rate of TD(0) March 27, 2015 70 / 84

slide-136
SLIDE 136

Strongly convex bandits

Handling Drift Error

Note Fn(θ) := 1 2

n

  • i=1

(yi − θ

Txi)2 and ¯

An = 1 n

n

  • i=1

xix

T

i . Also, E[yn | xn] = x

T

nθ∗.

To control the drift error, we observe that

  • ∇Fn(ˆ

θn) = 0 = ∇Fn−1(ˆ θn−1)

  • =

  • ˆ

θn−1 − ˆ θn = ξnA−1

n−1xn − (x

T

n(ˆ

θn − θ∗))A−1

n−1xn

  • .

Thus, drift is controlled by the convergence of ˆ θn to θ∗ Key: confidence ball result1

1Dani, Varsha, Thomas P. Hayes, and Sham M. Kakade, (2008) "Stochastic Linear Optimization under Bandit Feedback." In: COLT Prashanth L A Convergence rate of TD(0) March 27, 2015 70 / 84

slide-137
SLIDE 137

Strongly convex bandits

Handling Drift Error

Note Fn(θ) := 1 2

n

  • i=1

(yi − θ

Txi)2 and ¯

An = 1 n

n

  • i=1

xix

T

i . Also, E[yn | xn] = x

T

nθ∗.

To control the drift error, we observe that

  • ∇Fn(ˆ

θn) = 0 = ∇Fn−1(ˆ θn−1)

  • =

  • ˆ

θn−1 − ˆ θn = ξnA−1

n−1xn − (x

T

n(ˆ

θn − θ∗))A−1

n−1xn

  • .

Thus, drift is controlled by the convergence of ˆ θn to θ∗ Key: confidence ball result1

1Dani, Varsha, Thomas P. Hayes, and Sham M. Kakade, (2008) "Stochastic Linear Optimization under Bandit Feedback." In: COLT Prashanth L A Convergence rate of TD(0) March 27, 2015 70 / 84

slide-138
SLIDE 138

Strongly convex bandits

Error bound

With γn = c/(4(c + n)) and µc/4 ∈ (2/3, 1) we have: High prob. bound For any δ > 0, P  

  • θn − ˆ

θn

  • 2 ≤
  • Kµ,c

n log 1 δ + h1(n) √n   ≥ 1 − δ. Optimal rate O

  • n−1/2

Bound in expectation E

  • θn − ˆ

θn

  • 2 ≤
  • θ0 − ˆ

θn

  • 2

nµc + h2(n) √n .

Initial error Sampling error

1Kµ,c is a constant depending on µ and c and h1(n), h2(n) hide log factors. 2By iterate-averaging, the dependency of c on µ can be removed. Prashanth L A Convergence rate of TD(0) March 27, 2015 71 / 84

slide-139
SLIDE 139

Strongly convex bandits

Error bound

With γn = c/(4(c + n)) and µc/4 ∈ (2/3, 1) we have: High prob. bound For any δ > 0, P  

  • θn − ˆ

θn

  • 2 ≤
  • Kµ,c

n log 1 δ + h1(n) √n   ≥ 1 − δ. Optimal rate O

  • n−1/2

Bound in expectation E

  • θn − ˆ

θn

  • 2 ≤
  • θ0 − ˆ

θn

  • 2

nµc + h2(n) √n .

Initial error Sampling error

1Kµ,c is a constant depending on µ and c and h1(n), h2(n) hide log factors. 2By iterate-averaging, the dependency of c on µ can be removed. Prashanth L A Convergence rate of TD(0) March 27, 2015 71 / 84

slide-140
SLIDE 140

Strongly convex bandits

Error bound

With γn = c/(4(c + n)) and µc/4 ∈ (2/3, 1) we have: High prob. bound For any δ > 0, P  

  • θn − ˆ

θn

  • 2 ≤
  • Kµ,c

n log 1 δ + h1(n) √n   ≥ 1 − δ. Optimal rate O

  • n−1/2

Bound in expectation E

  • θn − ˆ

θn

  • 2 ≤
  • θ0 − ˆ

θn

  • 2

nµc + h2(n) √n .

Initial error Sampling error

1Kµ,c is a constant depending on µ and c and h1(n), h2(n) hide log factors. 2By iterate-averaging, the dependency of c on µ can be removed. Prashanth L A Convergence rate of TD(0) March 27, 2015 71 / 84

slide-141
SLIDE 141

Strongly convex bandits

PEGE Algorithm1

Input A basis {b1, . . . , bd} ∈ D for Rd. Pull each of the d basis arms once Using losses, compute OLS Use OLS estimate to compute a greedy decision Pull the greedy arm m times For each cycle m = 1, 2, . . . do Exploration Phase For i = 1 to d

  • Choose arm bi
  • Observe yi(m).

ˆ θmd = 1 m  

d

  • i=1

bibT

i

 

−1 m

  • i=1

d

  • j=1

biyj(i).

Exploitation Phase Find x = arg min

x∈D

{ˆ θT

mdx}

Choose arm x m times consecutively.

  • 1P. Rusmevichientong and J,N. Tsitsiklis, (2010) Linearly Parameterized Bandits. In: Math. Oper. Res.

Prashanth L A Convergence rate of TD(0) March 27, 2015 72 / 84

slide-142
SLIDE 142

Strongly convex bandits

PEGE Algorithm1

Input A basis {b1, . . . , bd} ∈ D for Rd. Pull each of the d basis arms once Using losses, compute OLS Use OLS estimate to compute a greedy decision Pull the greedy arm m times For each cycle m = 1, 2, . . . do Exploration Phase For i = 1 to d

  • Choose arm bi
  • Observe yi(m).

ˆ θmd = 1 m  

d

  • i=1

bibT

i

 

−1 m

  • i=1

d

  • j=1

biyj(i).

Exploitation Phase Find x = arg min

x∈D

{ˆ θT

mdx}

Choose arm x m times consecutively.

  • 1P. Rusmevichientong and J,N. Tsitsiklis, (2010) Linearly Parameterized Bandits. In: Math. Oper. Res.

Prashanth L A Convergence rate of TD(0) March 27, 2015 72 / 84

slide-143
SLIDE 143

Strongly convex bandits

PEGE Algorithm1

Input A basis {b1, . . . , bd} ∈ D for Rd. Pull each of the d basis arms once Using losses, compute OLS Use OLS estimate to compute a greedy decision Pull the greedy arm m times For each cycle m = 1, 2, . . . do Exploration Phase For i = 1 to d

  • Choose arm bi
  • Observe yi(m).

ˆ θmd = 1 m  

d

  • i=1

bibT

i

 

−1 m

  • i=1

d

  • j=1

biyj(i).

Exploitation Phase Find x = arg min

x∈D

{ˆ θT

mdx}

Choose arm x m times consecutively.

  • 1P. Rusmevichientong and J,N. Tsitsiklis, (2010) Linearly Parameterized Bandits. In: Math. Oper. Res.

Prashanth L A Convergence rate of TD(0) March 27, 2015 72 / 84

slide-144
SLIDE 144

Strongly convex bandits

PEGE Algorithm1

Input A basis {b1, . . . , bd} ∈ D for Rd. Pull each of the d basis arms once Using losses, compute OLS Use OLS estimate to compute a greedy decision Pull the greedy arm m times For each cycle m = 1, 2, . . . do Exploration Phase For i = 1 to d

  • Choose arm bi
  • Observe yi(m).

ˆ θmd = 1 m  

d

  • i=1

bibT

i

 

−1 m

  • i=1

d

  • j=1

biyj(i).

Exploitation Phase Find x = arg min

x∈D

{ˆ θT

mdx}

Choose arm x m times consecutively.

  • 1P. Rusmevichientong and J,N. Tsitsiklis, (2010) Linearly Parameterized Bandits. In: Math. Oper. Res.

Prashanth L A Convergence rate of TD(0) March 27, 2015 72 / 84

slide-145
SLIDE 145

Strongly convex bandits

PEGE Algorithm with fast GD

Input A basis {b1, . . . , bd} ∈ D for Rd. Pull each of the d basis arms once Using losses, update fast GD iterate Use fast GD iterate to compute a greedy decision Pull the greedy arm m times For each cycle m = 1, 2, . . . do Exploration Phase For i = 1 to d

  • Choose arm bi
  • Observe yi(m).

Update fast GD iterate θmd Exploitation Phase Find x = arg min

x∈D

{θT

mdx}

Choose arm x m times consecutively.

Prashanth L A Convergence rate of TD(0) March 27, 2015 73 / 84

slide-146
SLIDE 146

Strongly convex bandits

PEGE Algorithm with fast GD

Input A basis {b1, . . . , bd} ∈ D for Rd. Pull each of the d basis arms once Using losses, update fast GD iterate Use fast GD iterate to compute a greedy decision Pull the greedy arm m times For each cycle m = 1, 2, . . . do Exploration Phase For i = 1 to d

  • Choose arm bi
  • Observe yi(m).

Update fast GD iterate θmd Exploitation Phase Find x = arg min

x∈D

{θT

mdx}

Choose arm x m times consecutively.

Prashanth L A Convergence rate of TD(0) March 27, 2015 73 / 84

slide-147
SLIDE 147

Strongly convex bandits

PEGE Algorithm with fast GD

Input A basis {b1, . . . , bd} ∈ D for Rd. Pull each of the d basis arms once Using losses, update fast GD iterate Use fast GD iterate to compute a greedy decision Pull the greedy arm m times For each cycle m = 1, 2, . . . do Exploration Phase For i = 1 to d

  • Choose arm bi
  • Observe yi(m).

Update fast GD iterate θmd Exploitation Phase Find x = arg min

x∈D

{θT

mdx}

Choose arm x m times consecutively.

Prashanth L A Convergence rate of TD(0) March 27, 2015 73 / 84

slide-148
SLIDE 148

Strongly convex bandits

PEGE Algorithm with fast GD

Input A basis {b1, . . . , bd} ∈ D for Rd. Pull each of the d basis arms once Using losses, update fast GD iterate Use fast GD iterate to compute a greedy decision Pull the greedy arm m times For each cycle m = 1, 2, . . . do Exploration Phase For i = 1 to d

  • Choose arm bi
  • Observe yi(m).

Update fast GD iterate θmd Exploitation Phase Find x = arg min

x∈D

{θT

mdx}

Choose arm x m times consecutively.

Prashanth L A Convergence rate of TD(0) March 27, 2015 73 / 84

slide-149
SLIDE 149

Strongly convex bandits

Regret bound for PEGE+fast GD

(Strongly Convex Arms): (A3) The function G : θ → arg min

x∈D

Tx} is J-Lipschitz.

Theorem Under (A1)-(A3), regret RT :=

T

  • i=1

x

T

i θ∗ − min x∈D x

Tθ∗ satisfies

RT ≤ CK1(n)2d−1(θ∗2 + θ∗−1

2 )

√ T The bound is worse than that for PEGE by only a factor of O(log4(n))

Prashanth L A Convergence rate of TD(0) March 27, 2015 74 / 84

slide-150
SLIDE 150

Non-strongly convex bandits

Fast linUCB

Choose xn Observe yn Use θn to estimate ˆ θn xn := arg max

x∈D

UCB(x) Rewards yn s.t. E[yn | xn] = xT

nθ∗

Fast GD used to compute UCB(x) := xTθn + α

  • xTφ(x)

n

Prashanth L A Convergence rate of TD(0) March 27, 2015 75 / 84

slide-151
SLIDE 151

Non-strongly convex bandits

Fast linUCB

Choose xn Observe yn Use θn to estimate ˆ θn xn := arg max

x∈D

UCB(x) Rewards yn s.t. E[yn | xn] = xT

nθ∗

Fast GD used to compute UCB(x) := xTθn + α

  • xTφ(x)

n

Prashanth L A Convergence rate of TD(0) March 27, 2015 75 / 84

slide-152
SLIDE 152

Non-strongly convex bandits

Fast linUCB

Choose xn Observe yn Use θn to estimate ˆ θn xn := arg max

x∈D

UCB(x) Rewards yn s.t. E[yn | xn] = xT

nθ∗

Fast GD used to compute UCB(x) := xTθn + α

  • xTφ(x)

n

Prashanth L A Convergence rate of TD(0) March 27, 2015 75 / 84

slide-153
SLIDE 153

Non-strongly convex bandits

Adaptive regularization

Problem: In many settings, λmin

  • 1

n

n−1

  • i=1

xix

T

i

  • ≥ µ may not hold.

Solution: Adaptively regularize with λn ˜ θn := arg min

θ

1 2n

n

  • i=1

(yi − θ

Txi)2 + λn θ2

θn Pick in uniformly in {1, . . . , n} Random Sampling Update θn using (xin, yin) GD Update θn+1

GD update: θn = θn−1 + γn((yin − θ

T

n−1xin)xin − λnθn−1)

Prashanth L A Convergence rate of TD(0) March 27, 2015 76 / 84

slide-154
SLIDE 154

Non-strongly convex bandits

Adaptive regularization

Problem: In many settings, λmin

  • 1

n

n−1

  • i=1

xix

T

i

  • ≥ µ may not hold.

Solution: Adaptively regularize with λn ˜ θn := arg min

θ

1 2n

n

  • i=1

(yi − θ

Txi)2 + λn θ2

θn Pick in uniformly in {1, . . . , n} Random Sampling Update θn using (xin, yin) GD Update θn+1

GD update: θn = θn−1 + γn((yin − θ

T

n−1xin)xin − λnθn−1)

Prashanth L A Convergence rate of TD(0) March 27, 2015 76 / 84

slide-155
SLIDE 155

Non-strongly convex bandits

Adaptive regularization

Problem: In many settings, λmin

  • 1

n

n−1

  • i=1

xix

T

i

  • ≥ µ may not hold.

Solution: Adaptively regularize with λn ˜ θn := arg min

θ

1 2n

n

  • i=1

(yi − θ

Txi)2 + λn θ2

θn Pick in uniformly in {1, . . . , n} Random Sampling Update θn using (xin, yin) GD Update θn+1

GD update: θn = θn−1 + γn((yin − θ

T

n−1xin)xin − λnθn−1)

Prashanth L A Convergence rate of TD(0) March 27, 2015 76 / 84

slide-156
SLIDE 156

Non-strongly convex bandits

Adaptive regularization

Problem: In many settings, λmin

  • 1

n

n−1

  • i=1

xix

T

i

  • ≥ µ may not hold.

Solution: Adaptively regularize with λn ˜ θn := arg min

θ

1 2n

n

  • i=1

(yi − θ

Txi)2 + λn θ2

θn Pick in uniformly in {1, . . . , n} Random Sampling Update θn using (xin, yin) GD Update θn+1

GD update: θn = θn−1 + γn((yin − θ

T

n−1xin)xin − λnθn−1)

Prashanth L A Convergence rate of TD(0) March 27, 2015 76 / 84

slide-157
SLIDE 157

Non-strongly convex bandits

Why deriving error bounds is “really” difficult here?

θn − ˜ θn = Πn(θ0 − θ∗)

  • Initial Error

n

  • k=1

ΠnΠ−1

k (˜

θk − ˜ θk−1)

  • Drift Error

+

n

  • k=1

γkΠnΠ−1

k ∆ ˜

Mk

  • Sampling Error

, (3) Need

n

  • k=1

γkλk → ∞ to bound the initial error Set γn = O(n−α) (forcing λn = Ω(n−(1−α))) Bad news: This choice when plugged into (3) results in only a constant error bound!

Note: Πn :=

n

  • k=1
  • I − γk(¯

Ak + λkI)

  • and ˜

θn−1 − ˜ θn = Ω(n−1), whenever α ∈ (0, 1) Prashanth L A Convergence rate of TD(0) March 27, 2015 77 / 84

slide-158
SLIDE 158

Non-strongly convex bandits

Why deriving error bounds is “really” difficult here?

θn − ˜ θn = Πn(θ0 − θ∗)

  • Initial Error

n

  • k=1

ΠnΠ−1

k (˜

θk − ˜ θk−1)

  • Drift Error

+

n

  • k=1

γkΠnΠ−1

k ∆ ˜

Mk

  • Sampling Error

, (3) Need

n

  • k=1

γkλk → ∞ to bound the initial error Set γn = O(n−α) (forcing λn = Ω(n−(1−α))) Bad news: This choice when plugged into (3) results in only a constant error bound!

Note: Πn :=

n

  • k=1
  • I − γk(¯

Ak + λkI)

  • and ˜

θn−1 − ˜ θn = Ω(n−1), whenever α ∈ (0, 1) Prashanth L A Convergence rate of TD(0) March 27, 2015 77 / 84

slide-159
SLIDE 159

Non-strongly convex bandits

Why deriving error bounds is “really” difficult here?

θn − ˜ θn = Πn(θ0 − θ∗)

  • Initial Error

n

  • k=1

ΠnΠ−1

k (˜

θk − ˜ θk−1)

  • Drift Error

+

n

  • k=1

γkΠnΠ−1

k ∆ ˜

Mk

  • Sampling Error

, (3) Need

n

  • k=1

γkλk → ∞ to bound the initial error Set γn = O(n−α) (forcing λn = Ω(n−(1−α))) Bad news: This choice when plugged into (3) results in only a constant error bound!

Note: Πn :=

n

  • k=1
  • I − γk(¯

Ak + λkI)

  • and ˜

θn−1 − ˜ θn = Ω(n−1), whenever α ∈ (0, 1) Prashanth L A Convergence rate of TD(0) March 27, 2015 77 / 84

slide-160
SLIDE 160

Non-strongly convex bandits

Why deriving error bounds is “really” difficult here?

θn − ˜ θn = Πn(θ0 − θ∗)

  • Initial Error

n

  • k=1

ΠnΠ−1

k (˜

θk − ˜ θk−1)

  • Drift Error

+

n

  • k=1

γkΠnΠ−1

k ∆ ˜

Mk

  • Sampling Error

, (3) Need

n

  • k=1

γkλk → ∞ to bound the initial error Set γn = O(n−α) (forcing λn = Ω(n−(1−α))) Bad news: This choice when plugged into (3) results in only a constant error bound!

Note: Πn :=

n

  • k=1
  • I − γk(¯

Ak + λkI)

  • and ˜

θn−1 − ˜ θn = Ω(n−1), whenever α ∈ (0, 1) Prashanth L A Convergence rate of TD(0) March 27, 2015 77 / 84

slide-161
SLIDE 161

News recommendation application

Dilbert’s boss on news recommendation (and ML)

Prashanth L A Convergence rate of TD(0) March 27, 2015 78 / 84

slide-162
SLIDE 162

News recommendation application

Preliminary Results on Complacs News Feed Platform

200 400 600 −300 −200 −100 iteration Cumulative reward LinUCB LinUCB-GD

Prashanth L A Convergence rate of TD(0) March 27, 2015 79 / 84

slide-163
SLIDE 163

News recommendation application

Experiments on Yahoo! Dataset 1

Figure: The Featured tab in Yahoo! Today module

1Yahoo User-Click Log Dataset given under the Webscope program (2011) Prashanth L A Convergence rate of TD(0) March 27, 2015 80 / 84

slide-164
SLIDE 164

News recommendation application

Tracking Error

Tracking error: SGD

2 4 ·104 0.5 1 iteration n of flinUCB-GD

  • θn − ˜

θn

  • SGD

Tracking error: SVRG1

2 4 ·104 0.5 1 iteration n of flinUCB-SVRG

  • θn − ˜

θn

  • SVRG

Tracking error: SAG2

2 4 ·104 0.5 1 iteration n of flinUCB-SAG

  • θn − ˜

θn

  • SAG

1Johnson, R., and Zhang, T. (2013) “Accelerating stochastic gradient descent using predictive variance reduction”. In: NIPS 2Roux, N. L., Schmidt, M. and Bach, F. (2012) “A stochastic gradient method with an exponential convergence rate for finite training sets.” arXiv preprint arXiv:1202.6258. Prashanth L A Convergence rate of TD(0) March 27, 2015 81 / 84

slide-165
SLIDE 165

News recommendation application

Runtime Performance on two days of the Yahoo! dataset

Day-2 Day-4 0.5 1 1.5 ·106

1.37 · 106 1.72 · 106 4,933 6,474 81,818 1.07 · 105 44,504 55,630

runtime (ms) LinUCB fLinUCB-GD fLinUCB-SVRG fLinUCB-SAG

Prashanth L A Convergence rate of TD(0) March 27, 2015 82 / 84

slide-166
SLIDE 166

For Further Reading

For Further Reading I

Nathaniel Korda and Prashanth L.A., On TD(0) with function approximation: Concentration bounds and a centered variant with exponential convergence. arXiv:1411.3224, 2014. Prashanth L.A., Nathaniel Korda and Rémi Munos, Fast LSTD using stochastic approximation: Finite time analysis and application to traffic control. ECML, 2014. Nathaniel Korda, Prashanth L.A. and Rémi Munos, Fast gradient descent for least squares regression: Non-asymptotic bounds and application to bandits. AAAI, 2015.

Prashanth L A Convergence rate of TD(0) March 27, 2015 83 / 84

slide-167
SLIDE 167

For Further Reading

Dilbert’s boss (again) on big data!

Prashanth L A Convergence rate of TD(0) March 27, 2015 84 / 84