Matrix-Free Preconditioning in Online Learning Ashok Cutkosky, - - PowerPoint PPT Presentation

matrix free preconditioning in online learning
SMART_READER_LITE
LIVE PREVIEW

Matrix-Free Preconditioning in Online Learning Ashok Cutkosky, - - PowerPoint PPT Presentation

Matrix-Free Preconditioning in Online Learning Ashok Cutkosky, Tamas Sarlos Google Research Online Optimization For t = 1 . . . T , repeat: 1: Learner chooses a point w t . 2: Environment presents learner with a gradient g t (think E [ g t ] =


slide-1
SLIDE 1

Matrix-Free Preconditioning in Online Learning

Ashok Cutkosky, Tamas Sarlos Google Research

slide-2
SLIDE 2

Online Optimization

For t = 1 . . . T, repeat: 1: Learner chooses a point wt. 2: Environment presents learner with a gradient gt (think E[gt] = ∇F(wt)). 3: Learner suffers loss gt, wt. The objective is minimize regret: RT(w⋆) =

T

  • t=1

gt, wt

loss suffered

− gt, w⋆

benchmark loss

Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 1 of 20

slide-3
SLIDE 3

Online Optimization

For t = 1 . . . T, repeat: 1: Learner chooses a point wt. 2: Environment presents learner with a gradient gt (think E[gt] = ∇F(wt)). 3: Learner suffers loss gt, wt. The objective is minimize regret: RT(w⋆) =

T

  • t=1

gt, wt

loss suffered

− gt, w⋆

benchmark loss

Running an online algorithm on a stochastic optimization problem guarantees F(wT) − F(w⋆) ≤ RT (w⋆)

T

.

Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 1 of 20

slide-4
SLIDE 4

The Classic Algorithm: Gradient Descent

wt+1 = wt − ηtgt

Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 2 of 20

slide-5
SLIDE 5

The Classic Algorithm: Gradient Descent

wt+1 = wt − ηtgt Gradient descent obtains regret: RT(w⋆) ≤

  • T
  • t=1

w⋆2gt2

Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 2 of 20

slide-6
SLIDE 6

Gradient Descent

Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 3 of 20

slide-7
SLIDE 7

Preconditioning (Deterministic)

  • The gradient ∇F(w) may not point towards the minimum w⋆

Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 4 of 20

slide-8
SLIDE 8

Preconditioning (Deterministic)

  • The gradient ∇F(w) may not point towards the minimum w⋆

Key idea: “Preconditioning” means ignoring irrelevant directions.

Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 4 of 20

slide-9
SLIDE 9

Preconditioning (Stochastic)

  • Noise can also make gt not point towards the minimum.

Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 5 of 20

slide-10
SLIDE 10

Regret Bounds

  • Regret of un-preconditioned stochastic gradient descent (with the

appropriate learning rate) is RT(w⋆) ≤

  • T
  • t=1

w⋆2gt2 = O √ T

  • An ideal preconditioned algorithm should obtain regret

RT(w⋆) ≤

  • T
  • t=1

w⋆, gt2 = O √ T

  • Ashok Cutkosky, Tamas Sarlos

Matrix-Free Preconditioning in Online Learning 6 of 20

slide-11
SLIDE 11

Regret Bound Picture

Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 7 of 20

slide-12
SLIDE 12

Goals

  • Want regret bound as good as if we had ignored irrelevant directions

(up to constants/logs)

Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 8 of 20

slide-13
SLIDE 13

Using the Covariance Matrix

The typical approach to preconditioning maintains the matrix G =

T

  • t=1

gtg⊤

t

and compute various inverses and square roots of G. This can obtain the guarantee [CO18; KL17] RT(w⋆) ≤

  • d

T

  • t=1

w⋆, gt2

Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 9 of 20

slide-14
SLIDE 14

Issues with Using Covariance Matrix

  • d2 time is too slow - there’s a lot of work on compressing the matrix to

try to make some tradeoff [Luo+16; GKS18; Aga+18].

Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 10 of 20

slide-15
SLIDE 15

Issues with Using Covariance Matrix

  • d2 time is too slow - there’s a lot of work on compressing the matrix to

try to make some tradeoff [Luo+16; GKS18; Aga+18].

  • The regret bound might not even be beter!
  • d

T

  • t=1

w⋆, gt2

?

  • w⋆2

T

  • t=1

gt2

Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 10 of 20

slide-16
SLIDE 16

Goals

1: Want regret bound as good as if we had ignored irrelevant directions (up to constants/logs). 2: Want an efficient algorithm (O(d) time per update in d-dimensions). 3: Want to never do worse than non-preconditioned algorithms.

Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 11 of 20

slide-17
SLIDE 17

Goals

1: Want regret bound as good as if we had ignored irrelevant directions (up to constants/logs). 2: Want an efficient algorithm (O(d) time per update in d-dimensions). 3: Want to never do worse than non-preconditioned algorithms.

  • We will achieve 2 and 3, and sometimes 1.

Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 11 of 20

slide-18
SLIDE 18

Our Contribution

We provide an online learning algorithm that:

  • Runs in O(d) time per-update.
  • Always achieves regret:

RT(w⋆) ≤ w⋆

  • T
  • t=1

gt2

  • When −T

t=1 gt, w⋆/w⋆ ≥

T

t=1 gt2, achieves:

RT(w⋆) ≤

  • T
  • t=1

w⋆, gt2

Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 12 of 20

slide-19
SLIDE 19

Unpacking the Condition

  • We need −T

t=1 gt, w⋆/w⋆ ≥

T

t=1 gt2 for preconditioned

regret.

  • If gt are mean-zero independent random variables, then standard

concentration results say: − T

  • t=1

gt, w⋆/w⋆

  • T
  • t=1

gt

  • = Θ

 

  • T
  • t=1

gt2  

Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 13 of 20

slide-20
SLIDE 20

Unpacking the Condition

  • We need −T

t=1 gt, w⋆/w⋆ ≥

T

t=1 gt2 for preconditioned

regret.

  • If gt are mean-zero independent random variables, then standard

concentration results say: − T

  • t=1

gt, w⋆/w⋆

  • T
  • t=1

gt

  • = Θ

 

  • T
  • t=1

gt2   We achieve preconditioning whenever there is any “signal” in the gradients.

Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 13 of 20

slide-21
SLIDE 21

Coin Beting [OP16]

  • Define wealth:

WealthT = 1 −

T

  • t=1

gt, wt

Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 14 of 20

slide-22
SLIDE 22

Coin Beting [OP16]

  • Define wealth:

WealthT = 1 −

T

  • t=1

gt, wt

  • High wealth implies low regret:

RT(w⋆) = 1 −

T

  • t=1

gt, w⋆

  • ut of our control

−WealthT

Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 14 of 20

slide-23
SLIDE 23

Coin Beting [OP16]

  • Define wealth:

WealthT = 1 −

T

  • t=1

gt, wt

  • High wealth implies low regret:

RT(w⋆) = 1 −

T

  • t=1

gt, w⋆

  • ut of our control

−WealthT

  • At every iteration, choose a beting fraction vt ∈ Rd and use

wt = vtWealtht−1

Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 14 of 20

slide-24
SLIDE 24

Oracle value for v yields good algorithm

Set vt = v⋆ ≈

w⋆ w⋆√T

t=1gt,w⋆2 . Then

RT(w⋆) ≤

  • T
  • t=1

w⋆, gt2

  • There are no matrices here!
  • But we don’t know this magic value for v.

Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 15 of 20

slide-25
SLIDE 25

Online Learning Inside Online Learning [CO18]

  • Define ℓt(v) = − log(1 − gt, v). Then:

Rv

T(v⋆) := T

  • t=1

ℓt(vt) − ℓt(v⋆)

  • If Rv

T(v⋆) = O(log(T)), then the final regret RT(w⋆) is the same as if

we’d used the constant vt = v⋆.

Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 16 of 20

slide-26
SLIDE 26

Online Learning Inside Online Learning [CO18]

  • Define ℓt(v) = − log(1 − gt, v). Then:

Rv

T(v⋆) := T

  • t=1

ℓt(vt) − ℓt(v⋆)

  • If Rv

T(v⋆) = O(log(T)), then the final regret RT(w⋆) is the same as if

we’d used the constant vt = v⋆.

  • We can use online learning to choose the vt!

Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 16 of 20

slide-27
SLIDE 27

Overview of Algorithm Strategy

  • There exists an unknown v⋆ that would give preconditioned regret.
  • We can choose vt using online convex optimization on losses

ℓt(v) = − log(1 − gt, vt).

  • If we get Rv

T(v⋆) = T t=1 ℓt(vt) − ℓt(v⋆) = O(log(T)), then we are as

good as picking v⋆ from the beginning.

  • So how can we obtain logarithmic regret?

Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 17 of 20

slide-28
SLIDE 28

How to obtain logarithmic regret?

  • Strategy: Remember that the constant v⋆ we need to compete with is

v⋆ =

w⋆ w⋆√T

t=1gt,w⋆2 , so v⋆ = O(1/

√ T) usually.

  • This means that we can use a non-preconditioned online learning

algorithm to obtain logarithmic regret: Rv

T(v⋆) ≤ v⋆

√ T = O(1)

Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 18 of 20

slide-29
SLIDE 29

How to obtain logarithmic regret?

  • Strategy: Remember that the constant v⋆ we need to compete with is

v⋆ =

w⋆ w⋆√T

t=1gt,w⋆2 , so v⋆ = O(1/

√ T) usually.

  • This means that we can use a non-preconditioned online learning

algorithm to obtain logarithmic regret: Rv

T(v⋆) ≤ v⋆

√ T = O(1)

  • Sometimes the best v is not small - this is why we do not always
  • btain preconditioned regret.

Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 18 of 20

slide-30
SLIDE 30

Experiments

50000 100000 150000 200000 250000 300000 Steps 0.175 0.200 0.225 0.250 0.275 0.300 0.325 0.350 Test accuracy Adam Adagrad Recursive Recursive+Momentum

Test accuracy on LM1B dataset with Transformer model

Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 19 of 20

slide-31
SLIDE 31

Summary

  • When the gradients are “obviously non-random”, we obtain

preconditioned regret bounds without any bad √ d constant factors.

  • Otherwise, we decay to the ordinary non-preconditioned regret

bounds (actually, we improve log factors).

  • The algorithm runs in the same time complexity as ordinary gradient

descent.

  • The empirical performance is promising.

Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 20 of 20

slide-32
SLIDE 32

Summary

  • When the gradients are “obviously non-random”, we obtain

preconditioned regret bounds without any bad √ d constant factors.

  • Otherwise, we decay to the ordinary non-preconditioned regret

bounds (actually, we improve log factors).

  • The algorithm runs in the same time complexity as ordinary gradient

descent.

  • The empirical performance is promising.

Thank you!

Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 20 of 20