Matrix-Free Preconditioning in Online Learning Ashok Cutkosky, - - PowerPoint PPT Presentation
Matrix-Free Preconditioning in Online Learning Ashok Cutkosky, - - PowerPoint PPT Presentation
Matrix-Free Preconditioning in Online Learning Ashok Cutkosky, Tamas Sarlos Google Research Online Optimization For t = 1 . . . T , repeat: 1: Learner chooses a point w t . 2: Environment presents learner with a gradient g t (think E [ g t ] =
Online Optimization
For t = 1 . . . T, repeat: 1: Learner chooses a point wt. 2: Environment presents learner with a gradient gt (think E[gt] = ∇F(wt)). 3: Learner suffers loss gt, wt. The objective is minimize regret: RT(w⋆) =
T
- t=1
gt, wt
loss suffered
− gt, w⋆
benchmark loss
Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 1 of 20
Online Optimization
For t = 1 . . . T, repeat: 1: Learner chooses a point wt. 2: Environment presents learner with a gradient gt (think E[gt] = ∇F(wt)). 3: Learner suffers loss gt, wt. The objective is minimize regret: RT(w⋆) =
T
- t=1
gt, wt
loss suffered
− gt, w⋆
benchmark loss
Running an online algorithm on a stochastic optimization problem guarantees F(wT) − F(w⋆) ≤ RT (w⋆)
T
.
Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 1 of 20
The Classic Algorithm: Gradient Descent
wt+1 = wt − ηtgt
Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 2 of 20
The Classic Algorithm: Gradient Descent
wt+1 = wt − ηtgt Gradient descent obtains regret: RT(w⋆) ≤
- T
- t=1
w⋆2gt2
Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 2 of 20
Gradient Descent
Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 3 of 20
Preconditioning (Deterministic)
- The gradient ∇F(w) may not point towards the minimum w⋆
Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 4 of 20
Preconditioning (Deterministic)
- The gradient ∇F(w) may not point towards the minimum w⋆
Key idea: “Preconditioning” means ignoring irrelevant directions.
Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 4 of 20
Preconditioning (Stochastic)
- Noise can also make gt not point towards the minimum.
Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 5 of 20
Regret Bounds
- Regret of un-preconditioned stochastic gradient descent (with the
appropriate learning rate) is RT(w⋆) ≤
- T
- t=1
w⋆2gt2 = O √ T
- An ideal preconditioned algorithm should obtain regret
RT(w⋆) ≤
- T
- t=1
w⋆, gt2 = O √ T
- Ashok Cutkosky, Tamas Sarlos
Matrix-Free Preconditioning in Online Learning 6 of 20
Regret Bound Picture
Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 7 of 20
Goals
- Want regret bound as good as if we had ignored irrelevant directions
(up to constants/logs)
Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 8 of 20
Using the Covariance Matrix
The typical approach to preconditioning maintains the matrix G =
T
- t=1
gtg⊤
t
and compute various inverses and square roots of G. This can obtain the guarantee [CO18; KL17] RT(w⋆) ≤
- d
T
- t=1
w⋆, gt2
Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 9 of 20
Issues with Using Covariance Matrix
- d2 time is too slow - there’s a lot of work on compressing the matrix to
try to make some tradeoff [Luo+16; GKS18; Aga+18].
Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 10 of 20
Issues with Using Covariance Matrix
- d2 time is too slow - there’s a lot of work on compressing the matrix to
try to make some tradeoff [Luo+16; GKS18; Aga+18].
- The regret bound might not even be beter!
- d
T
- t=1
w⋆, gt2
?
≤
- w⋆2
T
- t=1
gt2
Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 10 of 20
Goals
1: Want regret bound as good as if we had ignored irrelevant directions (up to constants/logs). 2: Want an efficient algorithm (O(d) time per update in d-dimensions). 3: Want to never do worse than non-preconditioned algorithms.
Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 11 of 20
Goals
1: Want regret bound as good as if we had ignored irrelevant directions (up to constants/logs). 2: Want an efficient algorithm (O(d) time per update in d-dimensions). 3: Want to never do worse than non-preconditioned algorithms.
- We will achieve 2 and 3, and sometimes 1.
Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 11 of 20
Our Contribution
We provide an online learning algorithm that:
- Runs in O(d) time per-update.
- Always achieves regret:
RT(w⋆) ≤ w⋆
- T
- t=1
gt2
- When −T
t=1 gt, w⋆/w⋆ ≥
T
t=1 gt2, achieves:
RT(w⋆) ≤
- T
- t=1
w⋆, gt2
Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 12 of 20
Unpacking the Condition
- We need −T
t=1 gt, w⋆/w⋆ ≥
T
t=1 gt2 for preconditioned
regret.
- If gt are mean-zero independent random variables, then standard
concentration results say: − T
- t=1
gt, w⋆/w⋆
- ≤
- T
- t=1
gt
- = Θ
- T
- t=1
gt2
Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 13 of 20
Unpacking the Condition
- We need −T
t=1 gt, w⋆/w⋆ ≥
T
t=1 gt2 for preconditioned
regret.
- If gt are mean-zero independent random variables, then standard
concentration results say: − T
- t=1
gt, w⋆/w⋆
- ≤
- T
- t=1
gt
- = Θ
- T
- t=1
gt2 We achieve preconditioning whenever there is any “signal” in the gradients.
Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 13 of 20
Coin Beting [OP16]
- Define wealth:
WealthT = 1 −
T
- t=1
gt, wt
Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 14 of 20
Coin Beting [OP16]
- Define wealth:
WealthT = 1 −
T
- t=1
gt, wt
- High wealth implies low regret:
RT(w⋆) = 1 −
T
- t=1
gt, w⋆
- ut of our control
−WealthT
Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 14 of 20
Coin Beting [OP16]
- Define wealth:
WealthT = 1 −
T
- t=1
gt, wt
- High wealth implies low regret:
RT(w⋆) = 1 −
T
- t=1
gt, w⋆
- ut of our control
−WealthT
- At every iteration, choose a beting fraction vt ∈ Rd and use
wt = vtWealtht−1
Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 14 of 20
Oracle value for v yields good algorithm
Set vt = v⋆ ≈
w⋆ w⋆√T
t=1gt,w⋆2 . Then
RT(w⋆) ≤
- T
- t=1
w⋆, gt2
- There are no matrices here!
- But we don’t know this magic value for v.
Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 15 of 20
Online Learning Inside Online Learning [CO18]
- Define ℓt(v) = − log(1 − gt, v). Then:
Rv
T(v⋆) := T
- t=1
ℓt(vt) − ℓt(v⋆)
- If Rv
T(v⋆) = O(log(T)), then the final regret RT(w⋆) is the same as if
we’d used the constant vt = v⋆.
Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 16 of 20
Online Learning Inside Online Learning [CO18]
- Define ℓt(v) = − log(1 − gt, v). Then:
Rv
T(v⋆) := T
- t=1
ℓt(vt) − ℓt(v⋆)
- If Rv
T(v⋆) = O(log(T)), then the final regret RT(w⋆) is the same as if
we’d used the constant vt = v⋆.
- We can use online learning to choose the vt!
Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 16 of 20
Overview of Algorithm Strategy
- There exists an unknown v⋆ that would give preconditioned regret.
- We can choose vt using online convex optimization on losses
ℓt(v) = − log(1 − gt, vt).
- If we get Rv
T(v⋆) = T t=1 ℓt(vt) − ℓt(v⋆) = O(log(T)), then we are as
good as picking v⋆ from the beginning.
- So how can we obtain logarithmic regret?
Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 17 of 20
How to obtain logarithmic regret?
- Strategy: Remember that the constant v⋆ we need to compete with is
v⋆ =
w⋆ w⋆√T
t=1gt,w⋆2 , so v⋆ = O(1/
√ T) usually.
- This means that we can use a non-preconditioned online learning
algorithm to obtain logarithmic regret: Rv
T(v⋆) ≤ v⋆
√ T = O(1)
Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 18 of 20
How to obtain logarithmic regret?
- Strategy: Remember that the constant v⋆ we need to compete with is
v⋆ =
w⋆ w⋆√T
t=1gt,w⋆2 , so v⋆ = O(1/
√ T) usually.
- This means that we can use a non-preconditioned online learning
algorithm to obtain logarithmic regret: Rv
T(v⋆) ≤ v⋆
√ T = O(1)
- Sometimes the best v is not small - this is why we do not always
- btain preconditioned regret.
Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 18 of 20
Experiments
50000 100000 150000 200000 250000 300000 Steps 0.175 0.200 0.225 0.250 0.275 0.300 0.325 0.350 Test accuracy Adam Adagrad Recursive Recursive+Momentum
Test accuracy on LM1B dataset with Transformer model
Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 19 of 20
Summary
- When the gradients are “obviously non-random”, we obtain
preconditioned regret bounds without any bad √ d constant factors.
- Otherwise, we decay to the ordinary non-preconditioned regret
bounds (actually, we improve log factors).
- The algorithm runs in the same time complexity as ordinary gradient
descent.
- The empirical performance is promising.
Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 20 of 20
Summary
- When the gradients are “obviously non-random”, we obtain
preconditioned regret bounds without any bad √ d constant factors.
- Otherwise, we decay to the ordinary non-preconditioned regret
bounds (actually, we improve log factors).
- The algorithm runs in the same time complexity as ordinary gradient
descent.
- The empirical performance is promising.
Thank you!
Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 20 of 20