Incremental Methods for Additive Convex Cost Minimization: Deterministic vs Randomized Variants
Mert Gurbuzbalaban (Rutgers) joint work with
- A. Ozdaglar (MIT), P. Parrilo (MIT), D. Vanli (MIT)
DIMACS Workshop, August 2017
1
Incremental Methods for Additive Convex Cost Minimization: - - PowerPoint PPT Presentation
Incremental Methods for Additive Convex Cost Minimization: Deterministic vs Randomized Variants Mert Gurbuzbalaban (Rutgers) joint work with A. Ozdaglar (MIT), P. Parrilo (MIT), D. Vanli (MIT) DIMACS Workshop, August 2017 1 Introduction
1
Introduction
x
m
2
Introduction
i=1: xi ∈ Rn is a feature
1 m
i=1 L(yi, xi, θ) + pen(θ).
f2(x1, . . . , xn) fm(x1, . . . , xn) f1(x1, . . . , xn)
3
Introduction
i=1 ∇fi(x), is very costly.
i+1 = xk i − αk∇fi(xk i ),
4
Introduction
1 2 3 ¡ m ¡
5
Introduction
6
Incremental Gradient Method
i x = bi.
7
Incremental Gradient Method
8
Incremental Gradient Method
1
2
i Li.
3
i ),
9
Incremental Gradient Method
10
Incremental Gradient Method
1
1 .
1 = distk > Ω( 1 k1/5 ).
11
Incremental Gradient Method
i Pix − qT i x + ri,
j=1 jLσ(j)G ≤ LmG.
12
Random Orders
13
Random Orders
14
Random Orders
D
15
Random Reshuffling
D
2
i=1 ∇2fi(x∗)∇fi(x∗) and s ∈ (1/2, 1) .
16
Random Reshuffling
2(x + 1)2, f2(x) = 1 2(x − 1)2. Here, θ∗ = 0.
17
Random Reshuffling
18
Random Reshuffling
1
1 − αk(∇f1(xk 1 ) + ∇f2(xk 1 ) + ek)
2 ) − ∇f2(xk 1 )
2 ) − ∇f1(xk 1 )
k)
19
Random Reshuffling
1
1 − αk(∇f1(xk 1 ) + ∇f2(xk 1 ) + ek)
2 ) − ∇f2(xk 1 )
2 ) − ∇f1(xk 1 )
1 − x∗)
k ) by cyclic analysis
k)
20
Random Reshuffling
1 − xk+1 1
1 ) + ek,
1) = H∗(xj 1 − x∗) (with H∗ = ∇2f (x∗)),
j=0 (xj 1 − xj+1 1
j
j=0 H∗(xj 1 − x∗) + ej
∗
ks
→θ∗ a.s.
∗
k
j αj/k is the averaged stepsize.
21
Random Reshuffling
k
k),
m
ej αj
k
22
Random Reshuffling
∗ θ∗,
m
23
Random Reshuffling
x∈Rn f (x).
24
Random Reshuffling
25
Random Reshuffling
x∈Rn
1This is not restrictive as we could always put A into this form by scaling x easily.
26
Random Reshuffling
CCD
CCD,
RCD
RCD
27
Random Reshuffling
CCD∈Rn −1
28
Random Reshuffling
ℓ→∞
x0
RCD∈Rn −1
RCD) − x∗
RCD − x∗||
29
Random Reshuffling
4A
4
2.
30
Random Reshuffling
31
Random Reshuffling
32
Random Reshuffling
αLT are independent of α, then A is said to be
33
Random Reshuffling
λ
αLT for α ∈ (0, 1].
1
λ) is a constant.
2
λ) is strictly decreasing on λ ∈ (0, 1].
α)]1/t.
α)]i,i as a sum of all possible walks from i to itself
34
Random Reshuffling
35
Random Reshuffling
λ
36
Random Reshuffling
n
µ
2, we have νn ∈ [1, 3 2).
Rate(RCD) = 2.
37
Random Reshuffling
38
Random Reshuffling
39
Random Reshuffling
2 × n 2
n
1 200
1 2 3 4 5 6 7 8 9 10 −14 −12 −10 −8 −6 −4 −2 Number of Epochs ℓ log
CCD RCD Expected RCD 1 2 3 4 5 6 7 8 9 10 −14 −12 −10 −8 −6 −4 −2 Number of Epochs ℓ log
CCD RCD Expected RCD
40
Random Reshuffling
41
Conclusions
42
43
44
Conclusions
45
Conclusions
46
Conclusions
47
Conclusions
1
1 − αk
1 ) − ek
m
1 ) − ∇fi(xk i )
1 ) = Ak(xk 1 − x∗), where
0 ∇2f (x∗ + τ(xk 1 − x∗))dτ, and write for distk = xk 1 − x∗,
k , we have for k ≥ RL,
48
Conclusions
k
k
transient term
accumulated error
49
Conclusions
x
m
i+1 = xk i − αkg k i ,
i ∈ ∂fi(xk i ) is a subgradient of fi at xk i , and αk is a stepsize.
1
m+1 = xk 1 − αk
i=1 g k i .
1
2
3
m
m+1 . . .
1
1
50