Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic Optimization
Nati Srebro (TTIC)
- H. Brendan McMahan
(Google) Adam Smith (Boston University) Jialei Wang (UChicago→2" Investments) Blake Woodworth (TTIC)
Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic - - PowerPoint PPT Presentation
Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic Optimization Jialei Wang Blake Woodworth Adam Smith (UChicago 2" Investments) (TTIC) (Boston University) Nati Srebro H. Brendan McMahan (TTIC) (Google) Parallel
Nati Srebro (TTIC)
(Google) Adam Smith (Boston University) Jialei Wang (UChicago→2" Investments) Blake Woodworth (TTIC)
Many parallelization scenarios:
…
min
x F(x) := E z∼D [f(x; z)]
Ancestors( )
u10
i.e. exactly optimize subproblem in each node (ADMM, DANE, etc.)
x, β, z 7! arg min
y
f(y; z) + β 2 ky xk2
Prox oracle:
f(x; z)
i.e. exactly optimize subproblem in each node (ADMM, DANE, etc.)
x, β, z 7! arg min
y
f(y; z) + β 2 ky xk2
Prox oracle:
f(x; z)
is optimal
algorithms and lower bound
O ✓ H T 2 + L √ TKM ◆ O ✓ L √ TK ◆ ˜ O ✓ H TK + L √ TKM ◆ ˜ O ✓ min ⇢ L √ TK , H TK , H T 2
L √ TKM ◆
Ω ✓ min ⇢ L √ TK , H T 2K2
L √ TKM ◆
M K T K K K
O ✓ H T 2 + L √ TKM ◆ O ✓ L √ TK ◆ ˜ O ✓ H TK + L √ TKM ◆
Ω ✓ min ⇢ L √ TK , H T 2K2
L √ TKM ◆
Minibatch #1
g1 = 1 MK
MK
X
i=1
rf(x1; zi) g2 = 1 MK
MK
X
i=1
rf(x2; zi) g3 = 1 MK
MK
X
i=1
rf(x3; zi) g4 = 1 MK
MK
X
i=1
rf(x4; zi)
Minibatch #2 Minibatch #3 Minibatch #4 Calculate x3
x2
Calculate
x4
Calculate
x5
Calculate
O ✓ H T 2 + L √ TKM ◆ O ✓ L √ TK ◆ ˜ O ✓ H TK + L √ TKM ◆
Ω ✓ min ⇢ L √ TK , H T 2K2
L √ TKM ◆
Minibatch #1
g1 = 1 MK
MK
X
i=1
rf(x1; zi) g2 = 1 MK
MK
X
i=1
rf(x2; zi) g3 = 1 MK
MK
X
i=1
rf(x3; zi) g4 = 1 MK
MK
X
i=1
rf(x4; zi)
Minibatch #2 Minibatch #3 Minibatch #4 Calculate x3
x2
Calculate
x4
Calculate
x5
Calculate
O ✓ H T 2 + L √ TKM ◆ O ✓ L √ TK ◆ ˜ O ✓ H TK + L √ TKM ◆ ˜ O ✓ min ⇢ L √ TK , H TK , H T 2
L √ TKM ◆
Ω ✓ min ⇢ L √ TK , H T 2K2
L √ TKM ◆
Sequential SGD steps
O ✓ H T 2 + L √ TKM ◆ O ✓ L √ TK ◆ ˜ O ✓ H TK + L √ TKM ◆ ˜ O ✓ min ⇢ L √ TK , H TK , H T 2
L √ TKM ◆
Ω ✓ min ⇢ L √ TK , H T 2K2
L √ TKM ◆
Sequential SGD steps
O ✓ H T 2 + L √ TKM ◆ O ✓ L √ TK ◆ ˜ O ✓ H TK + L √ TKM ◆ ˜ O ✓ min ⇢ L √ TK , H TK , H T 2
L √ TKM ◆
Ω ✓ min ⇢ L √ TK , H T 2K2
L √ TKM ◆
Calculate full gradient in parallel Sequential variance-reduced updates Aggregate full gradient
O ✓ H T 2 + L √ TKM ◆ O ✓ L √ TK ◆ ˜ O ✓ H TK + L √ TKM ◆ ˜ O ✓ min ⇢ L √ TK , H TK , H T 2
L √ TKM ◆
Ω ✓ min ⇢ L √ TK , H T 2K2
L √ TKM ◆
Calculate full gradient in parallel Sequential variance-reduced updates Aggregate full gradient
O ✓ H T 2 + L √ TKM ◆ O ✓ L √ TK ◆ ˜ O ✓ H TK + L √ TKM ◆ ˜ O ✓ min ⇢ L √ TK , H TK , H T 2
L √ TKM ◆
Ω ✓ min ⇢ L √ TK , H T 2K2
L √ TKM ◆
O ✓ H T 2 + L √ TKM ◆ O ✓ L √ TK ◆ ˜ O ✓ H TK + L √ TKM ◆ ˜ O ✓ min ⇢ L √ TK , H TK , H T 2
L √ TKM ◆
Ω ✓ min ⇢ L √ TK , H T 2K2
L √ TKM ◆
O ✓ H T 2 + L √ TKM ◆ O ✓ L √ TK ◆ ˜ O ✓ H TK + L √ TKM ◆ ˜ O ✓ min ⇢ L √ TK , H TK , H T 2
L √ TKM ◆
Ω ✓ min ⇢ L √ TK , H T 2K2
L √ TKM ◆