Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic - - PowerPoint PPT Presentation

graph oracle models lower bounds and gaps for parallel
SMART_READER_LITE
LIVE PREVIEW

Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic - - PowerPoint PPT Presentation

Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic Optimization Jialei Wang Blake Woodworth Adam Smith (UChicago 2" Investments) (TTIC) (Boston University) Nati Srebro H. Brendan McMahan (TTIC) (Google) Parallel


slide-1
SLIDE 1

Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic Optimization

Nati Srebro (TTIC)

  • H. Brendan McMahan

(Google) Adam Smith (Boston University) Jialei Wang (UChicago→2" Investments) Blake Woodworth (TTIC)

slide-2
SLIDE 2

Parallel Stochastic Optimization/Learning

Many parallelization scenarios:

  • Synchronous parallelism
  • Asynchronous parallelism
  • Delayed updates
  • Few/many workers
  • Infrequent communication
  • Federated learning

min

x F(x) := E z∼D [f(x; z)]

slide-3
SLIDE 3

What is the best we can hope for in a given parallelism scenario?

slide-4
SLIDE 4
  • At each node u, make a query based only on knowledge of ancestors’
  • racle interaction (plus shared randomness)
  • Graph defines class of optimization algorithms
  • Come to our poster for details

Ancestors( )

u1 u2 u3 u4 u5 u6 u7 u8 u9

u10

u9

A(G) What is the best we can hope for in a given parallelism scenario?

  • We formalize the parallelism in terms of a dependency graph:
slide-5
SLIDE 5
  • Sequential:
  • Layer:
  • Delays:
  • Intermittent

Communication:

slide-6
SLIDE 6

Theorem: For any dependency graph with nodes and depth , no algorithm for optimizing convex, -Lipschitz, -smooth on a bounded domain in high dimensions can guarantee error less than: With stochastic gradient oracle: With stochastic prox oracle:

i.e. exactly optimize subproblem in each node (ADMM, DANE, etc.)

Generic Lower Bounds

x, β, z 7! arg min

y

f(y; z) + β 2 ky xk2

Prox oracle:

Ω ✓ min ⇢ L √ D , H D2

  • +

L √ N ◆

Ω ✓ min ⇢ L D, H D2

  • +

L √ N ◆

G D

N L H

f(x; z)

slide-7
SLIDE 7

Theorem: For any dependency graph with nodes and depth , no algorithm for optimizing convex, -Lipschitz, -smooth on a bounded domain in high dimensions can guarantee error less than: With stochastic gradient oracle: With stochastic prox oracle:

i.e. exactly optimize subproblem in each node (ADMM, DANE, etc.)

Generic Lower Bounds

x, β, z 7! arg min

y

f(y; z) + β 2 ky xk2

Prox oracle:

Ω ✓ min ⇢ L √ D , H D2

  • +

L √ N ◆

Ω ✓ min ⇢ L D, H D2

  • +

L √ N ◆

G D

N L H

f(x; z)

slide-8
SLIDE 8
  • Sequential:
  • SGD is optimal
  • Layers:
  • Accelerated minibatch SGD is
  • ptimal
  • Delays:
  • Delayed-update SGD is not
  • ptimal
  • “Wait-and-Collect” minibatch

is optimal

  • Intermittent

Communication:

  • Gaps between existing

algorithms and lower bound

slide-9
SLIDE 9
  • Lower bound:
  • Option 1: Accelerated Minibatch SGD
  • Option 2: Sequential SGD
  • Option 3: SVRG on empirical objective
  • Best of A-MB-SGD, SGD, SVRG:

O ✓ H T 2 + L √ TKM ◆ O ✓ L √ TK ◆ ˜ O ✓ H TK + L √ TKM ◆ ˜ O ✓ min ⇢ L √ TK , H TK , H T 2

  • +

L √ TKM ◆

Ω ✓ min ⇢ L √ TK , H T 2K2

  • +

L √ TKM ◆

M K T K K K

slide-10
SLIDE 10
  • Lower bound:
  • Option 1: Accelerated Minibatch SGD
  • Option 2: Sequential SGD
  • Option 3: SVRG on empirical objective
  • Best of A-MB-SGD, SGD, SVRG:

O ✓ H T 2 + L √ TKM ◆ O ✓ L √ TK ◆ ˜ O ✓ H TK + L √ TKM ◆

Ω ✓ min ⇢ L √ TK , H T 2K2

  • +

L √ TKM ◆

Minibatch #1

g1 = 1 MK

MK

X

i=1

rf(x1; zi) g2 = 1 MK

MK

X

i=1

rf(x2; zi) g3 = 1 MK

MK

X

i=1

rf(x3; zi) g4 = 1 MK

MK

X

i=1

rf(x4; zi)

Minibatch #2 Minibatch #3 Minibatch #4 Calculate x3

x2

Calculate

x4

Calculate

x5

Calculate

slide-11
SLIDE 11
  • Lower bound:
  • Option 1: Accelerated Minibatch SGD
  • Option 2: Sequential SGD
  • Option 3: SVRG on empirical objective
  • Best of A-MB-SGD, SGD, SVRG:

O ✓ H T 2 + L √ TKM ◆ O ✓ L √ TK ◆ ˜ O ✓ H TK + L √ TKM ◆

Ω ✓ min ⇢ L √ TK , H T 2K2

  • +

L √ TKM ◆

Minibatch #1

g1 = 1 MK

MK

X

i=1

rf(x1; zi) g2 = 1 MK

MK

X

i=1

rf(x2; zi) g3 = 1 MK

MK

X

i=1

rf(x3; zi) g4 = 1 MK

MK

X

i=1

rf(x4; zi)

Minibatch #2 Minibatch #3 Minibatch #4 Calculate x3

x2

Calculate

x4

Calculate

x5

Calculate

slide-12
SLIDE 12
  • Lower bound:
  • Option 1: Accelerated Minibatch SGD
  • Option 2: Sequential SGD
  • Option 3: SVRG on empirical objective
  • Best of A-MB-SGD, SGD, SVRG:

O ✓ H T 2 + L √ TKM ◆ O ✓ L √ TK ◆ ˜ O ✓ H TK + L √ TKM ◆ ˜ O ✓ min ⇢ L √ TK , H TK , H T 2

  • +

L √ TKM ◆

Ω ✓ min ⇢ L √ TK , H T 2K2

  • +

L √ TKM ◆

Sequential SGD steps

slide-13
SLIDE 13
  • Lower bound:
  • Option 1: Accelerated Minibatch SGD
  • Option 2: Sequential SGD
  • Option 3: SVRG on empirical objective
  • Best of A-MB-SGD, SGD, SVRG:

O ✓ H T 2 + L √ TKM ◆ O ✓ L √ TK ◆ ˜ O ✓ H TK + L √ TKM ◆ ˜ O ✓ min ⇢ L √ TK , H TK , H T 2

  • +

L √ TKM ◆

Ω ✓ min ⇢ L √ TK , H T 2K2

  • +

L √ TKM ◆

Sequential SGD steps

slide-14
SLIDE 14
  • Lower bound:
  • Option 1: Accelerated Minibatch SGD
  • Option 2: Sequential SGD
  • Option 3: SVRG on empirical objective
  • Best of A-MB-SGD, SGD, SVRG:

O ✓ H T 2 + L √ TKM ◆ O ✓ L √ TK ◆ ˜ O ✓ H TK + L √ TKM ◆ ˜ O ✓ min ⇢ L √ TK , H TK , H T 2

  • +

L √ TKM ◆

Ω ✓ min ⇢ L √ TK , H T 2K2

  • +

L √ TKM ◆

Calculate full gradient in parallel Sequential variance-reduced updates Aggregate full gradient

slide-15
SLIDE 15
  • Lower bound:
  • Option 1: Accelerated Minibatch SGD
  • Option 2: Sequential SGD
  • Option 3: SVRG on empirical objective
  • Best of A-MB-SGD, SGD, SVRG:

O ✓ H T 2 + L √ TKM ◆ O ✓ L √ TK ◆ ˜ O ✓ H TK + L √ TKM ◆ ˜ O ✓ min ⇢ L √ TK , H TK , H T 2

  • +

L √ TKM ◆

Ω ✓ min ⇢ L √ TK , H T 2K2

  • +

L √ TKM ◆

Calculate full gradient in parallel Sequential variance-reduced updates Aggregate full gradient

slide-16
SLIDE 16
  • Lower bound:
  • Option 1: Accelerated Minibatch SGD
  • Option 2: Sequential SGD
  • Option 3: SVRG on empirical objective
  • Combining 1-3:

O ✓ H T 2 + L √ TKM ◆ O ✓ L √ TK ◆ ˜ O ✓ H TK + L √ TKM ◆ ˜ O ✓ min ⇢ L √ TK , H TK , H T 2

  • +

L √ TKM ◆

Ω ✓ min ⇢ L √ TK , H T 2K2

  • +

L √ TKM ◆

slide-17
SLIDE 17
  • Lower bound:
  • Option 1: Accelerated Minibatch SGD
  • Option 2: Sequential SGD
  • Option 3: SVRG on empirical objective
  • Combining 1-3:

O ✓ H T 2 + L √ TKM ◆ O ✓ L √ TK ◆ ˜ O ✓ H TK + L √ TKM ◆ ˜ O ✓ min ⇢ L √ TK , H TK , H T 2

  • +

L √ TKM ◆

Ω ✓ min ⇢ L √ TK , H T 2K2

  • +

L √ TKM ◆

slide-18
SLIDE 18
  • Lower bound:
  • Option 1: Accelerated Minibatch SGD
  • Option 2: Sequential SGD
  • Option 3: SVRG on empirical objective
  • Combining 1-3:
  • Option 4: Parallel SGD

???

O ✓ H T 2 + L √ TKM ◆ O ✓ L √ TK ◆ ˜ O ✓ H TK + L √ TKM ◆ ˜ O ✓ min ⇢ L √ TK , H TK , H T 2

  • +

L √ TKM ◆

Ω ✓ min ⇢ L √ TK , H T 2K2

  • +

L √ TKM ◆

slide-19
SLIDE 19

Come to our poster tonight from 5-7pm