Farzin Haddadpour Joint work with Mohammad Mahdi Mehrdad Mahdavi - - PowerPoint PPT Presentation

farzin haddadpour
SMART_READER_LITE
LIVE PREVIEW

Farzin Haddadpour Joint work with Mohammad Mahdi Mehrdad Mahdavi - - PowerPoint PPT Presentation

Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex Optimization Farzin Haddadpour Joint work with Mohammad Mahdi Mehrdad Mahdavi Viveck Cadambe Kamani X min f ( x ) , Goal: Solving f i ( x ) i X min f ( x )


slide-1
SLIDE 1

Farzin Haddadpour

Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex Optimization

slide-2
SLIDE 2

Joint work with

Viveck Cadambe Mehrdad Mahdavi Mohammad Mahdi Kamani

slide-3
SLIDE 3

min f(x) , X

i

fi(x)

Goal: Solving

slide-4
SLIDE 4

min f(x) , X

i

fi(x)

Goal: Solving

x(t+1) = x(t) η 1 |ξ(t)|rf(x(t); ξ(t))

SGD

slide-5
SLIDE 5

min f(x) , X

i

fi(x)

Goal: Solving

x(t+1) = x(t) η 1 |ξ(t)|rf(x(t); ξ(t))

Parallelization due to

computational cost

x(t+1) = x(t) η p

p

X

j=1

1 |ξ(t)

j |

rf(x(t); ξ(t)

j )

SGD

Distributed SGD

slide-6
SLIDE 6

min f(x) , X

i

fi(x)

Goal: Solving

x(t+1) = x(t) η 1 |ξ(t)|rf(x(t); ξ(t))

Parallelization due to

computational cost

Communication is bottleneck

x(t+1) = x(t) η p

p

X

j=1

1 |ξ(t)

j |

rf(x(t); ξ(t)

j )

SGD

Distributed SGD

slide-7
SLIDE 7

Communication

Number of bits per iteration Gradient compression based techniques

slide-8
SLIDE 8

Communication

Number of rounds Number of bits per iteration Gradient compression based techniques Local SGD with periodic averaging

slide-9
SLIDE 9

Local SGD with periodic averaging

x(t+1)

j

= 1

p

Pp

j=1

h x(t)

j

− η ˜ g(t)

j

i if τ|T x(t+1)

j

= x(t)

j

− η ˜ g(t)

j

  • therwise,

Averaging step (a) Local update (b)

slide-10
SLIDE 10

Local SGD with periodic averaging

x(t+1)

j

= 1

p

Pp

j=1

h x(t)

j

− η ˜ g(t)

j

i if τ|T x(t+1)

j

= x(t)

j

− η ˜ g(t)

j

  • therwise,

Averaging step (a) Local update (b)

W1 W1

W3

W2 W2

W3

W1 W2

W3

W1

W3

W2

(a) (a) (a) p = 3, τ = 1

slide-11
SLIDE 11

Local SGD with periodic averaging

x(t+1)

j

= 1

p

Pp

j=1

h x(t)

j

− η ˜ g(t)

j

i if τ|T x(t+1)

j

= x(t)

j

− η ˜ g(t)

j

  • therwise,

Averaging step (a) Local update (b)

W1 W1 W1

W3

W2 W2 W2

W3 W3

(a) (b) p = 3, τ = 1

W1 W1

W3

W2 W2

W3

W1 W2

W3

W1

W3

W2

(a) (a) (a) p = 3, τ = 3

slide-12
SLIDE 12

Convergence Analysis of Local SGD with periodic averaging

Table 1: Comparison of different SGD based algorithms. Strategy Convergence error Assumptions Com-round(T/τ) SGD O(1/√pT) i.i.d. & b.g T [Yu et.al.] O(1/√pT) i.i.d. & b.g O(p

3 4 T 1 4 )

[Wang & Joshi] O(1/√pT) i.i.d. O(p

3 2 T 1 2 )

RI-SGD (τ, q) O(1/√pT) + O((1 − q/p)β) non-i.i.d. & b.d. O(p

3 2 T 1 2 )

b.g: Bounded gradient kgik2

2  G

Unbiased gradient estimation E[˜ gj] = gj

slide-13
SLIDE 13

Convergence Analysis of Local SGD with periodic averaging

Table 1: Comparison of different SGD based algorithms. Strategy Convergence error Assumptions Com-round(T/τ) SGD O(1/√pT) i.i.d. & b.g T [Yu et.al.] O(1/√pT) i.i.d. & b.g O(p

3 4 T 1 4 )

[Wang & Joshi] O(1/√pT) i.i.d. O(p

3 2 T 1 2 )

RI-SGD (τ, q) O(1/√pT) + O((1 − q/p)β) non-i.i.d. & b.d. O(p

3 2 T 1 2 )

b.g: Bounded gradient kgik2

2  G

  • A. Residual error is observe in practice but

theoretical understanding is missing?

  • B. How we can capture this in convergence

analysis?

  • C. Any solution to improve it?

Unbiased gradient estimation E[˜ gj] = gj

slide-14
SLIDE 14

Insufficiency of convergence analysis

  • A. Residual error is observe in practice but theoretical

understanding is missing? Unbiased gradient estimation does not hold

slide-15
SLIDE 15

Insufficiency of convergence analysis

  • A. Residual error is observe in practice but theoretical

understanding is missing?

  • B. How to capture this in convergence analysis?

Unbiased gradient estimation does not hold Analysis based on biased gradients Our work

slide-16
SLIDE 16

Insufficiency of convergence analysis

  • A. Residual error is observe in practice but theoretical

understanding is missing?

  • B. How to capture this in convergence analysis?
  • C. Any solution to improve it?

Unbiased gradient estimation does not hold Analysis based on biased gradients Redundancy Our work Our work

slide-17
SLIDE 17

Redundancy infused local SGD (RI-SGD) D = D1 ∪ D2 ∪ D3

W1 W1 W1

W3

W2 W2 W2

W3 W3

D1 D2 D3 p = 3, τ = 3 Local SGD

slide-18
SLIDE 18

Redundancy infused local SGD (RI-SGD) D = D1 ∪ D2 ∪ D3

W1 W1 W1

W3

W2 W2 W2

W3 W3

D1 D2 D3 p = 3, τ = 3 Local SGD RI-SGD q = 2, p = 3, τ = 3

W1 W1

W3

W2 W2

W3

W1

D1D2 D2 D1

W2

D3 D3

W3

Explicit redundancy

slide-19
SLIDE 19

Comparing RI-SGD with other schemes b.d: Bounded inner product of gradients hgi, gji  β

Assumption Redundancy

Biased gradients q: Number of data chunks at each worker node

slide-20
SLIDE 20

Comparing RI-SGD with other schemes b.d: Bounded inner product of gradients hgi, gji  β

Assumption Redundancy Table 1: Comparison of different SGD based algorithms. Strategy Convergence error Assumptions Com-round(T/τ) SGD O(1/√pT) i.i.d. & b.g T [Yu et.al.] O(1/√pT) i.i.d. & b.g O(p

3 4 T 1 4 )

[Wang & Joshi] O(1/√pT) i.i.d. O(p

3 2 T 1 2 )

RI-SGD (τ, q) O(1/√pT) + O((1 − q/p)β) non-i.i.d. & b.d. O(p

3 2 T 1 2 )

Biased gradients q: Number of data chunks at each worker node

slide-21
SLIDE 21
  • 1. Speed up not only due to larger effective

mini-batch size, but also due to increasing intra-gradient diversity.

  • 2. Fault-tolerance.
  • 3. Extension to heterogeneous mini-batch size

and possible application to federated

  • ptimization.

Advantages of RI-SGD:

slide-22
SLIDE 22

Faster convergence: Experiments over Image-net (top figures) and Cifar-100 (bottom figures)

slide-23
SLIDE 23

Increasing intra-gradient diversity: Experiments over Cifar-10

slide-24
SLIDE 24

Fault-Tolerance: Experiments over Cifar-10

slide-25
SLIDE 25

For more details please come to my poster session Wed Jun 12th 06:30 -- 09:00 PM @ Pacific Ballroom #185

Thanks for your attention!