Manifold Identification for Ultimately Communication-Efficient - - PowerPoint PPT Presentation

manifold identification for ultimately communication
SMART_READER_LITE
LIVE PREVIEW

Manifold Identification for Ultimately Communication-Efficient - - PowerPoint PPT Presentation

Manifold Identification for Ultimately Communication-Efficient Distributed Optimization Yu-Sheng Li Joint work with Wei-Lin Chiang (NTU) and Ching-pei Lee (NUS) Outline Overview Empirical Risk Minimization The Proposed Algorithm Experiments


slide-1
SLIDE 1

Manifold Identification for Ultimately Communication-Efficient Distributed Optimization

Yu-Sheng Li Joint work with Wei-Lin Chiang (NTU) and Ching-pei Lee (NUS)

slide-2
SLIDE 2

Outline

Overview Empirical Risk Minimization The Proposed Algorithm Experiments

slide-3
SLIDE 3

Distributed Machine Learning

1Originally by Jeff Dean in 2010, updated by Colin Scott at

https://colin-scott.github.io/personal_website/research/interactive_latency.html

1

Read 1 MB sequentially from memory 3 µs Read 1 MB sequentially from network 22 µs Read 1 MB sequentially from disk (SSD) 49 µs Round trip in the same datacenter 500 µs (Latency Numbers Every Programmer Should Know.1)

slide-4
SLIDE 4

Distributed Machine Learning

1Originally by Jeff Dean in 2010, updated by Colin Scott at

https://colin-scott.github.io/personal_website/research/interactive_latency.html

1

Read 1 MB sequentially from memory 3 µs Read 1 MB sequentially from network 22 µs Read 1 MB sequentially from disk (SSD) 49 µs Round trip in the same datacenter 500 µs (Latency Numbers Every Programmer Should Know.1) ◮ Inter-machine communication may be more time-consuming than local computations within a machine

  • Comm. cost = (# Comm. rounds) × (Bytes communicated per round)
slide-5
SLIDE 5

Sparsity-inducing Regularization

◮ To avoid overfitting and to force some desired structure of the solution, usually a sparsity-inducing regularizer is introduced

2

slide-6
SLIDE 6

Sparsity-inducing Regularization

◮ To avoid overfitting and to force some desired structure of the solution, usually a sparsity-inducing regularizer is introduced ◮ Example: ℓ2- vs. ℓ1-regularized logistic regression on news20 Relative reg. strength Sparsity of solution Test accuracy ℓ2-regularized 20 1,355,191 (100%) 99.7449% 210 1,355,191 (100%) 97.0044%

2

slide-7
SLIDE 7

Sparsity-inducing Regularization

◮ To avoid overfitting and to force some desired structure of the solution, usually a sparsity-inducing regularizer is introduced ◮ Example: ℓ2- vs. ℓ1-regularized logistic regression on news20 Relative reg. strength Sparsity of solution Test accuracy ℓ2-regularized 20 1,355,191 (100%) 99.7449% 210 1,355,191 (100%) 97.0044% ℓ1-regularized 20 67,071 (4.95%) 99.7499% 22 42,020 (3.10%) 99.7499% 24 14,524 (1.07%) 99.7449% 26 5,432 (0.40%) 99.6749% 28 1,472 (0.11%) 97.3495% 210 546 (0.04%) 92.8936%

2

slide-8
SLIDE 8

Our contributions

Recall:

  • Comm. cost = (# Comm. rounds) × (Bytes communicated per round)

3

slide-9
SLIDE 9

Our contributions

Recall:

  • Comm. cost = (# Comm. rounds) × (Bytes communicated per round)

◮ Focusing on the small subproblem ⇒ fewer bytes to communicate

3

slide-10
SLIDE 10

Our contributions

Recall:

  • Comm. cost = (# Comm. rounds) × (Bytes communicated per round)

◮ Focusing on the small subproblem ⇒ fewer bytes to communicate ◮ Acceleration by smooth optimization in the correct manifold ⇒ fewer rounds of communication

3

slide-11
SLIDE 11

Results (ours: MADPQN)

y-axis: relative distance to the optimal value (log-scaled) x-axis: communication costs (upper), training time (lower) news20 epsilon webspam

10 20 Communication (d bytes) 10

13

10

10

10

7

10

4

10

1

OWLQN L-COMM DPLBFGS MADPQN 20 40 60 Training Time (seconds) 10

13

10

10

10

7

10

4

10

1

OWLQN L-COMM DPLBFGS MADPQN 200 400 Communication (d bytes) 10

13

10

10

10

7

10

4

10

1

OWLQN L-COMM DPLBFGS MADPQN 250 500 750 1000 Training Time (seconds) 10

13

10

10

10

7

10

4

10

1

OWLQN L-COMM DPLBFGS MADPQN 5 10 Communication (d bytes) 10

8

10

5

10

2

101 OWLQN L-COMM DPLBFGS MADPQN 1000 2000 Training Time (seconds) 10

8

10

5

10

2

101 OWLQN L-COMM DPLBFGS MADPQN 4

slide-12
SLIDE 12

Outline

Overview Empirical Risk Minimization The Proposed Algorithm Experiments

5

slide-13
SLIDE 13

Outline

Overview Empirical Risk Minimization The Proposed Algorithm Experiments

slide-14
SLIDE 14

Distributed Empirical Risk Minimization (ERM)

◮ Train a model by minimizing a function that measures the performance on training data arg min

w∈Rd

f(w) :=

K

  • k=1

fk (w) ◮ There are K machines, and fk is exclusively available on machine k

6

slide-15
SLIDE 15

Distributed Empirical Risk Minimization (ERM)

◮ Train a model by minimizing a function that measures the performance on training data arg min

w∈Rd

f(w) :=

K

  • k=1

fk (w) ◮ There are K machines, and fk is exclusively available on machine k ◮ Synchronize w or ∇f(w) by communication: communication cost per iteration is O(d) ◮ How to reduce the O(d) cost?

6

slide-16
SLIDE 16

Sparsity-inducing Regularizer

◮ If w is sparse throughout the training process, we only need to synchronize a shorter vector ◮ Regularized ERM: min

w

f(w) + R(w)

7

slide-17
SLIDE 17

Sparsity-inducing Regularizer

◮ If w is sparse throughout the training process, we only need to synchronize a shorter vector ◮ Regularized ERM: min

w

f(w) + R(w) ◮ An ideal regularization term for forcing sparsity is the ℓ0 norm: w0 = number of nonzeros in w

7

slide-18
SLIDE 18

Sparsity-inducing Regularizer

◮ If w is sparse throughout the training process, we only need to synchronize a shorter vector ◮ Regularized ERM: min

w

f(w) + R(w) ◮ An ideal regularization term for forcing sparsity is the ℓ0 norm: w0 = number of nonzeros in w ◮ But this norm is not continuous and hence hard to optimize ◮ A good surrogate is the ℓ1 norm w1 = d

i=1 |wi|

◮ Our algorithm works for other partly smooth R, e.g. group-LASSO

7

slide-19
SLIDE 19

The Regularized Problem

◮ Now the problem becomes min

w

f(w) + w1, which is harder to minimize than f(w) alone since w1 is not differentiable ◮ As the gradient may not even exist, gradient descent or Newton method cannot be directly applied

8

slide-20
SLIDE 20

Proximal Quasi-Newton

◮ Proximal gradient is a simple algorithm that solves min

w′ ∇f(w)⊤(w′ − w) + 1

2αw′ − w2

2 + w′1,

where α is the step size for the current iteration ◮ Each calculation of ∇f requires one round of communication

9

slide-21
SLIDE 21

Proximal Quasi-Newton

◮ Proximal gradient is a simple algorithm that solves min

w′ ∇f(w)⊤(w′ − w) + 1

2αw′ − w2

2 + w′1,

where α is the step size for the current iteration ◮ Each calculation of ∇f requires one round of communication ◮ To reduce the amount of communication, we include some second-order information: reducing iterations ⇒ reducing rounds of communication ◮ Replace the term w′ − w2

2/2α with (w′ − w)⊤H(w′ − w)/2 for some H ≈ ∇2f(w)

9

slide-22
SLIDE 22

Outline

Overview Empirical Risk Minimization The Proposed Algorithm Experiments

slide-23
SLIDE 23

Utilizing Sparsity

◮ Even if we only update the nonzero entries of w, if we still compute the whole gradient ∇f(w), then the communication cost remains O(d)

10

slide-24
SLIDE 24

Utilizing Sparsity

◮ Even if we only update the nonzero entries of w, if we still compute the whole gradient ∇f(w), then the communication cost remains O(d) ◮ Guess: if wi = 0 at some iteration and it is likely to stay 0 at the next iteration, it remains 0 at the final solution ◮ Then we only solve the subproblem with respect to the coordinates that are likely to be nonzero

10

slide-25
SLIDE 25

Utilizing Sparsity

◮ Even if we only update the nonzero entries of w, if we still compute the whole gradient ∇f(w), then the communication cost remains O(d) ◮ Guess: if wi = 0 at some iteration and it is likely to stay 0 at the next iteration, it remains 0 at the final solution ◮ Then we only solve the subproblem with respect to the coordinates that are likely to be nonzero ◮ A progressive shrinking approach: once we guess wi = 0, we remove those coordinates from

  • ur problem in future iterations

◮ So the number of nonzeros in w (i.e. w0) gradually decreases

10

slide-26
SLIDE 26

Convergence Issue

◮ What if our guess was wrong at some iteration?

11

slide-27
SLIDE 27

Convergence Issue

◮ What if our guess was wrong at some iteration? ◮ Need to double-check: when some stopping criterion is met, we restart with all coordinates ◮ Training is terminated only when our model can hardly be improved using all coordinates

11

slide-28
SLIDE 28

More Acceleration by Smooth Optimization

◮ |wi| becomes twice-differentiable when wi = 0 ◮ If the coordinates where wi = 0 are fixed, the proximal approach is not needed anymore ◮ The problem can then be transformed into a smooth one for faster convergence

12

slide-29
SLIDE 29

More Acceleration by Smooth Optimization

◮ |wi| becomes twice-differentiable when wi = 0 ◮ If the coordinates where wi = 0 are fixed, the proximal approach is not needed anymore ◮ The problem can then be transformed into a smooth one for faster convergence ◮ When the nonzero pattern (manifold) does not change for some iterations, it is likely to be the final pattern

12

slide-30
SLIDE 30

More Acceleration by Smooth Optimization

◮ |wi| becomes twice-differentiable when wi = 0 ◮ If the coordinates where wi = 0 are fixed, the proximal approach is not needed anymore ◮ The problem can then be transformed into a smooth one for faster convergence ◮ When the nonzero pattern (manifold) does not change for some iterations, it is likely to be the final pattern ◮ Example with d = 5: {1, 2, 3, 4, 5}

12

slide-31
SLIDE 31

More Acceleration by Smooth Optimization

◮ |wi| becomes twice-differentiable when wi = 0 ◮ If the coordinates where wi = 0 are fixed, the proximal approach is not needed anymore ◮ The problem can then be transformed into a smooth one for faster convergence ◮ When the nonzero pattern (manifold) does not change for some iterations, it is likely to be the final pattern ◮ Example with d = 5: {1, 2, 3, 4, 5} → {2, 3, 5}

12

slide-32
SLIDE 32

More Acceleration by Smooth Optimization

◮ |wi| becomes twice-differentiable when wi = 0 ◮ If the coordinates where wi = 0 are fixed, the proximal approach is not needed anymore ◮ The problem can then be transformed into a smooth one for faster convergence ◮ When the nonzero pattern (manifold) does not change for some iterations, it is likely to be the final pattern ◮ Example with d = 5: {1, 2, 3, 4, 5} → {2, 3, 5} → {2, 5}

12

slide-33
SLIDE 33

More Acceleration by Smooth Optimization

◮ |wi| becomes twice-differentiable when wi = 0 ◮ If the coordinates where wi = 0 are fixed, the proximal approach is not needed anymore ◮ The problem can then be transformed into a smooth one for faster convergence ◮ When the nonzero pattern (manifold) does not change for some iterations, it is likely to be the final pattern ◮ Example with d = 5: {1, 2, 3, 4, 5} → {2, 3, 5} → {2, 5} → {2, 5}

12

slide-34
SLIDE 34

More Acceleration by Smooth Optimization

◮ |wi| becomes twice-differentiable when wi = 0 ◮ If the coordinates where wi = 0 are fixed, the proximal approach is not needed anymore ◮ The problem can then be transformed into a smooth one for faster convergence ◮ When the nonzero pattern (manifold) does not change for some iterations, it is likely to be the final pattern ◮ Example with d = 5: {1, 2, 3, 4, 5} → {2, 3, 5} → {2, 5} → {2, 5} → {2, 5}

12

slide-35
SLIDE 35

More Acceleration by Smooth Optimization

◮ |wi| becomes twice-differentiable when wi = 0 ◮ If the coordinates where wi = 0 are fixed, the proximal approach is not needed anymore ◮ The problem can then be transformed into a smooth one for faster convergence ◮ When the nonzero pattern (manifold) does not change for some iterations, it is likely to be the final pattern ◮ Example with d = 5: {1, 2, 3, 4, 5} → {2, 3, 5} → {2, 5} → {2, 5} → {2, 5} accelerate − − − − − − → · · ·

12

slide-36
SLIDE 36

More Acceleration by Smooth Optimization

◮ |wi| becomes twice-differentiable when wi = 0 ◮ If the coordinates where wi = 0 are fixed, the proximal approach is not needed anymore ◮ The problem can then be transformed into a smooth one for faster convergence ◮ When the nonzero pattern (manifold) does not change for some iterations, it is likely to be the final pattern ◮ Example with d = 5: {1, 2, 3, 4, 5} → {2, 3, 5} → {2, 5} → {2, 5} → {2, 5} accelerate − − − − − − → · · ·

restart

− − − − →{1, 2, 3, 4, 5} → · · ·

12

slide-37
SLIDE 37

More Acceleration by Smooth Optimization

◮ |wi| becomes twice-differentiable when wi = 0 ◮ If the coordinates where wi = 0 are fixed, the proximal approach is not needed anymore ◮ The problem can then be transformed into a smooth one for faster convergence ◮ When the nonzero pattern (manifold) does not change for some iterations, it is likely to be the final pattern ◮ Example with d = 5: {1, 2, 3, 4, 5} → {2, 3, 5} → {2, 5} → {2, 5} → {2, 5} accelerate − − − − − − → · · ·

restart

− − − − →{1, 2, 3, 4, 5} → · · ·

restart

− − − − →{1, 2, 3, 4, 5} → · · ·

12

slide-38
SLIDE 38

More Acceleration by Smooth Optimization

◮ |wi| becomes twice-differentiable when wi = 0 ◮ If the coordinates where wi = 0 are fixed, the proximal approach is not needed anymore ◮ The problem can then be transformed into a smooth one for faster convergence ◮ When the nonzero pattern (manifold) does not change for some iterations, it is likely to be the final pattern ◮ Example with d = 5: {1, 2, 3, 4, 5} → {2, 3, 5} → {2, 5} → {2, 5} → {2, 5} accelerate − − − − − − → · · ·

restart

− − − − →{1, 2, 3, 4, 5} → · · ·

restart

− − − − →{1, 2, 3, 4, 5} → · · ·

restart

− − − − →{1, 2, 3, 4, 5} → terminated

12

slide-39
SLIDE 39

Theoretical Guarantees

Theorem

If a cluster point w∗ of {w after each restart} satisfies 0 ∈ relint (∇f(w∗) + ∂R(w∗)) , then the manifold of w∗ will be identified within finite restarts.

13

slide-40
SLIDE 40

Outline

Overview Empirical Risk Minimization The Proposed Algorithm Experiments

slide-41
SLIDE 41

Settings

◮ We show the effectiveness of the proposed approach by ℓ1-regularized logistic regression min

w n

  • i=1

log(1 + exp(−yix⊤

i w)) + w1,

where there are n instances with features xi ∈ Rd and labels yi ∈ {−1, 1}

14

slide-42
SLIDE 42

Settings

◮ We show the effectiveness of the proposed approach by ℓ1-regularized logistic regression min

w n

  • i=1

log(1 + exp(−yix⊤

i w)) + w1,

where there are n instances with features xi ∈ Rd and labels yi ∈ {−1, 1} ◮ The instances are evenly split across K = 10 machines, connected by Intel MPI in a 1Gbps network environment

14

slide-43
SLIDE 43

Data Statistics

Data set Instances (n) Features (d) Nonzeros in optimal w∗ news20 19,996 1,355,191 506 epsilon 400,000 2,000 1,463 webspam 350,000 16,609,143 793 url 2,396,130 3,231,961 25,399 avazu-site 25,832,830 999,962 11,858 KDD2010-b 19,264,097 29,890,096 2,005,632

15

slide-44
SLIDE 44

Results

news20 webspam

25 50 75 100

Iteration Communication time

wMADPQN0

2 4 6

#nonzero of w

×103 100 200 300

Iteration Communication time

wMADPQN0

2 4

#nonzero of w

×104

16

slide-45
SLIDE 45

Results

news20 webspam

50 100

Iteration

100 200

Communication time

DPLBFGS MADPQN wMADPQN0

2 4 6

#nonzero of w

×103 100 200 300

Iteration

100 200 300

Communication time

DPLBFGS MADPQN wMADPQN0

2 4

#nonzero of w

×104

◮ DPLBFGS: a distributed proximal quasi-Newton method (Lee et al. 2019) ◮ Manifold-Aware Distributed Proximal Quasi-Newton (MADPQN): DPLBFGS + manifold selection + further acceleration

16

slide-46
SLIDE 46

Comparison with state of the art

◮ OWLQN (Andrew and Gao 2007): an extension of a quasi-Newton method, LBFGS, which is the most commonly used distributed method ◮ L-COMM (Chiang et al. 2018): an extension of the common directions method (Wang et al. 2016) ◮ DPLBFGS (Lee et al. 2019): a distributed proximal LBFGS method ◮ MADPQN: Our proposed Manifold-Aware Distributed Proximal Quasi-Newton method

17

slide-47
SLIDE 47

Results

y-axis: relative distance to the optimal value (log-scaled) x-axis: communication costs (upper), training time (lower) news20 epsilon webspam

10 20 Communication (d bytes) 10

13

10

10

10

7

10

4

10

1

OWLQN L-COMM DPLBFGS MADPQN 20 40 60 Training Time (seconds) 10

13

10

10

10

7

10

4

10

1

OWLQN L-COMM DPLBFGS MADPQN 200 400 Communication (d bytes) 10

13

10

10

10

7

10

4

10

1

OWLQN L-COMM DPLBFGS MADPQN 250 500 750 1000 Training Time (seconds) 10

13

10

10

10

7

10

4

10

1

OWLQN L-COMM DPLBFGS MADPQN 5 10 Communication (d bytes) 10

8

10

5

10

2

101 OWLQN L-COMM DPLBFGS MADPQN 1000 2000 Training Time (seconds) 10

8

10

5

10

2

101 OWLQN L-COMM DPLBFGS MADPQN 18

slide-48
SLIDE 48

Results

y-axis: relative distance to the optimal value (log-scaled) x-axis: communication costs (upper), training time (lower) url avazu-site KDD2010-b

500 1000 Communication (d bytes) 10

3

10

1

101 OWLQN L-COMM DPLBFGS MADPQN 5000 10000 15000 Training Time (seconds) 10

3

10

1

101 OWLQN L-COMM DPLBFGS MADPQN 2000 4000 6000 Communication (d bytes) 10

12

10

9

10

6

10

3

100 OWLQN L-COMM DPLBFGS MADPQN 5000 10000 15000 Training Time (seconds) 10

12

10

9

10

6

10

3

100 OWLQN L-COMM DPLBFGS MADPQN 2500 5000 7500 10000 Communication (d bytes) 10

9

10

6

10

3

100 L-COMM DPLBFGS MADPQN 20000 40000 60000 80000 Training Time (seconds) 10

9

10

6

10

3

100 L-COMM DPLBFGS MADPQN 18

slide-49
SLIDE 49

Conclusions

◮ Communication may be the bottleneck in distributed machine learning ◮ Communication cost can be reduced by utilizing the sparsity pattern throughout training ◮ Second-order information further improves convergence in later stage ◮ Theoretical support on manifold identification and superlinear convergence ◮ Source code to be released soon

19