Coordinate Update for Large Scale Optimization (via Asynchronous - - PowerPoint PPT Presentation

coordinate update for large scale optimization via
SMART_READER_LITE
LIVE PREVIEW

Coordinate Update for Large Scale Optimization (via Asynchronous - - PowerPoint PPT Presentation

Coordinate Update for Large Scale Optimization (via Asynchronous Parallel Computing) Wotao Yin (UCLA) Joint with: Y.T.Chow, B.Edmunds, R.Hannah, Z.Peng, T.Wu (UCLA), Y.Xu (Alabama), M.Yan (Michigan State) VALSE September 2016 1 / 58 How


slide-1
SLIDE 1

Coordinate Update for Large Scale Optimization (via Asynchronous Parallel Computing)

Wotao Yin (UCLA) Joint with: Y.T.Chow, B.Edmunds, R.Hannah, Z.Peng, T.Wu (UCLA), Y.Xu (Alabama), M.Yan (Michigan State) VALSE – September 2016

1 / 58

slide-2
SLIDE 2

How much do we need parallel computing?

slide-3
SLIDE 3

Back in 1993

2 / 58

slide-4
SLIDE 4

2006

3 / 58

slide-5
SLIDE 5

2014–2016

4 / 58

slide-6
SLIDE 6

35 Years of CPU Trend

1995 2000 2005 2010 2015 Number

  • f CPUs

Performance per core Cores per CPU

  • D. Henty. Emerging Architectures and Programming Models for Parallel Computing, 2012.

5 / 58

slide-7
SLIDE 7

Today: 4x ADM 16-core 3.5GHz CPUs (64 cores total)

6 / 58

slide-8
SLIDE 8

Today: Tesla K80 GPU (2496 cores)

7 / 58

slide-9
SLIDE 9

Today: Octa-Core Headsets

8 / 58

slide-10
SLIDE 10

Free lunch is over!

9 / 58

slide-11
SLIDE 11

How to use all the cores available?

slide-12
SLIDE 12

Parallel computing

Agent Agent Agent t1 t2 tN · · ·

Problem

10 / 58

slide-13
SLIDE 13

Parallel speedup

  • definition:

speedup = serial time parallel time

  • Amdahl’s Law: N agents, ρ percent of computing is parallel

ideal speedup = 1 (ρ/N) + (1 − ρ)

10 10

2

10

4

10

6

10

8

5 10 15 20 Number of processors Speedup 25% 50% 90% 95%

11 / 58

slide-14
SLIDE 14

Parallel speedup

  • ε := parallel overhead (e.g., startup, synchronization, collection)
  • real world:

speedup = 1 (ρ/N) + (1 − ρ) + ε

10 10

2

10

4

10

6

10

8

2 4 6 8 10 Number of processors Speedup 25% 50% 90% 95% 10 10

2

10

4

10

6

10

8

5 10 15 20 Number of processors Speedup 25% 50% 90% 95%

when ε = N when ε = log(N)

12 / 58

slide-15
SLIDE 15

Sync versus Async

Agent 1 Agent 2 Agent 3

idle idle idle idle

Synchronous (wait for the slowest)

Agent 1 Agent 2 Agent 3

Asynchronous (non-stop, no wait)

13 / 58

slide-16
SLIDE 16

Sync versus Async

Sync Async Sync Wait

  • Latency
  • Bus Contention
  • Memory Contention
  • Theory
  • Scalability

Good Better

14 / 58

slide-17
SLIDE 17

Compute more, communicate less CPU speed ≫ streaming speed = O(bandwidth) ≫ response speed = 1 latency

15 / 58

slide-18
SLIDE 18

Decompose-to-parallelize optimization models

slide-19
SLIDE 19

Large-sum decomposition

minimize

x∈Rm

r(x) + 1 N

N

  • i=1

fi(x)

  • interested in large N
  • nice structures: fi’s are smooth and r is proximable
  • Stochastic approximation methods: SG, SAG, SAGA, SVRG, Finito

pro: faster than batch methods con: update entire x ∈ Rm; model is restricted

16 / 58

slide-20
SLIDE 20

Coordinate descent (CD) decomposition

minimize

x∈Rm

f(x1, . . . , xm) + 1 m

m

  • i=1

ri(xi)

  • f is smooth, ri can be nonsmooth
  • update variables in a (shuffled) cyclic, random, greedy, or parallel fashion

pro: faster than the full-update method con: nonsmooth functions need to separable

17 / 58

slide-21
SLIDE 21

Solution overview

  • 1. Re-formulate a problem into

x = T(x) (use dual variables, operator splitting)

  • 2. Apply coordinate update (CU): at iteration k, select ik, do

xk+1

i

=

  • T(xk)

i

if i = ik xk

i

  • therwise.
  • 3. Parallelize CU without sync or locking

18 / 58

slide-22
SLIDE 22

Async-parallel coordinate update

slide-23
SLIDE 23

Brief history of async-parallel fixed-point algorithms

  • 1969 – a linear equation solver by Chazan and Miranker;
  • 1978 – extended to the fixed-point problem by Baudet under the

absolute-contraction1 type of assumption.

  • For 20–30 years, mainly solve linear, nonlinear and differential equations by

many people

  • 1989 – Parallel and Distributed Computation: Numerical Methods by

Bertsekas and Tsitsiklis. 2000 – Review by Frommer and Szyld.

  • 1991 – gradient-projection itr assuming a local linear-error bound by Tseng
  • 2001 – domain decomposition assuming strong convexity by Tai & Tseng

1An operator T : Rn → Rn is absolute-contractive if |T (x) − T (y)| ≤ P |x − y|, component-wise, where

|x| denotes the vector with components |xi|, i = 1, ..., n, and P ∈ Rn×n

+

and ρ(P ) < 1.

19 / 58

slide-24
SLIDE 24

Simple demo

x2 update is delayed; distance to solution increases!

20 / 58

slide-25
SLIDE 25

Simple demo

x2 update is delayed; distance to solution increases!

21 / 58

slide-26
SLIDE 26

Simple demo

x2 update is delayed; distance to solution increases!

22 / 58

slide-27
SLIDE 27

Simple demo

If x1 is updated much more frequently than x2 then divergence is likely.

23 / 58

slide-28
SLIDE 28

Previous theory: absolute-contractive operator

  • Absolute-contractive operator T : Rn → Rn:

if |T(x) − T(y)| ≤ P|x − y|, component-wise, where |x| denotes the vector with components |xi|, i = 1, ..., n, and P ∈ Rn×n

+

and ρ(P) < 1.

  • Interpretation: the full update xk+1 = Txk must produce x1, x2, . . . that

lie in nested shrinking boxes

  • Result: stable to async-parallelize xk+1 = Txk; some but few applications

24 / 58

slide-29
SLIDE 29

Randomized coordinate selection

  • select xi to update with probability pi, where mini pi > 0
  • benefits:
  • much more applications
  • automatic load balance
  • drawback:
  • require either global memory or communication
  • pseudo-random number generation takes time
  • practice:
  • despite theory, full randomization is unnecessary
  • enough to shuffle coordinates once

25 / 58

slide-30
SLIDE 30

Proposed method and theory

slide-31
SLIDE 31

ARock2: Async-parallel coordinate update

  • problem: x = T(x)
  • x = (x1, . . . , xm) ∈ H1 × · · · × Hm
  • sub-operator Si(x) := xi − (T(x))i
  • algorithm: each agent randomly picks ik ∈ {1, . . . , m}:

xk+1

i

  • xk

i − ηkSi(xk−dk),

if i = ik xk

i ,

  • therwise.
  • assumptions: nonexpansive T, no locking (dk is a vector), atomic update
  • guarantee: almost sure weak convergence under proper ηk

2Peng-Xu-Yan-Y. SISC’16 26 / 58

slide-32
SLIDE 32

Nonexpansive operator T

Problem: find x such that x = T(x)

  • r

0 = S(x) where T = I − S.

Assumption (nonexpansiveness)

The operator T : H → H is nonexpansive, i.e., T(x) − T(y) ≤ x − y ∀x, y ∈ H.

Assumption (existence of solution)

FixT := {x ∈ H : x = T(x)} is nonempty.

27 / 58

slide-33
SLIDE 33

Krasnosel’ski˘ i–Mann (KM) iteration

  • fixed-point problem: find x such that

x = T(x)

  • KM iteration: T is nonexpansive, pick η ∈ [ǫ, 1 − ǫ], iterate:

xk+1 = xk − η (I − T)

S

(xk)

  • why important: generalizes gradient descent, proximal-point algorithm,

prox-gradient, operator-splitting algorithms such as alternating projection, Douglas-Rachford and ADMM, parallel coordinate descent, . . .

28 / 58

slide-34
SLIDE 34
  • weak case: if T has a fixed point and is nonexpansive, then weak

converge to a fixed point, Sxk2 = o(1/k)

  • strong case: if T is contractive, then has a unique fixed point and linear

convergence

29 / 58

slide-35
SLIDE 35

ARock convergence

notation:

  • m = # coordinates
  • τ = max async delay
  • for simplicity, uniform selection pi ≡

1 m

Theorem (Known max delay)

Assume that T is nonexpansive and has a fixed point. Use step sizes ηk ∈ [ǫ,

1 2m−1/2τ+1), ∀k. Then, with probability one, xk ⇀ x∗ ∈ FixT.

Consequence:

  • O(1) step size if τ ∼ √m
  • no need to sync until at least p > O(√m)

30 / 58

slide-36
SLIDE 36

Stochastic (unbounded) delays

  • jk,i: delay of xi at iteration k
  • Pℓ := Pr[maxi{jk,i} ≥ ℓ]: iteration-independent distribution of max delay
  • ∃ B ∋ ∀k, |jk,i − jk,i′| < B: xi’s delays are even old at each iteration

Theorem

Assume that T is nonexpansive and has a fixed point. Fix c ∈ (0, 1). Use fixed step size η = cH for either of the following cases:

  • 1. if

ℓ(ℓPℓ)1/2 < ∞, set H =

1 +

1 √m

  • ℓ P 1/2

(ℓ1/2 + ℓ−1/2)−1

  • 2. if

ℓ ℓP 1/2 ℓ

< ∞, set H = 1 +

2 √m

  • ℓ P 1/2

−1

Then, with probability one, xk ⇀ x∗ ∈ FixT.

31 / 58

slide-37
SLIDE 37

Arbitrary unbounded delays

  • jk,i: async delay of xi at iteration k
  • jk = maxi{jk,i}: max delay at iteration k
  • lim inf jk < ∞: all but finitely many iterations have a bounded delay

Theorem

Assume that T is nonexpansive and has a fixed point. Fix c ∈ (0, 1) and R > 1. Use step sizes ηk = c

  • 1 +

Rjk−1/2 √m(R − 1)

−1

. Then, with probability one, xk

bnd ⇀ x∗ ∈ FixT.

  • Optionally optimize R based on {jk}.

32 / 58

slide-38
SLIDE 38

Numerical results

slide-39
SLIDE 39

TMAC: A Toolbox of Async-Parallel, Coordinate, Splitting, and Stochastic Methods

  • C++11 multi-threading (no shared-memory parallelism in Matlab)
  • Plug in your operators, get free coordinate-update and async-parallelism
  • github.com/uclaopt/tmac
  • committers: Brent Edmunds, Zhimin Peng
  • contributors: Yerong Li, Yezheng Li, Tianyu Wu
  • supports: Windows, Mac, Linux

33 / 58

slide-40
SLIDE 40

ℓ1 logistic regression

  • model:

minimize

x∈Rn

λx1 + 1 N

N

  • i=1

log 1 + exp(−bi · aT

i x)

, (1)

  • sparse numerical linear algebra are used for datasets: news20, url

5 10 15 20 Threads 5 10 15 20 Speedup sync async ideal

dataset “news20”

5 10 15 20 Threads 5 10 15 20 Speedup sync async ideal

dataset “url”

34 / 58

slide-41
SLIDE 41

Nonnegative matrix factorization

  • model:

minimize

X,Y ≥0

A − XT Y 2

F ,

(2)

  • despite nonconvex, amenable to parallel coordinate descent

50 100 150 200 Time(s) 104 105 106 107 108 Objective 1 core 2 cores 4 cores 8 cores 16 cores 5 10 15 20 Threads 5 10 15 20 Speedup async ideal 35 / 58

slide-42
SLIDE 42

Practical issue: coordinate friendly structure

slide-43
SLIDE 43

Coordinate friendly (CF) structure

  • CU is fast only if the subproblems are simple
  • require: cost

each CUi(z) = O 1

mcost[z → Tz]

for all z, i

  • CF is found on most first-order methods for structured problems3
  • May require caching and maintaining extra variables
  • May require switch orders in operator splitting
  • 3Z. Peng, T. Wu, Y. Xu, M. Yan, and W. Yin, Coordinate Friendly Structures, Algorithms and Applications,

Annals of Math Sci. and App., 2016.

36 / 58

slide-44
SLIDE 44

Single-operator examples

  • matrix-vector multiplication
  • separable functions
  • box-type constraints
  • sum of sparsely supported functions (arising in graphical models)
  • many loss functions of the form f(Ax), where f is (nearly) separable

37 / 58

slide-45
SLIDE 45

Composed operators

  • there are 9 rules for CF T1 ◦ T2, which cover widely many examples
  • general principles:
  • T1 ◦ T2 inherits the weaker separability property of the two
  • CF T1 ◦ T2 generally requires T1 to be CF and T2 to be either cheap,

easy-to-maintain, or directly CF.

  • If T1 is separable or cheap, requirement on T2 is weakened.

38 / 58

slide-46
SLIDE 46

Examples

  • Operator-splitting algorithms for the following problems are CF
  • total variation image processing
  • portfolio optimization
  • most sparse optimization problems
  • linear and second-cone programs
  • most ERM and SVMs

39 / 58

slide-47
SLIDE 47

Practical issue: nonseparable nonsmoothness

slide-48
SLIDE 48

Coordinate descent (CD) illustration

40 / 58

slide-49
SLIDE 49

CD fails on nonseparable nonsmooth functions

2 2 4 4 4 4 6 6 6 6 6 6 8 8 8 8 8 8 10 1 10 1 1 2 1 2

  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4 5

  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4 5

Rotated weighted ℓ1-norm. CD fixed-point is not stationary!

41 / 58

slide-50
SLIDE 50
  • smooth f(x1, x2)

p1 = ∇1f(x1, x2) p2 = ∇2f(x1, x2) = ⇒

  • p1

p2

  • = ∇f(x1, x2)
  • nonsmooth r(x1, x2)

q1 ∈ ∂1r(x1, x2) q2 ∈ ∂2r(x1, x2) = ⇒

  • q1

q2

  • ∈ ∂r(x1, x2)

(however, “= ⇒” holds if r is separable.)

42 / 58

slide-51
SLIDE 51

What do we want to solve?

  • total variation minimization:

minimize

x

TV(x) + 1 2Ax − b2

  • optimization over a graph G = (N, E)

minimize

x

  • (i,j)∈E

f(xi − xj) +

  • i∈N

gi(xi)

  • consensus optimization (decentralized computing)

minimize

x1,...,xm

f1(x1) + f2(x2) + · · · + fm(xm) subject to x1 = x2 = · · · = xm

43 / 58

slide-52
SLIDE 52
  • second-order cone program

minimize

x1,...,xm

cT x + 1 2xT Bx subject to A1x1 + A2x2 + · · · + Amxm = b x1 ∈ Q1, x2 ∈ Q2 . . . , xm ∈ Qm

  • extended monotropic program

minimize

x1,...,xm

g1(x1) + g2(x2) + · · · + gm(xm) + f(x) subject to A1x1 + A2x2 + · · · + Amxm = b where gi are nonsmooth and extended-valued

44 / 58

slide-53
SLIDE 53

Solution overview

  • 1. Re-formulate a problem into

x = T(x) (use dual variables, operator splitting)

  • 2. Apply coordinate update (CU): at iteration k, select ik, do

xk+1

i

=

  • T(xk)

i

if i = ik xk

i

  • therwise.
  • 3. Parallelize CU without sync or locking

45 / 58

slide-54
SLIDE 54

Primal-dual splitting for extended monotropic program

minimize

x1,...,xm

g1(x1) + g2(x2) + · · · + gm(xm) + f(x) subject to A1x1 + A2x2 + · · · + Amxm = b

  • function f is smooth and nonseparable
  • primal-dual splitting with a special metric: a new parallel algorithm
  • sk+1 = sk + γ(Axk − b)

xk+1

i

= proxηgi

  • xk

i − η(∇if(xk) + AT i sk + 2γAT i Axk − 2γAT i b)

, i = 1, . . . , m

  • Jacobi style
  • Gives async-parallel algorithms for: LP, QP, SOCP, multi-block

ADMM-able problems (no matrix inversions)...

46 / 58

slide-55
SLIDE 55

CT recovery: full vs random coordinate update

  • 284 × 284, 90 beam projections, 362 measurements each beam
  • TV minimization partitioned to 284 columns (blocks), run 100 epochs

47 / 58

slide-56
SLIDE 56

Portfolio optimization simulation

minimizex 1 2x⊤Qx risk , subject to x ≥ 0 buy ,

m

  • i=1

xi ≤ 1

  • limit

,

m

  • i=1

ξixi ≥ c

  • return

,

  • reformulation:

minimize

x

risk + ιD1(x) + ιD2(x) where D1 = {buy} and D2 = {limit, return}

  • full update based on 3-splitting operator (Davis-Y.’15)

zk+1 = T3S(zk)

48 / 58

slide-57
SLIDE 57

Full vs random coordinate update

49 / 58

slide-58
SLIDE 58

More applications

slide-59
SLIDE 59

Linear equations (asynchronous Jacobi)

  • require: invertible square matrix A with nonzero diagonal entries
  • let D be the diagonal part of A; then

Ax = b ⇐ ⇒ (I − D−1A)x + D−1b

  • T x

= x.

  • T is nonexpansive if I − D−1A2 ≤ 1, i.e., A is diagonal dominating
  • xk+1 = Txk recovers the Jacobi algorithm

50 / 58

slide-60
SLIDE 60

Algorithm 1: ARock for linear equations Input : shared variables x ∈ Rn, K > 0; set global iteration counter k = 0; while k < K, every agent asynchronously and continuously do sample i ∈ {1, . . . , m} uniformly at random; add − ηk

aii ( j aij ˆ

xk

j − bi) to shared variable xi;

update the global counter k ← k + 1;

51 / 58

slide-61
SLIDE 61

Sample code

loadData (A, d a t a f i l e n a m e ) ; loadData (b , l a b e l f i l e n a m e ) ; #pragma omp p a r a l l e l num threads ( p ) shared (A, b , x , para ) { // A, b , x , and para are passed by r e f e r e n c e c a l l Jacobi (A, b , x , para )

  • r ARock (A, b , x , para ) ;

}

  • p: the number of threads
  • A,b,x: shared variable
  • para: other parameters

52 / 58

slide-62
SLIDE 62

Jacobi worker function

f o r ( i n t i t r =0; i t r <ma x itr ; i t r ++) { // compute the update f o r the a s s i g n e d x [ i ] // . . . #pragma omp b a r r i e r { // w r i t e x [ i ] i n g l o b a l memory } #pragma omp b a r r i e r } Jacobi needs the barrier directive for synchronization

53 / 58

slide-63
SLIDE 63

ARock worker function

f o r ( i n t i t r =0; i t r <ma x itr ; i t r ++) { // pick i at random // compute the update f o r x [ i ] // . . . // w r i t e x ( i ) i n g l o b a l memory } ARock has no synchronization barrier directive

54 / 58

slide-64
SLIDE 64

Minimizing proximable+proximable functions

  • problem and operator

minimize

x

f(x) + g(x) ⇐ ⇒ z∗ = reflγf ◦ reflγg

  • TPRS

(z∗). recover x∗ = proxγg(z∗)

  • TPRS: Peaceman-Rachford splitting4
  • ARock gives async-parallel algorithms for
  • linear programs, second-order cone programs (inverting matrices)
  • ADMM-able problems

4reflective proximal map: reflγf := 2proxγf − I. The maps reflγf , reflγg and thus

reflγf ◦ reflγg are nonexpansive

55 / 58

slide-65
SLIDE 65

Parallel/distributed ADMM

  • structure: m convex functions fi; one convex function g
  • consensus problem:

minimize

x

m

i=1 fi(A(i)x) + g(x)

  • assign g to the master, and assign fi and Ai to worker i

Master Worker Worker Worker

......

  • async: master does not need to wait for all workers to communicate

56 / 58

slide-66
SLIDE 66

Async-parallel decentralized computing

  • a graph of connected agents: G = (V, E).
  • decentralized consensus optimization problem:

minimize

xi∈Rd,i∈V f(x) := i∈V fi(xi)

subject to xi = xj, ∀(i, j) ∈ E

  • each agent keeps fi private and only talks to neighbors
  • resulting async-algorithm: no global clock, no coordinator

57 / 58

slide-67
SLIDE 67

Summary of async-parallel coordinate update:

  • applicable to many problems
  • faster than full update and scalable

Papers: UCLA CAM reports

  • 15-37

theory

  • 16-13

problem structure, applications

  • 16-37

C++11 solver: github.com/uclaopt/tmac

  • 16-64

unbounded delay

Thank you!

58 / 58