coordinate update for large scale optimization via
play

Coordinate Update for Large Scale Optimization (via Asynchronous - PowerPoint PPT Presentation

Coordinate Update for Large Scale Optimization (via Asynchronous Parallel Computing) Wotao Yin (UCLA) Joint with: Y.T.Chow, B.Edmunds, R.Hannah, Z.Peng, T.Wu (UCLA), Y.Xu (Alabama), M.Yan (Michigan State) VALSE September 2016 1 / 58 How


  1. Coordinate Update for Large Scale Optimization (via Asynchronous Parallel Computing) Wotao Yin (UCLA) Joint with: Y.T.Chow, B.Edmunds, R.Hannah, Z.Peng, T.Wu (UCLA), Y.Xu (Alabama), M.Yan (Michigan State) VALSE – September 2016 1 / 58

  2. How much do we need parallel computing?

  3. Back in 1993 2 / 58

  4. 2006 3 / 58

  5. 2014–2016 4 / 58

  6. 35 Years of CPU Trend Number of CPUs Performance per core Cores per CPU 1995 2000 2005 2010 2015 D. Henty. Emerging Architectures and Programming Models for Parallel Computing, 2012. 5 / 58

  7. Today: 4x ADM 16-core 3.5GHz CPUs (64 cores total) 6 / 58

  8. Today: Tesla K80 GPU (2496 cores) 7 / 58

  9. Today: Octa-Core Headsets 8 / 58

  10. Free lunch is over! 9 / 58

  11. How to use all the cores available?

  12. Parallel computing Problem Agent Agent Agent t N t 2 t 1 · · · 10 / 58

  13. Parallel speedup • definition: serial time speedup = parallel time • Amdahl’s Law: N agents, ρ percent of computing is parallel 1 ideal speedup = ( ρ/N ) + (1 − ρ ) 20 25% 50% 90% 15 95% Speedup 10 5 0 0 2 4 6 8 10 10 10 10 10 Number of processors 11 / 58

  14. Parallel speedup • ε := parallel overhead (e.g., startup, synchronization, collection) • real world: 1 speedup = ( ρ/N ) + (1 − ρ ) + ε 10 20 25% 25% 50% 50% 8 90% 90% 15 95% 95% 6 Speedup Speedup 10 4 5 2 0 0 0 2 4 6 8 0 2 4 6 8 10 10 10 10 10 10 10 10 10 10 Number of processors Number of processors when ε = N when ε = log( N ) 12 / 58

  15. Sync versus Async Agent 1 Agent 1 idle idle Agent 2 Agent 2 idle Agent 3 Agent 3 idle Synchronous Asynchronous (wait for the slowest) (non-stop, no wait) 13 / 58

  16. Sync versus Async Sync Async Sync Wait � Latency � Bus Contention � Memory Contention � Theory � Scalability Good Better 14 / 58

  17. Compute more, communicate less CPU speed ≫ streaming speed = O ( bandwidth ) 1 ≫ response speed = latency 15 / 58

  18. Decompose-to-parallelize optimization models

  19. Large-sum decomposition N r ( x ) + 1 � minimize f i ( x ) N x ∈ R m i =1 • interested in large N • nice structures: f i ’s are smooth and r is proximable • Stochastic approximation methods: SG, SAG, SAGA, SVRG, Finito pro : faster than batch methods con : update entire x ∈ R m ; model is restricted 16 / 58

  20. Coordinate descent (CD) decomposition m f ( x 1 , . . . , x m ) + 1 � minimize r i ( x i ) m x ∈ R m i =1 • f is smooth, r i can be nonsmooth • update variables in a (shuffled) cyclic, random, greedy, or parallel fashion pro: faster than the full-update method con: nonsmooth functions need to separable 17 / 58

  21. Solution overview 1. Re-formulate a problem into x = T ( x ) (use dual variables, operator splitting) 2. Apply coordinate update (CU) : at iteration k , select i k , do �� T ( x k ) � if i = i k x k +1 i = i x k otherwise . i 3. Parallelize CU without sync or locking 18 / 58

  22. Async-parallel coordinate update

  23. Brief history of async-parallel fixed-point algorithms • 1969 – a linear equation solver by Chazan and Miranker; • 1978 – extended to the fixed-point problem by Baudet under the absolute-contraction 1 type of assumption. • For 20–30 years, mainly solve linear, nonlinear and differential equations by many people • 1989 – Parallel and Distributed Computation: Numerical Methods by Bertsekas and Tsitsiklis. 2000 – Review by Frommer and Szyld. • 1991 – gradient-projection itr assuming a local linear-error bound by Tseng • 2001 – domain decomposition assuming strong convexity by Tai & Tseng 1 An operator T : R n → R n is absolute-contractive if | T ( x ) − T ( y ) | ≤ P | x − y | , component-wise, where | x | denotes the vector with components | x i | , i = 1 , ..., n , and P ∈ R n × n and ρ ( P ) < 1 . + 19 / 58

  24. Simple demo x 2 update is delayed; distance to solution increases! 20 / 58

  25. Simple demo x 2 update is delayed; distance to solution increases! 21 / 58

  26. Simple demo x 2 update is delayed; distance to solution increases! 22 / 58

  27. Simple demo If x 1 is updated much more frequently than x 2 then divergence is likely. 23 / 58

  28. Previous theory: absolute-contractive operator • Absolute-contractive operator T : R n → R n : if | T ( x ) − T ( y ) | ≤ P | x − y | , component-wise, where | x | denotes the vector with components | x i | , i = 1 , ..., n , and P ∈ R n × n and ρ ( P ) < 1 . + • Interpretation : the full update x k +1 = Tx k must produce x 1 , x 2 , . . . that lie in nested shrinking boxes • Result: stable to async-parallelize x k +1 = Tx k ; some but few applications 24 / 58

  29. Randomized coordinate selection • select x i to update with probability p i , where min i p i > 0 • benefits : • much more applications • automatic load balance • drawback : • require either global memory or communication • pseudo-random number generation takes time • practice : • despite theory, full randomization is unnecessary • enough to shuffle coordinates once 25 / 58

  30. Proposed method and theory

  31. ARock 2 : Async-parallel coordinate update • problem: x = T ( x ) • x = ( x 1 , . . . , x m ) ∈ H 1 × · · · × H m • sub-operator S i ( x ) := x i − ( T ( x )) i • algorithm: each agent randomly picks i k ∈ { 1 , . . . , m } : � x k i − η k S i ( x k − d k ) , if i = i k x k +1 ← i x k i , otherwise . • assumptions: nonexpansive T , no locking ( d k is a vector), atomic update • guarantee: almost sure weak convergence under proper η k 2 Peng-Xu-Yan-Y. SISC’16 26 / 58

  32. Nonexpansive operator T Problem : find x such that x = T ( x ) or 0 = S ( x ) where T = I − S . Assumption (nonexpansiveness) The operator T : H → H is nonexpansive, i.e., � T ( x ) − T ( y ) � ≤ � x − y � ∀ x, y ∈ H . Assumption (existence of solution) Fix T := { x ∈ H : x = T ( x ) } is nonempty. 27 / 58

  33. Krasnosel’ski˘ i–Mann (KM) iteration • fixed-point problem: find x such that x = T ( x ) • KM iteration: T is nonexpansive, pick η ∈ [ ǫ, 1 − ǫ ] , iterate: x k +1 = x k − η ( I − T ) ( x k ) � �� � S • why important : generalizes gradient descent, proximal-point algorithm, prox-gradient, operator-splitting algorithms such as alternating projection, Douglas-Rachford and ADMM, parallel coordinate descent, . . . 28 / 58

  34. • weak case: if T has a fixed point and is nonexpansive, then weak converge to a fixed point, � Sx k � 2 = o (1 /k ) • strong case: if T is contractive, then has a unique fixed point and linear convergence 29 / 58

  35. ARock convergence notation: • m = # coordinates • τ = max async delay 1 • for simplicity, uniform selection p i ≡ m Theorem (Known max delay) Assume that T is nonexpansive and has a fixed point. Use step sizes 2 m − 1 / 2 τ +1 ) , ∀ k . Then, with probability one, x k ⇀ x ∗ ∈ Fix T . 1 η k ∈ [ ǫ, Consequence: • O (1) step size if τ ∼ √ m • no need to sync until at least p > O ( √ m ) 30 / 58

  36. Stochastic (unbounded) delays • j k,i : delay of x i at iteration k • P ℓ := Pr[max i { j k,i } ≥ ℓ ] : iteration-independent distribution of max delay • ∃ B ∋ ∀ k, | j k,i − j k,i ′ | < B : x i ’s delays are even old at each iteration Theorem Assume that T is nonexpansive and has a fixed point. Fix c ∈ (0 , 1) . Use fixed step size η = cH for either of the following cases: ℓ ( ℓP ℓ ) 1 / 2 < ∞ , set H = � ( ℓ 1 / 2 + ℓ − 1 / 2 ) � − 1 1. if � � ℓ P 1 / 2 1 1 + √ m ℓ < ∞ , set H = � � − 1 2. if � � ℓ ℓP 1 / 2 ℓ P 1 / 2 2 1 + √ m ℓ ℓ Then, with probability one, x k ⇀ x ∗ ∈ Fix T . 31 / 58

  37. Arbitrary unbounded delays • j k,i : async delay of x i at iteration k • j k = max i { j k,i } : max delay at iteration k • lim inf j k < ∞ : all but finitely many iterations have a bounded delay Theorem Assume that T is nonexpansive and has a fixed point. Fix c ∈ (0 , 1) and R > 1 . Use step sizes � R j k − 1 / 2 � − 1 η k = c 1 + √ m ( R − 1) . bnd ⇀ x ∗ ∈ Fix T . Then, with probability one, x k • Optionally optimize R based on { j k } . 32 / 58

  38. Numerical results

  39. TMAC: A Toolbox of Async-Parallel, Coordinate, Splitting, and Stochastic Methods • C++11 multi-threading (no shared-memory parallelism in Matlab) • Plug in your operators, get free coordinate-update and async-parallelism • github.com/uclaopt/tmac • committers: Brent Edmunds, Zhimin Peng • contributors: Yerong Li, Yezheng Li, Tianyu Wu • supports: Windows, Mac, Linux 33 / 58

  40. ℓ 1 logistic regression • model : N λ � x � 1 + 1 � log � i x ) � 1 + exp( − b i · a T minimize , (1) N x ∈ R n i =1 • sparse numerical linear algebra are used for datasets: news20, url 20 20 sync sync async async ideal ideal 15 15 Speedup Speedup 10 10 5 5 0 0 0 5 10 15 20 0 5 10 15 20 Threads Threads dataset “news20” dataset “url” 34 / 58

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend