Coordinate Update for Large Scale Optimization (via Asynchronous - PowerPoint PPT Presentation

Coordinate Update for Large Scale Optimization (via Asynchronous Parallel Computing) Wotao Yin (UCLA) Joint with: Y.T.Chow, B.Edmunds, R.Hannah, Z.Peng, T.Wu (UCLA), Y.Xu (Alabama), M.Yan (Michigan State) VALSE – September 2016 1 / 58

How much do we need parallel computing?

Back in 1993 2 / 58

2006 3 / 58

2014–2016 4 / 58

35 Years of CPU Trend Number of CPUs Performance per core Cores per CPU 1995 2000 2005 2010 2015 D. Henty. Emerging Architectures and Programming Models for Parallel Computing, 2012. 5 / 58

Today: 4x ADM 16-core 3.5GHz CPUs (64 cores total) 6 / 58

Today: Tesla K80 GPU (2496 cores) 7 / 58

Today: Octa-Core Headsets 8 / 58

Free lunch is over! 9 / 58

How to use all the cores available?

Parallel computing Problem Agent Agent Agent t N t 2 t 1 · · · 10 / 58

Parallel speedup • definition: serial time speedup = parallel time • Amdahl’s Law: N agents, ρ percent of computing is parallel 1 ideal speedup = ( ρ/N ) + (1 − ρ ) 20 25% 50% 90% 15 95% Speedup 10 5 0 0 2 4 6 8 10 10 10 10 10 Number of processors 11 / 58

Parallel speedup • ε := parallel overhead (e.g., startup, synchronization, collection) • real world: 1 speedup = ( ρ/N ) + (1 − ρ ) + ε 10 20 25% 25% 50% 50% 8 90% 90% 15 95% 95% 6 Speedup Speedup 10 4 5 2 0 0 0 2 4 6 8 0 2 4 6 8 10 10 10 10 10 10 10 10 10 10 Number of processors Number of processors when ε = N when ε = log( N ) 12 / 58

Sync versus Async Agent 1 Agent 1 idle idle Agent 2 Agent 2 idle Agent 3 Agent 3 idle Synchronous Asynchronous (wait for the slowest) (non-stop, no wait) 13 / 58

Sync versus Async Sync Async Sync Wait � Latency � Bus Contention � Memory Contention � Theory � Scalability Good Better 14 / 58

Compute more, communicate less CPU speed ≫ streaming speed = O ( bandwidth ) 1 ≫ response speed = latency 15 / 58

Decompose-to-parallelize optimization models

Large-sum decomposition N r ( x ) + 1 � minimize f i ( x ) N x ∈ R m i =1 • interested in large N • nice structures: f i ’s are smooth and r is proximable • Stochastic approximation methods: SG, SAG, SAGA, SVRG, Finito pro : faster than batch methods con : update entire x ∈ R m ; model is restricted 16 / 58

Coordinate descent (CD) decomposition m f ( x 1 , . . . , x m ) + 1 � minimize r i ( x i ) m x ∈ R m i =1 • f is smooth, r i can be nonsmooth • update variables in a (shuffled) cyclic, random, greedy, or parallel fashion pro: faster than the full-update method con: nonsmooth functions need to separable 17 / 58

Solution overview 1. Re-formulate a problem into x = T ( x ) (use dual variables, operator splitting) 2. Apply coordinate update (CU) : at iteration k , select i k , do �� T ( x k ) � if i = i k x k +1 i = i x k otherwise . i 3. Parallelize CU without sync or locking 18 / 58

Async-parallel coordinate update

Brief history of async-parallel fixed-point algorithms • 1969 – a linear equation solver by Chazan and Miranker; • 1978 – extended to the fixed-point problem by Baudet under the absolute-contraction 1 type of assumption. • For 20–30 years, mainly solve linear, nonlinear and differential equations by many people • 1989 – Parallel and Distributed Computation: Numerical Methods by Bertsekas and Tsitsiklis. 2000 – Review by Frommer and Szyld. • 1991 – gradient-projection itr assuming a local linear-error bound by Tseng • 2001 – domain decomposition assuming strong convexity by Tai & Tseng 1 An operator T : R n → R n is absolute-contractive if | T ( x ) − T ( y ) | ≤ P | x − y | , component-wise, where | x | denotes the vector with components | x i | , i = 1 , ..., n , and P ∈ R n × n and ρ ( P ) < 1 . + 19 / 58

Simple demo x 2 update is delayed; distance to solution increases! 20 / 58

Simple demo If x 1 is updated much more frequently than x 2 then divergence is likely. 23 / 58

Previous theory: absolute-contractive operator • Absolute-contractive operator T : R n → R n : if | T ( x ) − T ( y ) | ≤ P | x − y | , component-wise, where | x | denotes the vector with components | x i | , i = 1 , ..., n , and P ∈ R n × n and ρ ( P ) < 1 . + • Interpretation : the full update x k +1 = Tx k must produce x 1 , x 2 , . . . that lie in nested shrinking boxes • Result: stable to async-parallelize x k +1 = Tx k ; some but few applications 24 / 58

Randomized coordinate selection • select x i to update with probability p i , where min i p i > 0 • benefits : • much more applications • automatic load balance • drawback : • require either global memory or communication • pseudo-random number generation takes time • practice : • despite theory, full randomization is unnecessary • enough to shuffle coordinates once 25 / 58

Proposed method and theory

ARock 2 : Async-parallel coordinate update • problem: x = T ( x ) • x = ( x 1 , . . . , x m ) ∈ H 1 × · · · × H m • sub-operator S i ( x ) := x i − ( T ( x )) i • algorithm: each agent randomly picks i k ∈ { 1 , . . . , m } : � x k i − η k S i ( x k − d k ) , if i = i k x k +1 ← i x k i , otherwise . • assumptions: nonexpansive T , no locking ( d k is a vector), atomic update • guarantee: almost sure weak convergence under proper η k 2 Peng-Xu-Yan-Y. SISC’16 26 / 58

Nonexpansive operator T Problem : find x such that x = T ( x ) or 0 = S ( x ) where T = I − S . Assumption (nonexpansiveness) The operator T : H → H is nonexpansive, i.e., � T ( x ) − T ( y ) � ≤ � x − y � ∀ x, y ∈ H . Assumption (existence of solution) Fix T := { x ∈ H : x = T ( x ) } is nonempty. 27 / 58

Krasnosel’ski˘ i–Mann (KM) iteration • fixed-point problem: find x such that x = T ( x ) • KM iteration: T is nonexpansive, pick η ∈ [ ǫ, 1 − ǫ ] , iterate: x k +1 = x k − η ( I − T ) ( x k ) � �� S • why important : generalizes gradient descent, proximal-point algorithm, prox-gradient, operator-splitting algorithms such as alternating projection, Douglas-Rachford and ADMM, parallel coordinate descent, . . . 28 / 58

• weak case: if T has a fixed point and is nonexpansive, then weak converge to a fixed point, � Sx k � 2 = o (1 /k ) • strong case: if T is contractive, then has a unique fixed point and linear convergence 29 / 58

ARock convergence notation: • m = # coordinates • τ = max async delay 1 • for simplicity, uniform selection p i ≡ m Theorem (Known max delay) Assume that T is nonexpansive and has a fixed point. Use step sizes 2 m − 1 / 2 τ +1 ) , ∀ k . Then, with probability one, x k ⇀ x ∗ ∈ Fix T . 1 η k ∈ [ ǫ, Consequence: • O (1) step size if τ ∼ √ m • no need to sync until at least p > O ( √ m ) 30 / 58

Stochastic (unbounded) delays • j k,i : delay of x i at iteration k • P ℓ := Pr[max i { j k,i } ≥ ℓ ] : iteration-independent distribution of max delay • ∃ B ∋ ∀ k, | j k,i − j k,i ′ | < B : x i ’s delays are even old at each iteration Theorem Assume that T is nonexpansive and has a fixed point. Fix c ∈ (0 , 1) . Use fixed step size η = cH for either of the following cases: ℓ ( ℓP ℓ ) 1 / 2 < ∞ , set H = � ( ℓ 1 / 2 + ℓ − 1 / 2 ) � − 1 1. if � � ℓ P 1 / 2 1 1 + √ m ℓ < ∞ , set H = � � − 1 2. if � � ℓ ℓP 1 / 2 ℓ P 1 / 2 2 1 + √ m ℓ ℓ Then, with probability one, x k ⇀ x ∗ ∈ Fix T . 31 / 58

Arbitrary unbounded delays • j k,i : async delay of x i at iteration k • j k = max i { j k,i } : max delay at iteration k • lim inf j k < ∞ : all but finitely many iterations have a bounded delay Theorem Assume that T is nonexpansive and has a fixed point. Fix c ∈ (0 , 1) and R > 1 . Use step sizes � R j k − 1 / 2 � − 1 η k = c 1 + √ m ( R − 1) . bnd ⇀ x ∗ ∈ Fix T . Then, with probability one, x k • Optionally optimize R based on { j k } . 32 / 58

Numerical results

TMAC: A Toolbox of Async-Parallel, Coordinate, Splitting, and Stochastic Methods • C++11 multi-threading (no shared-memory parallelism in Matlab) • Plug in your operators, get free coordinate-update and async-parallelism • github.com/uclaopt/tmac • committers: Brent Edmunds, Zhimin Peng • contributors: Yerong Li, Yezheng Li, Tianyu Wu • supports: Windows, Mac, Linux 33 / 58

ℓ 1 logistic regression • model : N λ � x � 1 + 1 � log � i x ) � 1 + exp( − b i · a T minimize , (1) N x ∈ R n i =1 • sparse numerical linear algebra are used for datasets: news20, url 20 20 sync sync async async ideal ideal 15 15 Speedup Speedup 10 10 5 5 0 0 0 5 10 15 20 0 5 10 15 20 Threads Threads dataset “news20” dataset “url” 34 / 58

Coordinate Update for Large Scale Optimization (via Asynchronous - PowerPoint PPT Presentation

Coordinate Update for Large Scale Optimization (via Asynchronous Parallel Computing) Wotao Yin (UCLA) Joint with: Y.T.Chow, B.Edmunds, R.Hannah, Z.Peng, T.Wu (UCLA), Y.Xu (Alabama), M.Yan (Michigan State) VALSE September 2016 1 / 58 How

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Implementation Issues More from Interface point of view V Eye Y U N X Z Viewing Coordinate

Transformations & Transformations & Coordinate Systems Coordinate Systems CSCD 472?

C O C O A Communication-Efficient Coordinate Ascent Virginia Smith Martin Jaggi, Martin Tak

Pathwise Coordinate Optimization for Nonconvex Sparse Learning Tuo Zhao

Your Best Regards Here Interactively coordinate proactive e-commerce via process-centric

Optimization for data processing at a large scale Sparsity4PSL Summer School Emilie Chouzenoux

Networks and large scale optimization Open Data Science Conference Boston, May 2018 Sam Safavi

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

INFRASTRUCTURE 2110414 Large Scale Computing Systems Natawut Nupairoj, Ph.D. Outline 2

TO ANSWER EACH QUESTION . 1. Approximate the y-coordinate of the point on the line with the given

3-D Coordinate System 3-D Coordinate System Contour Map How thick is the earths crust?

3D Geometric Transformation (Chapt. 5 in FVD, Chapt. 11 in Hearn & Baker) 3D Coordinate

Analytic Geometry 1 -1 Cartesian Coordinate System

lecture 22 often is asynchronous (e.g. USB - universal serial bus ) = clock based (system bus

Gathering Asynchronous Robots in a Tree Sruti Gan Chaudhuri Jadavpur University, India

Lessons Learned From Teaching During Covid Professor Jennifer S. Martin St. Thomas University

Asynchronous Batch Bayesian Optimisation with Improved Local Penalisation Ahsan Alvi Binxin Ru,

Constructible sheaves and their cohomology for asynchronous logic and computation 14 January

An Evidential Tool Bus John Rushby Computer Science Laboratory SRI International Menlo Park,

Repetition delay-elements Synchronous sequential circuits clocked flip-flop Asynchronous

Slides for Lecture 25 ENEL 353: Digital Circuits Fall 2013 Term Steve Norman, PhD, PEng

Coordinate Update for Large Scale Optimization (via Asynchronous - PowerPoint PPT Presentation

Coordinate Update for Large Scale Optimization (via Asynchronous Parallel Computing) Wotao Yin (UCLA) Joint with: Y.T.Chow, B.Edmunds, R.Hannah, Z.Peng, T.Wu (UCLA), Y.Xu (Alabama), M.Yan (Michigan State) VALSE September 2016 1 / 58 How

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Implementation Issues More from Interface point of view V Eye Y U N X Z Viewing Coordinate

Transformations &amp; Transformations &amp; Coordinate Systems Coordinate Systems CSCD 472?

C O C O A Communication-Efficient Coordinate Ascent Virginia Smith Martin Jaggi, Martin Tak

Pathwise Coordinate Optimization for Nonconvex Sparse Learning Tuo Zhao

Your Best Regards Here Interactively coordinate proactive e-commerce via process-centric

Optimization for data processing at a large scale Sparsity4PSL Summer School Emilie Chouzenoux

Networks and large scale optimization Open Data Science Conference Boston, May 2018 Sam Safavi

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

INFRASTRUCTURE 2110414 Large Scale Computing Systems Natawut Nupairoj, Ph.D. Outline 2

TO ANSWER EACH QUESTION . 1. Approximate the y-coordinate of the point on the line with the given

3-D Coordinate System 3-D Coordinate System Contour Map How thick is the earths crust?

3D Geometric Transformation (Chapt. 5 in FVD, Chapt. 11 in Hearn &amp; Baker) 3D Coordinate

Analytic Geometry 1 -1 Cartesian Coordinate System

lecture 22 often is asynchronous (e.g. USB - universal serial bus ) = clock based (system bus

Gathering Asynchronous Robots in a Tree Sruti Gan Chaudhuri Jadavpur University, India

Lessons Learned From Teaching During Covid Professor Jennifer S. Martin St. Thomas University

Asynchronous Batch Bayesian Optimisation with Improved Local Penalisation Ahsan Alvi Binxin Ru,

Constructible sheaves and their cohomology for asynchronous logic and computation 14 January

An Evidential Tool Bus John Rushby Computer Science Laboratory SRI International Menlo Park,

Repetition delay-elements Synchronous sequential circuits clocked flip-flop Asynchronous

Slides for Lecture 25 ENEL 353: Digital Circuits Fall 2013 Term Steve Norman, PhD, PEng

Transformations & Transformations & Coordinate Systems Coordinate Systems CSCD 472?

3D Geometric Transformation (Chapt. 5 in FVD, Chapt. 11 in Hearn & Baker) 3D Coordinate