Asynchronous Parallel Methods for Optimization and Linear Algebra - PowerPoint PPT Presentation

Asynchronous Parallel Methods for Optimization and Linear Algebra Stephen Wright University of Wisconsin-Madison Workshop on Optimization for Modern Computation, Beijing, September 2014 Wright (UW-Madison) Asynchronous Stochastic Optimization September 2014 1 / 44

Collaborators Ji Liu (UW-Madison → U. Rochester) Victor Bittorf (UW-Madison → Cloudera) Chris R´ e (UW-Madison → Stanford) Krishna Sridhar (UW-Madison → GraphLab) Wright (UW-Madison) Asynchronous Stochastic Optimization September 2014 2 / 44

Asynchronous Random Kaczmarz 1 Asynchronous Parallel Stochastic Proximal Coordinate Descent 2 Algorithm with Inconsistent Read ( AsySPCD ) Wright (UW-Madison) Asynchronous Stochastic Optimization September 2014 3 / 44

Motivation Why study old, slow, simple algorithms? Often suitable for machine learning and big-data applications. Low accuracy required; Favorable data access patterns. Parallel asynchronous versions are a good fit for modern computers (multicore, NUMA, clusters). (Fairly) easy to implement. Interesting new analysis, tied to plausible models of parallel computation and data access. Wright (UW-Madison) Asynchronous Stochastic Optimization September 2014 4 / 44

Asynchronous Parallel Optimization Figure: Asynchronous parallel setup used in Hogwild! [Niu, Recht, R´ e, and Wright, 2011] Read “X”; Core Compute gradient at “X” Core Core Core Update “X” in RAM Core Core Core Core …... Cache Cache RAM (Shared Memory) “X” All cores share the same memory, containing the variable x ; All cores run the same optimization algorithm independently; All cores update the coordinates of x concurrently without any software locking. We use the same model of computation in this talk. Wright (UW-Madison) Asynchronous Stochastic Optimization September 2014 5 / 44

Asynchronous Random Kaczmarz 1 Asynchronous Parallel Stochastic Proximal Coordinate Descent 2 Algorithm with Inconsistent Read ( AsySPCD ) Wright (UW-Madison) Asynchronous Stochastic Optimization September 2014 6 / 44

1. Kaczmarz for Ax = b . Consider linear equations Ax = b , where the equations are consistent and matrix A is m × n , not necessarily square or full rank. Write   a T i a T   2   A =  , where � a i � 2 = 1, ∀ i (normalized rows). .  .  .  a T m Iteration j of Randomized Kaczmarz: Select row index i ( j ) ∈ { 1 , 2 , . . . , m } randomly with equal probability. Set x j +1 ← x j − ( a T i ( j ) x j − b i ( j ) ) a i ( j ) . Project x onto the plane of equation i ( j ). Wright (UW-Madison) Asynchronous Stochastic Optimization September 2014 7 / 44

Relationship to Stochastic Gradient Randomized Kaczmarz ≡ Stochastic Gradient applied to m m f ( x ) := 1 i x − b i ) 2 = 1 2 = 1 � � ( a T 2 m � Ax − b � 2 f i ( x ) 2 m m i =1 i =1 with steplength α k ≡ 1. However, it is a special case of SG, since the individual gradient estimates ∇ f i ( x ) = a i ( a T i x − b i ) approach zero as x → x ∗ . (The “variance” in the gradient estimate shrinks to zero.) Wright (UW-Madison) Asynchronous Stochastic Optimization September 2014 8 / 44

Randomized Kaczmarz Convergence: Linear Rate Recall that A is scaled: � a i � = 1 for all i . λ min , nz denotes minimum nonzero eigenvalue of A T A . P ( · ) is projection onto solution set. 1 2 � x j +1 − P ( x j +1 ) � 2 ≤ 1 2 � x j − a i ( j ) ( a T i ( j ) x j − b i ( j ) ) − P ( x j ) � 2 = 1 2 � x j − P ( x j ) � 2 − 1 2( a T i ( j ) x j − b i ( j ) ) 2 . Taking expectations: � 1 � � i ( j ) x j − b i ( j ) ) 2 � ≤ 1 2 � x j − P ( x j ) � 2 − 1 2 � x j +1 − P ( x j +1 ) � 2 | x j ( a T E 2 E = 1 2 � x j − P ( x j ) � 2 − 1 2 m � Ax j − b � 2 � 1 � 1 − λ min , nz 2 � x j − P ( x j ) � 2 . ≤ m Strohmer and Vershynin (2009) Wright (UW-Madison) Asynchronous Stochastic Optimization September 2014 9 / 44

Asynchronous Random Kaczmarz (Liu, Wright, 2014) Assumes that x is stored in shared memory, accessible to all cores. Each core runs a simple process, repeating indefinitely: Choose index i ∈ { 1 , 2 , . . . , m } uniformly at random; Choose component t ∈ supp( a i ) uniformly at random; Read the supp( a i )-components of x (from shared memory), needed to evaluate a T i x ; Update the t component of x : ( x ) t ← ( x ) t − γ � a i � 0 ( a i ) t ( a T i x − b i ) for some step size γ (a unitary operation); Note that x can be updated by other cores between the time it is read and the time that the update is performed. Differs from Randomized Kaczmarz in that each update is using outdated information and we update just a single component of x (in theory). Wright (UW-Madison) Asynchronous Stochastic Optimization September 2014 10 / 44

AsyRK : Global View From a “central” viewpoint, aggregating the actions of the individual cores, we have the following: At each iteration j : Select i ( j ) from { 1 , 2 , . . . , m } with equal probability; Select t ( j ) from the support of a i ( j ) with equal probability; Update component t ( j ): x j +1 = x j − γ � a i ( j ) � 0 ( a T i ( j ) x k ( j ) − b i ( j ) ) E t ( j ) a i ( j ) , where k ( j ) is some iterate prior to j but no more than τ cycles old: j − k ( j ) ≤ τ ; E t is the n × n matrix of all zeros, except for 1 in the ( t , t ) location. If all computational cores are roughly the same speed, we can think of the delay τ as being similar to the number of cores. Wright (UW-Madison) Asynchronous Stochastic Optimization September 2014 11 / 44

Consistent Reading Assumes consistent reading, that is, the x k ( j ) used to evaluate the residual is an x that actually existed at some point in the shared memory. (This condition may be violated if two or more updates happen to the supp( a i ( j ) )-components of x while they are being read.) When the vectors a i are sparse, inconsistency is not too frequent. More on this later! Wright (UW-Madison) Asynchronous Stochastic Optimization September 2014 12 / 44

AsyRK Analysis: A Key Element Key parameters: µ := max i =1 , 2 ,..., m � a i � 0 (maximum nonzero row count); α := max i , t � a i � 0 � AE t a i � ≤ µ � A � ; λ max = max eigenvalue of A T A . Idea of analysis: Choose some ρ > 1 and choose steplength γ small enough that ρ − 1 E ( � Ax j − b � 2 ) ≤ E ( � Ax j +1 − b � 2 ) ≤ ρ E ( � Ax j − b � 2 ) . Not too much change to the residual at each iteration. Hence, don’t pay too much of a price for using outdated information. But don’t want γ to be too tiny, otherwise overall progress is too slow. Strike a balance! Wright (UW-Madison) Asynchronous Stochastic Optimization September 2014 13 / 44

Main Theorem Theorem Choose any ρ > 1 and define γ via the following: ψ = µ + 2 λ max τρ τ m � � � 1 m ( ρ − 1) ρ − 1 γ ≤ min ψ , 2 λ max ρ τ +1 , m ρ τ ( m α 2 + λ 2 max τρ τ ) Then have ρ − 1 E ( � Ax j − b � 2 ) ≤ E ( � Ax j +1 − b � 2 ) ≤ ρ E ( � Ax j − b � 2 ) � � 1 − λ min , nz γ E ( � x j +1 − P ( x j +1 ) � 2 ) ≤ E ( � x j − P ( x j ) � 2 ) , (2 − γψ ) m µ A particular choice of ρ leads to simplified results, in a reasonable regime. Wright (UW-Madison) Asynchronous Stochastic Optimization September 2014 14 / 44

A Particular Choice Corollary Assume 2 e λ max ( τ + 1) ≤ m and set ρ = 1 + 2 e λ max / m. Can show that γ = 1 /ψ for this case, so expected convergence is � � λ min , nz E ( � x j +1 − P ( x j +1 ) � 2 ) ≤ E ( � x j − P ( x j ) � 2 ) . 1 − m ( µ + 1) In the regime 2 e λ max ( τ + 1) ≤ m considered here the delay τ doesn’t really interfere with convergence rate. In this regime, speedup is linear in the number of cores! Wright (UW-Madison) Asynchronous Stochastic Optimization September 2014 15 / 44

Discussion Rate is consistent with serial randomized Kaczmarz: extra factor of 1 / ( µ + 1) arises because we update just one component in x , not all the components in a i ( j ) . For random matrices A with unit rows, we have λ max ≈ (1 + O ( m / n )), with high probability, so that τ can be O ( m ) without compromising linear speedup. Conditions on τ are less strict than for asynchronous random algorithms for optimization problems. (Typically τ = O ( n 1 / 4 ) or τ = O ( n 1 / 2 ) for coordinate descent methods.) See below.... Wright (UW-Madison) Asynchronous Stochastic Optimization September 2014 16 / 44

AsyRK : Near-Linear Speedup Run on an Intel Xeon 40-core machine. Used one socket — 10 cores). Diverges a bit from the analysis: We update all components of x for a i ( j ) (not just one); We use sampling without replacement to work through the rows of A , reordering after each “epoch” Sparse Gaussian random matrix A ∈ R m × n with m = 100000 and n = 80000, sparsity δ = . 001. See linear speedup. m = 80000 n = 100000 sparsity = 0.001 m = 80000 n = 100000 sparsity = 0.001 10 thread= 1 Ideal thread= 2 9 AsyRK 0 10 thread= 4 8 thread= 8 thread=10 7 −2 10 speedup residual 6 5 −4 10 4 −6 10 3 2 −8 10 1 50 100 150 200 250 300 350 400 1 2 3 4 5 6 7 8 9 10 # epochs threads Wright (UW-Madison) Asynchronous Stochastic Optimization September 2014 17 / 44

Asynchronous Parallel Methods for Optimization and Linear Algebra - PowerPoint PPT Presentation

Asynchronous Parallel Methods for Optimization and Linear Algebra Stephen Wright University of Wisconsin-Madison Workshop on Optimization for Modern Computation, Beijing, September 2014 Wright (UW-Madison) Asynchronous Stochastic Optimization

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

AN ASYNCHRONOUS DIVIDER IMPLEMENTATION Navaneeth Jamadagni and Jo Ebergen 2 Asynchronous

How to Design Fast Asynchronous How to Design Fast Asynchronous Routers for Asynchronous Routers

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Asynchronous sequence circuits An asynchronous sequence machine is a sequence circuit without

A Parallel Bundle Method for Asynchronous Subspace Optimization in Lagrangian Relaxation Frank

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Conic optimization Aalborg University, June 26th, 2017 Joachim Dahl www.mosek.com Section 1

Introduction to Linear Programming Linear Programming is the study of optimization problems in

Parallel and Concurrent Haskell Part I Asynchronous agents Simon Marlow Threads (Microsoft

Embarrassingly Parallel Computations 3.2 1 Embarrassingly Parallel Computations A computation

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE

Coordinate Update for Large Scale Optimization (via Asynchronous Parallel Computing) Wotao Yin

Scalable Asynchronous Contact Mechanics using Charm++ Xiang Ni* , Laxmikant V. Kale* and Rasmus

Stochastic Gradient Push for Distributed Deep Learning Mido Assran, Nicolas Loizou, Nicolas

Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning Ton

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

Deep Reinforcement Learning Dominik Winkelbauer State Value function: 1 1 0.5

Parents in Community Colleges Dr. Mary Gatta, Senior Scholar Wider Opportunities for Women Online

Orfg Online Planning for Distance Learning In a School District Roger Sams, along with Katie

Leadership Actions to Mitigate a COVID Learning Loss Webinar June 11, 2020 Slides Melissa

Deep RL Robert Platt Northeastern University Q-learning Q-function Q action argmax state

Asynchronous Parallel Methods for Optimization and Linear Algebra - PowerPoint PPT Presentation

Asynchronous Parallel Methods for Optimization and Linear Algebra Stephen Wright University of Wisconsin-Madison Workshop on Optimization for Modern Computation, Beijing, September 2014 Wright (UW-Madison) Asynchronous Stochastic Optimization

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

AN ASYNCHRONOUS DIVIDER IMPLEMENTATION Navaneeth Jamadagni and Jo Ebergen 2 Asynchronous

How to Design Fast Asynchronous How to Design Fast Asynchronous Routers for Asynchronous Routers

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Asynchronous sequence circuits An asynchronous sequence machine is a sequence circuit without

A Parallel Bundle Method for Asynchronous Subspace Optimization in Lagrangian Relaxation Frank

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Conic optimization Aalborg University, June 26th, 2017 Joachim Dahl www.mosek.com Section 1

Introduction to Linear Programming Linear Programming is the study of optimization problems in

Parallel and Concurrent Haskell Part I Asynchronous agents Simon Marlow Threads (Microsoft

Embarrassingly Parallel Computations 3.2 1 Embarrassingly Parallel Computations A computation

Graphics 2014 Linear Algebra II Linear Maps &amp; Matrices Linear Maps &amp; Matrices CORE

Coordinate Update for Large Scale Optimization (via Asynchronous Parallel Computing) Wotao Yin

Scalable Asynchronous Contact Mechanics using Charm++ Xiang Ni* , Laxmikant V. Kale* and Rasmus

Stochastic Gradient Push for Distributed Deep Learning Mido Assran, Nicolas Loizou, Nicolas

Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning Ton

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

Deep Reinforcement Learning Dominik Winkelbauer State Value function: 1 1 0.5

Parents in Community Colleges Dr. Mary Gatta, Senior Scholar Wider Opportunities for Women Online

Orfg Online Planning for Distance Learning In a School District Roger Sams, along with Katie

Leadership Actions to Mitigate a COVID Learning Loss Webinar June 11, 2020 Slides Melissa

Deep RL Robert Platt Northeastern University Q-learning Q-function Q action argmax state

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE