Parallel Numerical Algorithms Chapter 6 Structured and Low Rank - PowerPoint PPT Presentation

Alternating Least Squares Coordinate Descent Gradient Descent Nonlinear Optimization Parallel Numerical Algorithms Chapter 6 – Structured and Low Rank Matrices Section 6.3 – Numerical Optimization Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 1 / 21

Alternating Least Squares Coordinate Descent Gradient Descent Nonlinear Optimization Outline Alternating Least Squares 1 Quadratic Optimization Parallel ALS Coordinate Descent 2 Coordinate Descent Cyclic Coordinate Descent Gradient Descent 3 Gradient Descent Stochastic Gradient Descent Parallel SGD Nonlinear Optimization 4 Nonlinear Equations Optimization Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 2 / 21

Alternating Least Squares Coordinate Descent Gradient Descent Nonlinear Optimization Quadratic Optimization: Matrix Completion Given a subset of entries Ω ⊆ { 1 , . . . , m } × { 1 , . . . , n } of the entries of matrix A ∈ R m × n , seek rank- k approximation � � 2 � � argmin a ij − w il h jl + λ ( || W || F + || H || F ) W ∈ R m × k , H ∈ R n × k ( i,j ) ∈ Ω l � �� ( A − W H T ) ij Problems of these type studied in sparse approximation Ω may be randomly selected sample subset Methods for this problem are typical of numerical optimization and machine learning Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 3 / 21

Alternating Least Squares Coordinate Descent Quadratic Optimization Gradient Descent Parallel ALS Nonlinear Optimization Alternating Least Squares Alternating least squares (ALS) fixes W and solves for H then vice versa until convergence Each step improves approximation, convergence to a minimum expected given satisfactory starting guess We have a quadratic optimization problem � � 2 � � argmin a ij − w il h jl + λ || W || F W ∈ R m × k ( i,j ) ∈ Ω l The optimization problem is independent for rows of W Letting w i = w i⋆ , h i = h i⋆ , Ω i = { j : ( i, j ) ∈ Ω } , seek � � 2 � a ij − w i h T argmin + λ || w i || 2 j w i ∈ R k j ∈ Ω i Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 4 / 21

Alternating Least Squares Coordinate Descent Quadratic Optimization Gradient Descent Parallel ALS Nonlinear Optimization ALS: Quadratic Optimization Seek minimizer w i for quadratic vector equation � � 2 � a ij − w i h T f ( w i ) = + λ || w i || j j ∈ Ω i Differentiating with respect to w i gives ∂f ( w i ) � � � h T w i h T = 2 j − a ij + 2 λ w i = 0 j ∂ w i j ∈ Ω i i and defining G ( i ) = � Rotating w i h T j = h j w T j ∈ Ω i h T j h j , � ( G ( i ) + λ I ) w T h T i = j a ij j ∈ Ω i which is a k × k symmetric linear system of equations Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 5 / 21

Alternating Least Squares Coordinate Descent Quadratic Optimization Gradient Descent Parallel ALS Nonlinear Optimization ALS: Iteration Cost For updating each w i , ALS is dominated in cost by two steps G ( i ) = � j ∈ Ω i h T j h j 1 dense matrix-matrix product O ( | Ω i | k 2 ) work logarithmic depth Solve linear system with G ( i ) + λ I 2 dense symmetric k × k linear solve O ( k 3 ) work typically O ( k ) depth Can do these for all m rows of W independently Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 6 / 21

Alternating Least Squares Coordinate Descent Quadratic Optimization Gradient Descent Parallel ALS Nonlinear Optimization Parallel ALS Let each task optimize a row w i of W Need to compute G ( i ) for each task Specific subset of rows of H needed for each G ( i ) Task execution is embarassingly parallel if all of H stored on each processor Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 7 / 21

Alternating Least Squares Coordinate Descent Quadratic Optimization Gradient Descent Parallel ALS Nonlinear Optimization Memory-Constrained Parallel ALS May not have enough memory to replicate H on all processors Communication required and pattern is data-dependent Could rotate rows of H along a ring of processors Each processor computes contributions to the G ( i ) it owns Requires Θ( p ) latency cost for each iteration of ALS Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 8 / 21

Alternating Least Squares Coordinate Descent Coordinate Descent Gradient Descent Cyclic Coordinate Descent Nonlinear Optimization Updating a Single Variable Rather than whole rows w i solve for elements of W , recall � � 2 � � argmin a ij − w il h jl + λ || W || F W ∈ R m × k ( i,j ) ∈ Ω l Coordinate descent finds the best replacement µ for w it � � 2 � � + λµ 2 µ = argmin a ij − µh jt − w il h jl µ j ∈ Ω i l � = t The solution is given by � � � a ij − � j ∈ Ω i h jt l � = t w il h jl µ = λ + � j ∈ Ω i h 2 jt Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 9 / 21

Alternating Least Squares Coordinate Descent Coordinate Descent Gradient Descent Cyclic Coordinate Descent Nonlinear Optimization Coordinate Descent For ∀ ( i, j ) ∈ Ω compute elements r ij of R = A − W H T so that we can optimize via � � � � � a ij − � � j ∈ Ω i h jt l � = t w il h jl j ∈ Ω i h jt r ij + w it h jt µ = = λ + � λ + � j ∈ Ω i h 2 j ∈ Ω i h 2 jt jt after which we can update R via r ij ← r ij − ( µ − w it ) h jt ∀ j ∈ Ω i both using O ( | Ω i | ) operations Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 10 / 21

Alternating Least Squares Coordinate Descent Coordinate Descent Gradient Descent Cyclic Coordinate Descent Nonlinear Optimization Cyclic Coordinate Descent (CCD) Updating w i costs O ( | Ω i | k ) operations with coordinate descent rather than O ( | Ω i | k 2 + k 3 ) operations with ALS By solving for all of w i at once, ALS obtains a more accurate solution than coordinate descent Coordinate descent with different update orderings: Cyclic coordinate descent (CCD) updates all columns of W then all columns of H (ALS-like ordering) CCD++ alternates between columns of W and H All entries within a column can be updated concurrently Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 11 / 21

Alternating Least Squares Coordinate Descent Coordinate Descent Gradient Descent Cyclic Coordinate Descent Nonlinear Optimization Parallel CCD++ Yu, Hsieh, Si, and Dhillon 2013 propose using a row-blocked layout of H and W They keep track of a corresponding m/p and n/p rows and columns of A and R on each processor (using twice the minimal amount of memory) Every column update in CCD++ is then fully parallelized, but an allgather of each column is required to update R The complexity of updating all of W and all of H is then T p ( m, n, k ) = Θ( kT allgather ( m + n ) + γQ 1 ( m, n, k ) /p ) p = O ( αk log p + β ( m + n ) k + γ | Ω | k/p ) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 12 / 21

Alternating Least Squares Gradient Descent Coordinate Descent Stochastic Gradient Descent Gradient Descent Parallel SGD Nonlinear Optimization Gradient-Based Update ALS minimizes w i , gradient descent methods only improve it Recall that we seek to minimize � � 2 � a ij − w i h T f ( w i ) = + λ || w i || j j ∈ Ω i and use the partial derivative � � ∂f ( w i ) � � � � h T w i h T = 2 j − a ij +2 λ w i = 2 λ w i − r ij h j j ∂ w i j ∈ Ω i j ∈ Ω i Gradient descent method updates w i = w i − η∂f ( w i ) ∂ w i where parameter η is our step-size Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 13 / 21

Alternating Least Squares Gradient Descent Coordinate Descent Stochastic Gradient Descent Gradient Descent Parallel SGD Nonlinear Optimization Stochastic Gradient Descent (SGD) Stochastic gradient descent (SGD) performs fine-grained updates based on a component of the gradient Again the full gradient is � � ∂f ( w i ) � � = 2 λ w i − r ij h j = 2 λ w i / | Ω i | − r ij h j ∂ w i j ∈ Ω i j ∈ Ω i SGD selects random ( i, j ) ∈ Ω and updates w i using h j w i ← w i − η ( λ w i / | Ω i | − r ij h j ) SGD then updates r ij = a ij − w T i h j Each update costs O ( k ) operations Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 14 / 21

Alternating Least Squares Gradient Descent Coordinate Descent Stochastic Gradient Descent Gradient Descent Parallel SGD Nonlinear Optimization Asynchronous SGD Parallelizing SGD is easy aside from ensuring concurrent updates do not conflict Asynchronous shared-memory implementations of SGD are popular and achieve high performance For sufficiently small step-size, inconsistencies among updates (e.g. duplication) are not problematic statistically Asynchronicity can slow down convergence Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 15 / 21

Parallel Numerical Algorithms Chapter 6 Structured and Low Rank - PowerPoint PPT Presentation

Alternating Least Squares Coordinate Descent Gradient Descent Nonlinear Optimization Parallel Numerical Algorithms Chapter 6 Structured and Low Rank Matrices Section 6.3 Numerical Optimization Michael T. Heath and Edgar Solomonik

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.1 Parallel Algorithm

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.3 Parallel

+ Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms

Parallel Algorithms Parallel Algorithms Examples Examples Concepts & Definitions

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.3 Triangular

Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.3 Triangular

+ Design of Parallel Algorithms Parallel Sorting Algorithms + Topic Overview n Issues in

Parallel Numerical Algorithms Chapter 1 Parallel Computing Michael T. Heath and Edgar

Parallel Numerical Algorithms Chapter 1 Parallel Computing Michael T. Heath and Edgar

Parallel Numerical Algorithms for Heterogeneous Parallel Computers Antonio M. Vidal Maci a

+ Design of Parallel Algorithms Parallel Dense Matrix Algorithms + Topic Overview n

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Wolf Creek Digital I&C Application Experience June 4, 2009 Terry Garrett Vice President

2018 Stafford County Public Schools Head ad Start, Vir irgin inia ia Preschool l Init itia

Adaptive Sketching for Fast and Convergent Canonical Polyadic Decomposition Alex Gittens , Kareem

VLC 4.0 FOSDEM 2019 VLC 3.0 3.0.x 2 VLC 4.0 3.0 numbers Vetinari 18000 commits

Matrix Factorization and Collaborative Filtering MF Readings: Matt Gormley (Koren et

Parallel Sparse Tensor Decomposition in Chapel Thomas B. Rolinger , Tyler A. Simon, Christopher

Charlotte A. & Clinton E. Rings Photo History (Slides) 1964 1971 By Al Ring 2007

Scalable Tensor Computations with Cyclops and Faster Algorithms for Alternating Least Squares

Parallel Numerical Algorithms Chapter 6 Structured and Low Rank - PowerPoint PPT Presentation

Alternating Least Squares Coordinate Descent Gradient Descent Nonlinear Optimization Parallel Numerical Algorithms Chapter 6 Structured and Low Rank Matrices Section 6.3 Numerical Optimization Michael T. Heath and Edgar Solomonik

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.1 Parallel Algorithm

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.3 Parallel

+ Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms

Parallel Algorithms Parallel Algorithms Examples Examples Concepts &amp; Definitions

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.3 Triangular

Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.3 Triangular

+ Design of Parallel Algorithms Parallel Sorting Algorithms + Topic Overview n Issues in

Parallel Numerical Algorithms Chapter 1 Parallel Computing Michael T. Heath and Edgar

Parallel Numerical Algorithms Chapter 1 Parallel Computing Michael T. Heath and Edgar

Parallel Numerical Algorithms for Heterogeneous Parallel Computers Antonio M. Vidal Maci a

+ Design of Parallel Algorithms Parallel Dense Matrix Algorithms + Topic Overview n

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Wolf Creek Digital I&amp;C Application Experience June 4, 2009 Terry Garrett Vice President

2018 Stafford County Public Schools Head ad Start, Vir irgin inia ia Preschool l Init itia

Adaptive Sketching for Fast and Convergent Canonical Polyadic Decomposition Alex Gittens , Kareem

VLC 4.0 FOSDEM 2019 VLC 3.0 3.0.x 2 VLC 4.0 3.0 numbers Vetinari 18000 commits

Matrix Factorization and Collaborative Filtering MF Readings: Matt Gormley (Koren et

Parallel Sparse Tensor Decomposition in Chapel Thomas B. Rolinger , Tyler A. Simon, Christopher

Charlotte A. &amp; Clinton E. Rings Photo History (Slides) 1964 1971 By Al Ring 2007

Scalable Tensor Computations with Cyclops and Faster Algorithms for Alternating Least Squares

Parallel Algorithms Parallel Algorithms Examples Examples Concepts & Definitions

Wolf Creek Digital I&C Application Experience June 4, 2009 Terry Garrett Vice President

Charlotte A. & Clinton E. Rings Photo History (Slides) 1964 1971 By Al Ring 2007