communication-optimal QR factorizations: performance and scalability - PowerPoint PPT Presentation

communication-optimal QR factorizations: performance and scalability on varying architectures Edward Hutter and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign Blue Waters Symposium 2019 Edward Hutter and Edgar Solomonik 1/28

Motivation for reducing algorithmic communication costs Communication and synchronization increasingly dominating algorithm performance on modern architectures Edward Hutter and Edgar Solomonik 2/28

Motivation for reducing algorithmic communication costs Communication and synchronization increasingly dominating algorithm performance on modern architectures α − β − γ cost model α - cost to send zero-byte message β - cost to inject byte of data into network γ - cost to perform flop with register-resident data Edward Hutter and Edgar Solomonik 2/28

Motivation for reducing algorithmic communication costs Communication and synchronization increasingly dominating algorithm performance on modern architectures α − β − γ cost model α - cost to send zero-byte message β - cost to inject byte of data into network γ - cost to perform flop with register-resident data Architectural trend: α ≫ β ≫ γ Edward Hutter and Edgar Solomonik 2/28

Motivation for reducing algorithmic communication costs Communication and synchronization increasingly dominating algorithm performance on modern architectures α − β − γ cost model α - cost to send zero-byte message β - cost to inject byte of data into network γ - cost to perform flop with register-resident data Architectural trend: α ≫ β ≫ γ Communication-avoiding algorithms for most dense matrix factorizations present in numerical libraries Edward Hutter and Edgar Solomonik 2/28

Motivation for reducing algorithmic communication costs Communication and synchronization increasingly dominating algorithm performance on modern architectures α − β − γ cost model α - cost to send zero-byte message β - cost to inject byte of data into network γ - cost to perform flop with register-resident data Architectural trend: α ≫ β ≫ γ Communication-avoiding algorithms for most dense matrix factorizations present in numerical libraries Goal: A QR factorization algorithm that prioritizes minimizing synchronization and communication cost Edward Hutter and Edgar Solomonik 2/28

Motivation for reducing algorithmic communication costs Communication and synchronization increasingly dominating algorithm performance on modern architectures α − β − γ cost model α - cost to send zero-byte message β - cost to inject byte of data into network γ - cost to perform flop with register-resident data Architectural trend: α ≫ β ≫ γ Communication-avoiding algorithms for most dense matrix factorizations present in numerical libraries Goal: A QR factorization algorithm that prioritizes minimizing synchronization and communication cost Our team uses BlueWaters to assess the scalability of new algorithms for numerical tensor algebra at massively large scale Edward Hutter and Edgar Solomonik 2/28

Architecture trends: machine balance decreasing peak node perf peak injection bandwidth machine balance machine launch year (Gflops/s) (Gwords/sec) (words/flop) ASCI Red 1997 0.666 0.4 1/1.665 ANL BG/P 2007 13.6 1 1/13.6 ONL Jaguar 2009 124.8 2.2 1/56 ANL BG/Q 2012 205 2 1/102.5 NCSA BlueWaters (XE) 2012 313.6 9.6 1/32 NCSA BlueWaters (XK) 2012 1320 9.6 1/137.5 ORNL Titan 2013 1320 8 1/165 ANL Theta 2017 3000+ 10.2 1/294 TACC Stampede2 2017 3000+ 12.5 1/240 LLNL Sierra 2018 28000 12.5 1/2240 ORNL Summit 2018 44000 12.5 1/3520 Edward Hutter and Edgar Solomonik 3/28

Architecture trends: machine balance decreasing peak node perf peak injection bandwidth machine balance machine launch year (Gflops/s) (Gwords/sec) (words/flop) ASCI Red 1997 0.666 0.4 1/1.665 ANL BG/P 2007 13.6 1 1/13.6 ONL Jaguar 2009 124.8 2.2 1/56 ANL BG/Q 2012 205 2 1/102.5 NCSA BlueWaters (XE) 2012 313.6 9.6 1/32 NCSA BlueWaters (XK) 2012 1320 9.6 1/137.5 ORNL Titan 2013 1320 8 1/165 ANL Theta 2017 3000+ 10.2 1/294 TACC Stampede2 2017 3000+ 12.5 1/240 LLNL Sierra 2018 28000 12.5 1/2240 ORNL Summit 2018 44000 12.5 1/3520 Higher arithmetic intensity → higher performance on new architectures Edward Hutter and Edgar Solomonik 3/28

Architecture trends: machine balance decreasing peak node perf peak injection bandwidth machine balance machine launch year (Gflops/s) (Gwords/sec) (words/flop) ASCI Red 1997 0.666 0.4 1/1.665 ANL BG/P 2007 13.6 1 1/13.6 ONL Jaguar 2009 124.8 2.2 1/56 ANL BG/Q 2012 205 2 1/102.5 NCSA BlueWaters (XE) 2012 313.6 9.6 1/32 NCSA BlueWaters (XK) 2012 1320 9.6 1/137.5 ORNL Titan 2013 1320 8 1/165 ANL Theta 2017 3000+ 10.2 1/294 TACC Stampede2 2017 3000+ 12.5 1/240 LLNL Sierra 2018 28000 12.5 1/2240 ORNL Summit 2018 44000 12.5 1/3520 Higher arithmetic intensity → higher performance on new architectures BlueWaters not a favorable machine for communication-avoiding algorithms Edward Hutter and Edgar Solomonik 3/28

Communication-avoiding Cholesky-QR2 (CA-CQR2) 3D algorithms utilize available extra memory to reduce communication asymptotically. Edward Hutter and Edgar Solomonik 4/28

Communication-avoiding Cholesky-QR2 (CA-CQR2) 3D algorithms utilize available extra memory to reduce communication asymptotically. We introduce CA-CQR2, a novel practical 3D QR factorization algorithm Edward Hutter and Edgar Solomonik 4/28

Communication-avoiding Cholesky-QR2 (CA-CQR2) 3D algorithms utilize available extra memory to reduce communication asymptotically. We introduce CA-CQR2, a novel practical 3D QR factorization algorithm extends CholeskyQR2 algorithm to arbitary m × n matrices across P processes Edward Hutter and Edgar Solomonik 4/28

Communication-avoiding Cholesky-QR2 (CA-CQR2) 3D algorithms utilize available extra memory to reduce communication asymptotically. We introduce CA-CQR2, a novel practical 3D QR factorization algorithm extends CholeskyQR2 algorithm to arbitary m × n matrices across P processes �� Pm 2 / n 2 � 1 / 6 � requires O less communication than known 2D QR algorithms Edward Hutter and Edgar Solomonik 4/28

Communication-avoiding Cholesky-QR2 (CA-CQR2) 3D algorithms utilize available extra memory to reduce communication asymptotically. We introduce CA-CQR2, a novel practical 3D QR factorization algorithm extends CholeskyQR2 algorithm to arbitary m × n matrices across P processes �� Pm 2 / n 2 � 1 / 6 � requires O less communication than known 2D QR algorithms incurs a number of (increasingly profitable) tradeoffs 2 − 4x more flops than Householder QR) matrix must be sufficiently well-conditioned � ( Pm / n ) 1 / 3 � requires O more memory than known 2D QR algorithms Edward Hutter and Edgar Solomonik 4/28

Communication-avoiding Cholesky-QR2 (CA-CQR2) 3D algorithms utilize available extra memory to reduce communication asymptotically. We introduce CA-CQR2, a novel practical 3D QR factorization algorithm extends CholeskyQR2 algorithm to arbitary m × n matrices across P processes �� Pm 2 / n 2 � 1 / 6 � requires O less communication than known 2D QR algorithms incurs a number of (increasingly profitable) tradeoffs 2 − 4x more flops than Householder QR) matrix must be sufficiently well-conditioned � ( Pm / n ) 1 / 3 � requires O more memory than known 2D QR algorithms All algorithms will be measured along the critical path instead of a volume measure Edward Hutter and Edgar Solomonik 4/28

Communication-avoiding Cholesky-QR2 (CA-CQR2) 3D algorithms utilize available extra memory to reduce communication asymptotically. We introduce CA-CQR2, a novel practical 3D QR factorization algorithm extends CholeskyQR2 algorithm to arbitary m × n matrices across P processes �� Pm 2 / n 2 � 1 / 6 � requires O less communication than known 2D QR algorithms incurs a number of (increasingly profitable) tradeoffs 2 − 4x more flops than Householder QR) matrix must be sufficiently well-conditioned � ( Pm / n ) 1 / 3 � requires O more memory than known 2D QR algorithms All algorithms will be measured along the critical path instead of a volume measure Figure: Horizontal (internode network) communication along critical path Edward Hutter and Edgar Solomonik 4/28

QR Strong scaling performance Strong Scaling: Stampede2 and BlueWaters, m/n=4096 300 ST2 ScaLAPACK ST2 CA-CQR2 250 Gigaflops/s/Node BW ScaLAPACK 200 BW CA-CQR2 150 100 50 0 512 1024 2048 4096 8192 16384 32768 65536 Processes Figure: Strong scaling for m × n matrices Edward Hutter and Edgar Solomonik 5/28

QR Strong scaling performance Strong Scaling on Stampede2 and BlueWaters, m/n=512 300 ST2 ScaLAPACK ST2 CA-CQR2 250 Gigaflops/s/Node BW ScaLAPACK 200 BW CA-CQR2 150 100 50 0 512 1024 2048 4096 8192 16384 32768 65536 Processes Figure: Strong scaling for m × n matrices Edward Hutter and Edgar Solomonik 6/28

QR Strong scaling performance Strong Scaling on Stampede2 and BlueWaters, m/n=64 200 ST2 ScaLAPACK ST2 CA-CQR2 Gigaflops/s/Node 150 BW ScaLAPACK BW CA-CQR2 100 50 0 512 1024 2048 4096 8192 16384 32768 65536 Processes Figure: Strong scaling for m × n matrices Edward Hutter and Edgar Solomonik 7/28

communication-optimal QR factorizations: performance and scalability - PowerPoint PPT Presentation

communication-optimal QR factorizations: performance and scalability on varying architectures Edward Hutter and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign Blue Waters Symposium 2019 Edward Hutter

Non-unique factorizations in bounded hereditary noetherian prime rings Daniel Smertnig

Factorizations of ideals in noncommutative rings similar to factorizations of ideals in

BOOLEAN MATRIX FACTORIZATIONS Pauli Miettinen Leap day, 2012 MATRIX FACTORIZATIONS

A type of generalized factorization on domains -Factorizations R.M. Ortiz-Albino Conference of

Pure-cycle Hurwitz factorizations and multi-noded rooted trees by Rosena Ruoxia Du East China

Factorizations, invariant subspaces and multivalency H.L. (Rudi) Wietsma Department of

Matrix-Factorizations and Superpotentials Marco Baumgartl ASC-LMU Munich 15th European Workshop

CSC 411 Lecture 18: Matrix Factorizations Roger Grosse, Amir-massoud Farahmand, and Juan

Factorizations of Coxeter Elements in Complex Reflection Groups University of Minnesota-Twin

Counting factorizations in complex reflection groups Joel Brewster Lewis (George Washington

Nonparametric Bayes tensor factorizations for big data David Dunson Department of Statistical

Chapter IX: Matrix factorizations Information Retrieval & Data Mining Universitt des

On enumerating factorizations in reflection groups. Theo Douvropoulos Paris VII, IRIF

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice

Optimal Agents Nick Hay 27th September 2005 1 / 36 Nick Hay Optimal Agents The Optimal Agent

Let the Market Drive Deployment A Strategy for Transitioning to BGP Security Phillipa Gill

Chapter 7.3: Euripides Life and Career Life and Career wrote something around 90 plays

Towards the Web of Things Dave Raggett, W3C UWE WDC, Bristol September 2007 1 Contact:

Investigation of magnetic phase transitions using the Laue technique 4 29 September 2017

Accurate Numerical Simula.ons of Chemical Phenomena Involved in

How Secure are Secure How Secure are Secure Interdomain Routing Protocols? Interdomain Routing

Introduction Vast transistor budgets, but .... Poor interconnect scaling Pressure to

August 23, 2017 Wednesday 8/23 To start: What do you know about MLA format? Our mission:

communication-optimal QR factorizations: performance and scalability - PowerPoint PPT Presentation

communication-optimal QR factorizations: performance and scalability on varying architectures Edward Hutter and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign Blue Waters Symposium 2019 Edward Hutter

Non-unique factorizations in bounded hereditary noetherian prime rings Daniel Smertnig

Factorizations of ideals in noncommutative rings similar to factorizations of ideals in

BOOLEAN MATRIX FACTORIZATIONS Pauli Miettinen Leap day, 2012 MATRIX FACTORIZATIONS

A type of generalized factorization on domains -Factorizations R.M. Ortiz-Albino Conference of

Pure-cycle Hurwitz factorizations and multi-noded rooted trees by Rosena Ruoxia Du East China

Factorizations, invariant subspaces and multivalency H.L. (Rudi) Wietsma Department of

Matrix-Factorizations and Superpotentials Marco Baumgartl ASC-LMU Munich 15th European Workshop

CSC 411 Lecture 18: Matrix Factorizations Roger Grosse, Amir-massoud Farahmand, and Juan

Factorizations of Coxeter Elements in Complex Reflection Groups University of Minnesota-Twin

Counting factorizations in complex reflection groups Joel Brewster Lewis (George Washington

Nonparametric Bayes tensor factorizations for big data David Dunson Department of Statistical

Chapter IX: Matrix factorizations Information Retrieval &amp; Data Mining Universitt des

On enumerating factorizations in reflection groups. Theo Douvropoulos Paris VII, IRIF

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice

Optimal Agents Nick Hay 27th September 2005 1 / 36 Nick Hay Optimal Agents The Optimal Agent

Let the Market Drive Deployment A Strategy for Transitioning to BGP Security Phillipa Gill

Chapter 7.3: Euripides Life and Career Life and Career wrote something around 90 plays

Towards the Web of Things Dave Raggett, W3C UWE WDC, Bristol September 2007 1 Contact:

Investigation of magnetic phase transitions using the Laue technique 4 29 September 2017

Accurate Numerical Simula.ons of Chemical Phenomena Involved in

How Secure are Secure How Secure are Secure Interdomain Routing Protocols? Interdomain Routing

Introduction Vast transistor budgets, but .... Poor interconnect scaling Pressure to

August 23, 2017 Wednesday 8/23 To start: What do you know about MLA format? Our mission:

Chapter IX: Matrix factorizations Information Retrieval & Data Mining Universitt des