Performance Analysis of Tile Low-Rank Cholesky Factorization Using - PowerPoint PPT Presentation

Performance Analysis of Tile Low-Rank Cholesky Factorization Using PaRSEC Instrumentation Tools Qinglei C Cao , Yu Pei, Thomas Herault, Kadir Akbudak, Aleksandr Mikhalev, George Bosilca, Hatem Ltaief, David Keyes, and Jack Dongarra Protools19

PaRSEC, task-based programming • Focus on data dependencies, data App flows, and tasks • Don’t develop for an architecture but for a portability layer Data Sched. Comm Runtime Distrib. • Let the runtime deal with the hardware characteristics Memory Heterogeneity Manager Manager • But provide as much user control as possible • StarSS, StarPU, Swift, Parallex, Quark, Kaapi, DuctTeip, ..., and PaRSEC

PaRSEC PaRSEC: a generic runtime system for asynchronous, architecture aware scheduling of fine- grained tasks on distributed many- core heterogeneous architectures. 3

PaRSEC Performance Programming Debug Easy to Use 4

PaRSEC Profiling Tools •Sits at the core of the performance profiling system •Events as identifiable entities Trace •Scalable for many- thread environments Collection o One profiling stream for each thread Framework o Additional helping threads in charge of I/O, memory allocators, compactors, … o Additional buffers allocated in advance •The Trace Collection Framework is used within the PaRSEC runtime through the P aRSEC INS INS trumentation (PINS) interface PINS: PaRSEC INStrumentati •PINS registers callbacks for all the important steps of a on task or communication life cycle •Dynamically configurable to generate only the events pertinent to the run 5

PaRSEC Profiling Tools •It is necessary to connect information with the actual DAG of tasks •Automatically generate DOT file each with a partial view Dependency of the DAG Analysis o Collection of the DAG can be done offline o One DOT file per process with tools to concatenate the different DOT files • The binary format of trace files is not exposed to the user, needs to be a portable and exploitable file format Trace • Hierarchical Data Format (HDF5) following the structure required by Conversion the popular Pandas Library • Tools to take the generated trace and convert it into a Gantt chart Tools • Provides a library to read the DOT files that are generated into a NetworkX [29] representation 6

PaRSEC Profiling Tools $> python >>> import pandas as pd >>> t = pd.HDFStore ( 'dpotrf.h5’ ) >>> t.event_types ACTIVATE_CB 6 Device delegate 1 MPI_ACTIVATE 2 MPI_DATA_CTL 3 MPI_DATA_PLD_RCV 5 MPI_DATA_PLD_SND 4 PUT_CB 7 TASK_MEMORY 0 potrf_dgemm 8 potrf_dpotrf 11 potrf_dsyrk 9 potrf_dtrsm 10 dtype: int64 7

TLR Cholesky Factorization The C Cholesky f factorizati tion of an N * N real symmetric, positive-definite matrix A has the form: A = LL T , where L is an N ⇥ N real lower triangular matrix with positive diagonal elements. v Apparently dense matrices arising in scientific applications, such as climate/weather forecasting in computational statistics, seismic imaging in earth science, structural and vibrational analysis in material science. v Common properties: Symmetric, positive-definite matrix o (Apparently) dense matrices o Often data-sparse, Decay of parameter o correlations with distance 8

TLR Cholesky Factorization v Dense matrices might be compressed: o Cholesky factorization (for distributed-memory architectures) o Tile low rank (TLR) matrix format o Significantly less memory o Preserving the accuracy requirements of the scientific application o Huge performance improvement via cutting down flops 9

TLR Cholesky Factorization A serial and for p = 1 to NT do POTRF (D(p,p)) PO incompressible for i = p+1 to NT do critical path of TLR TRSM (V(i,p), D(p,p)) TR Cholesky: (NT -1 ) * for j = p+1 to NT (POTRF + TRSM + LR_S _SYRK (D(j,j), U(j,p), V(j,p)) SYRK) + POTRF. for i = j+1 to NT do LR_G _GEMM (U(i,p), V(i,p), U(j,p), V(j,p), U(i,j), V(i,j), acc) Ke Kernel Dense C Cholesky TLR C Cholesky POTRF 1/3 * nb^3 1/3 * nb^3 TRSM nb^3 nb^2 * rank SYRK/LR_SYRK nb^3 2 * nb^2 * rank + 4 * nb * rank^2 GEMM/LR_GEMM 2 * nb^3 36 * nb * rank^2 Total O(N^3) O(N^2 * rank) 10 10

State-of-the-art st st-2D 2D-sq sqexp Syn-2D Sy 2D • Shaheen II, a Cray XC40 system, which has 6,174 compute nodes; • The accuracy threshold of 10 − 8, which ultimately yields absolute numerical error of order 10 − 9; 11

Experiments Optimal tile size Band Hierarchic distributio al POTRF n PaRSEC and its instrumentation tools Communic ation Novel lookahead volume reduction 12 12

Optimization 1: Optimal Tile Size v Tile size plays a significant role in TLR Cholesky v Operation b balance between tiles on a and o off c critical p path . v Assume N is the matrix size, node is the number of ● 250 nodes, k is the average rank of tiles off diagonal, then the ● best tile size nb can be approximated: ● 200 Time (s) Syn-2D, 16 ● 150 nodes, 2M 100 ● ● ● ● ● 50 ● 2500 5000 7500 Approximated Optimal Tile Size Experimental Optimal 4000 v The profiling tools in PaRSEC gets in: Tile Size st-2D-sqexp, o kernel execution time varies for each task, in 256 nodes 2000 terms of the number of operations o Set special event “ops_count” to gather the 0 operaions count for tasks in the critical path 1.08 2.16 4.32 6.48 8.64 10.8 15.12 and tasks off the critical path Matrix Size (10 6 ) 13

Evaluation: Hybrid Data Distributions mory and co • Imbalance: me memor computation • PaRSEC’s profiling system could provide the execution time for each process, as well as each thread, from which we extract the workload for each process to show load balancing. Band distribution, Band distribution, 2DBCDD band_size = 1 band_size = 2 14 14

Evaluation: Hybrid Data Distributions No. o . of N Nodes Matrix S Size Memory R Reduced ( (GB) • We use the event memory in the profiling 16 1080000 4.374 system to detail 16 2160000 8.748 memory usage of both 16 4320000 17.496 static matrix allocation 64 2160000 5.103 and dynamic temporary buffers. 64 4320000 10.206 • PaRSEC’s profiling 64 6480000 20.412 system also provides the execution time for each process, as well as each thread, from which we extract the workload for each process to show load balancing. Without the hybrid data distributions With the hybrid data distributions 15 15

Evaluation: Reduce Communication Volume v We used the PaRSEC tracing framework API to Initial rank register a new, distributions (i.e., application-specific type before factorization) of event, and at the on the left and the execution of each task, difference between initial and final we logged the rank of ranks (i.e., after the tile on which the task factorization) on the was working. right; the matrix size v Once the trace was is 1080K × 1080K, converted, we then and the tile size is wrote application- 2,700; up for 2D specific scripts to analyze problem, blew for the HDF5 file, and 3D problem produce the figures. 16 16

Evaluation: Novel Lookahead o We profiled the execution to ensure the critical path is respected, i.e. as soon as the data is read PaRSEC enables the critical tasks first. o To be able to compute the average time it takes for data to be produced on one node and consumed on another, we need to connect the task termination, network activation, payload emission, and remote task execution events. o This is provided by the PaRSEC profiling system through a Time between data is ready and TRSM starts for st-2D- combination of the trace sqexp. Left, without lookahead; right, with lookahead of 5; information and the DOT file. each point represents one TRSM; matrix has 100 × 100 tiles 17 17

Evaluation: Hierarchical POTRF o We exploited the basic timing ● ● h − POTRF ● POTRF 8 information produced by the Time (s) 6 tracing system ● Impact of 4 o Plus statistical packages Hierarchical ● 2 provided by pandas and POTRF: top, ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● execution time on NumPy to compute our 2000 4000 6000 8000 10000 a single node; metrics: we compute the Tile Size bottom, resource occupancy of the 100 occupancy of 540K h − POTRF computational resources Occupancy POTRF ● × 540K matrix on a ● ● ● ● ● ● ● ● ● 75 during the original run and 3 × 3 process grid ● ● 50 with a tile size of then during the hierarchical 2,700 POTRF run. 25 ● ● 1 − 100 100 − 130 131 − 160 161 − 200 Panel Range 18 18

Performance Analysis of Tile Low-Rank Cholesky Factorization Using - PowerPoint PPT Presentation

Performance Analysis of Tile Low-Rank Cholesky Factorization Using PaRSEC Instrumentation Tools Qinglei C Cao , Yu Pei, Thomas Herault, Kadir Akbudak, Aleksandr Mikhalev, George Bosilca, Hatem Ltaief, David Keyes, and Jack Dongarra Protools19

Experience the Difference 2017 DECRA Villa Tile Panel Detail 2017 DECRA Villa Tile Roof

2 3 4 5 8 9 MINNEAPOLIS MILWAUKEE MSA RANK #16 MSA RANK #39 CHICAGO MSA RANK #3

Eastern Redcedar Mulch Tile Meet the Team Overview Mission Statement Mulch Tile Process

Odyssey 2016 The Speaker and Language Recognition Workshop June 21-24, 2016 Bilbao, Spain The

Domino Tilings Can you tile the grid with L-shaped tiles? Domino Tilings Can you tile the grid

Design and Performance Issues of Cholesky and LU Solvers using UPCBLAS Jorge

Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation

Cholesky Decomposition Techniques in Quantum Chemical Implementations Outline What is

The Scalable Petascale Data-Driven Approach for the Cholesky Factorization with multiple GPUs

Asymptotics of Cholesky GARCH Models and Time-Varying Conditional Betas Serge Darolles, Christian

On Cholesky structures on real symmetric matrices and their applications Hideyuki ISHI (Osaka

On the minimum rank of a graph Jisu Jeong June 21, 2013 Jisu Jeong On the minimum rank of a

Corporate Presentation May 2018 Agenda Global Tile Industry Indian Tile Industry Kajaria

RED LAKE RIVER FARM TO RED LAKE RIVER FARM TO STREAM TILE DRAINAGE STREAM TILE DRAINAGE STUDY

Corporate Presentation Oct 2018 Agenda Global Tile Industry Indian Tile Industry Kajaria

Corporate Presentation May 2019 Agenda Global Tile Industry Indian Tile Industry Kajaria

Advanced features in Score-P and Scalasca David Bhme,

Applications of Graph Traversal Algorithm : Design & Analysis [12] In the last class

Project Management Massimo Felici Massimo Felici Project Management 2011 c 1 Project

CS 5150 Software Engineering 5. Project Management William Y. Arms Project Management: OS 360

Parallel Scan Alg lgorithm Shang Wang 1,2 , Yifan Bai 1 , Gennady Pekhimenko 1,2 1 2 The

Sparklens: Understanding the Scalability Limits of Spark Applications Ashish Dubey, Qubole ABOUT

Use of Task Graph Model for Parallel Program Design Detailed steps for parallel program design

How to Solve Complex Problems in Parallel (Parallel Divide and Conquer and Task Parallelism)

Performance Analysis of Tile Low-Rank Cholesky Factorization Using - PowerPoint PPT Presentation

Performance Analysis of Tile Low-Rank Cholesky Factorization Using PaRSEC Instrumentation Tools Qinglei C Cao , Yu Pei, Thomas Herault, Kadir Akbudak, Aleksandr Mikhalev, George Bosilca, Hatem Ltaief, David Keyes, and Jack Dongarra Protools19

Experience the Difference 2017 DECRA Villa Tile Panel Detail 2017 DECRA Villa Tile Roof

2 3 4 5 8 9 MINNEAPOLIS MILWAUKEE MSA RANK #16 MSA RANK #39 CHICAGO MSA RANK #3

Eastern Redcedar Mulch Tile Meet the Team Overview Mission Statement Mulch Tile Process

Odyssey 2016 The Speaker and Language Recognition Workshop June 21-24, 2016 Bilbao, Spain The

Domino Tilings Can you tile the grid with L-shaped tiles? Domino Tilings Can you tile the grid

Design and Performance Issues of Cholesky and LU Solvers using UPCBLAS Jorge

Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation

Cholesky Decomposition Techniques in Quantum Chemical Implementations Outline What is

The Scalable Petascale Data-Driven Approach for the Cholesky Factorization with multiple GPUs

Asymptotics of Cholesky GARCH Models and Time-Varying Conditional Betas Serge Darolles, Christian

On Cholesky structures on real symmetric matrices and their applications Hideyuki ISHI (Osaka

On the minimum rank of a graph Jisu Jeong June 21, 2013 Jisu Jeong On the minimum rank of a

Corporate Presentation May 2018 Agenda Global Tile Industry Indian Tile Industry Kajaria

RED LAKE RIVER FARM TO RED LAKE RIVER FARM TO STREAM TILE DRAINAGE STREAM TILE DRAINAGE STUDY

Corporate Presentation Oct 2018 Agenda Global Tile Industry Indian Tile Industry Kajaria

Corporate Presentation May 2019 Agenda Global Tile Industry Indian Tile Industry Kajaria

Advanced features in Score-P and Scalasca David Bhme,

Applications of Graph Traversal Algorithm : Design &amp; Analysis [12] In the last class

Project Management Massimo Felici Massimo Felici Project Management 2011 c 1 Project

CS 5150 Software Engineering 5. Project Management William Y. Arms Project Management: OS 360

Parallel Scan Alg lgorithm Shang Wang 1,2 , Yifan Bai 1 , Gennady Pekhimenko 1,2 1 2 The

Sparklens: Understanding the Scalability Limits of Spark Applications Ashish Dubey, Qubole ABOUT

Use of Task Graph Model for Parallel Program Design Detailed steps for parallel program design

How to Solve Complex Problems in Parallel (Parallel Divide and Conquer and Task Parallelism)

Applications of Graph Traversal Algorithm : Design & Analysis [12] In the last class