performance analysis of tile low rank cholesky
play

Performance Analysis of Tile Low-Rank Cholesky Factorization Using - PowerPoint PPT Presentation

Performance Analysis of Tile Low-Rank Cholesky Factorization Using PaRSEC Instrumentation Tools Qinglei C Cao , Yu Pei, Thomas Herault, Kadir Akbudak, Aleksandr Mikhalev, George Bosilca, Hatem Ltaief, David Keyes, and Jack Dongarra Protools19


  1. Performance Analysis of Tile Low-Rank Cholesky Factorization Using PaRSEC Instrumentation Tools Qinglei C Cao , Yu Pei, Thomas Herault, Kadir Akbudak, Aleksandr Mikhalev, George Bosilca, Hatem Ltaief, David Keyes, and Jack Dongarra Protools19

  2. PaRSEC, task-based programming • Focus on data dependencies, data App flows, and tasks • Don’t develop for an architecture but for a portability layer Data Sched. Comm Runtime Distrib. • Let the runtime deal with the hardware characteristics Memory Heterogeneity Manager Manager • But provide as much user control as possible • StarSS, StarPU, Swift, Parallex, Quark, Kaapi, DuctTeip, ..., and PaRSEC

  3. PaRSEC PaRSEC: a generic runtime system for asynchronous, architecture aware scheduling of fine- grained tasks on distributed many- core heterogeneous architectures. 3

  4. PaRSEC Performance Programming Debug Easy to Use 4

  5. PaRSEC Profiling Tools •Sits at the core of the performance profiling system •Events as identifiable entities Trace •Scalable for many- thread environments Collection o One profiling stream for each thread Framework o Additional helping threads in charge of I/O, memory allocators, compactors, … o Additional buffers allocated in advance •The Trace Collection Framework is used within the PaRSEC runtime through the P aRSEC INS INS trumentation (PINS) interface PINS: PaRSEC INStrumentati •PINS registers callbacks for all the important steps of a on task or communication life cycle •Dynamically configurable to generate only the events pertinent to the run 5

  6. PaRSEC Profiling Tools •It is necessary to connect information with the actual DAG of tasks •Automatically generate DOT file each with a partial view Dependency of the DAG Analysis o Collection of the DAG can be done offline o One DOT file per process with tools to concatenate the different DOT files • The binary format of trace files is not exposed to the user, needs to be a portable and exploitable file format Trace • Hierarchical Data Format (HDF5) following the structure required by Conversion the popular Pandas Library • Tools to take the generated trace and convert it into a Gantt chart Tools • Provides a library to read the DOT files that are generated into a NetworkX [29] representation 6

  7. PaRSEC Profiling Tools $> python >>> import pandas as pd >>> t = pd.HDFStore ( 'dpotrf.h5’ ) >>> t.event_types ACTIVATE_CB 6 Device delegate 1 MPI_ACTIVATE 2 MPI_DATA_CTL 3 MPI_DATA_PLD_RCV 5 MPI_DATA_PLD_SND 4 PUT_CB 7 TASK_MEMORY 0 potrf_dgemm 8 potrf_dpotrf 11 potrf_dsyrk 9 potrf_dtrsm 10 dtype: int64 7

  8. TLR Cholesky Factorization The C Cholesky f factorizati tion of an N * N real symmetric, positive-definite matrix A has the form: A = LL T , where L is an N ⇥ N real lower triangular matrix with positive diagonal elements. v Apparently dense matrices arising in scientific applications, such as climate/weather forecasting in computational statistics, seismic imaging in earth science, structural and vibrational analysis in material science. v Common properties: Symmetric, positive-definite matrix o (Apparently) dense matrices o Often data-sparse, Decay of parameter o correlations with distance 8

  9. TLR Cholesky Factorization v Dense matrices might be compressed: o Cholesky factorization (for distributed-memory architectures) o Tile low rank (TLR) matrix format o Significantly less memory o Preserving the accuracy requirements of the scientific application o Huge performance improvement via cutting down flops 9

  10. TLR Cholesky Factorization A serial and for p = 1 to NT do POTRF (D(p,p)) PO incompressible for i = p+1 to NT do critical path of TLR TRSM (V(i,p), D(p,p)) TR Cholesky: (NT -1 ) * for j = p+1 to NT (POTRF + TRSM + LR_S _SYRK (D(j,j), U(j,p), V(j,p)) SYRK) + POTRF. for i = j+1 to NT do LR_G _GEMM (U(i,p), V(i,p), U(j,p), V(j,p), U(i,j), V(i,j), acc) Ke Kernel Dense C Cholesky TLR C Cholesky POTRF 1/3 * nb^3 1/3 * nb^3 TRSM nb^3 nb^2 * rank SYRK/LR_SYRK nb^3 2 * nb^2 * rank + 4 * nb * rank^2 GEMM/LR_GEMM 2 * nb^3 36 * nb * rank^2 Total O(N^3) O(N^2 * rank) 10 10

  11. State-of-the-art st st-2D 2D-sq sqexp Syn-2D Sy 2D • Shaheen II, a Cray XC40 system, which has 6,174 compute nodes; • The accuracy threshold of 10 − 8, which ultimately yields absolute numerical error of order 10 − 9; 11

  12. Experiments Optimal tile size Band Hierarchic distributio al POTRF n PaRSEC and its instrumentation tools Communic ation Novel lookahead volume reduction 12 12

  13. Optimization 1: Optimal Tile Size v Tile size plays a significant role in TLR Cholesky v Operation b balance between tiles on a and o off c critical p path . v Assume N is the matrix size, node is the number of ● 250 nodes, k is the average rank of tiles off diagonal, then the ● best tile size nb can be approximated: ● 200 Time (s) Syn-2D, 16 ● 150 nodes, 2M 100 ● ● ● ● ● 50 ● 2500 5000 7500 Approximated Optimal Tile Size Experimental Optimal 4000 v The profiling tools in PaRSEC gets in: Tile Size st-2D-sqexp, o kernel execution time varies for each task, in 256 nodes 2000 terms of the number of operations o Set special event “ops_count” to gather the 0 operaions count for tasks in the critical path 1.08 2.16 4.32 6.48 8.64 10.8 15.12 and tasks off the critical path Matrix Size (10 6 ) 13

  14. Evaluation: Hybrid Data Distributions mory and co • Imbalance: me memor computation • PaRSEC’s profiling system could provide the execution time for each process, as well as each thread, from which we extract the workload for each process to show load balancing. Band distribution, Band distribution, 2DBCDD band_size = 1 band_size = 2 14 14

  15. Evaluation: Hybrid Data Distributions No. o . of N Nodes Matrix S Size Memory R Reduced ( (GB) • We use the event memory in the profiling 16 1080000 4.374 system to detail 16 2160000 8.748 memory usage of both 16 4320000 17.496 static matrix allocation 64 2160000 5.103 and dynamic temporary buffers. 64 4320000 10.206 • PaRSEC’s profiling 64 6480000 20.412 system also provides the execution time for each process, as well as each thread, from which we extract the workload for each process to show load balancing. Without the hybrid data distributions With the hybrid data distributions 15 15

  16. Evaluation: Reduce Communication Volume v We used the PaRSEC tracing framework API to Initial rank register a new, distributions (i.e., application-specific type before factorization) of event, and at the on the left and the execution of each task, difference between initial and final we logged the rank of ranks (i.e., after the tile on which the task factorization) on the was working. right; the matrix size v Once the trace was is 1080K × 1080K, converted, we then and the tile size is wrote application- 2,700; up for 2D specific scripts to analyze problem, blew for the HDF5 file, and 3D problem produce the figures. 16 16

  17. Evaluation: Novel Lookahead o We profiled the execution to ensure the critical path is respected, i.e. as soon as the data is read PaRSEC enables the critical tasks first. o To be able to compute the average time it takes for data to be produced on one node and consumed on another, we need to connect the task termination, network activation, payload emission, and remote task execution events. o This is provided by the PaRSEC profiling system through a Time between data is ready and TRSM starts for st-2D- combination of the trace sqexp. Left, without lookahead; right, with lookahead of 5; information and the DOT file. each point represents one TRSM; matrix has 100 × 100 tiles 17 17

  18. Evaluation: Hierarchical POTRF o We exploited the basic timing ● ● h − POTRF ● POTRF 8 information produced by the Time (s) 6 tracing system ● Impact of 4 o Plus statistical packages Hierarchical ● 2 provided by pandas and POTRF: top, ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● execution time on NumPy to compute our 2000 4000 6000 8000 10000 a single node; metrics: we compute the Tile Size bottom, resource occupancy of the 100 occupancy of 540K h − POTRF computational resources Occupancy POTRF ● × 540K matrix on a ● ● ● ● ● ● ● ● ● 75 during the original run and 3 × 3 process grid ● ● 50 with a tile size of then during the hierarchical 2,700 POTRF run. 25 ● ● 1 − 100 100 − 130 131 − 160 161 − 200 Panel Range 18 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend