Performance Analysis of Tile Low-Rank Cholesky Factorization Using - - PowerPoint PPT Presentation

performance analysis of tile low rank cholesky
SMART_READER_LITE
LIVE PREVIEW

Performance Analysis of Tile Low-Rank Cholesky Factorization Using - - PowerPoint PPT Presentation

Performance Analysis of Tile Low-Rank Cholesky Factorization Using PaRSEC Instrumentation Tools Qinglei C Cao , Yu Pei, Thomas Herault, Kadir Akbudak, Aleksandr Mikhalev, George Bosilca, Hatem Ltaief, David Keyes, and Jack Dongarra Protools19


slide-1
SLIDE 1

Performance Analysis of Tile Low-Rank Cholesky Factorization Using PaRSEC Instrumentation Tools

Qinglei C Cao, Yu Pei, Thomas Herault, Kadir Akbudak, Aleksandr Mikhalev, George Bosilca, Hatem Ltaief, David Keyes, and Jack Dongarra

Protools19

slide-2
SLIDE 2

PaRSEC, task-based programming

  • Focus on data dependencies, data

flows, and tasks

  • Don’t develop for an architecture but

for a portability layer

  • Let the runtime deal with the

hardware characteristics

  • But provide as much user control as possible
  • StarSS, StarPU, Swift, Parallex,

Quark, Kaapi, DuctTeip, ..., and PaRSEC

App

Data Distrib. Sched. Comm Memory Manager Heterogeneity Manager

Runtime

slide-3
SLIDE 3

PaRSEC

3

PaRSEC: a generic runtime system for asynchronous, architecture aware scheduling of fine- grained tasks on distributed many- core heterogeneous architectures.

slide-4
SLIDE 4

PaRSEC

Easy to Use

Programming Performance Debug

4

slide-5
SLIDE 5

PaRSEC Profiling Tools

  • Sits at the core of the performance profiling system
  • Events as identifiable entities
  • Scalable for many- thread environments
  • One profiling stream for each thread
  • Additional helping threads in charge of I/O, memory

allocators, compactors, …

  • Additional buffers allocated in advance

Trace Collection Framework

  • The Trace Collection Framework is used within the

PaRSEC runtime through the PaRSEC INS INStrumentation (PINS) interface

  • PINS registers callbacks for all the important steps of a

task or communication life cycle

  • Dynamically configurable to generate only the events

pertinent to the run PINS: PaRSEC INStrumentati

  • n

5

slide-6
SLIDE 6

PaRSEC Profiling Tools

  • It is necessary to connect information with the actual

DAG of tasks

  • Automatically generate DOT file each with a partial view
  • f the DAG
  • Collection of the DAG can be done offline
  • One DOT file per process with tools to concatenate the

different DOT files

Dependency Analysis

  • The binary format of trace files is not exposed to the user, needs to

be a portable and exploitable file format

  • Hierarchical Data Format (HDF5) following the structure required by

the popular Pandas Library

  • Tools to take the generated trace and convert it into a Gantt chart
  • Provides a library to read the DOT files that are generated into a

NetworkX [29] representation

Trace Conversion Tools

6

slide-7
SLIDE 7

PaRSEC Profiling Tools

7

$> python >>> import pandas as pd >>> t = pd.HDFStore('dpotrf.h5’) >>> t.event_types ACTIVATE_CB 6 Device delegate 1 MPI_ACTIVATE 2 MPI_DATA_CTL 3 MPI_DATA_PLD_RCV 5 MPI_DATA_PLD_SND 4 PUT_CB 7 TASK_MEMORY 0 potrf_dgemm 8 potrf_dpotrf 11 potrf_dsyrk 9 potrf_dtrsm 10 dtype: int64

slide-8
SLIDE 8

8

TLR Cholesky Factorization

The C Cholesky f factorizati tion of an N * N real symmetric, positive-definite matrix A has the form: A = LLT , where L is an N ⇥ N real lower triangular matrix with positive diagonal elements.

v Apparently dense matrices arising in scientific applications, such as climate/weather forecasting in computational statistics, seismic imaging in earth science, structural and vibrational analysis in material science. v Common properties:

  • Symmetric, positive-definite matrix
  • (Apparently) dense matrices
  • Often data-sparse, Decay of parameter

correlations with distance

slide-9
SLIDE 9

9

TLR Cholesky Factorization

v Dense matrices might be compressed:

  • Cholesky factorization (for

distributed-memory architectures)

  • Tile low rank (TLR) matrix format
  • Significantly less memory
  • Preserving the accuracy

requirements of the scientific application

  • Huge performance improvement via

cutting down flops

slide-10
SLIDE 10

TLR Cholesky Factorization

10 10

Ke Kernel Dense C Cholesky TLR C Cholesky POTRF 1/3 * nb^3 1/3 * nb^3 TRSM nb^3 nb^2 * rank SYRK/LR_SYRK nb^3 2 * nb^2 * rank + 4 * nb * rank^2 GEMM/LR_GEMM 2 * nb^3 36 * nb * rank^2 Total O(N^3) O(N^2 * rank)

for p = 1 to NT do PO POTRF(D(p,p)) for i = p+1 to NT do TR TRSM(V(i,p), D(p,p)) for j = p+1 to NT LR_S _SYRK(D(j,j), U(j,p), V(j,p)) for i = j+1 to NT do LR_G _GEMM(U(i,p), V(i,p), U(j,p), V(j,p), U(i,j), V(i,j), acc)

A serial and incompressible critical path of TLR Cholesky: (NT -1 ) * (POTRF + TRSM + SYRK) + POTRF.

slide-11
SLIDE 11

11

State-of-the-art

  • Shaheen II, a Cray

XC40 system, which has 6,174 compute nodes;

  • The accuracy

threshold of 10−8, which ultimately yields absolute numerical error of

  • rder 10−9;

st st-2D 2D-sq sqexp

Sy Syn-2D 2D

slide-12
SLIDE 12

Experiments

12 12

PaRSEC and its instrumentation tools

Optimal tile size Band distributio n Communic ation volume reduction Novel lookahead Hierarchic al POTRF

slide-13
SLIDE 13

13

Optimization 1: Optimal Tile Size

  • 50

100 150 200 250 2500 5000 7500 Tile Size Time (s)

2000 4000 1.08 2.16 4.32 6.48 8.64 10.8 15.12

Matrix Size (106) Tile Size

Approximated Optimal Experimental Optimal

Syn-2D, 16 nodes, 2M st-2D-sqexp, 256 nodes v Tile size plays a significant role in TLR Cholesky v The profiling tools in PaRSEC gets in:

  • kernel execution time varies for each task, in

terms of the number of operations

  • Set special event “ops_count” to gather the
  • peraions count for tasks in the critical path

and tasks off the critical path v Operation b balance between tiles on a and o

  • ff c

critical p path. v Assume N is the matrix size, node is the number of nodes, k is the average rank of tiles off diagonal, then the best tile size nb can be approximated:

slide-14
SLIDE 14

Evaluation: Hybrid Data Distributions

14 14

  • Imbalance: me

memor mory and co computation

  • PaRSEC’s profiling system could provide the

execution time for each process, as well as each thread, from which we extract the workload for each process to show load balancing.

2DBCDD Band distribution, band_size = 1 Band distribution, band_size = 2

slide-15
SLIDE 15

Evaluation: Hybrid Data Distributions

  • No. o

. of N Nodes Matrix S Size Memory R Reduced ( (GB) 16 1080000 4.374 16 2160000 8.748 16 4320000 17.496 64 2160000 5.103 64 4320000 10.206 64 6480000 20.412

15 15

Without the hybrid data distributions With the hybrid data distributions

  • We use the event

memory in the profiling system to detail memory usage of both static matrix allocation and dynamic temporary buffers.

  • PaRSEC’s profiling

system also provides the execution time for each process, as well as each thread, from which we extract the workload for each process to show load balancing.

slide-16
SLIDE 16

Evaluation: Reduce Communication Volume

16 16

Initial rank distributions (i.e., before factorization)

  • n the left and the

difference between initial and final ranks (i.e., after factorization) on the right; the matrix size is 1080K × 1080K, and the tile size is 2,700; up for 2D problem, blew for 3D problem

v We used the PaRSEC tracing framework API to register a new, application-specific type

  • f event, and at the

execution of each task, we logged the rank of the tile on which the task was working. v Once the trace was converted, we then wrote application- specific scripts to analyze the HDF5 file, and produce the figures.

slide-17
SLIDE 17

Evaluation: Novel Lookahead

17 17

Time between data is ready and TRSM starts for st-2D-

  • sqexp. Left, without lookahead; right, with lookahead of 5;

each point represents one TRSM; matrix has 100 × 100 tiles

  • We profiled the execution to ensure

the critical path is respected, i.e. as soon as the data is read PaRSEC enables the critical tasks first.

  • To be able to compute the average

time it takes for data to be produced on one node and consumed on another, we need to connect the task termination, network activation, payload emission, and remote task execution events.

  • This is provided by the PaRSEC

profiling system through a combination of the trace information and the DOT file.

slide-18
SLIDE 18

Evaluation: Hierarchical POTRF

18 18

  • 25

50 75 100 1−100 100−130 131−160 161−200

Panel Range Occupancy

h−POTRF POTRF

  • 2

4 6 8 2000 4000 6000 8000 10000

Tile Size Time (s)

  • h−POTRF

POTRF

Impact of Hierarchical POTRF: top, execution time on a single node; bottom, resource

  • ccupancy of 540K

× 540K matrix on a 3 × 3 process grid with a tile size of 2,700

  • We exploited the basic timing

information produced by the tracing system

  • Plus statistical packages

provided by pandas and NumPy to compute our metrics: we compute the

  • ccupancy of the

computational resources during the original run and then during the hierarchical POTRF run.

slide-19
SLIDE 19

19

Incremental Effects

100 200 400 800 2000 4000 8000 12000

Time (s) Type−Nodes

NONE−16 NONE−32 NONE−64 NONE−128 NONE−256 B−16 B−32 B−64 B−128 B−256

20 40 1.08 2.16 3.24 4.32 5.4 6.48 7.56 8.64 9.72

(a) Impact of Load Balancing %

100 200 400 800 2000 4000 8000 12000

Time (s) Type−Nodes

B−16 B−32 B−64 B−128 B−256 BS−16 BS−32 BS−64 BS−128 BS−256

20 40 1.08 2.16 3.24 4.32 5.4 6.48 7.56 8.64 9.72

(b) Impact of Reducing Communication Volume %

100 200 400 800 2000 4000 8000 12000

Time (s) Type−Nodes

BS−16 BS−32 BS−64 BS−128 BS−256 BSL−16 BSL−32 BSL−64 BSL−128 BSL−256

20 40 1.08 2.16 3.24 4.32 5.4 6.48 7.56 8.64 9.72

(c) Impact of Lookahead %

100 200 400 800 2000 4000 8000 12000

Time (s) Type−Nodes

BSL−16 BSL−32 BSL−64 BSL−128 BSL−256 BSLH−16 BSLH−32 BSLH−64 BSLH−128 BSLH−256

20 40 1.08 2.16 3.24 4.32 5.4 6.48 7.56 8.64 9.72

(d) Impact of Hierarchical POTRF % Matrix Size (106)

St-3D- sqexp

slide-20
SLIDE 20

20

Comparison with State-of-the-art

100 200 400 800 1600 3200 6400 3 6 9 Matrix Size (106) Time (s) Library−Nodes

HiCMA−16 HiCMA−32 HiCMA−64 HiCMA−128 HiCMA−256 HiCMA−512 Lorapo−16 Lorapo−32 Lorapo−64 Lorapo−128 Lorapo−256 Lorapo−512

100 200 400 800 1600 3200 6400 3 6 9 Matrix Size (106) Time (s) Library−Nodes

HiCMA−16 HiCMA−32 HiCMA−64 HiCMA−128 HiCMA−256 HiCMA−512 Lorapo−16 Lorapo−32 Lorapo−64 Lorapo−128 Lorapo−256 Lorapo−512

Sy Syn-2D 2D st st-2D 2D-sq sqexp

slide-21
SLIDE 21

21

3D Application and extreme-scale runs

  • 16

16 32 32 64 64 128 128 256 256 512 1024 1024 4096

4000 8000 16000 32000 65000 86000 10 20 30 40

Matrix Size (106) Time (s)

  • st−3D−sqexp

st−2D−sqexp

st-3D-sqexp The largest matrices that fit in memory up to 4096 nodes for st- 3D-sqexp and 1024 nodes for st-2D-sqexp

slide-22
SLIDE 22

Conclusion

  • Present the profiling system of PaRSEC: trace collection framework,

PINS, Dependency Analysis and Trace;

  • Demonstrate the performance analysis using profiling system in

PaRSEC to show optimization footprints of TLR Cholesky factorization from data distribution, communication-reducing and synchronization- reducing perspectives;

  • Thanks to the optimizations hinted by the profiling system, the new

TLR Cholesky factorization achieves an 8X performance speedup over existing state-of-the art implementations on massively parallel supercomputers, solves 3D problem in climate and weather prediction applications and up to 42M geospatial locations on 130,000 cores.

22 22

slide-23
SLIDE 23

23

Questions?