Easy Programming of Linear Algebra Operations on Hybrid CPU-GPU - PowerPoint PPT Presentation

libflame → Creating your own algorithm → FLAME notation and algorithms HPC & A Algorithm loop: repartition+operation+merging A TL A 00 α 11 T a 10 → → A BL A BR A 20 A 22 a 21 A 00 A TL T √ α 11 a 10 → a 21 A 20 A 22 – A BL A BR / T a 21 a 21 α 11 28 INRIA-Sophia Antipolis, June 2011

libflame → Creating your own algorithm → FLAME notation and algorithms HPC & A Algorithm loop: repartition A TL → A BL A BR A 00 T α 11 a 10 a 21 A 20 A 22 Indexing operations 29 INRIA-Sophia Antipolis, June 2011

libflame → Creating your own algorithm → FLAME notation and algorithms HPC & A Algorithm loop: operation A 00 α 11 T a 10 → A 20 a 21 A 00 T √ α 11 a 10 a 21 A 20 A 22 – / T a 21 a 21 α 11 Real computation 30 INRIA-Sophia Antipolis, June 2011

libflame → Creating your own algorithm → FLAME notation and algorithms HPC & A Algorithm loop: merging A 00 √ α 11 T a 10 → a 21 A 20 A 22 – / / T T a a 21 a 21 a α 11 A TL A BL A BR Indexing operation 31 INRIA-Sophia Antipolis, June 2011

libflame → Creating your own algorithm → FLAME notation and algorithms HPC & A Algorithm � Automatic development from math. specification A = L * L T Mechanical procedure 32 32 INRIA-Sophia Antipolis, June 2011

libflame → Creating your own algorithm → Spark: from algorithm to code HPC & A APIs Spark+APIs C, F77, Matlab, LabView, LaTeX 33 INRIA-Sophia Antipolis, June 2011

libflame → Creating your own algorithm → Spark: from algorithm to code HPC & A Spark website http://www.cs.utexas.edu/users/flame/Spark/ 34 INRIA-Sophia Antipolis, June 2011

libflame → Creating your own algorithm → Spark: from algorithm to code HPC & A Example: FLAME@lab [ ATL, ATR,... ABL, ABR ] = FLA_Part_2x2( A, 0, 0, 'FLA_TL' ); while ( size( ATL, 1 ) < size( A, 1 ) ) [ A00, a01, A02,... a10t, alpha11, a12t,... A20, a21, A22 ] = FLA_Repart_2x2_to_3x3( ATL, ATR,... ABL, ABR,... 1, 1, 'FLA_BR' ); %----------------------------------------% % : %----------------------------------------% [ ATL, ATR,... ABL, ABR ] = ... FLA_Cont_with_3x3_to_2x2( A00, a01, A02,... a10t, alpha11, a12t,... A20, a21, A22,... 'FLA_TL' ); end Indexing operations 35 INRIA-Sophia Antipolis, June 2011

libflame → Creating your own algorithm → Spark: from algorithm to code HPC & A Example: FLAME@lab � Manually fill-in operations […] = FLA_Part_2x2(…); while ( size( ATL, 1 ) < size( A, 1 ) ) while ( size( ATL, 1 ) < size( A, 1 ) ) […] = FLA_Repart_2x2_to_3x3(…); %----------------------------------------% alpha11 = sqrt( alpha11 ); a21 = a21 / alpha11; A22 = A22 – tril( a21*a21’ ); %----------------------------------------% […] = FLA_Cont_with_3x3_to_2x2(…); end Real computation 36 INRIA-Sophia Antipolis, June 2011

libflame → Creating your own algorithm → Running on multicore HPC & A Example: FLAMEC FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ){ b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &a01, &A02, /* ************* */ /* ************************** */ &a10t,/**/ &alpha11, &a12t, ABL, /**/ ABR, &A20, /**/ &a21, &A22, 1, 1, FLA_BR ); /*--------------------------------------*/ /* : */ /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, a01, /**/ A02, a10t, alpha11, /**/ a12t, /* ************** */ /* ************************/ &ABL, /**/ &ABR, A20, a21, /**/ A22, FLA_TL ); } 37 INRIA-Sophia Antipolis, June 2011

libflame → Creating your own algorithm → Running on multicore HPC & A Example: FLAMEC � libflame employs external BLAS: GotoBLAS, MKL, ACML, ATLAS, netlib FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLA_Sqrt( alpha11 ); FLA_Inv_scal( alpha11, a21 ); FLA_Syr( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, a21, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); } 38 INRIA-Sophia Antipolis, June 2011

Index HPC & A � The libflame library 1. A user’s view 2. Creating your own algorithm 1. Task parallelism 3. FLAME runtime 2. 2. SuperMatrix SuperMatrix 4. 4. Cluster of GPUs Cluster of GPUs 3. GPU support � The SMPs/GPUSs framework 39 INRIA-Sophia Antipolis, June 2011

Data-flow parallelism? Dynamic scheduling? HPC & A Run-time? � Surely not a new idea… � Cilk � StarSs (GridSs) � StarPU � … � “An Efficient Algorithm for Exploiting Multiple Arithmetic Units”, R. M. Tomasulo, IBM J. of R&D, Volume 11, Number 1, Page 25 (1967) The basis for exploitation of ILP on current superscalar processors! 40 INRIA-Sophia Antipolis, June 2011

The TEXT project HPC & A � Towards Exaflop applicaTions � Demonstrate that Hybrid MPI/SMPSs addresses the Exascale challenges in a productive and efficient way. � � Deploy at supercomputing centers: Julich, EPCC, HLRS, BSC Deploy at supercomputing centers: Julich, EPCC, HLRS, BSC � Port Applications (HLA, SPECFEM3D, PEPC, PSC, BEST, CPMD, LS1 MarDyn) and develop algorithms. � Develop additional environment capabilities � tools (debug, performance) � improvements in runtime systems (load balance and GPUSs) � Support other users � Identify users of TEXT applications � Identify and support interested application developers � Contribute to Standards (OpenMP ARB, PERI-XML) 41 INRIA-Sophia Antipolis, June 2011

libflame → FLAME runtime → Task parallelism HPC & A Blocked algorithms � Cholesky factorization T A 11 = L 11 L 11 -T A 21 := L 21 = A 21 L 11 T A 22 := A 22 – L 21 L 21 � �� 42 42 INRIA-Sophia Antipolis, June 2011

libflame → FLAME runtime → Task parallelism HPC & A Blocked algorithms � Cholesky factorization A = L * L T FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); APIs + Tools /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); } 43 43 INRIA-Sophia Antipolis, June 2011

libflame → FLAME runtime → Task parallelism HPC & A Blocked algorithms � Simple parallelization: T A 11 = L 11 L 11 -T A 21 := L 21 = A 21 L 11 link with MT BLAS T A 22 := A 22 – L 21 L 21 FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); } 44 44 INRIA-Sophia Antipolis, June 2011

libflame → FLAME runtime → Task parallelism HPC & A Blocked algorithms � There is more parallelism! Inside the same iteration In different iterations �� 45 45 INRIA-Sophia Antipolis, June 2011

libflame → FLAME runtime → SuperMatrix HPC & A Exploiting task-level parallelism � SuperMatrix: automatic identification of tasks/dependencies Super � Matrix 1 2 4 7 5 3 6 8 9 10 46 INRIA-Sophia Antipolis, June 2011

libflame → FLAME runtime → SuperMatrix HPC & A Exploiting task-level parallelism � SuperMatrix: automatic identification of tasks/dependencies HOW? � Input/output/input-output Super operands and order of Matrix operations in code determine dependencies � Direction of operands is defined /*--------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); as part of BLAS specification FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ 47 INRIA-Sophia Antipolis, June 2011

libflame → FLAME runtime → SuperMatrix HPC & A Exploiting task-level parallelism � SuperMatrix: scheduling of tasks to � cores 1 2 4 7 Super 5 3 6 8 9 10 Matrix 48 INRIA-Sophia Antipolis, June 2011

libflame → FLAME runtime → SuperMatrix HPC & A Exploiting task-level parallelism � SuperMatrix: scheduling of tasks to � cores HOW? 1 2 4 7 � List of ready tasks Super � One thread per core 5 3 6 8 9 10 Matrix 1. Centralized list 2. One list per-thread 3. One list per-thread and work- stealing 49 INRIA-Sophia Antipolis, June 2011

libflame → FLAME runtime → GPU support HPC & A Single GPU � SuperMatrix: Dealing with data transfers between host (CPU)/device � (GPU) memory spaces 1 2 4 7 Super 5 3 6 8 9 10 Matrix 50 INRIA-Sophia Antipolis, June 2011

libflame → FLAME runtime → GPU support HPC & A Single GPU: a user’s view FLA_Obj A; // Initialize conventional matrix: buffer, m, rs, cs // Obtain storage blocksize, # of threads: b, n_threads FLA_Init(); FLASH_Obj_create( FLA_DOUBLE, m, m, 1, &b, &A ); FLASH_Copy_buffer_to_hier( m, m, buffer, rs, cs, 0, 0, A ); FLASH_Queue_set_num_threads( n_threads ); FLASH_Queue_enable_gpu(); FLASH_Chol( FLA_LOWER_TRIANGULAR, A ); FLASH_Obj_free( &A ); FLA_Finalize(); 51 51 INRIA-Sophia Antipolis, June 2011

libflame → FLAME runtime → GPU support HPC & A Single GPU: under the cover FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ Indexing operations (with FLA_Repart_2x2_to_3x3(…); addresses in device /*--------------------------------------*/ memory) FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); } 52 52 INRIA-Sophia Antipolis, June 2011

libflame → FLAME runtime → GPU support HPC & A Single GPU: under the cover FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ Super Matrix FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); Real computation: FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); Runtime keeps track of data in /*--------------------------------------*/ host/device memory and FLA_Cont_with_3x3_to_2x2(…); performs the necessary transfers, reducing #copies } 53 53 INRIA-Sophia Antipolis, June 2011

libflame → FLAME runtime → GPU support HPC & A Single GPU: under the cover FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ Super Matrix FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); 1. Copy matrix to GPU /*--------------------------------------*/ memory before FLA_Cont_with_3x3_to_2x2(…); algorithm commences } 54 54 INRIA-Sophia Antipolis, June 2011

libflame → FLAME runtime → GPU support HPC & A Single GPU: under the cover FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ Super Matrix FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, 2. Copy block A11 from FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ device to host before its FLA_Cont_with_3x3_to_2x2(…); factorization } 55 55 INRIA-Sophia Antipolis, June 2011

libflame → FLAME runtime → GPU support HPC & A Single GPU: under the cover FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ Super Matrix FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, 3. Copy block A11 from FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ host to device before FLA_Cont_with_3x3_to_2x2(…); using it in subsequent } computations 56 56 INRIA-Sophia Antipolis, June 2011

libflame → FLAME runtime → GPU support HPC & A Multi-GPU � SuperMatrix: Dealing with data transfers between host (CPU)/device � (GPU) memory spaces 1 2 4 7 Super 5 3 6 8 9 10 Matrix 57 INRIA-Sophia Antipolis, June 2011

libflame → FLAME runtime → GPU support HPC & A Multi-GPU � How do we program these? GPU #0 GPU #1 PCI-e CPU(s) bus bus GPU #2 GPU #3 Inter- connect 58 58 INRIA-Sophia Antipolis, June 2011

libflame → FLAME runtime → GPU support HPC & A Multi-GPU: a user’s view FLA_Obj A; // Initialize conventional matrix: buffer, m, rs, cs // Obtain storage blocksize, # of threads: b, n_threads FLA_Init(); FLASH_Obj_create( FLA_DOUBLE, m, m, 1, &b, &A ); FLASH_Copy_buffer_to_hier( m, m, buffer, rs, cs, 0, 0, A ); FLASH_Queue_set_num_threads( n_threads ); FLASH_Queue_enable_gpu(); FLASH_Chol( FLA_LOWER_TRIANGULAR, A ); FLASH_Obj_free( &A ); FLA_Finalize(); 59 59 INRIA-Sophia Antipolis, June 2011

libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover � Naïve approach: �� GPU #0 �� GPU #1 � PCI-e CPU(s) �� bus bus GPU #2 GPU #3 → �� Inter- connect 60 60 INRIA-Sophia Antipolis, June 2011

libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover � How do we program these? GPU #0 GPU #1 PCI-e CPU(s) bus bus GPU #2 GPU #3 Inter- connect View as a… � Shared-memory multiprocessor + DSM 61 61 INRIA-Sophia Antipolis, June 2011

libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover � View system as a shared- memory multiprocessors (multi-core processor with GPU #0 GPU #1 hw. coherence) PCI-e CPU(s) bus bus GPU #2 MP P 0 +C 0 GPU #3 Inter- connect P 1 +C 1 P 2 +C 2 P 3 +C 3 62 62 INRIA-Sophia Antipolis, June 2011

libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover � Software Distributed-Shared Memory (DSM) � Software: flexibility vs. efficiency � Underlying distributed memory hidden from the users � Reduce memory transfers using write-back, write- invalidate,… � Well-known approach, not too efficient as a middleware for general apps. � Regularity of dense linear algebra operations makes a difference! 63 63 INRIA-Sophia Antipolis, June 2011

libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover � Reduce #data transfers: �� Super �� ! Matrix "�� #�� → �� #�� → �� $�� GPU #0 $�� GPU #1 PCI-e CPU(s) bus GPU #2 GPU #3 Inter- connect 64 64 INRIA-Sophia Antipolis, June 2011

libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ Super Matrix FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); 1. Distribute matrix among /*--------------------------------------*/ GPU memories (2D workload distribution) FLA_Cont_with_3x3_to_2x2(…); before algorithm } commences 65 INRIA-Sophia Antipolis, June 2011

libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover GPU #0 GPU #1 Super Matrix GPU #2 GPU #3 GPU #3 1. Distribute matrix among GPU memories (2D workload distribution): owner-computes rule 66 INRIA-Sophia Antipolis, June 2011

libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ Super Matrix FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); 2. Copy block A11 from /*--------------------------------------*/ corresponding device to FLA_Cont_with_3x3_to_2x2(…); host before its factorization } 67 INRIA-Sophia Antipolis, June 2011

libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ Super Matrix FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); 3. Broadcast block A11 from /*--------------------------------------*/ host to appropriate FLA_Cont_with_3x3_to_2x2(…); devices before using it in } subsequent computations (write-update) 68 INRIA-Sophia Antipolis, June 2011

libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ Super Matrix FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); 4. Keep A11 in receiving /*--------------------------------------*/ device(s) in case needed FLA_Cont_with_3x3_to_2x2(…); in subsequent } computations (cache) 69 INRIA-Sophia Antipolis, June 2011

libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ Super Matrix FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); 5. Keep updated A21 in /*--------------------------------------*/ device till replaced (write- FLA_Cont_with_3x3_to_2x2(…); back) } 70 INRIA-Sophia Antipolis, June 2011

libflame → FLAME runtime → GPU support HPC & A Performance 71 INRIA-Sophia Antipolis, June 2011

Index HPC & A � The libflame library 1. A user’s view 2. Creating your own algorithm 3. FLAME runtime 4. 4. Clusters of GPUs Clusters of GPUs 1. DLA for clusters 2. Host-centric view 3. Device-centric view � The StarSs framework 74 INRIA-Sophia Antipolis, June 2011

libflame → Clusters of GPUs → DLA for clusters HPC & A libflame-like libraries � PLAPACK (UT@Austin) � Use of objects ( PLA_Obj ), vectors, matrices, projected vectors, etc., with layout embedded � PMB distribution � Layered and modular design: all communication is � Layered and modular design: all communication is done via copies ( PLA_Copy ) and reductions ( PLA_Reduce ) from one object type to another � Elemental (Jack Poulson) � Based on PLAPACK, but C++ � Element-wise cyclic data layout 75 75 INRIA-Sophia Antipolis, June 2011

libflame → Clusters of GPUs → Host-centric view HPC & A Data in host memory � Before executing a kernel, copy input data to GPU memory � After execution, retrieve results back to node main results back to node main memory GPU #0 � Easy to program (wrappers GPU #1 PCI-e to kernels) CPU(s) bus � Copies linked to kernel GPU #2 execution: O(n 3 ) transfers GPU #3 Inter- between CPU and GPU connect 76 76 INRIA-Sophia Antipolis, June 2011

libflame → Clusters of GPUs → Device-centric view HPC & A Data in GPU memory � Before sending a piece of data, retrieve it back to node main memory (compact on the fly) � After reception, copy � After reception, copy contents to GPU memory GPU #0 � Easy to program (wrappers GPU #1 PCI-e to MPI calls) CPU(s) bus � Copies linked to GPU #2 communication, not kernel GPU #3 Inter- execution: O(n 2 ) transfers connect between CPU and GPU 77 77 INRIA-Sophia Antipolis, June 2011

libflame → Clusters of GPUs HPC & A Performance 22x 10x 5x 78 78 INRIA-Sophia Antipolis, June 2011

libflame → Clusters of GPUs HPC & A Performance 79 79 INRIA-Sophia Antipolis, June 2011

Acknowledgements HPC & A � Funding sources 80 80 INRIA-Sophia Antipolis, June 2011

Further information HPC & A � Contact: � field@cs.utexas.edu � FLAME project website: � www.cs.utexas.edu/users/flame/ � www.cs.utexas.edu/users/flame/ � libflame: The Complete Reference � www.cs.utexas.edu/users/field/docs/ � Updated nightly � www.lulu.com/content/5915632 � Updated occasionally 81 81 INRIA-Sophia Antipolis, June 2011

Index HPC & A � The libflame library GPU support � The StarSs framework 82 INRIA-Sophia Antipolis, June 2011

The TEXT project HPC & A � Towards Exaflop applicaTions � Demonstrate that Hybrid MPI/SMPSs addresses the Exascale challenges in a productive and efficient way. � � Deploy at supercomputing centers: Julich, EPCC, HLRS, BSC Deploy at supercomputing centers: Julich, EPCC, HLRS, BSC � Port Applications (HLA, SPECFEM3D, PEPC, PSC, BEST, CPMD, LS1 MarDyn) and develop algorithms. � Develop additional environment capabilities � tools (debug, performance) � improvements in runtime systems (load balance and GPUSs) � Support other users � Identify users of TEXT applications � Identify and support interested application developers � Contribute to Standards (OpenMP ARB, PERI-XML) 83 INRIA-Sophia Antipolis, June 2011

Index HPC & A � The libflame library � The StarSs framework 1. StarSs overview 2. OmpSs Slides from Rosa M. Badia Barcelona Supercomputing Center Thanks! 84 INRIA-Sophia Antipolis, June 2011

StarSs → StarSs overview HPC & A Programming model ParallelResources (multicore, SMP, cluster, c loud, grid) Synchronization, Task selection + results transfer Resource 1 Sequential Application parameters direction ... Resource 2 (input, output, inout) for (i=0; i<N; i++){ T1 ( data1, data2); Resource 3 T2 ( data4, data5); T3 ( data2, data5, data6); T4 ( data7, data8); . . T5 ( data6, data8, data9); T5 ( data6, data8, data9); . } ... . T1 0 T2 0 Resource N T4 0 T3 0 Scheduling, T5 0 Task graph creation T1 1 T2 1 data transfer, based on data T4 1 T3 1 task execution precedence T5 1 T1 2 … 85 INRIA-Sophia Antipolis, June 2011

StarSs → StarSs overview HPC & A Programming model StarSs GridSs CellSs SMPSs ClusterSs GPUSs ClusterSs ClearSpeedSs OmpSs COMPSs @ SMP @ GPU @ Cluster � Programmability/Portability � Incremental parallelization/restructure � StarSs � Separate algorithm from resources � � Disciplined programming A “node” level programming model � � Sequential C/Fortran/Java + annotations “Same” source code runs on “any” machine � Optimized task implementations will result in better performance. � Task based. Asynchrony, data-flow. � Performance � “Simple” linear address space � Intelligent Runtime � Directionality annotations on tasks arguments � Automatically extracts and exploits parallelism � Dataflow, workflow � Nicely integrates in hybrid MPI/StarSs � Matches computations to specific resources on each type of � Natural support for heterogeneity target platform � Asynchronous (data-flow) execution and locality awareness 86 INRIA-Sophia Antipolis, June 2011

StarSs → StarSs overview A sequential program… HPC & A void vadd3 (float A[BS], float B[BS], float C[BS]); void scale_add (float sum, float A[BS], float B[BS]); void accum (float A[BS], float *sum); for (i=0; i<N; i+=BS) // C=A+B vadd3 ( &A[i], &B[i], &C[i]); ... for (i=0; i<N; i+=BS) // sum(C[i]) accum (&C[i], &sum); ... for (i=0; i<N; i+=BS) // B=sum*A scale_add (sum, &E[i], &B[i]); ... for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]); ... for (i=0; i<N; i+=BS) // E=G+F vadd3 (&G[i], &F[i], &E[i]); 87 INRIA-Sophia Antipolis, June 2011

StarSs → StarSs overview A sequential program… taskified… HPC & A #pragma css task input(A, B) output(C) void vadd3 (float A[BS], float B[BS], float C[BS]); Compute dependences @ task instantiation time #pragma css task input(sum, A) inout(B) void scale_add (float sum, float A[BS], float B[BS]); 1 2 3 4 #pragma css task input(A) inout(sum) void accum (float A[BS], float *sum); for (i=0; i<N; i+=BS) // C=A+B 5 7 8 6 vadd3 ( &A[i], &B[i], &C[i]); ... for (i=0; i<N; i+=BS) // sum(C[i]) 9 10 11 12 accum (&C[i], &sum); ... for (i=0; i<N; i+=BS) // B=sum*A 13 14 15 16 scale_add (sum, &E[i], &B[i]); ... for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]); 17 18 19 20 ... for (i=0; i<N; i+=BS) // E=G+F Color/number: order of task instantiation vadd3 (&G[i], &F[i], &E[i]); Some antidependences covered by flow dependences not drawn 88 INRIA-Sophia Antipolis, June 2011

StarSs → StarSs overview A sequential program… taskified… with data-flow HPC & A execution Decouple how we write from #pragma css task input(A, B) output(C) Write how it is executed void vadd3 (float A[BS], float B[BS], float C[BS]); Execute #pragma css task input(sum, A) inout(B) void scale_add (float sum, float A[BS], float B[BS]); 1 2 3 4 #pragma css task input(A) inout(sum) void accum (float A[BS], float *sum); for (i=0; i<N; i+=BS) // C=A+B 5 7 8 6 vadd3 ( &A[i], &B[i], &C[i]); ... for (i=0; i<N; i+=BS) // sum(C[i]) 9 10 11 12 accum (&C[i], &sum); ... for (i=0; i<N; i+=BS) // B=sum*A 13 14 15 16 scale_add (sum, &E[i], &B[i]); ... for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]); 17 18 19 20 ... for (i=0; i<N; i+=BS) // E=G+F Color/number: a possible order of task execution vadd3 (&G[i], &F[i], &E[i]); 89 INRIA-Sophia Antipolis, June 2011

StarSs → StarSs overview HPC & A The potential of data access information � Flexibility to dynamically traverse dataflow graph “optimizing” � Concurrency. Critical path � Memory access: data transfers performed by run time � Opportunities for � Prefetch � Reuse � Eliminate antidependences (rename) � Replication management � Coherency/consistency handled by the runtime 90 INRIA-Sophia Antipolis, June 2011

Index HPC & A � The libflame library � The StarSs framework 1. StarSs overview 2. OmpSs 1. Overview & syntax 2. Compiler 3. Runtime 4. Examples 91 INRIA-Sophia Antipolis, June 2011

StarSs → OmpSs → Overview & syntax HPC & A OmpSs = OpenMP + StarSs extensions � OmpSs is based on OpenMP with some differences: � Different execution model � Extended memory model � Extensions for point-to-point inter-task synchronizations � data dependencies � Extensions for heterogeneity � Other minor extensions 92 INRIA-Sophia Antipolis, June 2011

StarSs → OmpSs → Overview & syntax HPC & A Execution model � Thread- pool model � OpenMP parallel “ignored” � All threads created on startup � One of them starts executing main � All get work from a task pool � And can generate new work 93 INRIA-Sophia Antipolis, June 2011

StarSs → OmpSs → Overview & syntax HPC & A Memory model � Two “modes“ are allowed: � pure SMP: � Single address space � OpenMP standard memory model is used � � non-SMP (cluster, GPUs, ...): � Multiple address spaces exists � Same data may exists in multiple of these � Data consistency ensured by the implementation 94 INRIA-Sophia Antipolis, June 2011

StarSs → OmpSs → Overview & syntax HPC & A Main element: Task � Task: unit of computation � Task definition � Pragmas in lined � Pragmas attached to function definition � Pragmas attached to function definition #pragma omp task void foo (int Y[size], int size) { int j; for (j=0; j<size; j++) Y[j]= j; } int main() { int X[100] foo (X, 100); } 95 INRIA-Sophia Antipolis, June 2011

StarSs → OmpSs → Overview & syntax HPC & A Defining dependences � Clauses that express data direction: � input � output � inout � Dependences computed at runtime taking into account � Dependences computed at runtime taking into account these clauses #pragma omp task output( x ) 1 x = 5; #pragma omp task input( x ) 3 printf("%d\n" , x ) ; #pragma omp task inout( x ) 2 x++; #pragma omp task input( x ) printf ("%d\n" , x ) ; 4 96 INRIA-Sophia Antipolis, June 2011

StarSs → OmpSs → Overview & syntax HPC & A Heterogeneity: the target directive � Directive to specify device specific information: #pragma omp target [ clauses ] � Clauses: � device: which device (smp, gpu) � copy_in, copy_out, copy_inout: data to be moved in and out � implements: specifies alternate implementations #pragma target device (smp) #pragma omp task input (Y) void foo (int Y[size], int size) { int j; for (j=0; j<size; j++) Y[j]= j; } int main() { int X[100] foo (X, 100) ; } 97 INRIA-Sophia Antipolis, June 2011

StarSs → OmpSs → Overview & syntax HPC & A Synchronization #pragma omp task wait � Suspends the current task until all children tasks are completed � Just direct children, not descendants void traverse_list ( List l ) { Element e ; Element e ; for ( e = l-> first; e ; e = e->next ) #pragma omp task process ( e ) ; #pragma omp taskwait } 2 ... 1 3 4 98 INRIA-Sophia Antipolis, June 2011

StarSs → OmpSs → Overview & syntax HPC & A Hierarchical task graph � Nesting #pragma omp task input([BS][BS]A, [BS][BS]B)\ inout([BS][BS]C) void small_dgemm(float *C, float *A, float *B); #pragma omp task input([N][N]A, [N][N] B)\ inout([N][N]C) inout([N][N]C) void block_dgemm(float *C, float *A, float *B){ int i, j, k; for (i=0; i< N; i+=BS) for (j=0; j< N; j+=BS) for (k=0; k< N; k+=BS) small_dgemm(&C[i][j], &A[i][k], &B[k][j]) } main() { ... block_dgemm(A,B,C); block_dgemm(D,E,F); #pragma omp task wait } 99 INRIA-Sophia Antipolis, June 2011

StarSs → OmpSs → Compiler HPC & A Mercurium � Minor role � Recognizes constructs and transforms them to calls to the runtime � Manages code restructuring for different target devices � Device-specific handlers � May generate code in a separate file � Invokes different back-end compilers → nvcc for NVIDIA 100 INRIA-Sophia Antipolis, June 2011

Easy Programming of Linear Algebra Operations on Hybrid CPU-GPU - PowerPoint PPT Presentation

HPC & A Easy Programming of Linear Algebra Operations on Hybrid CPU-GPU Platforms Enrique S. Quintana-Ort 1 INRIA-Sophia Antipolis, June 2011 Index HPC & A The libflame library GPU support The StarSs framework 2

Easy-to-Use Easy-to-Install Easy on the Budget orecx.com Easy-to-Use

Chapter 1 What is Linear Algebra? Chapter 1 What is Linear Algebra? The study of linear

Hybrid Construction Hybrid Construction Hybrid Construction Hybrid Construction 1 VP

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE

Lecture 14: Dense Linear Algebra David Bindel 18 Oct 2010 Where we are This week: dense

Linear Algebra Linear algebra has become as basic and as applicable as calculus, and

Linear algebra explained in four pages Excerpt from the N O BULLSHIT GUIDE TO LINEAR ALGEBRA by

PV Math Department MCL Vision Credit Options Credit General General/Post- College Honors

Linear Programming Linear Programming In a linear programming problem, there is a set of

Matrices Basic Linear Algebra Overview Lecture will cover why matrices and linear algebra

MATRICES AND LINEAR ALGEBRA Linear Algebra Matrix manipulation is the original essence of

Expressive Linear Algebra in Haskell Henning Thielemann 2019-08-21 Expressive Linear Algebra in

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model for t he Dist

Hybrid Automobiles Hybrid Automobiles It switches easily between fuel, batteries, or both It

5.1 Basic Operations Chapter 5: Algebra 2 Chapter 5: Algebra

Flame Graphs on FreeBSD Brendan Gregg Senior Performance

Intro to Continuous Delivery and DevOps From a testing perspective Lisa Crispin With material

Today Digital Signal Processors Digital signal processors Microcontrollers are optimized

How to exploit CI as a means of deployment? Breakout session 26-27th Feb 2018 58th CREST Open

SELF SE SELF SE SE SELF SE SELF LF-INJECTION LF LF LF-INJECTION INJECTION INJECTION

Profiling Go Code FLAME GRAPHS Ram Nadella Online Marketplace CPU Flame Graph of Linux Kernel

Numerical Study of Cyclic Variation in a Large Bore 2- Stroke Natural Gas Engine Timothy

The Apollo 1 Fire INST 154 Apollo at 50 Lee Atwood Interview Apollo 1 Audio Apollo 204 Review

Easy Programming of Linear Algebra Operations on Hybrid CPU-GPU - PowerPoint PPT Presentation

HPC & A Easy Programming of Linear Algebra Operations on Hybrid CPU-GPU Platforms Enrique S. Quintana-Ort 1 INRIA-Sophia Antipolis, June 2011 Index HPC & A The libflame library GPU support The StarSs framework 2

Easy-to-Use Easy-to-Install Easy on the Budget orecx.com Easy-to-Use

Chapter 1 What is Linear Algebra? Chapter 1 What is Linear Algebra? The study of linear

Hybrid Construction Hybrid Construction Hybrid Construction Hybrid Construction 1 VP

Graphics 2014 Linear Algebra II Linear Maps &amp; Matrices Linear Maps &amp; Matrices CORE

Lecture 14: Dense Linear Algebra David Bindel 18 Oct 2010 Where we are This week: dense

Linear Algebra Linear algebra has become as basic and as applicable as calculus, and

Linear algebra explained in four pages Excerpt from the N O BULLSHIT GUIDE TO LINEAR ALGEBRA by

PV Math Department MCL Vision Credit Options Credit General General/Post- College Honors

Linear Programming Linear Programming In a linear programming problem, there is a set of

Matrices Basic Linear Algebra Overview Lecture will cover why matrices and linear algebra

MATRICES AND LINEAR ALGEBRA Linear Algebra Matrix manipulation is the original essence of

Expressive Linear Algebra in Haskell Henning Thielemann 2019-08-21 Expressive Linear Algebra in

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model for t he Dist

Hybrid Automobiles Hybrid Automobiles It switches easily between fuel, batteries, or both It

5.1 Basic Operations Chapter 5: Algebra 2 Chapter 5: Algebra

Flame Graphs on FreeBSD Brendan Gregg Senior Performance

Intro to Continuous Delivery and DevOps From a testing perspective Lisa Crispin With material

Today Digital Signal Processors Digital signal processors Microcontrollers are optimized

How to exploit CI as a means of deployment? Breakout session 26-27th Feb 2018 58th CREST Open

SELF SE SELF SE SE SELF SE SELF LF-INJECTION LF LF LF-INJECTION INJECTION INJECTION

Profiling Go Code FLAME GRAPHS Ram Nadella Online Marketplace CPU Flame Graph of Linux Kernel

Numerical Study of Cyclic Variation in a Large Bore 2- Stroke Natural Gas Engine Timothy

The Apollo 1 Fire INST 154 Apollo at 50 Lee Atwood Interview Apollo 1 Audio Apollo 204 Review

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE