easy programming of linear algebra operations on hybrid
play

Easy Programming of Linear Algebra Operations on Hybrid CPU-GPU - PowerPoint PPT Presentation

HPC & A Easy Programming of Linear Algebra Operations on Hybrid CPU-GPU Platforms Enrique S. Quintana-Ort 1 INRIA-Sophia Antipolis, June 2011 Index HPC & A The libflame library GPU support The StarSs framework 2


  1. libflame → Creating your own algorithm → FLAME notation and algorithms HPC & A Algorithm loop: repartition+operation+merging A TL A 00 α 11 T a 10 → → A BL A BR A 20 A 22 a 21 A 00 A TL T √ α 11 a 10 → a 21 A 20 A 22 – A BL A BR / T a 21 a 21 α 11 28 INRIA-Sophia Antipolis, June 2011

  2. libflame → Creating your own algorithm → FLAME notation and algorithms HPC & A Algorithm loop: repartition A TL → A BL A BR A 00 T α 11 a 10 a 21 A 20 A 22 Indexing operations 29 INRIA-Sophia Antipolis, June 2011

  3. libflame → Creating your own algorithm → FLAME notation and algorithms HPC & A Algorithm loop: operation A 00 α 11 T a 10 → A 20 a 21 A 00 T √ α 11 a 10 a 21 A 20 A 22 – / T a 21 a 21 α 11 Real computation 30 INRIA-Sophia Antipolis, June 2011

  4. libflame → Creating your own algorithm → FLAME notation and algorithms HPC & A Algorithm loop: merging A 00 √ α 11 T a 10 → a 21 A 20 A 22 – / / T T a a 21 a 21 a α 11 A TL A BL A BR Indexing operation 31 INRIA-Sophia Antipolis, June 2011

  5. libflame → Creating your own algorithm → FLAME notation and algorithms HPC & A Algorithm � Automatic development from math. specification A = L * L T Mechanical procedure 32 32 INRIA-Sophia Antipolis, June 2011

  6. libflame → Creating your own algorithm → Spark: from algorithm to code HPC & A APIs Spark+APIs C, F77, Matlab, LabView, LaTeX 33 INRIA-Sophia Antipolis, June 2011

  7. libflame → Creating your own algorithm → Spark: from algorithm to code HPC & A Spark website http://www.cs.utexas.edu/users/flame/Spark/ 34 INRIA-Sophia Antipolis, June 2011

  8. libflame → Creating your own algorithm → Spark: from algorithm to code HPC & A Example: FLAME@lab [ ATL, ATR,... ABL, ABR ] = FLA_Part_2x2( A, 0, 0, 'FLA_TL' ); while ( size( ATL, 1 ) < size( A, 1 ) ) [ A00, a01, A02,... a10t, alpha11, a12t,... A20, a21, A22 ] = FLA_Repart_2x2_to_3x3( ATL, ATR,... ABL, ABR,... 1, 1, 'FLA_BR' ); %----------------------------------------% % : %----------------------------------------% [ ATL, ATR,... ABL, ABR ] = ... FLA_Cont_with_3x3_to_2x2( A00, a01, A02,... a10t, alpha11, a12t,... A20, a21, A22,... 'FLA_TL' ); end Indexing operations 35 INRIA-Sophia Antipolis, June 2011

  9. libflame → Creating your own algorithm → Spark: from algorithm to code HPC & A Example: FLAME@lab � Manually fill-in operations […] = FLA_Part_2x2(…); while ( size( ATL, 1 ) < size( A, 1 ) ) while ( size( ATL, 1 ) < size( A, 1 ) ) […] = FLA_Repart_2x2_to_3x3(…); %----------------------------------------% alpha11 = sqrt( alpha11 ); a21 = a21 / alpha11; A22 = A22 – tril( a21*a21’ ); %----------------------------------------% […] = FLA_Cont_with_3x3_to_2x2(…); end Real computation 36 INRIA-Sophia Antipolis, June 2011

  10. libflame → Creating your own algorithm → Running on multicore HPC & A Example: FLAMEC FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ){ b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &a01, &A02, /* ************* */ /* ************************** */ &a10t,/**/ &alpha11, &a12t, ABL, /**/ ABR, &A20, /**/ &a21, &A22, 1, 1, FLA_BR ); /*--------------------------------------*/ /* : */ /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, a01, /**/ A02, a10t, alpha11, /**/ a12t, /* ************** */ /* ************************/ &ABL, /**/ &ABR, A20, a21, /**/ A22, FLA_TL ); } 37 INRIA-Sophia Antipolis, June 2011

  11. libflame → Creating your own algorithm → Running on multicore HPC & A Example: FLAMEC � libflame employs external BLAS: GotoBLAS, MKL, ACML, ATLAS, netlib FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLA_Sqrt( alpha11 ); FLA_Inv_scal( alpha11, a21 ); FLA_Syr( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, a21, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); } 38 INRIA-Sophia Antipolis, June 2011

  12. Index HPC & A � The libflame library 1. A user’s view 2. Creating your own algorithm 1. Task parallelism 3. FLAME runtime 2. 2. SuperMatrix SuperMatrix 4. 4. Cluster of GPUs Cluster of GPUs 3. GPU support � The SMPs/GPUSs framework 39 INRIA-Sophia Antipolis, June 2011

  13. Data-flow parallelism? Dynamic scheduling? HPC & A Run-time? � Surely not a new idea… � Cilk � StarSs (GridSs) � StarPU � … � “An Efficient Algorithm for Exploiting Multiple Arithmetic Units”, R. M. Tomasulo, IBM J. of R&D, Volume 11, Number 1, Page 25 (1967) The basis for exploitation of ILP on current superscalar processors! 40 INRIA-Sophia Antipolis, June 2011

  14. The TEXT project HPC & A � Towards Exaflop applicaTions � Demonstrate that Hybrid MPI/SMPSs addresses the Exascale challenges in a productive and efficient way. � � Deploy at supercomputing centers: Julich, EPCC, HLRS, BSC Deploy at supercomputing centers: Julich, EPCC, HLRS, BSC � Port Applications (HLA, SPECFEM3D, PEPC, PSC, BEST, CPMD, LS1 MarDyn) and develop algorithms. � Develop additional environment capabilities � tools (debug, performance) � improvements in runtime systems (load balance and GPUSs) � Support other users � Identify users of TEXT applications � Identify and support interested application developers � Contribute to Standards (OpenMP ARB, PERI-XML) 41 INRIA-Sophia Antipolis, June 2011

  15. libflame → FLAME runtime → Task parallelism HPC & A Blocked algorithms � Cholesky factorization T A 11 = L 11 L 11 -T A 21 := L 21 = A 21 L 11 T A 22 := A 22 – L 21 L 21 � ������������� ������������� ������������� 42 42 INRIA-Sophia Antipolis, June 2011

  16. libflame → FLAME runtime → Task parallelism HPC & A Blocked algorithms � Cholesky factorization A = L * L T FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); APIs + Tools /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); } 43 43 INRIA-Sophia Antipolis, June 2011

  17. libflame → FLAME runtime → Task parallelism HPC & A Blocked algorithms � Simple parallelization: T A 11 = L 11 L 11 -T A 21 := L 21 = A 21 L 11 link with MT BLAS T A 22 := A 22 – L 21 L 21 FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); } 44 44 INRIA-Sophia Antipolis, June 2011

  18. libflame → FLAME runtime → Task parallelism HPC & A Blocked algorithms � There is more parallelism! Inside the same iteration In different iterations ������������� ������������� 45 45 INRIA-Sophia Antipolis, June 2011

  19. libflame → FLAME runtime → SuperMatrix HPC & A Exploiting task-level parallelism � SuperMatrix: automatic identification of tasks/dependencies Super � Matrix 1 2 4 7 5 3 6 8 9 10 46 INRIA-Sophia Antipolis, June 2011

  20. libflame → FLAME runtime → SuperMatrix HPC & A Exploiting task-level parallelism � SuperMatrix: automatic identification of tasks/dependencies HOW? � Input/output/input-output Super operands and order of Matrix operations in code determine dependencies � Direction of operands is defined /*--------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); as part of BLAS specification FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ 47 INRIA-Sophia Antipolis, June 2011

  21. libflame → FLAME runtime → SuperMatrix HPC & A Exploiting task-level parallelism � SuperMatrix: scheduling of tasks to � cores 1 2 4 7 Super 5 3 6 8 9 10 Matrix 48 INRIA-Sophia Antipolis, June 2011

  22. libflame → FLAME runtime → SuperMatrix HPC & A Exploiting task-level parallelism � SuperMatrix: scheduling of tasks to � cores HOW? 1 2 4 7 � List of ready tasks Super � One thread per core 5 3 6 8 9 10 Matrix 1. Centralized list 2. One list per-thread 3. One list per-thread and work- stealing 49 INRIA-Sophia Antipolis, June 2011

  23. libflame → FLAME runtime → GPU support HPC & A Single GPU � SuperMatrix: Dealing with data transfers between host (CPU)/device � (GPU) memory spaces 1 2 4 7 Super 5 3 6 8 9 10 Matrix 50 INRIA-Sophia Antipolis, June 2011

  24. libflame → FLAME runtime → GPU support HPC & A Single GPU: a user’s view FLA_Obj A; // Initialize conventional matrix: buffer, m, rs, cs // Obtain storage blocksize, # of threads: b, n_threads FLA_Init(); FLASH_Obj_create( FLA_DOUBLE, m, m, 1, &b, &A ); FLASH_Copy_buffer_to_hier( m, m, buffer, rs, cs, 0, 0, A ); FLASH_Queue_set_num_threads( n_threads ); FLASH_Queue_enable_gpu(); FLASH_Chol( FLA_LOWER_TRIANGULAR, A ); FLASH_Obj_free( &A ); FLA_Finalize(); 51 51 INRIA-Sophia Antipolis, June 2011

  25. libflame → FLAME runtime → GPU support HPC & A Single GPU: under the cover FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ Indexing operations (with FLA_Repart_2x2_to_3x3(…); addresses in device /*--------------------------------------*/ memory) FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); } 52 52 INRIA-Sophia Antipolis, June 2011

  26. libflame → FLAME runtime → GPU support HPC & A Single GPU: under the cover FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ Super Matrix FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); Real computation: FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); Runtime keeps track of data in /*--------------------------------------*/ host/device memory and FLA_Cont_with_3x3_to_2x2(…); performs the necessary transfers, reducing #copies } 53 53 INRIA-Sophia Antipolis, June 2011

  27. libflame → FLAME runtime → GPU support HPC & A Single GPU: under the cover FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ Super Matrix FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); 1. Copy matrix to GPU /*--------------------------------------*/ memory before FLA_Cont_with_3x3_to_2x2(…); algorithm commences } 54 54 INRIA-Sophia Antipolis, June 2011

  28. libflame → FLAME runtime → GPU support HPC & A Single GPU: under the cover FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ Super Matrix FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, 2. Copy block A11 from FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ device to host before its FLA_Cont_with_3x3_to_2x2(…); factorization } 55 55 INRIA-Sophia Antipolis, June 2011

  29. libflame → FLAME runtime → GPU support HPC & A Single GPU: under the cover FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ Super Matrix FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, 3. Copy block A11 from FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ host to device before FLA_Cont_with_3x3_to_2x2(…); using it in subsequent } computations 56 56 INRIA-Sophia Antipolis, June 2011

  30. libflame → FLAME runtime → GPU support HPC & A Multi-GPU � SuperMatrix: Dealing with data transfers between host (CPU)/device � (GPU) memory spaces 1 2 4 7 Super 5 3 6 8 9 10 Matrix 57 INRIA-Sophia Antipolis, June 2011

  31. libflame → FLAME runtime → GPU support HPC & A Multi-GPU � How do we program these? GPU #0 GPU #1 PCI-e CPU(s) bus bus GPU #2 GPU #3 Inter- connect 58 58 INRIA-Sophia Antipolis, June 2011

  32. libflame → FLAME runtime → GPU support HPC & A Multi-GPU: a user’s view FLA_Obj A; // Initialize conventional matrix: buffer, m, rs, cs // Obtain storage blocksize, # of threads: b, n_threads FLA_Init(); FLASH_Obj_create( FLA_DOUBLE, m, m, 1, &b, &A ); FLASH_Copy_buffer_to_hier( m, m, buffer, rs, cs, 0, 0, A ); FLASH_Queue_set_num_threads( n_threads ); FLASH_Queue_enable_gpu(); FLASH_Chol( FLA_LOWER_TRIANGULAR, A ); FLASH_Obj_free( &A ); FLA_Finalize(); 59 59 INRIA-Sophia Antipolis, June 2011

  33. libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover � Naïve approach: ����������������������������������� � ������ GPU #0 ������������������������������ GPU #1 � PCI-e CPU(s) �������������������� �������������������� bus bus GPU #2 GPU #3 → ������������������ Inter- connect 60 60 INRIA-Sophia Antipolis, June 2011

  34. libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover � How do we program these? GPU #0 GPU #1 PCI-e CPU(s) bus bus GPU #2 GPU #3 Inter- connect View as a… � Shared-memory multiprocessor + DSM 61 61 INRIA-Sophia Antipolis, June 2011

  35. libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover � View system as a shared- memory multiprocessors (multi-core processor with GPU #0 GPU #1 hw. coherence) PCI-e CPU(s) bus bus GPU #2 MP P 0 +C 0 GPU #3 Inter- connect P 1 +C 1 P 2 +C 2 P 3 +C 3 62 62 INRIA-Sophia Antipolis, June 2011

  36. libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover � Software Distributed-Shared Memory (DSM) � Software: flexibility vs. efficiency � Underlying distributed memory hidden from the users � Reduce memory transfers using write-back, write- invalidate,… � Well-known approach, not too efficient as a middleware for general apps. � Regularity of dense linear algebra operations makes a difference! 63 63 INRIA-Sophia Antipolis, June 2011

  37. libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover � Reduce #data transfers: ������������������������������� � Super ��������� ���������! Matrix "��������������������� � #��� ���� → ����������� #��� ���� → ����������� � � $��������� � GPU #0 $��������������� � GPU #1 PCI-e CPU(s) bus GPU #2 GPU #3 Inter- connect 64 64 INRIA-Sophia Antipolis, June 2011

  38. libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ Super Matrix FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); 1. Distribute matrix among /*--------------------------------------*/ GPU memories (2D workload distribution) FLA_Cont_with_3x3_to_2x2(…); before algorithm } commences 65 INRIA-Sophia Antipolis, June 2011

  39. libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover GPU #0 GPU #1 Super Matrix GPU #2 GPU #3 GPU #3 1. Distribute matrix among GPU memories (2D workload distribution): owner-computes rule 66 INRIA-Sophia Antipolis, June 2011

  40. libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ Super Matrix FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); 2. Copy block A11 from /*--------------------------------------*/ corresponding device to FLA_Cont_with_3x3_to_2x2(…); host before its factorization } 67 INRIA-Sophia Antipolis, June 2011

  41. libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ Super Matrix FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); 3. Broadcast block A11 from /*--------------------------------------*/ host to appropriate FLA_Cont_with_3x3_to_2x2(…); devices before using it in } subsequent computations (write-update) 68 INRIA-Sophia Antipolis, June 2011

  42. libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ Super Matrix FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); 4. Keep A11 in receiving /*--------------------------------------*/ device(s) in case needed FLA_Cont_with_3x3_to_2x2(…); in subsequent } computations (cache) 69 INRIA-Sophia Antipolis, June 2011

  43. libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ Super Matrix FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); 5. Keep updated A21 in /*--------------------------------------*/ device till replaced (write- FLA_Cont_with_3x3_to_2x2(…); back) } 70 INRIA-Sophia Antipolis, June 2011

  44. libflame → FLAME runtime → GPU support HPC & A Performance 71 INRIA-Sophia Antipolis, June 2011

  45. libflame → FLAME runtime → GPU support HPC & A Performance 72 INRIA-Sophia Antipolis, June 2011

  46. libflame → FLAME runtime → GPU support HPC & A Performance 73 INRIA-Sophia Antipolis, June 2011

  47. Index HPC & A � The libflame library 1. A user’s view 2. Creating your own algorithm 3. FLAME runtime 4. 4. Clusters of GPUs Clusters of GPUs 1. DLA for clusters 2. Host-centric view 3. Device-centric view � The StarSs framework 74 INRIA-Sophia Antipolis, June 2011

  48. libflame → Clusters of GPUs → DLA for clusters HPC & A libflame-like libraries � PLAPACK (UT@Austin) � Use of objects ( PLA_Obj ), vectors, matrices, projected vectors, etc., with layout embedded � PMB distribution � Layered and modular design: all communication is � Layered and modular design: all communication is done via copies ( PLA_Copy ) and reductions ( PLA_Reduce ) from one object type to another � Elemental (Jack Poulson) � Based on PLAPACK, but C++ � Element-wise cyclic data layout 75 75 INRIA-Sophia Antipolis, June 2011

  49. libflame → Clusters of GPUs → Host-centric view HPC & A Data in host memory � Before executing a kernel, copy input data to GPU memory � After execution, retrieve results back to node main results back to node main memory GPU #0 � Easy to program (wrappers GPU #1 PCI-e to kernels) CPU(s) bus � Copies linked to kernel GPU #2 execution: O(n 3 ) transfers GPU #3 Inter- between CPU and GPU connect 76 76 INRIA-Sophia Antipolis, June 2011

  50. libflame → Clusters of GPUs → Device-centric view HPC & A Data in GPU memory � Before sending a piece of data, retrieve it back to node main memory (compact on the fly) � After reception, copy � After reception, copy contents to GPU memory GPU #0 � Easy to program (wrappers GPU #1 PCI-e to MPI calls) CPU(s) bus � Copies linked to GPU #2 communication, not kernel GPU #3 Inter- execution: O(n 2 ) transfers connect between CPU and GPU 77 77 INRIA-Sophia Antipolis, June 2011

  51. libflame → Clusters of GPUs HPC & A Performance 22x 10x 5x 78 78 INRIA-Sophia Antipolis, June 2011

  52. libflame → Clusters of GPUs HPC & A Performance 79 79 INRIA-Sophia Antipolis, June 2011

  53. Acknowledgements HPC & A � Funding sources 80 80 INRIA-Sophia Antipolis, June 2011

  54. Further information HPC & A � Contact: � field@cs.utexas.edu � FLAME project website: � www.cs.utexas.edu/users/flame/ � www.cs.utexas.edu/users/flame/ � libflame: The Complete Reference � www.cs.utexas.edu/users/field/docs/ � Updated nightly � www.lulu.com/content/5915632 � Updated occasionally 81 81 INRIA-Sophia Antipolis, June 2011

  55. Index HPC & A � The libflame library GPU support � The StarSs framework 82 INRIA-Sophia Antipolis, June 2011

  56. The TEXT project HPC & A � Towards Exaflop applicaTions � Demonstrate that Hybrid MPI/SMPSs addresses the Exascale challenges in a productive and efficient way. � � Deploy at supercomputing centers: Julich, EPCC, HLRS, BSC Deploy at supercomputing centers: Julich, EPCC, HLRS, BSC � Port Applications (HLA, SPECFEM3D, PEPC, PSC, BEST, CPMD, LS1 MarDyn) and develop algorithms. � Develop additional environment capabilities � tools (debug, performance) � improvements in runtime systems (load balance and GPUSs) � Support other users � Identify users of TEXT applications � Identify and support interested application developers � Contribute to Standards (OpenMP ARB, PERI-XML) 83 INRIA-Sophia Antipolis, June 2011

  57. Index HPC & A � The libflame library � The StarSs framework 1. StarSs overview 2. OmpSs Slides from Rosa M. Badia Barcelona Supercomputing Center Thanks! 84 INRIA-Sophia Antipolis, June 2011

  58. StarSs → StarSs overview HPC & A Programming model ParallelResources (multicore, SMP, cluster, c loud, grid) Synchronization, Task selection + results transfer Resource 1 Sequential Application parameters direction ... Resource 2 (input, output, inout) for (i=0; i<N; i++){ T1 ( data1, data2); Resource 3 T2 ( data4, data5); T3 ( data2, data5, data6); T4 ( data7, data8); . . T5 ( data6, data8, data9); T5 ( data6, data8, data9); . } ... . T1 0 T2 0 Resource N T4 0 T3 0 Scheduling, T5 0 Task graph creation T1 1 T2 1 data transfer, based on data T4 1 T3 1 task execution precedence T5 1 T1 2 … 85 INRIA-Sophia Antipolis, June 2011

  59. StarSs → StarSs overview HPC & A Programming model StarSs GridSs CellSs SMPSs ClusterSs GPUSs ClusterSs ClearSpeedSs OmpSs COMPSs @ SMP @ GPU @ Cluster � Programmability/Portability � Incremental parallelization/restructure � StarSs � Separate algorithm from resources � � Disciplined programming A “node” level programming model � � Sequential C/Fortran/Java + annotations “Same” source code runs on “any” machine � Optimized task implementations will result in better performance. � Task based. Asynchrony, data-flow. � Performance � “Simple” linear address space � Intelligent Runtime � Directionality annotations on tasks arguments � Automatically extracts and exploits parallelism � Dataflow, workflow � Nicely integrates in hybrid MPI/StarSs � Matches computations to specific resources on each type of � Natural support for heterogeneity target platform � Asynchronous (data-flow) execution and locality awareness 86 INRIA-Sophia Antipolis, June 2011

  60. StarSs → StarSs overview A sequential program… HPC & A void vadd3 (float A[BS], float B[BS], float C[BS]); void scale_add (float sum, float A[BS], float B[BS]); void accum (float A[BS], float *sum); for (i=0; i<N; i+=BS) // C=A+B vadd3 ( &A[i], &B[i], &C[i]); ... for (i=0; i<N; i+=BS) // sum(C[i]) accum (&C[i], &sum); ... for (i=0; i<N; i+=BS) // B=sum*A scale_add (sum, &E[i], &B[i]); ... for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]); ... for (i=0; i<N; i+=BS) // E=G+F vadd3 (&G[i], &F[i], &E[i]); 87 INRIA-Sophia Antipolis, June 2011

  61. StarSs → StarSs overview A sequential program… taskified… HPC & A #pragma css task input(A, B) output(C) void vadd3 (float A[BS], float B[BS], float C[BS]); Compute dependences @ task instantiation time #pragma css task input(sum, A) inout(B) void scale_add (float sum, float A[BS], float B[BS]); 1 2 3 4 #pragma css task input(A) inout(sum) void accum (float A[BS], float *sum); for (i=0; i<N; i+=BS) // C=A+B 5 7 8 6 vadd3 ( &A[i], &B[i], &C[i]); ... for (i=0; i<N; i+=BS) // sum(C[i]) 9 10 11 12 accum (&C[i], &sum); ... for (i=0; i<N; i+=BS) // B=sum*A 13 14 15 16 scale_add (sum, &E[i], &B[i]); ... for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]); 17 18 19 20 ... for (i=0; i<N; i+=BS) // E=G+F Color/number: order of task instantiation vadd3 (&G[i], &F[i], &E[i]); Some antidependences covered by flow dependences not drawn 88 INRIA-Sophia Antipolis, June 2011

  62. StarSs → StarSs overview A sequential program… taskified… with data-flow HPC & A execution Decouple how we write from #pragma css task input(A, B) output(C) Write how it is executed void vadd3 (float A[BS], float B[BS], float C[BS]); Execute #pragma css task input(sum, A) inout(B) void scale_add (float sum, float A[BS], float B[BS]); 1 2 3 4 #pragma css task input(A) inout(sum) void accum (float A[BS], float *sum); for (i=0; i<N; i+=BS) // C=A+B 5 7 8 6 vadd3 ( &A[i], &B[i], &C[i]); ... for (i=0; i<N; i+=BS) // sum(C[i]) 9 10 11 12 accum (&C[i], &sum); ... for (i=0; i<N; i+=BS) // B=sum*A 13 14 15 16 scale_add (sum, &E[i], &B[i]); ... for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]); 17 18 19 20 ... for (i=0; i<N; i+=BS) // E=G+F Color/number: a possible order of task execution vadd3 (&G[i], &F[i], &E[i]); 89 INRIA-Sophia Antipolis, June 2011

  63. StarSs → StarSs overview HPC & A The potential of data access information � Flexibility to dynamically traverse dataflow graph “optimizing” � Concurrency. Critical path � Memory access: data transfers performed by run time � Opportunities for � Prefetch � Reuse � Eliminate antidependences (rename) � Replication management � Coherency/consistency handled by the runtime 90 INRIA-Sophia Antipolis, June 2011

  64. Index HPC & A � The libflame library � The StarSs framework 1. StarSs overview 2. OmpSs 1. Overview & syntax 2. Compiler 3. Runtime 4. Examples 91 INRIA-Sophia Antipolis, June 2011

  65. StarSs → OmpSs → Overview & syntax HPC & A OmpSs = OpenMP + StarSs extensions � OmpSs is based on OpenMP with some differences: � Different execution model � Extended memory model � Extensions for point-to-point inter-task synchronizations � data dependencies � Extensions for heterogeneity � Other minor extensions 92 INRIA-Sophia Antipolis, June 2011

  66. StarSs → OmpSs → Overview & syntax HPC & A Execution model � Thread- pool model � OpenMP parallel “ignored” � All threads created on startup � One of them starts executing main � All get work from a task pool � And can generate new work 93 INRIA-Sophia Antipolis, June 2011

  67. StarSs → OmpSs → Overview & syntax HPC & A Memory model � Two “modes“ are allowed: � pure SMP: � Single address space � OpenMP standard memory model is used � � non-SMP (cluster, GPUs, ...): � Multiple address spaces exists � Same data may exists in multiple of these � Data consistency ensured by the implementation 94 INRIA-Sophia Antipolis, June 2011

  68. StarSs → OmpSs → Overview & syntax HPC & A Main element: Task � Task: unit of computation � Task definition � Pragmas in lined � Pragmas attached to function definition � Pragmas attached to function definition #pragma omp task void foo (int Y[size], int size) { int j; for (j=0; j<size; j++) Y[j]= j; } int main() { int X[100] foo (X, 100); } 95 INRIA-Sophia Antipolis, June 2011

  69. StarSs → OmpSs → Overview & syntax HPC & A Defining dependences � Clauses that express data direction: � input � output � inout � Dependences computed at runtime taking into account � Dependences computed at runtime taking into account these clauses #pragma omp task output( x ) 1 x = 5; #pragma omp task input( x ) 3 printf("%d\n" , x ) ; #pragma omp task inout( x ) 2 x++; #pragma omp task input( x ) printf ("%d\n" , x ) ; 4 96 INRIA-Sophia Antipolis, June 2011

  70. StarSs → OmpSs → Overview & syntax HPC & A Heterogeneity: the target directive � Directive to specify device specific information: #pragma omp target [ clauses ] � Clauses: � device: which device (smp, gpu) � copy_in, copy_out, copy_inout: data to be moved in and out � implements: specifies alternate implementations #pragma target device (smp) #pragma omp task input (Y) void foo (int Y[size], int size) { int j; for (j=0; j<size; j++) Y[j]= j; } int main() { int X[100] foo (X, 100) ; } 97 INRIA-Sophia Antipolis, June 2011

  71. StarSs → OmpSs → Overview & syntax HPC & A Synchronization #pragma omp task wait � Suspends the current task until all children tasks are completed � Just direct children, not descendants void traverse_list ( List l ) { Element e ; Element e ; for ( e = l-> first; e ; e = e->next ) #pragma omp task process ( e ) ; #pragma omp taskwait } 2 ... 1 3 4 98 INRIA-Sophia Antipolis, June 2011

  72. StarSs → OmpSs → Overview & syntax HPC & A Hierarchical task graph � Nesting #pragma omp task input([BS][BS]A, [BS][BS]B)\ inout([BS][BS]C) void small_dgemm(float *C, float *A, float *B); #pragma omp task input([N][N]A, [N][N] B)\ inout([N][N]C) inout([N][N]C) void block_dgemm(float *C, float *A, float *B){ int i, j, k; for (i=0; i< N; i+=BS) for (j=0; j< N; j+=BS) for (k=0; k< N; k+=BS) small_dgemm(&C[i][j], &A[i][k], &B[k][j]) } main() { ... block_dgemm(A,B,C); block_dgemm(D,E,F); #pragma omp task wait } 99 INRIA-Sophia Antipolis, June 2011

  73. StarSs → OmpSs → Compiler HPC & A Mercurium � Minor role � Recognizes constructs and transforms them to calls to the runtime � Manages code restructuring for different target devices � Device-specific handlers � May generate code in a separate file � Invokes different back-end compilers → nvcc for NVIDIA 100 INRIA-Sophia Antipolis, June 2011

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend