Easy Programming of Linear Algebra Operations on Hybrid CPU-GPU - - PowerPoint PPT Presentation

easy programming of linear algebra operations on hybrid
SMART_READER_LITE
LIVE PREVIEW

Easy Programming of Linear Algebra Operations on Hybrid CPU-GPU - - PowerPoint PPT Presentation

HPC & A Easy Programming of Linear Algebra Operations on Hybrid CPU-GPU Platforms Enrique S. Quintana-Ort 1 INRIA-Sophia Antipolis, June 2011 Index HPC & A The libflame library GPU support The StarSs framework 2


slide-1
SLIDE 1

HPC&A

Easy Programming of Linear Algebra Operations on Hybrid CPU-GPU Platforms

INRIA-Sophia Antipolis, June 2011

1

Enrique S. Quintana-Ortí

slide-2
SLIDE 2

HPC&A

Index

  • The libflame library

INRIA-Sophia Antipolis, June 2011

2

  • The StarSs framework

GPU support

slide-3
SLIDE 3

HPC&A

Disclaimer

Not a course on low-level programming of dense linear algebra kernels on GPUs!

  • V. Volkov, J. Demmel. “Benchmarking

GPUs to tune dense linear algebra”, +

INRIA-Sophia Antipolis, June 2011

3

Sorry if the performance numbers of some “products” do not look so good…

GPUs to tune dense linear algebra”, SC08 L-S. Chien, “Hand-tuned SGEMM on GT200 GPU”, TR Tsing Hua Univ., Taiwan

+

slide-4
SLIDE 4

HPC&A

Initial considerations

  • No low level programming: assume there is a tuned

BLAS for GPU/general-purpose cores

  • High level approach: minimal changes to existing

codes/solutions

INRIA-Sophia Antipolis, June 2011

4

codes/solutions

  • Abstract programmer from data transfers, without loosing

performance

  • Runtime-based solution for multi-GPU platforms
slide-5
SLIDE 5

HPC&A

Index

  • The libflame library

1. A user’s view 2. Creating your own algorithm 3. FLAME runtime 4. Clusters of GPUs

INRIA-Sophia Antipolis, June 2011

5

  • The StarSs framework

4. Clusters of GPUs

slide-6
SLIDE 6

HPC&A

Index

  • The libflame library

1. Introduction 2. Configuration 3. Operation status 4. Examples 1. A user’s view 2. Creating your own algorithm 3. FLAME runtime 4. Clusters of GPUs

INRIA-Sophia Antipolis, June 2011

6

  • The StarSs framework

4. Examples 4. Clusters of GPUs

slide-7
SLIDE 7

HPC&A

libflame → A user’s view→ Introduction

What is libflame? A library? A framework? A repository of algorithmic variants?

INRIA-Sophia Antipolis, June 2011

7

7

slide-8
SLIDE 8

HPC&A

libflame → A user’s view→ Introduction

Who are the targets? Newcomers Current users of BLAS, LAPACK Some PLAPACK/ScaLAPACK users whose problems are too small

INRIA-Sophia Antipolis, June 2011

8

8

are too small

slide-9
SLIDE 9

HPC&A

libflame → A user’s view→ Introduction

Current Coverage Level-1, -2, and -3 BLAS

No banded support

Basic LAPACK operations

No SVD/EVD support (yet)

INRIA-Sophia Antipolis, June 2011

9

9

No SVD/EVD support (yet)

Operations not implemented by LAPACK

Example: up-and-downdating

Supporting utility functions

slide-10
SLIDE 10

HPC&A

libflame → A user’s view→ Introduction

Key Features Formally derived algorithms Object-based API (abstraction!) High performance Storage: column-, row-major, AND general

INRIA-Sophia Antipolis, June 2011

10

10

Storage: column-, row-major, AND general

Or a mixture!

Dependency-aware multithreading Build system(s)

GNU and Windows

slide-11
SLIDE 11

HPC&A

libflame → A user’s view→ Introduction

Where is libflame available? FLAME website

www.cs.utexas.edu/users/flame/libflame/ Milestone releases Nightly snapshots

INRIA-Sophia Antipolis, June 2011

11

11

Available as free software under LGPL

slide-12
SLIDE 12

HPC&A

libflame → A user’s view→ Configuration

libflame for GNU/Linux Software requirements

C compiler GNU bash (2.0 or later) GNU make

INRIA-Sophia Antipolis, June 2011

12

12

BLAS library OpenMP-aware C compiler (optional)

Build system

./configure; make; make install

slide-13
SLIDE 13

HPC&A

libflame → A user’s view→ Configuration

libflame for Windows Software requirements

C compiler Python (2.6 or later) Microsoft nmake

INRIA-Sophia Antipolis, June 2011

13

13

BLAS library OpenMP-aware C compiler (optional)

Build system

.\configure.cmd; nmake; nmake install

slide-14
SLIDE 14

HPC&A

libflame → A user’s view→ Configuration

Options Static/dynamic library generation lapack2flame compatibility layer Multithreading: POSIX threads or OpenMP GPU support

INRIA-Sophia Antipolis, June 2011

14

14

GPU support Memory alignment Internal error checking

slide-15
SLIDE 15

HPC&A

libflame → A user’s view→ Operation status

A sampling of functionality

  • peration

Classic FLAME FLASH/SM lapack2flame Level-3 BLAS y y n/a Cholesky y y y LU with partial pivoting y y y LU with incremental pivoting y y * QR (UT) y y y INRIA-Sophia Antipolis, June 2011

15

QR (UT) y y y LQ (UT) y y y SPD/HPD inversion y y y Triangular inversion y y y Triangular Sylvester y y y Lyapunov y y y Up-and-downdate (UT) y y * SVD planned EVD planned

15

* Not present in LAPACK

slide-16
SLIDE 16

HPC&A

libflame → A user’s view→ Examples

Types of interfaces Classic

Matrix type: flat Task-level parallelism: no

Suitable for multithreaded BLAS

INRIA-Sophia Antipolis, June 2011

16

16

FLASH

Matrix type: hierarchical Task-level parallelism: SuperMatrix

Sequential execution is optional

slide-17
SLIDE 17

HPC&A

libflame → A user’s view→ Examples

External buffers

FLA_Obj A; // Initialize conventional matrix: buffer, m, rs, cs FLA_Init();

INRIA-Sophia Antipolis, June 2011

17

17 FLA_Obj_create_without_buffer( FLA_DOUBLE, m, m, &A ); FLA_Obj_attach_buffer( buffer, rs, cs, &A ); FLA_Chol( FLA_LOWER_TRIANGULAR, A ); FLA_Obj_free_without_buffer( &A ); FLA_Finalize();

slide-18
SLIDE 18

HPC&A

libflame → A user’s view→ Examples

Native objects

FLA_Obj A; // Initialize conventional matrix: buffer, m, rs, cs FLA_Init();

INRIA-Sophia Antipolis, June 2011

18

18 FLA_Obj_create( FLA_DOUBLE, m, m, 0, 0, &A ); FLA_Copy_buffer_to_object( FLA_NO_TRANSPOSE, m, m, buffer, rs, cs, 0, 0, A ); FLA_Chol( FLA_LOWER_TRIANGULAR, A ); FLA_Obj_free( &A ); FLA_Finalize();

slide-19
SLIDE 19

HPC&A

libflame → A user’s view→ Examples

Storage-by-blocks (seq)

FLA_Obj A; // Initialize conventional matrix: buffer, m, rs, cs // Obtain storage blocksize b. FLA_Init();

INRIA-Sophia Antipolis, June 2011

19

19 FLASH_Obj_create( FLA_DOUBLE, m, m, 1, &b, &A ); FLASH_Copy_buffer_to_hier( m, m, buffer, rs, cs, 0, 0, A ); FLASH_Queue_disable(); FLASH_Chol( FLA_LOWER_TRIANGULAR, A ); FLASH_Obj_free( &A ); FLA_Finalize();

slide-20
SLIDE 20

HPC&A

libflame → A user’s view→ Examples

Storage-by-blocks (MT)

FLA_Obj A; // Initialize conventional matrix: buffer, m, rs, cs // Obtain storage blocksize, # of threads: b, n_threads FLA_Init();

INRIA-Sophia Antipolis, June 2011

20

20 FLASH_Obj_create( FLA_DOUBLE, m, m, 1, &b, &A ); FLASH_Copy_buffer_to_hier( m, m, buffer, rs, cs, 0, 0, A ); FLASH_Queue_set_num_threads( n_threads ); FLASH_Chol( FLA_LOWER_TRIANGULAR, A ); FLASH_Obj_free( &A ); FLA_Finalize();

slide-21
SLIDE 21

HPC&A

libflame → A user’s view→ Examples

Solving a system via LU

FLA_Obj A, p, b, x; FLA_Obj_create( FLA_DOUBLE, m, m, 0, 0, &A ); FLA_Obj_create( FLA_DOUBLE, m, 1, 0, 0, &b ); FLA_Obj_create( FLA_DOUBLE, m, 1, 0, 0, &x ); FLA_Obj_create( FLA_INT, m, 1, 0, 0, &p );

INRIA-Sophia Antipolis, June 2011

21

21 // Initialize A, b, x. FLA_LU_piv( A, p ); FLA_LU_piv_solve( A, p, b, x ); FLA_Obj_free( &A ); FLA_Obj_free( &b ); FLA_Obj_free( &x ); FLA_Obj_free( &p );

slide-22
SLIDE 22

HPC&A

libflame → A user’s view→ Examples

Solving a system via QR

FLA_Obj A, T, b, x; FLA_Obj_create( FLA_DOUBLE, m, n, 0, 0, &A ); FLA_Obj_create( FLA_DOUBLE, m, 1, 0, 0, &b ); FLA_Obj_create( FLA_DOUBLE, n, 1, 0, 0, &x ); // Initialize A, b, x.

INRIA-Sophia Antipolis, June 2011

22

22 // Initialize A, b, x. FLA_QR_UT_create_T( A, &T ); FLA_QR_UT( A, T ); FLA_QR_UT_solve( A, T, b, x ); FLA_Obj_free( &A ); FLA_Obj_free( &b ); FLA_Obj_free( &x );

slide-23
SLIDE 23

HPC&A

libflame → A user’s view→ Performance

Up-and-Downdate

INRIA-Sophia Antipolis, June 2011

23

23

slide-24
SLIDE 24

HPC&A

Index

  • The libflame library
  • 1. FLAME notation and algorithms
  • 2. Spark: from algorithm to code
  • 3. Running on multicore

1. A user’s view 2. Creating your own algorithm 3. FLAME runtime 4. Clusters of GPUs

INRIA-Sophia Antipolis, June 2011

24

  • The StarSs framework
  • 3. Running on multicore

4. Clusters of GPUs

slide-25
SLIDE 25

HPC&A

libflame → Creating your own algorithm

The Cholesky factorization

FLA_Obj A; // Initialize conventional matrix: buffer, m, rs, cs FLA_Init();

INRIA-Sophia Antipolis, June 2011

25

25 FLA_Obj_create_without_buffer( FLA_DOUBLE, m, m, &A ); FLA_Obj_attach_buffer( buffer, rs, cs, &A ); FLA_Chol( FLA_LOWER_TRIANGULAR, A ); FLA_Obj_free_without_buffer( &A ); FLA_Finalize();

slide-26
SLIDE 26

HPC&A

libflame → Creating your own algorithm

The Cholesky factorization Lower triangular case:

A = * L LT

INRIA-Sophia Antipolis, June 2011

26

Key in the solution of s.p.d. linear systems

A x = b ≡ (LLT)x = b L y = b ⇒ y LT x = y ⇒ x

slide-27
SLIDE 27

HPC&A

libflame → Creating your own algorithm → FLAME notation and algorithms

Algorithm A → L →

A22 a21

*

α11 L22 l21 λ11

T

INRIA-Sophia Antipolis, June 2011

27

A = L * LT ≡ = →

α11 = λ11λ11

T

→ λ11 = √α11 a21 = l21 λ11 → l21 = a21/λ11 A22 = L22L22

T+ a21 a21 T →

(A22-a21 a21

T)= L22L22 T

27

T

A22 a21

*

α11 L22 l21 λ11 L22 l21 λ11 *

slide-28
SLIDE 28

HPC&A

libflame → Creating your own algorithm → FLAME notation and algorithms

Algorithm loop: repartition+operation+merging

→ →

ATL ABL ABR α11 a21 A00 A20 a10

T

A22

INRIA-Sophia Antipolis, June 2011

28

√α11 a21 / α11 A22 – a21 a21

T

A00 A20 a10

T

ATL ABL ABR

slide-29
SLIDE 29

HPC&A

libflame → Creating your own algorithm → FLAME notation and algorithms

Algorithm loop: repartition

ATL ABL ABR

INRIA-Sophia Antipolis, June 2011

29

α11 a21 A22 A00 A20 a10

T

Indexing operations

slide-30
SLIDE 30

HPC&A

libflame → Creating your own algorithm → FLAME notation and algorithms

Algorithm loop: operation

α11 a21 A00 A20 a10

T

INRIA-Sophia Antipolis, June 2011

30

√α11 a21 / α11 A22 – a21 a21

T

A00 A20 a10

T

Real computation

slide-31
SLIDE 31

HPC&A

libflame → Creating your own algorithm → FLAME notation and algorithms

Algorithm loop: merging

√α11 a21 / A22 – a a

T

A00 A20 a10

T

INRIA-Sophia Antipolis, June 2011

31

ATL ABL ABR / α11 a21 a21

T

Indexing operation

slide-32
SLIDE 32

HPC&A

libflame → Creating your own algorithm → FLAME notation and algorithms

Algorithm Automatic development from math. specification

A = L * LT

INRIA-Sophia Antipolis, June 2011

32

32

Mechanical procedure

slide-33
SLIDE 33

HPC&A

libflame → Creating your own algorithm → Spark: from algorithm to code

APIs

INRIA-Sophia Antipolis, June 2011

33

Spark+APIs C, F77, Matlab, LabView, LaTeX

slide-34
SLIDE 34

HPC&A

libflame → Creating your own algorithm → Spark: from algorithm to code

Spark website

http://www.cs.utexas.edu/users/flame/Spark/

INRIA-Sophia Antipolis, June 2011

34

slide-35
SLIDE 35

HPC&A

libflame → Creating your own algorithm → Spark: from algorithm to code

Example: FLAME@lab

[ ATL, ATR,... ABL, ABR ] = FLA_Part_2x2( A, 0, 0, 'FLA_TL' ); while ( size( ATL, 1 ) < size( A, 1 ) ) [ A00, a01, A02,... a10t, alpha11, a12t,...

INRIA-Sophia Antipolis, June 2011

35 A20, a21, A22 ] = FLA_Repart_2x2_to_3x3( ATL, ATR,... ABL, ABR,... 1, 1, 'FLA_BR' ); %----------------------------------------% % : %----------------------------------------% [ ATL, ATR,... ABL, ABR ] = ... FLA_Cont_with_3x3_to_2x2( A00, a01, A02,... a10t, alpha11, a12t,... A20, a21, A22,... 'FLA_TL' ); end

Indexing operations

slide-36
SLIDE 36

HPC&A

libflame → Creating your own algorithm → Spark: from algorithm to code

Example: FLAME@lab

[…] = FLA_Part_2x2(…); while ( size( ATL, 1 ) < size( A, 1 ) )

Manually fill-in operations

INRIA-Sophia Antipolis, June 2011

36 while ( size( ATL, 1 ) < size( A, 1 ) ) […] = FLA_Repart_2x2_to_3x3(…); %----------------------------------------% alpha11 = sqrt( alpha11 ); a21 = a21 / alpha11; A22 = A22 – tril( a21*a21’ ); %----------------------------------------% […] = FLA_Cont_with_3x3_to_2x2(…); end

Real computation

slide-37
SLIDE 37

HPC&A

libflame → Creating your own algorithm → Running on multicore

Example: FLAMEC

FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ){ b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &a01, &A02,

INRIA-Sophia Antipolis, June 2011

37 /* ************* */ /* ************************** */ &a10t,/**/ &alpha11, &a12t, ABL, /**/ ABR, &A20, /**/ &a21, &A22, 1, 1, FLA_BR ); /*--------------------------------------*/ /* : */ /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, a01, /**/ A02, a10t, alpha11, /**/ a12t, /* ************** */ /* ************************/ &ABL, /**/ &ABR, A20, a21, /**/ A22, FLA_TL ); }

slide-38
SLIDE 38

HPC&A

libflame → Creating your own algorithm → Running on multicore

Example: FLAMEC

FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){

libflame employs external BLAS: GotoBLAS, MKL, ACML, ATLAS, netlib

INRIA-Sophia Antipolis, June 2011

38 FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLA_Sqrt( alpha11 ); FLA_Inv_scal( alpha11, a21 ); FLA_Syr( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, a21, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); }

slide-39
SLIDE 39

HPC&A

Index

  • The libflame library

1. Task parallelism 2. SuperMatrix

1. A user’s view 2. Creating your own algorithm 3. FLAME runtime 4. Cluster of GPUs

INRIA-Sophia Antipolis, June 2011

39

  • The SMPs/GPUSs framework

2. SuperMatrix 3. GPU support

4. Cluster of GPUs

slide-40
SLIDE 40

HPC&A

Data-flow parallelism? Dynamic scheduling? Run-time?

Surely not a new idea…

Cilk StarSs (GridSs) StarPU …

INRIA-Sophia Antipolis, June 2011

40

“An Efficient Algorithm for Exploiting Multiple Arithmetic Units”,

  • R. M. Tomasulo, IBM J. of R&D, Volume 11, Number 1, Page

25 (1967) The basis for exploitation of ILP

  • n current superscalar processors!
slide-41
SLIDE 41

HPC&A

The TEXT project

  • Towards Exaflop applicaTions
  • Demonstrate that Hybrid MPI/SMPSs addresses the Exascale challenges

in a productive and efficient way.

  • Deploy at supercomputing centers: Julich, EPCC, HLRS, BSC

INRIA-Sophia Antipolis, June 2011

41

  • Deploy at supercomputing centers: Julich, EPCC, HLRS, BSC
  • Port Applications (HLA, SPECFEM3D, PEPC, PSC, BEST, CPMD, LS1 MarDyn)

and develop algorithms.

  • Develop additional environment capabilities
  • tools (debug, performance)
  • improvements in runtime systems (load balance and GPUSs)
  • Support other users
  • Identify users of TEXT applications
  • Identify and support interested application developers
  • Contribute to Standards (OpenMP ARB, PERI-XML)
slide-42
SLIDE 42

HPC&A

libflame → FLAME runtime → Task parallelism

Blocked algorithms Cholesky factorization

A11 = L11 L11

T

A21 := L21 = A21 L11

  • T

A22 := A22 – L21 L21

T

INRIA-Sophia Antipolis, June 2011

42

42

slide-43
SLIDE 43

HPC&A

libflame → FLAME runtime → Task parallelism

Blocked algorithms Cholesky factorization

A = L * LT

FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…);

INRIA-Sophia Antipolis, June 2011

43

43 APIs + Tools

FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); }

slide-44
SLIDE 44

HPC&A

libflame → FLAME runtime → Task parallelism

Blocked algorithms Simple parallelization: link with MT BLAS

FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…);

A11 = L11 L11

T

A21 := L21 = A21 L11

  • T

A22 := A22 – L21 L21

T

INRIA-Sophia Antipolis, June 2011

44

44

FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); }

slide-45
SLIDE 45

HPC&A

libflame → FLAME runtime → Task parallelism

Blocked algorithms There is more parallelism!

Inside the same iteration In different iterations

INRIA-Sophia Antipolis, June 2011

45

45

slide-46
SLIDE 46

HPC&A

libflame → FLAME runtime → SuperMatrix

Exploiting task-level parallelism

  • SuperMatrix: automatic identification
  • f tasks/dependencies

INRIA-Sophia Antipolis, June 2011

46

  • 1

2 3 4 5 6 7 8 9 10

Super Matrix

slide-47
SLIDE 47

HPC&A

libflame → FLAME runtime → SuperMatrix

Exploiting task-level parallelism

  • SuperMatrix: automatic identification
  • f tasks/dependencies

HOW?

INRIA-Sophia Antipolis, June 2011

47 /*--------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/

Input/output/input-output

  • perands and order of
  • perations in code determine

dependencies Direction of operands is defined as part of BLAS specification

Super Matrix

slide-48
SLIDE 48

HPC&A

libflame → FLAME runtime → SuperMatrix

Exploiting task-level parallelism

  • SuperMatrix: scheduling of tasks to

cores

  • 1

INRIA-Sophia Antipolis, June 2011

48

2 3 4 5 6 7 8 9 10

Super Matrix

slide-49
SLIDE 49

HPC&A

libflame → FLAME runtime → SuperMatrix

Exploiting task-level parallelism

  • SuperMatrix: scheduling of tasks to

cores

HOW?

  • 1

INRIA-Sophia Antipolis, June 2011

49

  • List of ready tasks

One thread per core

  • 1. Centralized list

2. One list per-thread 3. One list per-thread and work- stealing

2 3 4 5 6 7 8 9 10

Super Matrix

slide-50
SLIDE 50

HPC&A

libflame → FLAME runtime → GPU support

Single GPU

  • SuperMatrix: Dealing with data

transfers between host (CPU)/device (GPU) memory spaces

  • 1

INRIA-Sophia Antipolis, June 2011

50

2 3 4 5 6 7 8 9 10

Super Matrix

slide-51
SLIDE 51

HPC&A

libflame → FLAME runtime → GPU support

Single GPU: a user’s view

FLA_Obj A; // Initialize conventional matrix: buffer, m, rs, cs // Obtain storage blocksize, # of threads: b, n_threads FLA_Init();

INRIA-Sophia Antipolis, June 2011

51

51 FLASH_Obj_create( FLA_DOUBLE, m, m, 1, &b, &A ); FLASH_Copy_buffer_to_hier( m, m, buffer, rs, cs, 0, 0, A ); FLASH_Queue_set_num_threads( n_threads ); FLASH_Queue_enable_gpu(); FLASH_Chol( FLA_LOWER_TRIANGULAR, A ); FLASH_Obj_free( &A ); FLA_Finalize();

slide-52
SLIDE 52

HPC&A

libflame → FLAME runtime → GPU support

Single GPU: under the cover

FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,

Indexing operations (with addresses in device memory)

INRIA-Sophia Antipolis, June 2011

52

52

FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); }

slide-53
SLIDE 53

HPC&A

libflame → FLAME runtime → GPU support

Single GPU: under the cover

FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,

Super Matrix

INRIA-Sophia Antipolis, June 2011

53

53

FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); }

Real computation:

Runtime keeps track of data in host/device memory and performs the necessary transfers, reducing #copies

slide-54
SLIDE 54

HPC&A

libflame → FLAME runtime → GPU support

Single GPU: under the cover

FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,

Super Matrix

INRIA-Sophia Antipolis, June 2011

54

54

FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); }

  • 1. Copy matrix to GPU

memory before algorithm commences

slide-55
SLIDE 55

HPC&A

libflame → FLAME runtime → GPU support

Single GPU: under the cover

FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,

Super Matrix

INRIA-Sophia Antipolis, June 2011

55

55

FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); }

  • 2. Copy block A11 from

device to host before its factorization

slide-56
SLIDE 56

HPC&A

libflame → FLAME runtime → GPU support

Single GPU: under the cover

FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,

Super Matrix

INRIA-Sophia Antipolis, June 2011

56

56

FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); }

  • 3. Copy block A11 from

host to device before using it in subsequent computations

slide-57
SLIDE 57

HPC&A

libflame → FLAME runtime → GPU support

Multi-GPU

  • SuperMatrix: Dealing with data

transfers between host (CPU)/device (GPU) memory spaces

  • 1

INRIA-Sophia Antipolis, June 2011

57

2 3 4 5 6 7 8 9 10

Super Matrix

slide-58
SLIDE 58

HPC&A

libflame → FLAME runtime → GPU support

Multi-GPU How do we program these?

CPU(s) PCI-e bus GPU #1 GPU #0 INRIA-Sophia Antipolis, June 2011

58

58

bus GPU #3 GPU #2 Inter- connect

slide-59
SLIDE 59

HPC&A

libflame → FLAME runtime → GPU support

Multi-GPU: a user’s view

FLA_Obj A; // Initialize conventional matrix: buffer, m, rs, cs // Obtain storage blocksize, # of threads: b, n_threads FLA_Init();

INRIA-Sophia Antipolis, June 2011

59

59 FLASH_Obj_create( FLA_DOUBLE, m, m, 1, &b, &A ); FLASH_Copy_buffer_to_hier( m, m, buffer, rs, cs, 0, 0, A ); FLASH_Queue_set_num_threads( n_threads ); FLASH_Queue_enable_gpu(); FLASH_Chol( FLA_LOWER_TRIANGULAR, A ); FLASH_Obj_free( &A ); FLA_Finalize();

slide-60
SLIDE 60

HPC&A

libflame → FLAME runtime → GPU support

Multi-GPU: under the cover Naïve approach:

  • CPU(s)

PCI-e bus GPU #1 GPU #0 INRIA-Sophia Antipolis, June 2011

60

60

bus GPU #3 GPU #2 Inter- connect

slide-61
SLIDE 61

HPC&A

libflame → FLAME runtime → GPU support

Multi-GPU: under the cover How do we program these?

CPU(s) PCI-e bus GPU #1 GPU #0 INRIA-Sophia Antipolis, June 2011

61

61

View as a…

Shared-memory multiprocessor + DSM

bus GPU #3 GPU #2 Inter- connect

slide-62
SLIDE 62

HPC&A

libflame → FLAME runtime → GPU support

Multi-GPU: under the cover View system as a shared- memory multiprocessors (multi-core processor with

  • hw. coherence)

CPU(s) PCI-e bus GPU #1 GPU #0 INRIA-Sophia Antipolis, June 2011

62

62 MP P0+C0 P1+C1 P2+C2 P3+C3

bus GPU #3 GPU #2 Inter- connect

slide-63
SLIDE 63

HPC&A

libflame → FLAME runtime → GPU support

Multi-GPU: under the cover Software Distributed-Shared Memory (DSM)

Software: flexibility vs. efficiency Underlying distributed memory hidden from the users Reduce memory transfers using write-back, write-

INRIA-Sophia Antipolis, June 2011

63

63

invalidate,… Well-known approach, not too efficient as a middleware for general apps. Regularity of dense linear algebra operations makes a difference!

slide-64
SLIDE 64

HPC&A

libflame → FLAME runtime → GPU support

Multi-GPU: under the cover Reduce #data transfers:

  • !
  • "
  • # →

Super Matrix

INRIA-Sophia Antipolis, June 2011

64

64

  • # →
  • $
  • $

CPU(s) PCI-e bus GPU #1 GPU #3 GPU #0 GPU #2 Inter- connect

slide-65
SLIDE 65

HPC&A

libflame → FLAME runtime → GPU support

Multi-GPU: under the cover

Super Matrix

FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,

INRIA-Sophia Antipolis, June 2011

65 FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); }

  • 1. Distribute matrix among

GPU memories (2D workload distribution) before algorithm commences

slide-66
SLIDE 66

HPC&A

libflame → FLAME runtime → GPU support

Multi-GPU: under the cover

Super Matrix

GPU #1 GPU #3 GPU #0 GPU #2 INRIA-Sophia Antipolis, June 2011

66

  • 1. Distribute matrix among

GPU memories (2D workload distribution):

  • wner-computes rule

GPU #3

slide-67
SLIDE 67

HPC&A

libflame → FLAME runtime → GPU support

Multi-GPU: under the cover

Super Matrix

FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,

INRIA-Sophia Antipolis, June 2011

67 FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); }

  • 2. Copy block A11 from

corresponding device to host before its factorization

slide-68
SLIDE 68

HPC&A

libflame → FLAME runtime → GPU support

Multi-GPU: under the cover

Super Matrix

FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,

INRIA-Sophia Antipolis, June 2011

68 FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); }

  • 3. Broadcast block A11 from

host to appropriate devices before using it in subsequent computations (write-update)

slide-69
SLIDE 69

HPC&A

libflame → FLAME runtime → GPU support

Multi-GPU: under the cover

Super Matrix

FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,

INRIA-Sophia Antipolis, June 2011

69 FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); }

  • 4. Keep A11 in receiving

device(s) in case needed in subsequent computations (cache)

slide-70
SLIDE 70

HPC&A

libflame → FLAME runtime → GPU support

Multi-GPU: under the cover

Super Matrix

FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,

INRIA-Sophia Antipolis, June 2011

70 FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); }

  • 5. Keep updated A21 in

device till replaced (write- back)

slide-71
SLIDE 71

HPC&A

libflame → FLAME runtime → GPU support

Performance

INRIA-Sophia Antipolis, June 2011

71

slide-72
SLIDE 72

HPC&A

libflame → FLAME runtime → GPU support

Performance

INRIA-Sophia Antipolis, June 2011

72

slide-73
SLIDE 73

HPC&A

libflame → FLAME runtime → GPU support

Performance

INRIA-Sophia Antipolis, June 2011

73

slide-74
SLIDE 74

HPC&A

Index

  • The libflame library

1. A user’s view 2. Creating your own algorithm 3. FLAME runtime 4. Clusters of GPUs

INRIA-Sophia Antipolis, June 2011

74

  • The StarSs framework

1. DLA for clusters 2. Host-centric view 3. Device-centric view 4. Clusters of GPUs

slide-75
SLIDE 75

HPC&A

libflame → Clusters of GPUs → DLA for clusters

libflame-like libraries

PLAPACK (UT@Austin)

Use of objects (PLA_Obj), vectors, matrices, projected vectors, etc., with layout embedded PMB distribution Layered and modular design: all communication is

INRIA-Sophia Antipolis, June 2011

75

Layered and modular design: all communication is done via copies (PLA_Copy) and reductions (PLA_Reduce) from one object type to another

Elemental (Jack Poulson)

Based on PLAPACK, but C++ Element-wise cyclic data layout

75

slide-76
SLIDE 76

HPC&A

libflame → Clusters of GPUs → Host-centric view

Data in host memory

Before executing a kernel, copy input data to GPU memory After execution, retrieve results back to node main

INRIA-Sophia Antipolis, June 2011

76

results back to node main memory Easy to program (wrappers to kernels) Copies linked to kernel execution: O(n3) transfers between CPU and GPU

76

CPU(s) PCI-e bus GPU #1 GPU #3 GPU #0 GPU #2 Inter- connect

slide-77
SLIDE 77

HPC&A

libflame → Clusters of GPUs → Device-centric view

Data in GPU memory

Before sending a piece of data, retrieve it back to node main memory (compact on the fly) After reception, copy

INRIA-Sophia Antipolis, June 2011

77

After reception, copy contents to GPU memory Easy to program (wrappers to MPI calls) Copies linked to communication, not kernel execution: O(n2) transfers between CPU and GPU

77

CPU(s) PCI-e bus GPU #1 GPU #3 GPU #0 GPU #2 Inter- connect

slide-78
SLIDE 78

HPC&A

libflame → Clusters of GPUs

Performance

22x

INRIA-Sophia Antipolis, June 2011

78

78 5x 10x

slide-79
SLIDE 79

HPC&A

libflame → Clusters of GPUs

Performance

INRIA-Sophia Antipolis, June 2011

79

79

slide-80
SLIDE 80

HPC&A

Acknowledgements Funding sources

INRIA-Sophia Antipolis, June 2011

80

80

slide-81
SLIDE 81

HPC&A

Further information Contact:

field@cs.utexas.edu

FLAME project website:

www.cs.utexas.edu/users/flame/

INRIA-Sophia Antipolis, June 2011

81

81

www.cs.utexas.edu/users/flame/

libflame: The Complete Reference

www.cs.utexas.edu/users/field/docs/

Updated nightly

www.lulu.com/content/5915632

Updated occasionally

slide-82
SLIDE 82

HPC&A

Index

  • The libflame library

INRIA-Sophia Antipolis, June 2011

82

  • The StarSs framework

GPU support

slide-83
SLIDE 83

HPC&A

The TEXT project

  • Towards Exaflop applicaTions
  • Demonstrate that Hybrid MPI/SMPSs addresses the Exascale challenges

in a productive and efficient way.

  • Deploy at supercomputing centers: Julich, EPCC, HLRS, BSC

INRIA-Sophia Antipolis, June 2011

83

  • Deploy at supercomputing centers: Julich, EPCC, HLRS, BSC
  • Port Applications (HLA, SPECFEM3D, PEPC, PSC, BEST, CPMD, LS1 MarDyn)

and develop algorithms.

  • Develop additional environment capabilities
  • tools (debug, performance)
  • improvements in runtime systems (load balance and GPUSs)
  • Support other users
  • Identify users of TEXT applications
  • Identify and support interested application developers
  • Contribute to Standards (OpenMP ARB, PERI-XML)
slide-84
SLIDE 84

HPC&A

Index

  • The libflame library

INRIA-Sophia Antipolis, June 2011

84

  • The StarSs framework

1. StarSs overview 2. OmpSs

Slides from Rosa M. Badia Barcelona Supercomputing Center Thanks!

slide-85
SLIDE 85

HPC&A

StarSs → StarSs overview

Programming model

... for (i=0; i<N; i++){ T1 (data1, data2); T2 (data4, data5); T3 (data2, data5, data6); T4 (data7, data8); T5 (data6, data8, data9);

Sequential Application Resource 1 Resource 2 Resource 3 .

Task selection + parameters direction (input, output, inout) Synchronization, results transfer

ParallelResources (multicore, SMP, cluster,cloud, grid) INRIA-Sophia Antipolis, June 2011

85

T5 (data6, data8, data9); } ...

T10 T20 T30 T40 T50 T11 T21 T31 T41 T51 T12

Resource N . . .

Task graph creation based on data precedence Scheduling, data transfer, task execution

slide-86
SLIDE 86

HPC&A

StarSs → StarSs overview

Programming model

StarSs

CellSs SMPSs GPUSs GridSs ClearSpeedSs ClusterSs

OmpSs

ClusterSs

COMPSs

INRIA-Sophia Antipolis, June 2011

86

StarSs

  • A “node” level programming model
  • Sequential C/Fortran/Java + annotations
  • Task based. Asynchrony, data-flow.
  • “Simple” linear address space
  • Directionality annotations on tasks arguments
  • Nicely integrates in hybrid MPI/StarSs
  • Natural support for heterogeneity

@ SMP @ GPU @ Cluster

  • Programmability/Portability
  • Incremental parallelization/restructure
  • Separate algorithm from resources
  • Disciplined programming
  • “Same” source code runs on “any” machine

Optimized task implementations will result in better performance.

  • Performance
  • Intelligent Runtime

Automatically extracts and exploits parallelism Dataflow, workflow Matches computations to specific resources on each type of target platform

  • Asynchronous (data-flow) execution and locality awareness
slide-87
SLIDE 87

HPC&A

StarSs → StarSs overview

A sequential program…

void vadd3 (float A[BS], float B[BS], float C[BS]); void scale_add (float sum, float A[BS], float B[BS]); void accum (float A[BS], float *sum);

INRIA-Sophia Antipolis, June 2011

87 for (i=0; i<N; i+=BS) // C=A+B vadd3 ( &A[i], &B[i], &C[i]); ... for (i=0; i<N; i+=BS) // sum(C[i]) accum (&C[i], &sum); ... for (i=0; i<N; i+=BS) // B=sum*A scale_add (sum, &E[i], &B[i]); ... for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]); ... for (i=0; i<N; i+=BS) // E=G+F vadd3 (&G[i], &F[i], &E[i]);

slide-88
SLIDE 88

HPC&A

StarSs → StarSs overview

A sequential program… taskified…

1 2 3 4

Compute dependences @ task instantiation time

#pragma css task input(A, B) output(C) void vadd3 (float A[BS], float B[BS], float C[BS]); #pragma css task input(sum, A) inout(B) void scale_add (float sum, float A[BS], float B[BS]); #pragma css task input(A) inout(sum) void accum (float A[BS], float *sum);

INRIA-Sophia Antipolis, June 2011

88

13 14 15 16 5 6 8 7 17 9 18 10 19 11 20 12 Color/number: order of task instantiation Some antidependences covered by flow dependences not drawn

for (i=0; i<N; i+=BS) // C=A+B vadd3 ( &A[i], &B[i], &C[i]); ... for (i=0; i<N; i+=BS) // sum(C[i]) accum (&C[i], &sum); ... for (i=0; i<N; i+=BS) // B=sum*A scale_add (sum, &E[i], &B[i]); ... for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]); ... for (i=0; i<N; i+=BS) // E=G+F vadd3 (&G[i], &F[i], &E[i]);

slide-89
SLIDE 89

HPC&A

StarSs → StarSs overview

A sequential program… taskified… with data-flow execution

1 2 3 4

#pragma css task input(A, B) output(C) void vadd3 (float A[BS], float B[BS], float C[BS]); #pragma css task input(sum, A) inout(B) void scale_add (float sum, float A[BS], float B[BS]); #pragma css task input(A) inout(sum) void accum (float A[BS], float *sum);

Write

Decouple how we write from how it is executed

Execute

INRIA-Sophia Antipolis, June 2011

89

13 14 15 16 5 6 8 7 17 9 18 10 19 11 20 12

for (i=0; i<N; i+=BS) // C=A+B vadd3 ( &A[i], &B[i], &C[i]); ... for (i=0; i<N; i+=BS) // sum(C[i]) accum (&C[i], &sum); ... for (i=0; i<N; i+=BS) // B=sum*A scale_add (sum, &E[i], &B[i]); ... for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]); ... for (i=0; i<N; i+=BS) // E=G+F vadd3 (&G[i], &F[i], &E[i]);

Color/number: a possible order of task execution

slide-90
SLIDE 90

HPC&A

StarSs → StarSs overview

The potential of data access information

Flexibility to dynamically traverse dataflow graph “optimizing”

  • Concurrency. Critical path

Memory access: data transfers performed by run time

INRIA-Sophia Antipolis, June 2011

90

Opportunities for

Prefetch Reuse Eliminate antidependences (rename) Replication management

Coherency/consistency handled by the runtime

slide-91
SLIDE 91

HPC&A

Index

  • The libflame library

INRIA-Sophia Antipolis, June 2011

91

  • The StarSs framework

1. StarSs overview 2. OmpSs

  • 1. Overview & syntax
  • 2. Compiler
  • 3. Runtime
  • 4. Examples
slide-92
SLIDE 92

HPC&A

StarSs → OmpSs → Overview & syntax

OmpSs = OpenMP + StarSs extensions OmpSs is based on OpenMP with some differences:

Different execution model Extended memory model

INRIA-Sophia Antipolis, June 2011

92

Extensions for point-to-point inter-task synchronizations

data dependencies

Extensions for heterogeneity Other minor extensions

slide-93
SLIDE 93

HPC&A

StarSs → OmpSs → Overview & syntax

Execution model Thread-pool model

OpenMP parallel “ignored” All threads created on startup One of them starts executing main

INRIA-Sophia Antipolis, June 2011

93

All get work from a task pool And can generate new work

slide-94
SLIDE 94

HPC&A

StarSs → OmpSs → Overview & syntax

Memory model

Two “modes“ are allowed: pure SMP: Single address space OpenMP standard memory model is used

  • INRIA-Sophia Antipolis, June 2011

94

non-SMP (cluster, GPUs, ...): Multiple address spaces exists Same data may exists in multiple of these Data consistency ensured by the implementation

slide-95
SLIDE 95

HPC&A

StarSs → OmpSs → Overview & syntax

Main element: Task Task: unit of computation Task definition Pragmas in lined Pragmas attached to function definition

INRIA-Sophia Antipolis, June 2011

95

Pragmas attached to function definition

#pragma omp task void foo (int Y[size], int size) { int j; for (j=0; j<size; j++) Y[j]= j; } int main() { int X[100] foo (X, 100); }

slide-96
SLIDE 96

HPC&A

StarSs → OmpSs → Overview & syntax

Defining dependences

Clauses that express data direction:

input

  • utput

inout

Dependences computed at runtime taking into account

INRIA-Sophia Antipolis, June 2011

96

Dependences computed at runtime taking into account these clauses

#pragma omp task output( x ) x = 5; #pragma omp task input( x ) printf("%d\n" , x ) ; #pragma omp task inout( x ) x++; #pragma omp task input( x ) printf ("%d\n" , x ) ;

1 2 3 4

slide-97
SLIDE 97

HPC&A

StarSs → OmpSs → Overview & syntax

Heterogeneity: the target directive

  • Directive to specify device specific information:

#pragma omp target [ clauses ]

  • Clauses:

device: which device (smp, gpu) copy_in, copy_out, copy_inout: data to be moved in and out implements: specifies alternate implementations

INRIA-Sophia Antipolis, June 2011

97

#pragma target device (smp) #pragma omp task input (Y) void foo (int Y[size], int size) { int j; for (j=0; j<size; j++) Y[j]= j; } int main() { int X[100] foo (X, 100) ; }

slide-98
SLIDE 98

HPC&A

StarSs → OmpSs → Overview & syntax

Synchronization

#pragma omp task wait Suspends the current task until all children tasks are completed Just direct children, not descendants

void traverse_list ( List l ) { Element e ; INRIA-Sophia Antipolis, June 2011

98

Element e ; for ( e = l-> first; e ; e = e->next ) #pragma omp task process ( e ) ; #pragma omp taskwait }

1 2 3 4 ...

slide-99
SLIDE 99

HPC&A

StarSs → OmpSs → Overview & syntax

Hierarchical task graph

Nesting

#pragma omp task input([BS][BS]A, [BS][BS]B)\ inout([BS][BS]C) void small_dgemm(float *C, float *A, float *B); #pragma omp task input([N][N]A, [N][N] B)\ inout([N][N]C)

INRIA-Sophia Antipolis, June 2011

99 inout([N][N]C) void block_dgemm(float *C, float *A, float *B){ int i, j, k; for (i=0; i< N; i+=BS) for (j=0; j< N; j+=BS) for (k=0; k< N; k+=BS) small_dgemm(&C[i][j], &A[i][k], &B[k][j]) } main() { ... block_dgemm(A,B,C); block_dgemm(D,E,F); #pragma omp task wait }

slide-100
SLIDE 100

HPC&A

StarSs → OmpSs → Compiler

Mercurium

  • Minor role
  • Recognizes constructs and transforms them to calls to the runtime
  • Manages code restructuring for different target devices
  • Device-specific handlers
  • May generate code in a

separate file

INRIA-Sophia Antipolis, June 2011

100

  • Invokes different back-end

compilers → nvcc for NVIDIA

slide-101
SLIDE 101

HPC&A

StarSs → OmpSs → Runtime

Structure

  • Support to different programming models: OpenMP (OmpSs), StarSs, Chapel
  • Independent components for thread, task, dependence handling, task scheduling, ...
  • Most of the runtime independent of the target arch.: SMP, GPU, simulator, cluster
  • Support to heterogeneous targets: i.e., threads running tasks in regular cores and GPUs
  • Instrumentation: Generation ofexecution traces

Application (StarSs, OmpSs, ...)

INRIA-Sophia Antipolis, June 2011

101 NANOS API Task Management trace Instrumentation Architecture Interface Data Coherence & Movement Thread Management Task Scheduling GPU SMP Cluster ... Dependence Management Scheduling Policies

dep. aware

Bf local. ... Paraver SimTrace

slide-102
SLIDE 102

HPC&A

StarSs → OmpSs → Runtime

Structure behaviour: task handling

Task generation Data dependence analysis Task scheduling

INRIA-Sophia Antipolis, June 2011

102

slide-103
SLIDE 103

HPC&A

StarSs → OmpSs → Runtime

Structure behaviour: coherence support

  • Different address spaces managed with:

A hierarchical directory A software cache per each:

  • Cluster node
  • GPU
  • Data transfers between different memory spaces only when needed

INRIA-Sophia Antipolis, June 2011

103

Write-through Write-back

slide-104
SLIDE 104

HPC&A

StarSs → OmpSs → Runtime

Structure behaviour: clusters

  • One runtime instance per node
  • One master image
  • N-1 worker images
  • Low level communication through active messages
  • Tasks generated by master
  • Tasks executed by worker threads in the master
  • Tasks delegated to slave nodes through the communication thread

INRIA-Sophia Antipolis, June 2011

104

  • Tasks delegated to slave nodes through the communication thread
  • Remote task execution:
  • Data transfer

(if necessary)

  • Overlap of computation

with communication

  • Task execution
  • Local scheduler
slide-105
SLIDE 105

HPC&A

StarSs → OmpSs → Runtime

Structure behaviour: GPUs

  • Automatic handling of Multi-GPU execution
  • Transparent data-managementon GPU side (allocation, transfers, ...) and

synchronization

  • One manager thread in the host per GPU. Responsible for:
  • Transferring data from/to GPUs
  • Executing GPU tasks

INRIA-Sophia Antipolis, June 2011

105

  • Synchronization
  • Overlap of computation

and communication

  • Data pre-fetch
slide-106
SLIDE 106

HPC&A

StarSs → OmpSs → Runtime

Structure behaviour: clusters of GPUs

Composes previous approaches Supports for heterogeneity and hierarchy: Application with homogeneous tasks: SMP or GPU Applications with heterogeneous tasks: SMP and GPU Applications with hierarchical and heterogeneous tasks: I.e., coarser grain SMP tasks Internally generating GPU tasks

INRIA-Sophia Antipolis, June 2011

106

Internally generating GPU tasks

slide-107
SLIDE 107

HPC&A

StarSs → OmpSs → Examples

MxM on matrix stored by blocks

int main (int argc, char **argv) { int i, j, k; ... initialize(A, B, C); for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) mm_tile( C[i][j], A[i][k],

BS BS NB NB BS BS

INRIA-Sophia Antipolis, June 2011

107

mm_tile( C[i][j], A[i][k], B[k][j]); } #pragma omp task input([BS][BS]A,[BS][BS]B)\ inout([BS][BS]C) static void mm_tile ( float C[BS][BS], float A[BS][BS], float B[BS][BS]) { int i, j, k; for (i=0; i< BS; i++) for (j=0; j< BS; j++) for (k=0; k< BS; k++) C[i][j] += A[i][k] * B[k][j]; }

Will work on matrices of any size Will work on any number of cores/devices

slide-108
SLIDE 108

HPC&A

StarSs → OmpSs → Examples

MxM on GPUs using CUBLAS kernel

BS BS NB NB BS BS

int main (int argc, char **argv) { int i, j, k; ... initialize(A, B, C); for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) mm_tile( C[i][j], A[i][k],

INRIA-Sophia Antipolis, June 2011

108

#pragma omp target device (cuda) copy_deps #pragma omp task input([BS][BS]A, [BS][BS]B, BS)\ inout([BS][BS]C) void mm_tile(float *A, float *B, float *C, int BS) { unsigned char nt = 'N'; float sone = 1.0, smone = -1.0; float *d_A, *d_B, *d_C; cublasSgemm(nt, nt, nt, BS, BS, smone, A, BS, B, BS, sone, C, BS); } mm_tile( C[i][j], A[i][k], B[k][j]); }

slide-109
SLIDE 109

HPC&A

StarSs → OmpSs → Examples

MxM performance on multi-GPU

Two Intel Xeon E5440, 4 cores 4 Tesla S2050 GPUs

INRIA-Sophia Antipolis, June 2011

109

slide-110
SLIDE 110

HPC&A

StarSs → OmpSs → Examples

MxM performance on cluster of GPUs

8 nodes of DAS-4:

Two Intel Xeon E5620, 4 cores 1 GTX 480 GPU QDR Infiniband INRIA-Sophia Antipolis, June 2011

110

slide-111
SLIDE 111

HPC&A

void blocked_cholesky( int NB, float *A ) { int i, j, k; for (k=0; k<NB; k++) { spotrf (A[k*NB+k]);

StarSs → OmpSs → Examples

Cholesky: Block matrix storage on GPU

  • Source code

independent of # devices

n = 8192; bs =1024 INRIA-Sophia Antipolis, June 2011

111

for (i=k+1; i<NB; i++) strsm (A[k*NB+k], A[k*NB+i]); for (i=k+1; i<NB; i++) { for (j=k+1; j<i; j++) sgemm( A[k*NB+i], A[k*NB+j], A[j*NB+i]); ssyrk (A[k*NB+i], A[i*NB+i]); } } #pragma omp task wait }

Spotrf: Slow task @ GPU In critical path (scheduling problem)

#pragma omp target device (cuda) copy_deps #pragma omp task inout([BS][BS]A) void spotrf (float *A); #pragma omp target device (cuda) copy_deps #pragma omp task input ([BS][BS]A) inout([BS][BS]C) void ssyrk (float *A, float *C); #pragma omp target device (cuda) copy_deps #pragma omp task input ([BS][BS]A, [BS][BS]B) inout([BS][BS]C) void sgemm (float *A, float *B, float *C); #pragma omp target device (cuda) copy_deps #pragma omp task input ([BS][BS]T) inout([BS][BS]B) void strsm (float *T, float *B);

slide-112
SLIDE 112

HPC&A

StarSs → OmpSs → Examples

Cholesky: Heterogeneous execution

  • spotrf more efficient at

CPU

  • Overlap between CPU

and GPU

n = 8192; bs =1024 INRIA-Sophia Antipolis, June 2011

112

#pragma omp target device (smp) copy_deps #pragma omp task inout([BS][BS]A) void spotrf_tile(float *A, int BS) { long INFO; char L = 'L'; spotrf_( &L, &BS, A, &BS, &INFO ); }

Late start Lack of

  • verlap
slide-113
SLIDE 113

HPC&A

StarSs → OmpSs → Examples

Cholesky performance

  • Matrixsize: 16K x 16K
  • Block size: 2K x 2K
  • Storage: Blocked / contiguous
  • Tasks:
  • spotrf: MAGMA
  • trsm, syrk, gemm: CUBLAS

Blocked INRIA-Sophia Antipolis, June 2011

113

Linear

slide-114
SLIDE 114

HPC&A

Further information Contact:

rosa.m.badia@bsc.es

Barcelona Supercomputing Center:

www.bsc.es

INRIA-Sophia Antipolis, June 2011

114

114

www.bsc.es

slide-115
SLIDE 115

HPC&A

Farewell

Thanks for your attention!*

INRIA-Sophia Antipolis, June 2011

115

115

*Hope you enjoyed this as much as Antibes-Sophia’s beach