A Parallel Numerical Library for UPC Jorge Gonzlez-Domnguez 1 *, Mara - - PowerPoint PPT Presentation

a parallel numerical library for upc
SMART_READER_LITE
LIVE PREVIEW

A Parallel Numerical Library for UPC Jorge Gonzlez-Domnguez 1 *, Mara - - PowerPoint PPT Presentation

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions A Parallel Numerical Library for UPC Jorge Gonzlez-Domnguez 1 *, Mara J. Martn 1 , Guillermo L. Taboada 1 , Juan Tourio 1 , Ramn


slide-1
SLIDE 1

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions

A Parallel Numerical Library for UPC

Jorge González-Domínguez1*, María J. Martín1, Guillermo

  • L. Taboada1, Juan Touriño1, Ramón Doallo1, Andrés

Gómez2

1Computer Architecture Group

University of A Coruña (Spain) {jgonzalezd,mariam,taboada, juan,doallo}@udc.es

2Galicia Supercomputing Center

(CESGA) Santiago de Compostela (Spain) {agomez}@cesga.es

15th International European Conference on Parallel and Distributed Computing (Euro-Par 2009), Delft University of Technology, Delft, The Netherlands

1/32

slide-2
SLIDE 2

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions

1

Introduction Unified Parallel C for High-Performance Computing Parallel Numerical Computing in UPC

2

Design of the library Private routines Shared routines

3

Implementation of the library

4

Experimental evaluation

5

Conclusions

2/32

slide-3
SLIDE 3

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Unified Parallel C for High-Performance Computing Parallel Numerical Computing in UPC

1

Introduction Unified Parallel C for High-Performance Computing Parallel Numerical Computing in UPC

2

Design of the library

3

Implementation of the library

4

Experimental evaluation

5

Conclusions

3/32

slide-4
SLIDE 4

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Unified Parallel C for High-Performance Computing Parallel Numerical Computing in UPC

UPC: a Suitable Alternative for HPC in Multi-core Era

Programming models:

Traditionally: Shared/Distributed memory programming models Challenge: hybrid memory architectures PGAS (Partitioned Global Address Space)

PGAS Languages:

UPC -> C Titanium -> Java Co-Array Fortran -> Fortran

UPC Compilers:

Berkeley UPC GCC (Intrepid) Michigan TU HP , Cray and IBM UPC Compilers

4/32

slide-5
SLIDE 5

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Unified Parallel C for High-Performance Computing Parallel Numerical Computing in UPC

UPC: a Suitable Alternative for HPC in Multi-core Era

Programming models:

Traditionally: Shared/Distributed memory programming models Challenge: hybrid memory architectures PGAS (Partitioned Global Address Space)

PGAS Languages:

UPC -> C Titanium -> Java Co-Array Fortran -> Fortran

UPC Compilers:

Berkeley UPC GCC (Intrepid) Michigan TU HP , Cray and IBM UPC Compilers

4/32

slide-6
SLIDE 6

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Unified Parallel C for High-Performance Computing Parallel Numerical Computing in UPC

UPC: a Suitable Alternative for HPC in Multi-core Era

Programming models:

Traditionally: Shared/Distributed memory programming models Challenge: hybrid memory architectures PGAS (Partitioned Global Address Space)

PGAS Languages:

UPC -> C Titanium -> Java Co-Array Fortran -> Fortran

UPC Compilers:

Berkeley UPC GCC (Intrepid) Michigan TU HP , Cray and IBM UPC Compilers

4/32

slide-7
SLIDE 7

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Unified Parallel C for High-Performance Computing Parallel Numerical Computing in UPC

Important identifiers

THREADS -> Total number of threads in execution MYTHREAD -> Rank of the current thread

#include<stdio.h> #include<upc.h> int main() { printf("Thread %d of %d: Hello world\n", MYTHREAD, THREADS);} $ upcc -o helloworld helloworld.upc $ upcrun -n 3 helloworld Thread 0 of 3: Hello world Thread 2 of 3: Hello world Thread 1 of 3: Hello world

5/32

slide-8
SLIDE 8

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Unified Parallel C for High-Performance Computing Parallel Numerical Computing in UPC

Important identifiers

THREADS -> Total number of threads in execution MYTHREAD -> Rank of the current thread

#include<stdio.h> #include<upc.h> int main() { printf("Thread %d of %d: Hello world\n", MYTHREAD, THREADS);} $ upcc -o helloworld helloworld.upc $ upcrun -n 3 helloworld Thread 0 of 3: Hello world Thread 2 of 3: Hello world Thread 1 of 3: Hello world

5/32

slide-9
SLIDE 9

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Unified Parallel C for High-Performance Computing Parallel Numerical Computing in UPC

Important identifiers

THREADS -> Total number of threads in execution MYTHREAD -> Rank of the current thread

#include<stdio.h> #include<upc.h> int main() { printf("Thread %d of %d: Hello world\n", MYTHREAD, THREADS);} $ upcc -o helloworld helloworld.upc $ upcrun -n 3 helloworld Thread 0 of 3: Hello world Thread 2 of 3: Hello world Thread 1 of 3: Hello world

5/32

slide-10
SLIDE 10

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Unified Parallel C for High-Performance Computing Parallel Numerical Computing in UPC

Shared array declaration

shared [block_factor] A [size] size -> Total number of elements block_factor -> Number of consecutive elements with affinity to the same thread -> Size of the chunks

6/32

slide-11
SLIDE 11

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Unified Parallel C for High-Performance Computing Parallel Numerical Computing in UPC

BLAS libraries Basic Linear Algebra Subprograms Specification of a set of numerical functions Widely used by scientists and engineers SparseBLAS and PBLAS (Parallel BLAS) BLAS implementations Generic and open source

GSL -> GNU

Optimized for specific architectures

MKL -> Intel ACML -> AMD CXML -> Compaq MLIB -> HP

7/32

slide-12
SLIDE 12

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Unified Parallel C for High-Performance Computing Parallel Numerical Computing in UPC

BLAS libraries Basic Linear Algebra Subprograms Specification of a set of numerical functions Widely used by scientists and engineers SparseBLAS and PBLAS (Parallel BLAS) BLAS implementations Generic and open source

GSL -> GNU

Optimized for specific architectures

MKL -> Intel ACML -> AMD CXML -> Compaq MLIB -> HP

7/32

slide-13
SLIDE 13

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Unified Parallel C for High-Performance Computing Parallel Numerical Computing in UPC

BLAS level Tblasname Action BLAS1 Tcopy Copies a vector Tswap Swaps the elements of two vectors Tscal Scales a vector by a scalar Taxpy Updates a vector using another one: y = α ∗ x + y Tdot Dot product Tnrm2 Euclidean norm Tasum Sums the absolute value of the elements of a vector iTamax Finds the index with the maximum value iTamin Finds the index with the minimum value BLAS2 Tgemv Matrix-vector product Ttrsv Solves a triangular system of equations Tger Outer product BLAS3 Tgemm Matrix-matrix product Ttrsm Solves a block of triangular systems of equations

8/32

slide-14
SLIDE 14

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Unified Parallel C for High-Performance Computing Parallel Numerical Computing in UPC

Numerical computing in UPC

No numerical libraries for PGAS languages

Alternatives for the programmers: Develop the routine by themselves

More effort Worse performance

Use different programming models with parallel numerical libraries

Distributed memory -> MPI Shared memory -> OpenMP

Consequence: Barrier to the productivity of PGAS languages.

9/32

slide-15
SLIDE 15

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Unified Parallel C for High-Performance Computing Parallel Numerical Computing in UPC

Numerical computing in UPC

No numerical libraries for PGAS languages

Alternatives for the programmers: Develop the routine by themselves

More effort Worse performance

Use different programming models with parallel numerical libraries

Distributed memory -> MPI Shared memory -> OpenMP

Consequence: Barrier to the productivity of PGAS languages.

9/32

slide-16
SLIDE 16

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Unified Parallel C for High-Performance Computing Parallel Numerical Computing in UPC

Numerical computing in UPC

No numerical libraries for PGAS languages

Alternatives for the programmers: Develop the routine by themselves

More effort Worse performance

Use different programming models with parallel numerical libraries

Distributed memory -> MPI Shared memory -> OpenMP

Consequence: Barrier to the productivity of PGAS languages.

9/32

slide-17
SLIDE 17

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Private routines Shared routines

1

Introduction

2

Design of the library Private routines Shared routines

3

Implementation of the library

4

Experimental evaluation

5

Conclusions

10/32

slide-18
SLIDE 18

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Private routines Shared routines

Analysis of related works

Distributed memory approach (Parallel -MPI- BLAS) Message Passing paradigm Only private memory New structures to represent distributed vectors or matrices

Difficult to understand and work with

Functions to help to work with them

Creation Storage of data Deletion

New approach Usage of UPC shared arrays

11/32

slide-19
SLIDE 19

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Private routines Shared routines

Analysis of related works

Distributed memory approach (Parallel -MPI- BLAS) Message Passing paradigm Only private memory New structures to represent distributed vectors or matrices

Difficult to understand and work with

Functions to help to work with them

Creation Storage of data Deletion

New approach Usage of UPC shared arrays

11/32

slide-20
SLIDE 20

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Private routines Shared routines

Analysis of related works

Distributed memory approach (Parallel -MPI- BLAS) Message Passing paradigm Only private memory New structures to represent distributed vectors or matrices

Difficult to understand and work with

Functions to help to work with them

Creation Storage of data Deletion

New approach Usage of UPC shared arrays

11/32

slide-21
SLIDE 21

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Private routines Shared routines

Two functions for each BLAS routine

Private functions Input and output data in private memory Pointers from private to private Data distribution internal to the function -> not chosen

  • r known by the user

Shared functions Input and output data in shared memory Pointers from private to shared Data distribution chosen by the user through a parameter

12/32

slide-22
SLIDE 22

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Private routines Shared routines

Two functions for each BLAS routine

Private functions Input and output data in private memory Pointers from private to private Data distribution internal to the function -> not chosen

  • r known by the user

Shared functions Input and output data in shared memory Pointers from private to shared Data distribution chosen by the user through a parameter

12/32

slide-23
SLIDE 23

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Private routines Shared routines

Two functions for each BLAS routine

Private functions Input and output data in private memory Pointers from private to private Data distribution internal to the function -> not chosen

  • r known by the user

Shared functions Input and output data in shared memory Pointers from private to shared Data distribution chosen by the user through a parameter

12/32

slide-24
SLIDE 24

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Private routines Shared routines

upc_blas_[p]Tblasname

p value _ -> shared version p -> private version T value i -> integer l -> long f -> float d -> double 2 versions *4 datatypes *14 routines = 112 functions

13/32

slide-25
SLIDE 25

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Private routines Shared routines

y = a ∗ x + y

private version -> upc_blas_pdaxpy shared version ->upc_blas_daxpy

14/32

slide-26
SLIDE 26

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Private routines Shared routines

int upc_blas_pdaxpy(const int size, const double a, const int thread_src, const double *x, const int thread_dst, double *y); Parameters

  • size. Vectors length
  • a. Scale factor

x, y. Private pointer to the position of private memory where vectors elements are stored thread_src. [0,THREADS] Number of the thread with input x and y data in its private memory If THREADS -> Vectors replicated in all private spaces thread_dst. [0,THREADS] Number of the thread whith the private memory where output data will be written If THREADS -> Output replicated in all private spaces -> BROADCAST

15/32

slide-27
SLIDE 27

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Private routines Shared routines

int upc_blas_pdaxpy(const int size, const double a, const int thread_src, const double *x, const int thread_dst, double *y); Parameters

  • size. Vectors length
  • a. Scale factor

x, y. Private pointer to the position of private memory where vectors elements are stored thread_src. [0,THREADS] Number of the thread with input x and y data in its private memory If THREADS -> Vectors replicated in all private spaces thread_dst. [0,THREADS] Number of the thread whith the private memory where output data will be written If THREADS -> Output replicated in all private spaces -> BROADCAST

15/32

slide-28
SLIDE 28

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Private routines Shared routines

int upc_blas_pdaxpy(const int size, const double a, const int thread_src, const double *x, const int thread_dst, double *y); Parameters

  • size. Vectors length
  • a. Scale factor

x, y. Private pointer to the position of private memory where vectors elements are stored thread_src. [0,THREADS] Number of the thread with input x and y data in its private memory If THREADS -> Vectors replicated in all private spaces thread_dst. [0,THREADS] Number of the thread whith the private memory where output data will be written If THREADS -> Output replicated in all private spaces -> BROADCAST

15/32

slide-29
SLIDE 29

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Private routines Shared routines

int upc_blas_pdaxpy(const int size, const double a, const int thread_src, const double *x, const int thread_dst, double *y); Parameters

  • size. Vectors length
  • a. Scale factor

x, y. Private pointer to the position of private memory where vectors elements are stored thread_src. [0,THREADS] Number of the thread with input x and y data in its private memory If THREADS -> Vectors replicated in all private spaces thread_dst. [0,THREADS] Number of the thread whith the private memory where output data will be written If THREADS -> Output replicated in all private spaces -> BROADCAST

15/32

slide-30
SLIDE 30

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Private routines Shared routines

int upc_blas_pdaxpy(const int size, const double a, const int thread_src, const double *x, const int thread_dst, double *y); Parameters

  • size. Vectors length
  • a. Scale factor

x, y. Private pointer to the position of private memory where vectors elements are stored thread_src. [0,THREADS] Number of the thread with input x and y data in its private memory If THREADS -> Vectors replicated in all private spaces thread_dst. [0,THREADS] Number of the thread whith the private memory where output data will be written If THREADS -> Output replicated in all private spaces -> BROADCAST

15/32

slide-31
SLIDE 31

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Private routines Shared routines

int upc_blas_pdaxpy(const int size, const double a, const int thread_src, const double *x, const int thread_dst, double *y); Parameters

  • size. Vectors length
  • a. Scale factor

x, y. Private pointer to the position of private memory where vectors elements are stored thread_src. [0,THREADS] Number of the thread with input x and y data in its private memory If THREADS -> Vectors replicated in all private spaces thread_dst. [0,THREADS] Number of the thread whith the private memory where output data will be written If THREADS -> Output replicated in all private spaces -> BROADCAST

15/32

slide-32
SLIDE 32

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Private routines Shared routines

int upc_blas_daxpy(const int block_size, const int size, const double a, shared const double *x, shared double *y); Parameters

  • size. Vectors length -> The same as private
  • a. Scale factor -> The same as private

x, y. Private pointer to the position of shared memory where vectors elements are stored

16/32

slide-33
SLIDE 33

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Private routines Shared routines

int upc_blas_daxpy(const int block_size, const int size, const double a, shared const double *x, shared double *y); Parameters

  • size. Vectors length -> The same as private
  • a. Scale factor -> The same as private

x, y. Private pointer to the position of shared memory where vectors elements are stored

16/32

slide-34
SLIDE 34

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Private routines Shared routines

int upc_blas_daxpy(const int block_size, const int size, const double a, shared const double *x, shared double *y); Parameters

  • size. Vectors length -> The same as private
  • a. Scale factor -> The same as private

x, y. Private pointer to the position of shared memory where vectors elements are stored

16/32

slide-35
SLIDE 35

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Private routines Shared routines

Meaning of block_size for vectors In the range [1,size] Number of consecutive elements with affinity to the same thread For performance, block_size = block_factor of shared arrays Determines the distribution of the work shared [block_size] y [size]

17/32

slide-36
SLIDE 36

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Private routines Shared routines

Meaning of block_size for matrices distributed by rows

Additional parameter dist_dimm = row_dist In the range [1,rows] Number of consecutive rows with affinity to the same thread Determines the distribution of the work

shared [block_size*cols] y [rows*cols]

18/32

slide-37
SLIDE 37

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions Private routines Shared routines

Meaning of block_size for matrices distributed by columns

Additional parameter dist_dimm = col_dist In the range [1,cols] Number of consecutive columns with affinity to the same thread Determines the distribution of the work

shared [block_size] y [rows*cols]

19/32

slide-38
SLIDE 38

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions

1

Introduction

2

Design of the library

3

Implementation of the library

4

Experimental evaluation

5

Conclusions

20/32

slide-39
SLIDE 39

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions

General steps to achieve a good efficiency UPC optimization techniques:

Privatization of the accesses to shared memory

21/32

slide-40
SLIDE 40

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions

General steps to achieve a good efficiency UPC optimization techniques:

Privatization of the accesses to shared memory

21/32

slide-41
SLIDE 41

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions

Private pointers to private memory The standard C pointers Stored in private memory Able to access:

Private memory Part of the shared memory with affinity to the thread

Very fast to dereference Private pointers to shared memory Stored in private memory Able to access:

Any position in all shared memory

Heavier than standard C pointers -> Slower accesses

22/32

slide-42
SLIDE 42

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions

General steps to achieve a good efficiency UPC optimization techniques:

Privatization of the accesses to shared memory Agregation of remote shared memory accesses (upc_memget, upc_memput, upc_memcpy) Overlapping remote accesses with computation

Correct distribution of the workload and data among threads -> private case Call the most efficient underlying numerical libraries

23/32

slide-43
SLIDE 43

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions

General steps to achieve a good efficiency UPC optimization techniques:

Privatization of the accesses to shared memory Agregation of remote shared memory accesses (upc_memget, upc_memput, upc_memcpy) Overlapping remote accesses with computation

Correct distribution of the workload and data among threads -> private case Call the most efficient underlying numerical libraries

23/32

slide-44
SLIDE 44

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions

General steps to achieve a good efficiency UPC optimization techniques:

Privatization of the accesses to shared memory Agregation of remote shared memory accesses (upc_memget, upc_memput, upc_memcpy) Overlapping remote accesses with computation

Correct distribution of the workload and data among threads -> private case Call the most efficient underlying numerical libraries

23/32

slide-45
SLIDE 45

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions

General steps to achieve a good efficiency UPC optimization techniques:

Privatization of the accesses to shared memory Agregation of remote shared memory accesses (upc_memget, upc_memput, upc_memcpy) Overlapping remote accesses with computation

Correct distribution of the workload and data among threads -> private case Call the most efficient underlying numerical libraries

23/32

slide-46
SLIDE 46

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions

1

Introduction

2

Design of the library

3

Implementation of the library

4

Experimental evaluation

5

Conclusions

24/32

slide-47
SLIDE 47

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions

Finis Terrae (CESGA) 142 HP Integrity rx7640 nodes, each: 16 Montvale Itanium2 (IA64) cores at 1.6 GHz

2 cells, each

4 dual-core procesors 1 shared memory module

128 GB RAM Mellanox InfiniBand HCA (16 Gbps bandwidth) SW Configuration: Berkeley UPC (BUPC) 2.6 Intel Math Kernel Library (MKL) 9.1

All sequential routines BLAS1, BLAS2 and BLAS3 Other routines: SparseBLAS, LAPACK, ScaLAPACK...

25/32

slide-48
SLIDE 48

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions

Configuration of benchmarks Hybrid memory configuration

Locality exploitation of threads in the same node -> shared memory Enhancing scalability -> distributed memory

4 threads per node, 2 per cell Private version src_threads = THREADS dst_threads = 0

26/32

slide-49
SLIDE 49

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions

2 4 8 16 32 64 128 THREADS speedups efficiencies 4 8 12 16 20 24 28 32 36 40 44 48 52 SPEEDUP

1

EFFICIENCY

DOT PRODUCT (pddot)

50M 100M 150M

27/32

slide-50
SLIDE 50

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions

2 4 8 16 32 64 128 THREADS speedups efficiencies 4 12 20 28 36 44 52 60 68 SPEEDUP

1

EFFICIENCY

MATRIX-VECTOR PRODUCT (pdgemv)

10000 20000 30000

28/32

slide-51
SLIDE 51

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions

2 4 8 16 32 64 128 THREADS speedups efficiencies 8 16 24 32 40 48 56 64 72 80 88 SPEEDUP

1

EFFICIENCY

MATRIX-MATRIX PRODUCT (pdgemm)

6000 8000 10000

29/32

slide-52
SLIDE 52

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions

1

Introduction

2

Design of the library

3

Implementation of the library

4

Experimental evaluation

5

Conclusions

30/32

slide-53
SLIDE 53

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions

Summary First parallel numerical library developed for UPC -> Novelty Allows to store input and output data in private or shared memory -> Flexibility Use of the standard sequential BLAS functions -> Portability Scalability demonstrated by experimental tests -> Efficiency Future work Develop a sparse counterpart library for UPC

31/32

slide-54
SLIDE 54

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions

Summary First parallel numerical library developed for UPC -> Novelty Allows to store input and output data in private or shared memory -> Flexibility Use of the standard sequential BLAS functions -> Portability Scalability demonstrated by experimental tests -> Efficiency Future work Develop a sparse counterpart library for UPC

31/32

slide-55
SLIDE 55

Introduction Design of the library Implementation of the library Experimental evaluation Conclusions

Questions?

Contact: Jorge González-Domínguez jgonzalezd@udc.es Computer Architecture Group, Dept. of Electronics and Systems University of A Coruña, Spain

32/32