Linear Algebra (LA) Algorithms for Hybrid Architectures with XKaapi - - PowerPoint PPT Presentation

linear algebra la algorithms for hybrid architectures
SMART_READER_LITE
LIVE PREVIEW

Linear Algebra (LA) Algorithms for Hybrid Architectures with XKaapi - - PowerPoint PPT Presentation

Linear Algebra (LA) Algorithms for Hybrid Architectures with XKaapi WSPPD 2011, August 19, 2011 Federal University of Rio Grande do Sul (UFRGS) Joo V. F. Lima Phd Student joao.lima@inf.ufrgs.br Nicolas Maillard (UFRGS), Vincent Danjean and


slide-1
SLIDE 1

Linear Algebra (LA) Algorithms for Hybrid Architectures with XKaapi

WSPPD 2011, August 19, 2011 Federal University of Rio Grande do Sul (UFRGS)

João V. F. Lima

Phd Student joao.lima@inf.ufrgs.br

Nicolas Maillard (UFRGS), Vincent Danjean and Thierry Gautier (MOAIS-LIG)

Advisors

slide-2
SLIDE 2

2

Contents

  • Introduction
  • Parallel LA Algorithms
  • XKaapi programming model
  • Linear Algebra with XKaapi
  • Conclusion
slide-3
SLIDE 3

3

Contents

  • Introduction
  • Parallel LA Algorithms
  • XKaapi programming model
  • Linear Algebra with XKaapi
  • Conclusion
slide-4
SLIDE 4

4

Introduction

  • Solving linear algebra (LA) systems is a

fundamental problem in scientific computing

  • Enabling LA for hybrid architectures is

strategic

  • Many hybrid processing units (PU)
  • The problem is reduce the gap

theoretical/achived performance

slide-5
SLIDE 5

5

Introduction

multicore GPU multicore GPU

Pthreads, TBB, Cilk++, OpenMP, UPC, KAAPI, etc CUDA, OpenCL, etc

????

slide-6
SLIDE 6

6

Introduction

  • Efforts include optimized BLAS and LAPACK

libraries for these architectures

  • Some examples:
  • PLASMAS (library/multicore)
  • MAGMA (library/hybrid GPU-based)
  • StarPU (runtime system/hybrid architectures)
  • XKaapi (work in progress)
slide-7
SLIDE 7

7

Contents

  • Introduction
  • Parallel LA Algorithms
  • LA for Multicore Processors
  • LA for Hybrid Systems
  • XKaapi programming model
  • Linear Algebra with XKaapi
  • Conclusion
slide-8
SLIDE 8

8

LA for Multicore Processors

  • LAPACK/ScaLAPACK are a “de facto” standard
  • Both exploit parallelism at BLAS level
  • GotoBLAS, ATLAS, etc
  • Their algorithms can be described as the

repetition of

  • Panel factorization - accumulate (Level-2 BLAS)
  • Trailing submatrix update - apply to the rest of the

matrix (Level-3 BLAS)

slide-9
SLIDE 9

9

LA for Multicore Processors

  • Rich parallelism at Level-3 BLAS as panel size is small
  • Level-2 BLAS cannot be efficiently parallelized on shared-

memory

  • It introduces a fork-join execution pattern with limitations
  • Scalability – high const at panel factorization (sequential)
  • Asynchronicity – multiple threads have to wait the previous step
  • BLAS level parallelism up to 2-3x slower
  • Solution – exploit parallelism at an higher level
slide-10
SLIDE 10

10

PLASMA

  • Parallel Linear Algebra Software for Multicore

Architectures

  • Designed to be efficient on
  • Homogeneous multicore processors
  • Multi-socket systems of multicore processors
  • Three crucial elements
  • Tile algorithms (square tiles)
  • Tile data layout (cache size)
  • Dynamic scheduling (WS with pthreads/QUARK)
  • http://icl.cs.utk.edu/plasma
slide-11
SLIDE 11

11

PLASMA

slide-12
SLIDE 12

12

PLASMA Tiled Cholesky

  • Assuming a symetric, positive matrix A of size p * b x p * b

A=( A11  ⋯  A21 A22 ⋯  ⋮ ⋮ ⋱ ⋮ A p1 A p2 ⋯ A pp)

  • Where b is the block size
  • Each Aij is of size b x b
slide-13
SLIDE 13

13

PLASMA Tiled Cholesky

1 for k = 1,...,p do 2 DPOTRF2(Akk, Lkk) 3 for i = k+1,...,p do 4 DTRSM(Lkk,Aik,Lik) 5 endfor 6 for i= k+1,...,p do 7 for j= k+1,...,i do 8 DGSMM(Lik,Ljk,Aij) 9 endfor 10 endfor 11 endfor

slide-14
SLIDE 14

14

PLASMA Tiled Cholesky

  • Tiled Cholesky factorization 4x4 [Buttari et al, 2009]

POTRF

TRSM SYRK

GEMM

TRSM TRSM

POTRF GEMM GEMM

TRSM TRSM SYRK SYRK

GEMM

SYRK SYRK

POTRF

TRSM SYRK

POTRF

Tiled Cholesky 4x4

[Dongarra, SC2010]

slide-15
SLIDE 15

15

Contents

  • Introduction
  • Parallel LA Algorithms
  • LA for Multicore Processors
  • LA for Hybrid Systems
  • XKaapi programming model
  • Linear Algebra with XKaapi
  • Conclusion
slide-16
SLIDE 16

16

LA for Hybrid Systems

  • How to code LA for hybrid architectures ?
  • Use Hybrid Algorithms
  • Considerations about the hybridization
  • Split LA algorithms as BLAS-based tasks

– Task parallelism

  • Choose granularity with auto-tunning
  • Define dependencies among them

– Algorithms as DAGs

  • Schedule the tasks over the multicore and the GPU
slide-17
SLIDE 17

17

LA for Hybrid Systems

  • Scheduling is of crucial importance
  • Schedule small and non-parallelizable task on CPU

– Level-1 and Level-2 BLAS tasks

  • Schedule large and highly data parallel tasks on GPU

– Level-3 BLAS tasks

  • Scheduling approaches on hybrid systems
  • Static – highly efficient to a specific architecture
  • Dynamic – load balancing, tunning depends on

BLAS-level granularity

slide-18
SLIDE 18

18

Hybrid Tiled Cholesky

1 for k = 1,...,p do 2 Akk= POTRF(Akk) 3 for i = k+1,...,p do 4 Aik= TRSM(Akk,Aik) 5 endfor 6 for i= k+1,...,p do 7 for j= k+1,...,i do 8 Aij= GEMM(Aik, Ajk, Aij) 9 endfor 10 Aii= SYRK(Aik, Aii) 11 endfor 12 endfor

slide-19
SLIDE 19

19

Hybrid Tiled Cholesky

for( j=0; j<*n; j+=nb ){ jb= min( nb,*n-j ); cublasSsyrk( da(j,0), da(j,j) ); cudaMemcpy2DAsync( work,da(j,j),DtoH,s[1]); if( j+jb < *n ) cublasSgemm( da(j+jb,0), da(j+jb,j) ); cudaStreamSynchronize( stream[1] ); spotrf( "Lower", &jb, work ,&jb, info ); if( *info != 0 ) { *info= *info+j; break; } cudaMemcpy2DAsync( da(j,j),work,HtoD,s[0]); if( j+jb < *n ) cublasStrsm( da(j,j), da(j+jb,j) ); }

Algorithm for 1 CPU + 1 GPU

slide-20
SLIDE 20

20

MAGMA

  • Matrix Algebra on GPU and Multicore

Architectures

  • A subset of LAPACK and BLAS routines
  • Each routine has two versions
  • An highly efficient version o CPU
  • An highly efficient version to GPU (MAGMA BLAS)
  • Interface very similar to LAPACK
slide-21
SLIDE 21

21

StarPU

  • Tasking API for numerical kernel designers
  • Supports heterogeneous PUs (Cell BE, GPUs)
  • Composed of
  • data-management facility
  • task execution engine
  • http://runtime.bordeaux.inria.fr/StarPU
slide-22
SLIDE 22

22

StarPU Data Management

  • Each device has a buffer
  • MSI caching protocol
  • (M) modified, (S) shared, (I) invalid
  • Data transfers are transparent

starpu_data_handle vector_handle; starpu_vector_data_register(&vector_handle, …); task->buffers[0].handle = vector_handle; task->buffers[0].mode = STARPU_RW;

Data registration

slide-23
SLIDE 23

23

StarPU Task Concept

  • Concept of codelets: abstraction of a task

static starpu_codelet cl = { .where = STARPU_CPU | STARPU_CUDA, .cpu_func = scal_cpu_func, /* CPU */ .cuda_func = scal_cuda_func, /* GPU */ .nbuffers = 1, /* n of parameters */ .model = &vector_scal_model, .power_model = &vector_scal_power_model }

Codelet

slide-24
SLIDE 24

24

Contents

  • Introduction
  • Parallel LA Algorithms
  • LA for Multicore Processors
  • LA for Hybrid Systems
  • XKaapi programming model
  • Linear Algebra with XKaapi
  • Conclusion
slide-25
SLIDE 25

25

XKaapi programming model

  • XKaapi is a C/C++ library
  • Targets multicore+GPU+cluster architectures
  • Goals
  • Simplify the development of parallel applications

– Platform abstraction (programming model)

  • Automatic dynamic load balancing

– theoretically & pratically performances – Work Stealing based algorithms

slide-26
SLIDE 26

26

XKaapi programming model

  • We will focus on the C++ API Kaapi++
  • Three main concepts
  • Task signature

– Nº of parameters, types, and access mode

  • Task implementation

– The implementations available to each PU (CPU, GPU,

etc)

  • Data pointer

– When data is shared between tasks

slide-27
SLIDE 27

27

XKaapi programming model

struct TaskHello{ public ka::Task<1>::Signature<int>{}; template<> struct TaskBodyCPU<TaskHello> { void operator()( int n ) {/* CPU implementation … */} }; template<> struct TaskBodyGPU<TaskHello>{ void operator()( int n ) {/* GPU implementation … */} };

slide-28
SLIDE 28

28

Contents

  • Introduction
  • Parallel LA Algorithms
  • LA for Multicore Processors
  • LA for Hybrid Systems
  • XKaapi programming model
  • Linear Algebra with XKaapi
  • Conclusion
slide-29
SLIDE 29

29

XKaapi Tiled Cholesky

  • Definition of four BLAS-based tasks
  • TaskDPOTRF, TaskDTRSM, TaskDSYRK, TaskDGEMM
  • Abstraction of the target PU and scheduling
  • The runtime decides which PU will execute
  • The code does not contain any reference to

schedule details

  • Asynchronous execution of tasks
  • The order only depends on the data production of

previous tasks

slide-30
SLIDE 30

30

XKaapi Tiled Cholesky

for(k=0;k < N;k += blocsize){ ka::Spawn<TaskDPOTRF>()( A(rk,rk) ); for(m=k+blocsize;m < N;m += blocsize ka::Spawn<TaskDTRSM>()(A(rk,rk), A(rm,rk) /* B */ ); for(m=k+blocsize;m < N;m += blocsize){ ka::Spawn<TaskDSYRK>()( A(rm,rk), A(rm,rm) /* C */); for(n=k+blocsize;n < m;n += blocsize) ka::Spawn<TaskDGEMM>()( A(rm,rk), /* A */ A(rn,rk), /* B */ A(rm,rn) /* C */ ); }

}

slide-31
SLIDE 31

31

Contents

  • Introduction
  • Parallel LA Algorithms
  • LA for Multicore Processors
  • LA for Hybrid Systems
  • XKaapi programming model
  • Linear Algebra with XKaapi
  • Conclusion
slide-32
SLIDE 32

32

Conclusion

  • XKaapi interface meets hybridization aspects
  • Kaapi++
  • Performance on hybrid architecture includes
  • PU affinity
  • Task performance (CPU-bound, GPU-bound, etc)
  • Data management
slide-33
SLIDE 33

33

Future Works

  • XKaapi on N CPU(s) + 1 GPU
  • Distributed shared memory concepts (DSM)
  • Optimized many-task execution on GPU
  • XKaapi on N CPU(s) + N GPU(s) + ???
  • DSM-like protocol of data
  • Problem of CPU-bound or GPU-bound tasks
  • PU affinity
  • Scheduler based on work stealing
slide-34
SLIDE 34

34

References

  • [Asanovic et al, 2009] K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K.

Keutzer, J. Kubiatowicz, N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, D. Wessel, and K. Yelick. A view of the parallel computing landscape. Communications of the ACM, 52(10):56–67, 2009.

  • [Buttari et al, 2009] A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. A

class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Computing, 35(1):38 – 53, 2009.

  • [El-Ghazawi et al, 2009] T. El-Ghazawi, B. Chamberlain, D. P. Grove.

Programming Using the Partitioned Global Address Space (PGAS) Model. Supercomputing 2009 Tutorial, 2009.

  • [Nickolls et al, 2010] J. Nickolls and W. Dally. The GPU Computing Era.

IEEE Micro, 30(2):56 –69, mar. 2010.

  • [Dongarra et al, 2011] J. Dongarra et al. The International Exascale

Software Project roadmap. International Journal of High Performance Computing Applications, 25(1):3 – 60, 2011

slide-35
SLIDE 35

Linear Algebra (LA) Algorithms for Hybrid Architectures with XKaapi

WSPPD 2011, August 19, 2011 Federal University of Rio Grande do Sul (UFRGS)

João V. F. Lima

Phd Student joao.lima@inf.ufrgs.br

Nicolas Maillard (UFRGS), Vincent Danjean and Thierry Gautier (MOAIS-LIG)

Advisors

slide-36
SLIDE 36

36

Software Stack

  • BLAS (Basic Linear Algebra Subprograms) - vector and

matrix multiplication. Ex.: Goto BLAS, ATLAS, MKL, ACML

  • Level-1 - vector operations
  • Leve-2 - matrix-vector operations
  • Level-3 - matrix-matrix operations
  • LAPACK (Linear Algebra PACKage) - numerical linear

algebra

  • LU, Cholesky and QR
  • ATLAS (Automatically Linear Algebra Software) - BLAS and

some LAPACK routines with auto-tunning

( y ←α x+ y) ( y ←α A x+β y) (C ←α A B+βC)

slide-37
SLIDE 37

37

StarPU Execution

  • Each resource is a worker
  • Asynchronous or user-defined (sync by user)
  • associated with a list of codelets
  • Workers can push/pop
  • The scheduling policy can be defined
  • Creating queues to workers (FIFO's or stacks)
  • Topologies (central queue or per-worker queues)
  • Some pre-defined (history-based)