starpu exploiting heterogeneous architectures through
play

StarPU : Exploiting heterogeneous architectures through task-based - PowerPoint PPT Presentation

1 StarPU : Exploiting heterogeneous architectures through task-based programming Cdric Augonnet Nathalie Furmento Raymond Namyst Samuel Thibault INRIA Bordeaux, LaBRI, University of Bordeaux ComplexHPC spring school May 13 rd 2011 2


  1. 1 StarPU : Exploiting heterogeneous architectures through task-based programming Cédric Augonnet Nathalie Furmento Raymond Namyst Samuel Thibault INRIA Bordeaux, LaBRI, University of Bordeaux ComplexHPC spring school – May 13 rd 2011

  2. 2 The RUNTIME Team NL

  3. 3 The RUNTIME Team NL

  4. 4 The RUNTIME Team NL Doing Parallelism for centuries !

  5. 5 The RUNTIME Team Research directions • High Performance Runtime Systems for Parallel Architectures “ Runtime Systems perform dynamically what cannot be not statically ” • • Main research directions • Exploiting shared memory machines – Thread scheduling over hierarchical multicore architectures – Task scheduling over accelerator-based machines • Communication over high speed networks – Multicore-aware communication engines – Multithreaded MPI implementations • Integration of multithreading and communication – Runtime support for hybrid programming • See http://runtime.bordeaux.inria.fr/ for more information

  6. 6 Introduction Toward heterogeneous multi-core architectures • Multicore is here • Hierarchical architectures Mixed Large • Manycore is coming and • Power is a major concern Small Cores • Architecture specialization • Now – Accelerators (GPGPUs, FPGAs) – Coprocessors (Cell's SPUs) • In the (near?) Future – Many simple cores – A few full-featured cores

  7. 7 Introduction How to program these architectures? Multicore • Multicore programming • pthreads, OpenMP, TBB, ... OpenMP TBB Cilk MPI CPU CPU CPU CPU M. M.

  8. 8 Introduction How to program these architectures? Accelerators • Multicore programming OpenCL • pthreads, OpenMP, TBB, ... CUDA libspe ATI Stream •Accelerator programming • Consensus on OpenCL? • (Often) Pure offloading model CPU CPU *PU M. CPU CPU *PU M. M. M.

  9. 9 Introduction How to program these architectures? Multicore Accelerators • Multicore programming OpenCL • pthreads, OpenMP, TBB, ... OpenMP CUDA libspe ? TBB Cilk ? ATI Stream •Accelerator programming MPI • Consensus on OpenCL? • (Often) Pure offloading model CPU CPU *PU M. •Hybrid models? • Take advantage of all resources ☺ CPU CPU • Complex interactions ☹ *PU M. M. M.

  10. 10 Introduction Challenging issues at all stages • Applications • Programming paradigm • BLAS kernels, FFT, … HPC Applications • Compilers Compiling Specific environment librairies • Languages • Code generation/optimization Runtime system • Runtime systems • Resources management Operating System • Task scheduling Hardware • Architecture • Memory interconnect

  11. 11 Introduction Challenging issues at all stages Expressive interface • Applications • Programming paradigm • BLAS kernels, FFT, … HPC Applications • Compilers Compiling Specific environment librairies • Languages • Code generation/optimization Runtime system • Runtime systems • Resources management Operating System • Task scheduling Hardware • Architecture • Memory interconnect Execution Feedback

  12. 12 Outline • Overview of StarPU • Programming interface • Task & data management • Task scheduling • MAGMA+PLASMA example • Experimental features • Conclusion

  13. 13 Overview of StarPU

  14. 14 Overview of StarPU A = A+B Rationale Dynamically schedule tasks CPU CPU on all processing units GPU M. • See a pool of CPU CPU heterogeneous processing M. B GPU units A M. B M. GPU M. Avoid unnecessary data CPU CPU transfers between CPU CPU GPU M. A accelerators • Software VSM for M. M. heterogeneous machines

  15. The StarPU runtime system 15 HPC Applications Execution model High-level data management library Scheduling engine Specific drivers ... CPUs GPUs SPUs Mastering CPUs, GPUs, SPUs … *PUs

  16. 16 The StarPU runtime system The need for runtime systems • “do dynamically what can’t HPC Applications be done statically anymore” Parallel Parallel Compilers Libraries • StarPU provides • Task scheduling • Memory management • Compilers and libraries StarPU generate (graphs of) parallel Drivers (CUDA, OpenCL) tasks CPU GPU … • Additional information is welcome!

  17. 17 Data management • StarPU provides a Virtual HPC Applications Shared Memory subsystem Parallel Parallel Compilers Libraries • Weak consistency • Replication • Single writer • High level API – Partitioning filters StarPU • Input & ouput of tasks = reference to VSM data Drivers (CUDA, OpenCL) CPU GPU …

  18. 18 The StarPU runtime system Task scheduling • Tasks = HPC Applications • Data input & output Parallel Parallel – Reference to VSM data Compilers Libraries • Multiple implementations – E.g. CUDA + CPU implementation • Dependencies with other tasks • Scheduling hints StarPU • StarPU provides an Open Drivers (CUDA, OpenCL) Scheduling platform cpu f gpu (A RW , B R , C R ) CPU GPU … • Scheduling algorithm = spu plug-ins

  19. 19 The StarPU runtime system Task scheduling • Who generates the code ? HPC Applications • StarPU Task = ~function pointers Parallel Parallel • StarPU doesn't generate code Compilers Libraries • Libraries era • PLASMA + MAGMA • FFTW + CUFFT... StarPU • Rely on compilers • PGI accelerators Drivers (CUDA, OpenCL) cpu • CAPS HMPP... f gpu (A RW , B R , C R ) CPU GPU … spu

  20. 20 The StarPU runtime system Execution model Application Scheduling engine Memory StarPU Management (DSM) CPU driver A B GPU driver #k ... GPU CPU#k RAM A B

  21. 21 The StarPU runtime system Execution model Application A+= B Scheduling engine Memory StarPU Management (DSM) CPU driver A B GPU driver #k ... GPU CPU#k RAM A B Submit task « A += B »

  22. 22 The StarPU runtime system Execution model Application Scheduling engine Memory StarPU Management A+= B (DSM) CPU driver A B GPU driver #k ... GPU CPU#k RAM A B Schedule task

  23. 23 The StarPU runtime system Execution model Application Scheduling engine Memory StarPU Management A+= B (DSM) CPU driver A B GPU driver #k ... GPU CPU#k RAM B A B Fetch data

  24. 24 The StarPU runtime system Execution model Application Scheduling engine Memory StarPU Management A+= B (DSM) CPU driver A B GPU driver #k ... GPU CPU#k RAM B A A B Fetch data

  25. 25 The StarPU runtime system Execution model Application Scheduling engine Memory StarPU Management A+= B (DSM) CPU driver A B GPU driver #k ... GPU CPU#k RAM B A A B Fetch data

  26. 26 The StarPU runtime system Execution model Application Scheduling engine Memory StarPU Management (DSM) CPU driver A B GPU driver #k ... GPU CPU#k RAM B A A B A+= B Offload computation

  27. 27 The StarPU runtime system Execution model Application Scheduling engine Memory StarPU Management (DSM) CPU driver A B GPU driver #k ... GPU CPU#k RAM B A A B Notify termination

  28. 28 The StarPU runtime system Development context • History • Started about 3 years ago • StarPU main core ~ 20k lines of code • Written in C • 3 core developers – Cédric Augonnet, Samuel Thibault, Nathalie Furmento • Open Source • Released under LGPL • Sources freely available – svn repository and nightly tarballs – See http://runtime.bordeaux.inria.fr/StarPU/ • Open to external contributors

  29. 29 The StarPU runtime system Supported platforms • Supported architectures • Multicore CPUs (x86, PPC, ...) • NVIDIA GPUs • OpenCL devices (eg. AMD cards) • Cell processors (experimental) • Supported Operating Systems • Linux • Mac OS • Windows

  30. 30 Performance teaser • QR decomposition • Mordor8 (UTK) : 16 CPUs (AMD) + 4 GPUs (C1060)

  31. 31 Programming interface

  32. 32 Scaling a vector Launching StarPU • Makefile flags • CFLAGS +=$(shell pkg-config --cflags libstarpu) • LDFLAGS+=$(shell pkg-config --libs libstarpu) • Headers • #include <starpu.h> • (De)Initialize StarPU • starpu_init(NULL); • starpu_shutdown();

  33. 33 Scaling a vector Data registration • Register a piece of data to StarPU • float array[NX]; for (unsigned i = 0; i < NX; i++) array[i] = 1.0f; starpu_data_handle vector_handle ; starpu_vector_data_register( &vector_handle , 0, array, NX, sizeof(vector[0])); • Unregister data • starpu_data_unregister( vector_handle );

  34. 34 Scaling a vector Defining a codelet • CPU kernel void scal_cpu_func(void *buffers[], void *cl_arg) { struct starpu_vector_interface_s *vector = buffers[0]; unsigned n = STARPU_VECTOR_GET_NX(vector); float *val = (float *)STARPU_VECTOR_GET_PTR(vector); float *factor = cl_arg; for (int i = 0; i < n; i++) val[i] *= *factor; }

  35. 35 Scaling a vector Defining a codelet (2) • CUDA kernel (compiled with nvcc, in a separate .cu file) __global__ void vector_mult_cuda(float *val, unsigned n, float factor) { for(unsigned i = 0 ; i < n ; i++) val[i] *= factor; } extern "C" void scal_cuda_func(void *buffers[], void *cl_arg) { struct starpu_vector_interface_s *vector = buffers[0]; unsigned n = STARPU_VECTOR_GET_NX(vector); float *val = (float *)STARPU_VECTOR_GET_PTR(vector); float *factor = (float *)cl_arg; vector_mult_cuda<<<1,1>>>(val, n, *factor); cudaThreadSynchronize(); }

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend