StarPU : Exploiting heterogeneous architectures through task-based - PowerPoint PPT Presentation

1 StarPU : Exploiting heterogeneous architectures through task-based programming Cédric Augonnet Nathalie Furmento Raymond Namyst Samuel Thibault INRIA Bordeaux, LaBRI, University of Bordeaux ComplexHPC spring school – May 13 rd 2011

2 The RUNTIME Team NL

3 The RUNTIME Team NL

4 The RUNTIME Team NL Doing Parallelism for centuries !

5 The RUNTIME Team Research directions • High Performance Runtime Systems for Parallel Architectures “ Runtime Systems perform dynamically what cannot be not statically ” • • Main research directions • Exploiting shared memory machines – Thread scheduling over hierarchical multicore architectures – Task scheduling over accelerator-based machines • Communication over high speed networks – Multicore-aware communication engines – Multithreaded MPI implementations • Integration of multithreading and communication – Runtime support for hybrid programming • See http://runtime.bordeaux.inria.fr/ for more information

6 Introduction Toward heterogeneous multi-core architectures • Multicore is here • Hierarchical architectures Mixed Large • Manycore is coming and • Power is a major concern Small Cores • Architecture specialization • Now – Accelerators (GPGPUs, FPGAs) – Coprocessors (Cell's SPUs) • In the (near?) Future – Many simple cores – A few full-featured cores

7 Introduction How to program these architectures? Multicore • Multicore programming • pthreads, OpenMP, TBB, ... OpenMP TBB Cilk MPI CPU CPU CPU CPU M. M.

8 Introduction How to program these architectures? Accelerators • Multicore programming OpenCL • pthreads, OpenMP, TBB, ... CUDA libspe ATI Stream •Accelerator programming • Consensus on OpenCL? • (Often) Pure offloading model CPU CPU *PU M. CPU CPU *PU M. M. M.

9 Introduction How to program these architectures? Multicore Accelerators • Multicore programming OpenCL • pthreads, OpenMP, TBB, ... OpenMP CUDA libspe ? TBB Cilk ? ATI Stream •Accelerator programming MPI • Consensus on OpenCL? • (Often) Pure offloading model CPU CPU *PU M. •Hybrid models? • Take advantage of all resources ☺ CPU CPU • Complex interactions ☹ *PU M. M. M.

10 Introduction Challenging issues at all stages • Applications • Programming paradigm • BLAS kernels, FFT, … HPC Applications • Compilers Compiling Specific environment librairies • Languages • Code generation/optimization Runtime system • Runtime systems • Resources management Operating System • Task scheduling Hardware • Architecture • Memory interconnect

11 Introduction Challenging issues at all stages Expressive interface • Applications • Programming paradigm • BLAS kernels, FFT, … HPC Applications • Compilers Compiling Specific environment librairies • Languages • Code generation/optimization Runtime system • Runtime systems • Resources management Operating System • Task scheduling Hardware • Architecture • Memory interconnect Execution Feedback

12 Outline • Overview of StarPU • Programming interface • Task & data management • Task scheduling • MAGMA+PLASMA example • Experimental features • Conclusion

13 Overview of StarPU

14 Overview of StarPU A = A+B Rationale Dynamically schedule tasks CPU CPU on all processing units GPU M. • See a pool of CPU CPU heterogeneous processing M. B GPU units A M. B M. GPU M. Avoid unnecessary data CPU CPU transfers between CPU CPU GPU M. A accelerators • Software VSM for M. M. heterogeneous machines

The StarPU runtime system 15 HPC Applications Execution model High-level data management library Scheduling engine Specific drivers ... CPUs GPUs SPUs Mastering CPUs, GPUs, SPUs … *PUs

16 The StarPU runtime system The need for runtime systems • “do dynamically what can’t HPC Applications be done statically anymore” Parallel Parallel Compilers Libraries • StarPU provides • Task scheduling • Memory management • Compilers and libraries StarPU generate (graphs of) parallel Drivers (CUDA, OpenCL) tasks CPU GPU … • Additional information is welcome!

17 Data management • StarPU provides a Virtual HPC Applications Shared Memory subsystem Parallel Parallel Compilers Libraries • Weak consistency • Replication • Single writer • High level API – Partitioning filters StarPU • Input & ouput of tasks = reference to VSM data Drivers (CUDA, OpenCL) CPU GPU …

18 The StarPU runtime system Task scheduling • Tasks = HPC Applications • Data input & output Parallel Parallel – Reference to VSM data Compilers Libraries • Multiple implementations – E.g. CUDA + CPU implementation • Dependencies with other tasks • Scheduling hints StarPU • StarPU provides an Open Drivers (CUDA, OpenCL) Scheduling platform cpu f gpu (A RW , B R , C R ) CPU GPU … • Scheduling algorithm = spu plug-ins

19 The StarPU runtime system Task scheduling • Who generates the code ? HPC Applications • StarPU Task = ~function pointers Parallel Parallel • StarPU doesn't generate code Compilers Libraries • Libraries era • PLASMA + MAGMA • FFTW + CUFFT... StarPU • Rely on compilers • PGI accelerators Drivers (CUDA, OpenCL) cpu • CAPS HMPP... f gpu (A RW , B R , C R ) CPU GPU … spu

20 The StarPU runtime system Execution model Application Scheduling engine Memory StarPU Management (DSM) CPU driver A B GPU driver #k ... GPU CPU#k RAM A B

21 The StarPU runtime system Execution model Application A+= B Scheduling engine Memory StarPU Management (DSM) CPU driver A B GPU driver #k ... GPU CPU#k RAM A B Submit task « A += B »

22 The StarPU runtime system Execution model Application Scheduling engine Memory StarPU Management A+= B (DSM) CPU driver A B GPU driver #k ... GPU CPU#k RAM A B Schedule task

23 The StarPU runtime system Execution model Application Scheduling engine Memory StarPU Management A+= B (DSM) CPU driver A B GPU driver #k ... GPU CPU#k RAM B A B Fetch data

24 The StarPU runtime system Execution model Application Scheduling engine Memory StarPU Management A+= B (DSM) CPU driver A B GPU driver #k ... GPU CPU#k RAM B A A B Fetch data

25 The StarPU runtime system Execution model Application Scheduling engine Memory StarPU Management A+= B (DSM) CPU driver A B GPU driver #k ... GPU CPU#k RAM B A A B Fetch data

26 The StarPU runtime system Execution model Application Scheduling engine Memory StarPU Management (DSM) CPU driver A B GPU driver #k ... GPU CPU#k RAM B A A B A+= B Offload computation

27 The StarPU runtime system Execution model Application Scheduling engine Memory StarPU Management (DSM) CPU driver A B GPU driver #k ... GPU CPU#k RAM B A A B Notify termination

28 The StarPU runtime system Development context • History • Started about 3 years ago • StarPU main core ~ 20k lines of code • Written in C • 3 core developers – Cédric Augonnet, Samuel Thibault, Nathalie Furmento • Open Source • Released under LGPL • Sources freely available – svn repository and nightly tarballs – See http://runtime.bordeaux.inria.fr/StarPU/ • Open to external contributors

29 The StarPU runtime system Supported platforms • Supported architectures • Multicore CPUs (x86, PPC, ...) • NVIDIA GPUs • OpenCL devices (eg. AMD cards) • Cell processors (experimental) • Supported Operating Systems • Linux • Mac OS • Windows

30 Performance teaser • QR decomposition • Mordor8 (UTK) : 16 CPUs (AMD) + 4 GPUs (C1060)

31 Programming interface

32 Scaling a vector Launching StarPU • Makefile flags • CFLAGS +=$(shell pkg-config --cflags libstarpu) • LDFLAGS+=$(shell pkg-config --libs libstarpu) • Headers • #include <starpu.h> • (De)Initialize StarPU • starpu_init(NULL); • starpu_shutdown();

33 Scaling a vector Data registration • Register a piece of data to StarPU • float array[NX]; for (unsigned i = 0; i < NX; i++) array[i] = 1.0f; starpu_data_handle vector_handle ; starpu_vector_data_register( &vector_handle , 0, array, NX, sizeof(vector[0])); • Unregister data • starpu_data_unregister( vector_handle );

34 Scaling a vector Defining a codelet • CPU kernel void scal_cpu_func(void *buffers[], void *cl_arg) { struct starpu_vector_interface_s *vector = buffers[0]; unsigned n = STARPU_VECTOR_GET_NX(vector); float *val = (float *)STARPU_VECTOR_GET_PTR(vector); float *factor = cl_arg; for (int i = 0; i < n; i++) val[i] *= *factor; }

35 Scaling a vector Defining a codelet (2) • CUDA kernel (compiled with nvcc, in a separate .cu file) __global__ void vector_mult_cuda(float *val, unsigned n, float factor) { for(unsigned i = 0 ; i < n ; i++) val[i] *= factor; } extern "C" void scal_cuda_func(void *buffers[], void *cl_arg) { struct starpu_vector_interface_s *vector = buffers[0]; unsigned n = STARPU_VECTOR_GET_NX(vector); float *val = (float *)STARPU_VECTOR_GET_PTR(vector); float *factor = (float *)cl_arg; vector_mult_cuda<<<1,1>>>(val, n, *factor); cudaThreadSynchronize(); }

StarPU : Exploiting heterogeneous architectures through task-based - PowerPoint PPT Presentation

1 StarPU : Exploiting heterogeneous architectures through task-based programming Cdric Augonnet Nathalie Furmento Raymond Namyst Samuel Thibault INRIA Bordeaux, LaBRI, University of Bordeaux ComplexHPC spring school May 13 rd 2011 2

Composing multiple StarPU applications Composing multiple StarPU applications over heterogeneous

Running PEPPHER benchmarks on top of the StarPU runtime system Cdric Augonnet Nicolas Collin

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

Architectures Architectural styles Software architectures Architectures versus middleware

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

MULTIPROCESSORS AND HETEROGENEOUS ARCHITECTURES Hakim Weatherspoon CS6410 Slides borrowed

Visualization of Geant4 Data: Exploiting Component Visualization of Geant4 Data: Exploiting

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

Di ff erentially-Private Batch Query Answering Exploiting the Workload vs. Exploiting the Data

Exploiting Private Local Exploiting Private Local Memories to Reduce the Memories to Reduce the

Exploiting carbon and nitrogen Exploiting carbon and nitrogen compounds for enhanced energy

Exploiting Extreme Processor Counts on the Cray Exploiting Extreme Processor Counts on the Cray

Hacking Browser's DOM Exploiting Ajax and RIA Exploiting Ajax and RIA Shreeraj Shah

Register Packing Register Packing Exploiting Narrow- -Width Operands Width Operands Exploiting

Exploiting Level- Exploiting Level -of of- -Detail Perception Detail Perception Multiple

C3 B: Exploiting the Num erous C3 B: Exploiting the Num erous Possibilities W eb Technology

Modelling and Analysis of Parallel/ Distributed Time-dependent Systems: An Approach based on JADE

Cheap Thrills: the Price of Leisure and the Decline of Work Hours Alexandr Kopytov Nikolai

Discovery of Latent Factors in High-dimensional Data via Spectral Methods Furong Huang

In [144]: # HIDDEN import matplotlib matplotlib.use('Agg') from datascience import * % matplotlib

Foreign Bodies Complications: Fun Ways to Get Them Off or Out Incr d Risk if > 24 hours

d Q I ~ ; .I!fo"~ -~ L21 c.~:~!~":"!i:~-:-:~-~- iti 4'4ikb .:.;. L

betegmenedzsels sarokpontjai etilnglikol - mrgezsben Pap Csaba Zsolt dr., Elek Istvn

Diane M. DeGroat Staff Attorney HIV/AIDS Law Consortium Training Objectives Understand the

StarPU : Exploiting heterogeneous architectures through task-based - PowerPoint PPT Presentation

1 StarPU : Exploiting heterogeneous architectures through task-based programming Cdric Augonnet Nathalie Furmento Raymond Namyst Samuel Thibault INRIA Bordeaux, LaBRI, University of Bordeaux ComplexHPC spring school May 13 rd 2011 2

Composing multiple StarPU applications Composing multiple StarPU applications over heterogeneous

Running PEPPHER benchmarks on top of the StarPU runtime system Cdric Augonnet Nicolas Collin

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

Architectures Architectural styles Software architectures Architectures versus middleware

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

MULTIPROCESSORS AND HETEROGENEOUS ARCHITECTURES Hakim Weatherspoon CS6410 Slides borrowed

Visualization of Geant4 Data: Exploiting Component Visualization of Geant4 Data: Exploiting

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

Di ff erentially-Private Batch Query Answering Exploiting the Workload vs. Exploiting the Data

Exploiting Private Local Exploiting Private Local Memories to Reduce the Memories to Reduce the

Exploiting carbon and nitrogen Exploiting carbon and nitrogen compounds for enhanced energy

Exploiting Extreme Processor Counts on the Cray Exploiting Extreme Processor Counts on the Cray

Hacking Browser's DOM Exploiting Ajax and RIA Exploiting Ajax and RIA Shreeraj Shah

Register Packing Register Packing Exploiting Narrow- -Width Operands Width Operands Exploiting

Exploiting Level- Exploiting Level -of of- -Detail Perception Detail Perception Multiple

C3 B: Exploiting the Num erous C3 B: Exploiting the Num erous Possibilities W eb Technology

Modelling and Analysis of Parallel/ Distributed Time-dependent Systems: An Approach based on JADE

Cheap Thrills: the Price of Leisure and the Decline of Work Hours Alexandr Kopytov Nikolai

Discovery of Latent Factors in High-dimensional Data via Spectral Methods Furong Huang

In [144]: # HIDDEN import matplotlib matplotlib.use('Agg') from datascience import * % matplotlib

Foreign Bodies Complications: Fun Ways to Get Them Off or Out Incr d Risk if &gt; 24 hours

d Q I ~ ; .I!fo&quot;~ -~ L21 c.~:~!~&quot;:&quot;!i:~-:-:~-~- iti 4'4ikb .:.;. L

betegmenedzsels sarokpontjai etilnglikol - mrgezsben Pap Csaba Zsolt dr., Elek Istvn

Diane M. DeGroat Staff Attorney HIV/AIDS Law Consortium Training Objectives Understand the

Foreign Bodies Complications: Fun Ways to Get Them Off or Out Incr d Risk if > 24 hours

d Q I ~ ; .I!fo"~ -~ L21 c.~:~!~":"!i:~-:-:~-~- iti 4'4ikb .:.;. L