Composing multiple StarPU applications Composing multiple StarPU - - PowerPoint PPT Presentation

composing multiple starpu applications composing multiple
SMART_READER_LITE
LIVE PREVIEW

Composing multiple StarPU applications Composing multiple StarPU - - PowerPoint PPT Presentation

Composing multiple StarPU applications Composing multiple StarPU applications over heterogeneous machines: a supervised approach Andra Hugo With Abdou Guermouche, Pierre-Andr Wacrenier, Raymond Namyst Inria, LaBRI, University of Bordeaux


slide-1
SLIDE 1

Composing multiple StarPU applications Composing multiple StarPU applications

  • ver heterogeneous machines:

a supervised approach

Andra Hugo

With Abdou Guermouche, Pierre-André Wacrenier, Raymond Namyst

Inria, LaBRI, University of Bordeaux

RUNTIME INRIA Group

INRIA Bordeaux Sud-Ouest

slide-2
SLIDE 2

The increasing role of runtime systems

Code reusability

  • Many HPC applications rely on

specific parallel libraries

  • Linear algebra, FFT, Stencils
  • Efficient implementations sitting on

top of dynamic runtime systems

  • To deal with hybrid, multicore

complex hardware

Cilk OpenMP IntelTBB Anthill Harmony KAAPI StarPU StarSs

Runtime

  • 2

complex hardware

  • E.g. MKL/OpenMP,

MAGMA/StarPU

  • To avoid reinventing the wheel!
  • Some application may benefit from

relying on multiple libraries

  • Potentially using different

underlying runtime systems…

DAGuE Charm++ Qilin

slide-3
SLIDE 3

The increasing role of runtime systems

Code reusability

  • Many HPC applications rely on

specific parallel libraries

  • Linear algebra, FFT, Stencils
  • Efficient implementations sitting on

top of dynamic runtime systems

  • To deal with hybrid, multicore

complex hardware

Cilk OpenMP IntelTBB Anthill Harmony KAAPI StarPU StarSs

Runtime

  • 3

complex hardware

  • E.g. MKL/OpenMP,

MAGMA/StarPU

  • To avoid reinventing the wheel!
  • Some application may benefit from

relying on multiple libraries

  • Potentially using different

underlying runtime systems…

DAGuE Charm++ Qilin

And the performance

  • f the application

=>

slide-4
SLIDE 4

Struggle for resources

Interferences between parallel libraries

  • Parallel libraries typically allocate

and bind one thread per core Problems:

  • Resource over-subscription
  • Resource under-subscription

Solutions:

  • Stand-alone allocation
  • Hand-made allocation

Runtime

  • 4
  • Hand-made allocation
  • Examples:
  • Sparse direct solvers
  • Code coupling (multi-physics,

multi-scale)

  • Etc…

CPU 1 CPU 1 CPU 2 CPU 2 CPU 3 CPU 3 CPU 4 CPU 4 GPU GPU Example: qr_mumps

slide-5
SLIDE 5

Struggle for resources

Interferences between parallel libraries

  • Parallel libraries typically allocate

and bind one thread per core Problems:

  • Resource over-subscription
  • Resource under-subscription

Solutions:

  • Stand-alone allocation
  • Hand-made allocation

Runtime

  • 5

=> Composability problem

CPU 1 CPU 1 CPU 2 CPU 2 CPU 3 CPU 3 CPU 4 CPU 4 GPU GPU Example: qr_mumps

  • Hand-made allocation
  • Examples:
  • Sparse direct solvers
  • Code coupling (multi-physics,

multi-scale)

  • Etc…
slide-6
SLIDE 6

Composability problem

How to deal with it?

Intel TBB Lithe

Runtime

  • 6
  • Advanced environments allow partitioning of hardware resources
  • Intel TBB
  • The pool of workers are split in arenas
  • Lithe
  • Resource sharing management interface
  • Harts are transferred between parallel libraries
  • Main challenge: Automatically adjusting the amount of resources allocated to each library
slide-7
SLIDE 7

Our approach: Scheduling Contexts

Toward code composability

  • Isolate concurrent parallel codes
  • Similar to lightweight virtual machines

Context B Push Context A Push

Runtime

  • 7

CPU workers GPU workers

slide-8
SLIDE 8

Our approach: Scheduling Contexts

Toward code composability

  • Contexts may expand and shrink
  • Hypervised approach

Context B Push Context A Push

  • Isolate concurrent parallel codes
  • Similar to lightweight virtual machines

Runtime

  • 8
  • Resize contexts
  • Share resources
  • Maximize overall throughput
  • Use dynamic feedback both from

application and runtime

CPU workers GPU workers Hypervisor

slide-9
SLIDE 9

Tackle the Composability problem

  • Runtime System to validate our proposal
  • Scheduling contexts to isolate parallel codes
  • The Hypervisor to (re)size scheduling contexts

Runtime

  • 9
slide-10
SLIDE 10

Tackle the Composability problem

  • Runtime System to validate our proposal
  • Scheduling contexts to isolate parallel codes
  • The Hypervisor to (re)size scheduling contexts

Runtime

  • 10
slide-11
SLIDE 11

Using StarPU as an experimental platform

A runtime system for *PU architectures for studying resource negociation

  • The StarPU runtime system
  • Dynamically schedule tasks on all

processing units

  • See a pool of heterogeneous

processing units

A = A+B CPU CPU CPU CPU M. GPU

Runtime

  • Avoid unnecessary data transfers

between accelerators

  • Software VSM for

heterogeneous machines

M. CPU CPU M. GPU CPU CPU CPU CPU M. A B B M. GPU M. GPU

  • 11
slide-12
SLIDE 12

Parallel Compilers HPC Applications Parallel Libraries

Overview of StarPU

Maximizing PU occupancy, minimizing data transfers

  • Accept tasks that may have

multiple implementations

  • Potential inter-dependencies
  • Leads to a directed acyclic

graph of tasks

  • Data-flow approach

CPU StarPU Drivers (CUDA, OpenCL)

Runtime

  • Open, general purpose

scheduling platform

  • Scheduling policies = plugins

GPU MIC

(ARW, BR, CR)

f

cpu gpu spu

  • 12
slide-13
SLIDE 13

Tasks scheduling

How does it work?

  • When a task is submitted, it first goes

into a pool of “frozen tasks” until all dependencies are met

  • Then, the task is “pushed” to the

scheduler

  • Idle processing units actively poll for

Scheduler

Push

Runtime

  • Idle processing units actively poll for

work (“pop”)

  • What happens inside the scheduler is…

up to you!

  • Examples:
  • mct, work stealing, eager, priority

CPU workers GPU workers Pop Pop

  • 13
slide-14
SLIDE 14

Tackle the Composability problem

  • Runtime System to validate our proposal
  • Scheduling contexts to isolate parallel codes
  • The Hypervisor to (re)size scheduling contexts

Runtime

  • 14
slide-15
SLIDE 15

Scheduling Contexts in StarPU

Extension of StarPU

  • “Virtual” StarPU machines
  • Feature their own scheduler
  • Minimize interferences
  • Enforce data locality
  • Allocation of resources

Runtime

  • 15
  • Explicit:
  • Programmer’s input
  • Supervised:
  • Tips on the number of resources
  • Tips on the number of flops
  • Shared processing units
slide-16
SLIDE 16

Scheduling contexts in StarPU

Easily use contexts in your application

int resources1[3] = {CPU_1, CPU_2, GPU_1}; int resources2[4] = {CPU_3, CPU_4, CPU_5, CPU_6}; /* define the scheduling policy and the table

  • f resource ids */

sched_ctx1 = starpu_create_sched_ctx(“mct",resources1,3);

MCT

Runtime

sched_ctx2 = starpu_create_sched_ctx("greedy",resources2,4);

  • 16
slide-17
SLIDE 17

Scheduling contexts in StarPU

Easily use contexts in your application

int resources1[3] = {CPU_1, CPU_2, GPU_1}; int resources2[4] = {CPU_3, CPU_4, CPU_5, CPU_6}; /* define the scheduling policy and the table

  • f resource ids */

sched_ctx1 = starpu_create_sched_ctx("heft",resources1,3);

Runtime

// thread 2: /* define the context associated to kernel 2 */ starpu_set_sched_ctx(sched_ctx2); /* submit the set of tasks of parallel kernel 2*/ for( i = 0; i < ntasks2; i++) starpu_task_submit(tasks2[i]); sched_ctx2 = starpu_create_sched_ctx("greedy",resources2,4); // thread 1: /* define the context associated to kernel 1 */ starpu_set_sched_ctx(sched_ctx1); /* submit the set of tasks of the parallel kernel 1*/ for( i = 0; i < ntasks1; i++) starpu_task_submit(tasks1[i]);

  • 17
slide-18
SLIDE 18

Experimental evaluation

Platform and Application

  • 9 CPUs (two Intel hexacore processors, 3 cores devoted to execute

GPU drivers) + 3 GPUs

  • MAGMA Linear Algebra Library
  • StarPU Implementation
  • Cholesky Factorization kernel
  • Euler3D solver

Runtime

  • 18
  • Computational Fluid Dynamic benchmark
  • Rodinia benchmark suite
  • Iterative solver for 3D Euler equations

for compressible fluids

  • StarPU Implementation

MAGMA – Cholesky Factorization

slide-19
SLIDE 19

Composing Magma and the Euler3D solver

Different parallel kernels

16 18 20 No contexts 19.8 2 contexts 14.2

CFD And Cholesky Factorization

  • Computational Fluid Dynamic:
  • Domain decomposition parallelization
  • Independent tasks per iteration
  • Dependencies between iterations
  • Strong affinity with GPUs
  • 2 sub-domains: 2 GPUs

Runtime

  • 19

2 4 6 8 10 12 14 Time(s)

  • Cholesky Factorization:
  • Scalable on both CPUs & GPUs
  • 1GPU & 9 CPUs
  • Large number of tasks
  • Contexts’ benefits:
  • Enforcing locality constraints
slide-20
SLIDE 20

Micro-benchmark: 9 Cholesky factorizations in parallel

Gain performance from data locality

  • Mixing parallel kernels:
  • Unnecessary data transfers

between Host Memory & GPU memory -> blocking waits

  • GPU Memory flushes

Time (s)

10 20 30 40 50 60 44.3 52 34.8 34.4 Serial Execution 1 Context: 9 CPUs / 3GPUs 3 contexts : 3 x (3 CPUs / 1 GPU) 9 Contexts: 9 x ( 1 CPUs / 0.3 GPUs)

Runtime

  • 20
slide-21
SLIDE 21
  • Mixing parallel kernels:
  • Unnecessary data transfers

between Host Memory & GPU memory -> blocking waits

  • GPU Memory flushes

10 20 30 40 50 60 44.3 52 34.8 34.4 Serial Execution : 87 GB 1 Context: 9 CPUs / 3GPUs : 113 GB 3 contexts : 3 x (3 CPUs / 1 GPU) : 37 GB 9 Contexts: 9 x ( 1 CPUs / 0.3 GPUs) : 41GB

Time (s)

Micro-benchmark: 9 Cholesky factorizations in parallel

Gain performance from data locality

Runtime

  • 21
slide-22
SLIDE 22

Tackle the Composability problem

  • Runtime System to validate our proposal
  • Scheduling contexts to isolate parallel codes
  • The Hypervisor to (re)size scheduling contexts

Runtime

  • 22
slide-23
SLIDE 23
  • Idea:
  • Dynamically resize scheduling

contexts

  • Different resizing policies
  • Optimization criteria:
  • Minimize resources’ idle time
  • Maximize the instant speed of the

The Hypervisor

What if static dimensioning doesn’t work?

Runtime

  • Maximize the instant speed of the

resources/contexts

  • Minimize total execution of the

application

  • Workload of the application

provided

  • Linear programs to evaluate the

best distribution of the resources

  • 23
slide-24
SLIDE 24

Dealing with non scalable kernels

Idleness-based policy

  • CFD decomposed in 2 sub-domains
  • Static distribution:
  • CFD: 3 GPUs
  • Cholesky Factorization: 9 CPUs
  • Hypervisor’s intervention:

40 50 60 53.08

Runtime

  • CFD: 2GPUs
  • Cholesky Factorization: 1 GPU & 9

CPUs

  • 24

Time (s)

10 20 30 14.11 Static distribution of resources Dynamically adjusted distribution of resources

slide-25
SLIDE 25

Feedback of the application

Application-driven policy

Time (s)

  • 2 streams of parallel kernels
  • 1 of them pops in from time to time (the green one)
  • The hypervisor: assigns some CPUs to the intruder

17 17.5 18 18.5 19 19.5 20 19.70 17.20

Runtime

  • 25

15.5 16 16.5 17 Overlapping contexts Dynamically adjusted distribution of resources

slide-26
SLIDE 26

Facing irregular applications

Speed-based resizing policies

  • Evaluate the speed of contexts
  • Compute the number of resources of each type
  • f architecture needed by each context
  • How many GPUs/CPUs ?
  • To execute in a minimal amount of time

Runtime

  • 26
slide-27
SLIDE 27

Facing irregular applications

Speed-based resizing policies

  • Evaluate the speed of contexts
  • Compute the number of resources of each type
  • f architecture needed by each context
  • How many GPUs/CPUs ?
  • To execute in a minimal amount of time

nGPUs in Context c

Runtime

  • 27

nCPUs in Context c Workload of Context c

slide-28
SLIDE 28

Facing irregular applications

Speed-based resizing policies

  • Evaluate the speed of contexts
  • Compute the number of resources of each type
  • f architecture needed by each context
  • How many GPUs/CPUs ?
  • To execute in a minimal amount of time

nGPUs in Context c

15 20 25 24.8 17.29

Runtime

  • 28

nCPUs in Context c Workload of Context c

5 10 Incorrect Distribution of resources over contexts Speed-based policy corrects the initial distribution of resources

slide-29
SLIDE 29

Conclusion & Future Work

  • Scheduling Contexts allow using multiple parallel libraries

simultaneously

  • Currently implemented in StarPU runtime system
  • A Hypervisor dynamically shrinks / extends contexts
  • Future Work

Runtime

  • New metrics to guide resizing
  • More intelligent sharing of resources (GPUs)
  • Extend scheduling contexts to other parallel environments
  • And much more!
  • 29