Running PEPPHER benchmarks on top of the StarPU runtime system - - PowerPoint PPT Presentation

running peppher benchmarks on top of the starpu runtime
SMART_READER_LITE
LIVE PREVIEW

Running PEPPHER benchmarks on top of the StarPU runtime system - - PowerPoint PPT Presentation

1 Running PEPPHER benchmarks on top of the StarPU runtime system Cdric Augonnet Nicolas Collin Nathalie Furmento Raymond Namyst Samuel Thibault INRIA Bordeaux, LaBRI, Universit de Bordeaux 22 th January 2011 2 The StarPU runtime system


slide-1
SLIDE 1

1

Running PEPPHER benchmarks on top

  • f the StarPU runtime system

22th January 2011

Cédric Augonnet Nicolas Collin Nathalie Furmento Raymond Namyst Samuel Thibault

INRIA Bordeaux, LaBRI, Université de Bordeaux

slide-2
SLIDE 2

2

Parallel Compilers HPC Applications Runtime system Operating System CPU Parallel Libraries

  • “do dynamically what can’t

be done statically”

  • Typical duties
  • Task scheduling
  • Memory management
  • Compilers and libraries

generate (graphs of) parallel tasks

  • Additional information is

welcome!

Motivations

GPU …

The StarPU runtime system

slide-3
SLIDE 3

3

  • Main Challenges
  • Dynamically schedule

tasks on all processing units

– See a pool of heterogeneous cores – Scheduling ≠ offloading

  • Avoid unnecessary

data transfers between accelerators

– Need to keep track of data copies

Motivations

A = A+B

The StarPU runtime system

M. M. CPU CPU CPU CPU M. GPU GPU CPU CPU CPU CPU M. M. B M. GPU M. GPU A M. B A

slide-4
SLIDE 4

4

Parallel Compilers HPC Applications StarPU Drivers (CUDA, OpenCL) CPU Parallel Libraries

  • StarPU provides a Virtual

Shared Memory subsystem

  • Weak Consistency
  • Replication
  • Single writer
  • High level API
  • Application registers data
  • Input & ouput of tasks =

reference to registered data

The StarPU runtime system

Memory Management

GPU …

slide-5
SLIDE 5

5

Parallel Compilers HPC Applications StarPU Drivers (CUDA, OpenCL) CPU Parallel Libraries

  • Tasks =
  • Data input & output
  • Dependencies with other

tasks

  • Multiple implementations

– e.g. CUDA and/or CPU

  • Scheduling hints
  • StarPU provides an Open

Scheduling platform

  • Scheduling algorithm =

plug-ins

The StarPU runtime system

Task scheduling

GPU …

f

cpu gpu spu

(ARW, BR, CR)

slide-6
SLIDE 6

6

  • Fast Fourier Transform (FFT)
  • Mixing FFTW and CUFFTW
  • Dense Linear Algebra
  • Mixing PLASMA and MAGMA
  • Computational Fluid Dynamic (CFD)
  • Porting Rodinia's CFD

Peppher Benchmarks

slide-7
SLIDE 7

7

Dense Linear Algebra Mixing PLASMA and MAGMA (Collaboration with UTK)

slide-8
SLIDE 8

8

  • Background
  • Cholesky/LU/QR: Solve dense linear systems
  • UTK : ~ leaders for Dense Linear Algebra for 20 years
  • Need performance portability
  • State of the art libraries
  • PLASMA (Multicore CPUs)
  • MAGMA (Multiple GPUs)
  • Our approach
  • Use PLASMA algorithms
  • PLASMA kernels on CPUs, MAGMA kernels on GPUs
  • Schedule tasks with StarPU

Mixing PLASMA and MAGMA with StarPU

Background

slide-9
SLIDE 9

9

Mixing PLASMA and MAGMA with StarPU

Productivity

// Sequential Tile Cholesky FOR k = 0..TILES-1 DPOTRF(A[k][k]) FOR m = k+1..TILES-1 DTRSM(A[k][k], A[m][k]) FOR n = k+1..TILES-1 DSYRK(A[n][k], A[n][n]) FOR m = n+1..TILES-1 DGEMM(A[m][k], A[n][k], A[m][n]) // Hybrid Tile Cholesky FOR k = 0..TILES-1 starpu_Insert_Task(DPOTRF, …) FOR m = k+1..TILES-1 starpu_Insert_Task(DTRSM, …) FOR n = k+1..TILES-1 starpu_Insert_Task(DSYRK, …) FOR m = n+1..TILES-1 starpu_Insert_Task(DGEMM, …)

  • Programmability
  • Cholesky: ~half a week, QR: ~2 days of works, LU : ~time

to write new kernels

  • Quick algorithmic prototyping
slide-10
SLIDE 10

10

  • Cholesky decomposition
  • Hannibal: 8 CPU cores (Nehalem) + 3 GPUs (NV FX5800)

Mixing PLASMA and MAGMA with StarPU

slide-11
SLIDE 11

11

  • QR decomposition
  • Mordor8 (UTK) : 16 CPUs (AMD) + 4 GPUs (C1060)

Mixing PLASMA and MAGMA with StarPU

slide-12
SLIDE 12

12

  • QR decomposition
  • Mordor8 (UTK) : 16 CPUs (AMD) + 4 GPUs (C1060)

Mixing PLASMA and MAGMA with StarPU

MAGMA

slide-13
SLIDE 13

13

  • QR decomposition
  • Mordor8 (UTK) : 16 CPUs (AMD) + 4 GPUs (C1060)

Mixing PLASMA and MAGMA with StarPU

+12 CPUs ~200GFlops Peak : 12 cores ~150 GFlops

slide-14
SLIDE 14

14

  • Memory transfers during Cholesky decomposition

Mixing PLASMA and MAGMA with StarPU

~2.5x less transfers

slide-15
SLIDE 15

15

  • Add more algorithms
  • 2-sided Factorizations (eg. Hessenberg)
  • Solvers
  • Going to be released as a standalone library
  • Toward a complete LAPACK implementation for hybrid

computing

  • Need autotuning facilities!
  • Next step: integrate MPI
  • On-going work
  • Accelerated SCALAPACK ?

Mixing PLASMA and MAGMA with StarPU

Perspective

slide-16
SLIDE 16

16

Rodinia's CFD Solver

slide-17
SLIDE 17

17

  • The Rodinia benchmark suite
  • Cover the different « Berkeley Dwarves »
  • Available either in OpenMP or in CUDA
  • Neither multi-GPU nor hybrid systems
  • Rodinia's CFD Solver benchmark
  • 3D Euler equations for incompressible flow
  • Unstructured Grid Finite Volumes
  • Memory intensive kernel
  • Pre-processing and Post-processing are not available

– Need to create our own input meshes

Rodinia's CFD Solver

Background

slide-18
SLIDE 18

18

  • Pre-processing
  • Generated a mesh of the

air around a sphere

  • Very simple yet !
  • Parallelizing the problem
  • Partition the mesh using

SCOTCH

  • 1 task = update 1 part
  • Redundant computation
  • Exchange part boundaries

Rodinia's CFD Solver

Methodology

slide-19
SLIDE 19

19

Rodinia's CFD Solver

Post-processing

slide-20
SLIDE 20

20

  • Problem size
  • 64x64x64 grid, 1.3 Millions tetrahedrons
  • Reference CPU performance
  • 1 core (Intel Westmere X5650)

– 1.4s per iteration

  • 12 cores

– 0.15s per iteration

  • Preliminary performance with StarPU
  • 1 NVIDIA C2050

– 53ms per iteration

  • 2 NVIDIA C2050

– 28ms per iteration

  • We need large problems !

Rodinia's CFD Solver

Preliminary results

slide-21
SLIDE 21

21

  • Port in OpenCL
  • Use hybrid platforms
  • GPUs are much faster than CPUs

– Memory bound – Rather few tasks

  • Parallel CPU tasks

– large granularity

  • Heterogeneity-aware data layout
  • CPUs : Arrays of Structures (cache friendly)
  • GPUs : Structures of Arrays (SIMD friendly)

Rodinia's CFD Solver

Perspective

slide-22
SLIDE 22

22

  • StarPU
  • Data management & Task scheduling
  • Freely available under LGPL on Linux, Mac and Windows
  • Adapted 3 PEPPHER benchmarks
  • FFTW + CUFFTW
  • MAGMA + PLASMA
  • Rodinia's CFD Solver

Conclusion

slide-23
SLIDE 23

23

  • Productive approach
  • Rely on existing kernels for CPU/GPU
  • Architecture independent task model
  • Higher-level front-ends would help

– StarSs, HMPP, Codeplay's Offload

  • Autotuning will be required
  • Need to find optimal granularity

– Parallel tasks – Divisible tasks

  • Select code variants

– eg. with SkePU

Conclusion