CPU+GPU Load Balance Guided by Execution Time Prediction - - PowerPoint PPT Presentation

cpu gpu load balance guided by execution time prediction
SMART_READER_LITE
LIVE PREVIEW

CPU+GPU Load Balance Guided by Execution Time Prediction - - PowerPoint PPT Presentation

CPU+GPU Load Balance Guided by Execution Time Prediction Jean-Franois Dollinger, Vincent Loechner Inria CAMUS, ICube Lab., University of Strasbourg jean-francois.dollinger@inria.fr, vincent.loechner@inria.fr 19 January 2015 1 / 29 Outline 1


slide-1
SLIDE 1

CPU+GPU Load Balance Guided by Execution Time Prediction

Jean-François Dollinger, Vincent Loechner

Inria CAMUS, ICube Lab., University of Strasbourg jean-francois.dollinger@inria.fr, vincent.loechner@inria.fr

19 January 2015

1 / 29

slide-2
SLIDE 2

Outline

1 Introduction 2 Prediction

Overview Code generation Profiling

3 Runtime

CPU + GPU

4 Conclusion

1 / 29

slide-3
SLIDE 3

Introduction

Achieving and predicting performance on CPU/GPU is difficult. Sensitive to:

  • Input dataset (CUDA grid size, cache effects)
  • Compiler optimizations (unrolling, fission)
  • Cloudy infrastructures
  • Hardware availability
  • Efficient resources exploitation

2 / 29

slide-4
SLIDE 4

Introduction

Because of dynamic behaviors compilers miss performance

  • pportunities
  • PLUTO
  • PPCG
  • Par4All
  • openACC/HMPP: manual tuning

→ Automatic methods are the way to go (typical use case) → Our interest: polyhedral codes

3 / 29

slide-5
SLIDE 5

Introduction

How to get performance?

  • Right code with right PU (Processing Unit)
  • Select best code version on each given PU
  • Ensure load balance between PUs

→ Multi-versioning + runtime code selection = win

4 / 29

slide-6
SLIDE 6

Outline

1 Introduction 2 Prediction

Overview Code generation Profiling

3 Runtime

CPU + GPU

4 Conclusion

5 / 29

slide-7
SLIDE 7

Prediction

Overview

Multi-versioning: performance factors

  • Static factors (instruction)
  • External dynamic factors (scheduler)
  • Internal dynamic factors (cache effects, memory contention)

6 / 29

slide-8
SLIDE 8

Prediction

Overview

Extract scop

#pragma scop for(...) for(...) for(...) S0(...); #pragma endscop

O ine pro ling Pro le kernels Pro le memcpy

version 0

memcpy duration

Ranking tables Bandwidth table

Runtime prediction

+

Kernel duration

version 2

version 0 version 1

Application binary

  • bject le

... /* scop */ call schedule(...) call dispatch(...) /* endscop */ ...

Static code generation Launch PPCG

2

Parallelize and chunk (ppcg)

#pragma omp parallel for for(t0 = lb; t0 < ub; ...) for(...) for(...) S0(...);

Launch PLUTO Build templates

version 0 2

version 0

version 1

Kernel duration GPU CPU Schedule

version 2

7 / 29

slide-9
SLIDE 9

Prediction

Overview

Pedro Framework [Benoit Pradelle et al. 2011]

  • Multi-versioning of polyhedral loop nests
  • Target : multicore CPUs

8 / 29

slide-10
SLIDE 10

Outline

1 Introduction 2 Prediction

Overview Code generation Profiling

3 Runtime

CPU + GPU

4 Conclusion

9 / 29

slide-11
SLIDE 11

Prediction

Code generation

Code version

  • Block size
  • Tile size
  • Schedule

→ controlled by PPCG cmd line options PPCG, source-to-source compiler

  • Transforms C to CUDA
  • Generates:
  • Ehrhart polynomials
  • Loop nest parameters

Python scripts

  • Fill templates in C code

10 / 29

slide-12
SLIDE 12

Outline

1 Introduction 2 Prediction

Overview Code generation Profiling

3 Runtime

CPU + GPU

4 Conclusion

11 / 29

slide-13
SLIDE 13

Prediction

Data transfers: host ↔ device

  • Parameter: message size
  • Asymetric and non-uniform bandwidth

Code simulation

  • Parameters: number of CUDA blocks, sequential parameters
  • Load balance
  • Memory contention

How to model the performance curves ?

  • Affine intervals detection

12 / 29

slide-14
SLIDE 14

Prediction

Testbed

1st test platform

  • 2 Nvidia GTX 590 (16 (SM) * 32 (SP))
  • Asus P8P67-Pro (PCIe 2, x8 per card)
  • Core i7 2700k, stock

2nd test platform

  • Nvidia GTX 680 (8 (SM) * 192 (SP))
  • Asus P8P67-Deluxe (PCIe 2, x16)
  • Core i7 2600

13 / 29

slide-15
SLIDE 15

Prediction

Data transfers (testbed 1)

1000 2000 3000 4000 5000 6000 7000 10-3 10-2 10-1 100 101 102 103 104 105 106 107 bandwidth (MB/s) size (KB) real dev-host GTX590 real host-dev GTX590

  • prof. dev-host GTX590
  • prof. host-dev GTX590

14 / 29

slide-16
SLIDE 16

Prediction

Data transfers (testbed 2)

1000 2000 3000 4000 5000 6000 7000 10-3 10-2 10-1 100 101 102 103 104 105 106 107 bandwidth (MB/s) size (KB) real dev-host GTX680 real host-dev GTX680

  • prof. dev-host GTX680
  • prof. host-dev GTX680

15 / 29

slide-17
SLIDE 17

Prediction

Kernel simulation (testbed 1)

0.01 0.1 1 100 101 102 103 104 105 execution time per iteration (ns) number of blocks gemm 32x16 - GTX 590 real profiled 16 / 29

slide-18
SLIDE 18

Prediction

Kernel simulation (testbed 1)

0.01 0.1 1 128 256 384 512 640 768 896 1024 1152 1280 1408 1536 (e)xecution time per iteration (ns) (p)arameter size syrk2 - GTX 590 real ei=piβ+ui 17 / 29

slide-19
SLIDE 19

Outline

1 Introduction 2 Prediction

Overview Code generation Profiling

3 Runtime

CPU + GPU

4 Conclusion

18 / 29

slide-20
SLIDE 20

Runtime

CPU + GPU

Outermost parallel loop split into chunks

  • Each chunk associated to one PU
  • PUs performance differ

→ Ensure load balance Multi-Versioning

  • Code optimized towards target (PLUTO + PPCG)
  • Multiple code versions (combined)

Two components:

  • Scheduler:
  • Execution time of chunks [B. Pradelle et al.] + [J-F. Dollinger

et al.]

  • Adjust chunks sizes
  • Dispatcher

19 / 29

slide-21
SLIDE 21

Runtime

CPU + GPU

Scheduler functioning

1 T0 = t0 ∗ Card D0 ≈ t1 ∗ Card D1 ≈ ... ≈ tn ∗ Card Dn 2 Ti must tend to 1/n ∗ n−1 i=0 (ti ∗ Card Di) = 1/n ∗ Tall 3 ti = f (Gi, seq) on GPU 4 ti = g(Pi, Si) on CPU

Eliminate inefficient PUs

20 / 29

slide-22
SLIDE 22

Runtime

CPU + GPU

  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 1 1 2 3 4 5 6

  • workload proportion | exec. time proportion

step

  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 1 1 2 3 4 5 6

  • workload proportion | exec. time proportion

step

21 / 29

slide-23
SLIDE 23

Runtime

CPU + GPU (speedup to one PU)

5 10 15 20 25 30 g e m m 2 m m 3 m m s y r k s y r 2 k d

  • i

t g e n g e s u m m v m v t g e m v e r speedup CPU 1GPU CPU+1GPU CPU+2GPUs CPU+3GPUs CPU+4GPUs

22 / 29

slide-24
SLIDE 24

Runtime

CPU + GPU (load imbalance)

0.2 0.4 0.6 0.8 1 g e m m 2 m m 3 m m s y r k s y r 2 k d

  • i

t g e n g e s u m m v m v t g e m v e r imbalance CPU+1GPU CPU+2GPUs CPU+3GPUs CPU+4GPUs

23 / 29

slide-25
SLIDE 25

Runtime

Multiversioning CPU + GPU (speedup to worst)

1 2 3 4 5 6 C P U 1 G P U C P U + 1 G P U C P U + 2 G P U s C P U + 3 G P U s C P U + 4 G P U s speedup syr2k (c1) syr2k (c2) syr2k (c3) syr2k (c4) syr2k (c5) syr2k (c6) syr2k (c7) syr2k (c8) syr2k (c9) syr2k (all)

24 / 29

slide-26
SLIDE 26

Runtime

Multiversioning CPU + GPU (imbalance)

0.2 0.4 0.6 0.8 1 C P U + 1 G P U C P U + 2 G P U s C P U + 3 G P U s C P U + 4 G P U s imbalance syr2k (c1) syr2k (c2) syr2k (c3) syr2k (c4) syr2k (c5) syr2k (c6) syr2k (c7) syr2k (c8) syr2k (c9) syr2k (all)

25 / 29

slide-27
SLIDE 27

Conclusion

Framework capabilities

  • Execution time prediction
  • Fastest version selection
  • CPU vs GPU competition
  • CPU + GPU joint usage

Future work

  • Energy consumption

26 / 29

slide-28
SLIDE 28

Outline

1 Introduction 2 Prediction

Overview Code generation Profiling

3 Runtime

CPU + GPU

4 Conclusion

27 / 29

slide-29
SLIDE 29

Annex

Offline profiling: ranking table

Number of threads version 1 version 2 version 3 1 40 ms 55 ms 32 ms 2 32 ms 28 ms 17 ms 3 22 ms 15 ms 9 ms 4 14 ms 7 ms 8 ms

Online prediction: execution time computation

  • bservation = {2000, 600, 300, 300}

prediction (version1) = ((2000 − 600) ∗ 40) + ((600 − 300) ∗ 32) + (0 ∗ 22) + (300 ∗ 14) = 69800ms

28 / 29

slide-30
SLIDE 30

Annex

The algorithm stages:

  • Init.: distribute iterations equitably amongst PUs
  • Repeat 10 times:
  • Compute per chunk execution time
  • ri = Ti/Tall
  • Adjust chunk size according to ri

29 / 29