CPU+GPU Load Balance Guided by Execution Time Prediction - PowerPoint PPT Presentation

CPU+GPU Load Balance Guided by Execution Time Prediction Jean-François Dollinger, Vincent Loechner Inria CAMUS, ICube Lab., University of Strasbourg jean-francois.dollinger@inria.fr, vincent.loechner@inria.fr 19 January 2015 1 / 29

Outline 1 Introduction 2 Prediction Overview Code generation Profiling 3 Runtime CPU + GPU 4 Conclusion 1 / 29

Introduction Achieving and predicting performance on CPU/GPU is difficult. Sensitive to: • Input dataset (CUDA grid size, cache effects) • Compiler optimizations (unrolling, fission) • Cloudy infrastructures • Hardware availability • Efficient resources exploitation 2 / 29

Introduction Because of dynamic behaviors compilers miss performance opportunities • PLUTO • PPCG • Par4All • openACC/HMPP: manual tuning → Automatic methods are the way to go (typical use case) → Our interest: polyhedral codes 3 / 29

Introduction How to get performance? • Right code with right PU (Processing Unit) • Select best code version on each given PU • Ensure load balance between PUs → Multi-versioning + runtime code selection = win 4 / 29

Prediction Overview Multi-versioning: performance factors • Static factors (instruction) • External dynamic factors (scheduler) • Internal dynamic factors (cache effects, memory contention) 6 / 29

Prediction Overview Static code generation #pragma omp parallel for #pragma scop for(t0 = lb; t0 < ub; ...) for(...) Parallelize for(...) Extract for(...) and chunk for(...) scop for(...) S0(...); (ppcg) S0(...); #pragma endscop Launch Launch PPCG PLUTO 2 2 version 0 version 0 Build templates O ine pro ling Pro le Pro le memcpy kernels Bandwidth table Ranking tables Runtime prediction GPU CPU Application binary object le memcpy Kernel Kernel ... duration duration duration /* scop */ call schedule(...) + call dispatch(...) /* endscop */ ... version 1 version 0 version 1 version 2 version 2 version 0 Schedule 7 / 29

Prediction Overview Pedro Framework [Benoit Pradelle et al. 2011] • Multi-versioning of polyhedral loop nests • Target : multicore CPUs 8 / 29

Prediction Code generation Code version • Block size • Tile size • Schedule → controlled by PPCG cmd line options PPCG, source-to-source compiler • Transforms C to CUDA • Generates: • Ehrhart polynomials • Loop nest parameters Python scripts • Fill templates in C code 10 / 29

Prediction Data transfers: host ↔ device • Parameter: message size • Asymetric and non-uniform bandwidth Code simulation • Parameters: number of CUDA blocks, sequential parameters • Load balance • Memory contention How to model the performance curves ? • Affine intervals detection 12 / 29

Prediction Testbed 1st test platform • 2 Nvidia GTX 590 (16 (SM) * 32 (SP)) • Asus P8P67-Pro (PCIe 2, x8 per card) • Core i7 2700k, stock 2nd test platform • Nvidia GTX 680 (8 (SM) * 192 (SP)) • Asus P8P67-Deluxe (PCIe 2, x16) • Core i7 2600 13 / 29

Prediction Data transfers (testbed 1) 7000 real dev-host GTX590 real host-dev GTX590 prof. dev-host GTX590 prof. host-dev GTX590 6000 5000 bandwidth (MB/s) 4000 3000 2000 1000 0 10 -3 10 -2 10 -1 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 size (KB) 14 / 29

Prediction Data transfers (testbed 2) 7000 real dev-host GTX680 real host-dev GTX680 prof. dev-host GTX680 prof. host-dev GTX680 6000 5000 bandwidth (MB/s) 4000 3000 2000 1000 0 10 -3 10 -2 10 -1 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 size (KB) 15 / 29

Prediction Kernel simulation (testbed 1) gemm 32x16 - GTX 590 1 real profiled execution time per iteration (ns) 0.1 0.01 10 0 10 1 10 2 10 3 10 4 10 5 number of blocks 16 / 29

Prediction Kernel simulation (testbed 1) syrk2 - GTX 590 1 real e i =p i β +u i (e)xecution time per iteration (ns) 0.1 0.01 0 128 256 384 512 640 768 896 1024 1152 1280 1408 1536 (p)arameter size 17 / 29

Runtime CPU + GPU Outermost parallel loop split into chunks • Each chunk associated to one PU • PUs performance differ → Ensure load balance Multi-Versioning • Code optimized towards target (PLUTO + PPCG) • Multiple code versions (combined) Two components: • Scheduler: • Execution time of chunks [B. Pradelle et al.] + [J-F. Dollinger et al.] • Adjust chunks sizes • Dispatcher 19 / 29

Runtime CPU + GPU Scheduler functioning 1 T 0 = t 0 ∗ Card D 0 ≈ t 1 ∗ Card D 1 ≈ ... ≈ t n ∗ Card D n 2 T i must tend to 1 / n ∗ � n − 1 i = 0 ( t i ∗ Card D i ) = 1 / n ∗ T all 3 t i = f ( G i , seq ) on GPU 4 t i = g ( P i , S i ) on CPU Eliminate inefficient PUs 20 / 29

Runtime CPU + GPU 1 1 0.8 0.8 0.6 0.6 -workload proportion | exec. time proportion -workload proportion | exec. time proportion 0.4 0.4 0.2 0.2 0 0 -0.2 -0.2 -0.4 -0.4 -0.6 -0.6 -0.8 -0.8 -1 -1 1 1 2 2 3 3 4 4 5 5 6 6 step step 21 / 29

Runtime CPU + GPU (speedup to one PU) 30 CPU 1GPU CPU+1GPU CPU+2GPUs 25 CPU+3GPUs CPU+4GPUs 20 speedup 15 10 5 0 g 2 3 s s d g m g e m m y y o e e r r v m i s m m m k 2 t t g u m k v e m e n m r v 22 / 29

Runtime CPU + GPU (load imbalance) 1 CPU+1GPU CPU+2GPUs CPU+3GPUs CPU+4GPUs 0.8 0.6 imbalance 0.4 0.2 0 g 2 3 s s d g m g e m m y y o e e r r v m i s m m m k 2 t t g u m k v e m e n m r v 23 / 29

Runtime Multiversioning CPU + GPU (speedup to worst) 6 syr2k (c1) syr2k (c2) syr2k (c3) syr2k (c4) 5 syr2k (c5) syr2k (c6) syr2k (c7) syr2k (c8) 4 syr2k (c9) syr2k (all) speedup 3 2 1 0 C 1 C C C C P G P P P P P U U U U U U + + + + 1 2 3 4 G G G G P P P P U U U U s s s 24 / 29

Runtime Multiversioning CPU + GPU (imbalance) 1 syr2k (c1) syr2k (c2) syr2k (c3) syr2k (c4) syr2k (c5) 0.8 syr2k (c6) syr2k (c7) syr2k (c8) syr2k (c9) syr2k (all) 0.6 imbalance 0.4 0.2 0 C C C C P P P P U U U U + + + + 1 2 3 4 G G G G P P P P U U U U s s s 25 / 29

Conclusion Framework capabilities • Execution time prediction • Fastest version selection • CPU vs GPU competition • CPU + GPU joint usage Future work • Energy consumption 26 / 29

Annex Offline profiling: ranking table Number of threads version 1 version 2 version 3 1 40 ms 55 ms 32 ms 2 32 ms 28 ms 17 ms 3 22 ms 15 ms 9 ms 4 14 ms 7 ms 8 ms Online prediction: execution time computation observation = { 2000 , 600 , 300 , 300 } prediction ( version 1 ) = (( 2000 − 600 ) ∗ 40 ) + (( 600 − 300 ) ∗ 32 ) + ( 0 ∗ 22 ) + ( 300 ∗ 14 ) = 69800 ms 28 / 29

Annex The algorithm stages: • Init.: distribute iterations equitably amongst PUs • Repeat 10 times: • Compute per chunk execution time • r i = T i / T all • Adjust chunk size according to r i 29 / 29

CPU+GPU Load Balance Guided by Execution Time Prediction - PowerPoint PPT Presentation

CPU+GPU Load Balance Guided by Execution Time Prediction Jean-Franois Dollinger, Vincent Loechner Inria CAMUS, ICube Lab., University of Strasbourg jean-francois.dollinger@inria.fr, vincent.loechner@inria.fr 19 January 2015 1 / 29 Outline 1

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

FFR Guided Functional FFR Guided Functional FFR Guided Functional FFR Guided Functional

CPU scheduling CPU 1 P k P 3 P 2 P 1 . . . CPU 2 . . . CPU n The scheduling problem: - Have

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Query Processing Query Processing Steps balance < 2500 ( balance ( account)) balance

Guided Therapeutics in Cancer Surgery Guided Therapeutics in Cancer Surgery Guided Therapeutics

Prediction-Guided Performance-Energy Trade-off for Interactive Applications Daniel Lo Taejoon

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

Load Balancing with nftables by Laura Garca (Zen Load Balancer Team) Netdev 1.1 Prototype of

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Constant-Overhead Secure Computation using Preprocessing Ivan Damgrd, Sarah Zakarias Aarhus

Composable GPU programming GPUs -- what are they? Basic model: SIMD, SPMD, MIMD; blocks

Duality of upper and lower powerlocales on locally compact locales Tatsuji Kawai University of

Sub-Nyquist Sampling of Wideband Signals Deborah Cohen Technion Israel Institute of

Guaranteed Learning of Latent Variable Models through Tensor Methods Furong Huang University of

Flexible ADMM for Block-Structured Convex and Nonconvex Optimization Zhi-Quan (Tom) Luo Joint

Unit 5: Inference for categorical variables Lecture 2: Inference for 2-sample proportions

C++ Program Information Database for Analysis Tools Wanghong Yuan, Xiangkui Chen, Tao Xie, Hong

CPU+GPU Load Balance Guided by Execution Time Prediction - PowerPoint PPT Presentation

CPU+GPU Load Balance Guided by Execution Time Prediction Jean-Franois Dollinger, Vincent Loechner Inria CAMUS, ICube Lab., University of Strasbourg jean-francois.dollinger@inria.fr, vincent.loechner@inria.fr 19 January 2015 1 / 29 Outline 1

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

FFR Guided Functional FFR Guided Functional FFR Guided Functional FFR Guided Functional

CPU scheduling CPU 1 P k P 3 P 2 P 1 . . . CPU 2 . . . CPU n The scheduling problem: - Have

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Query Processing Query Processing Steps balance &lt; 2500 ( balance ( account)) balance

Guided Therapeutics in Cancer Surgery Guided Therapeutics in Cancer Surgery Guided Therapeutics

Prediction-Guided Performance-Energy Trade-off for Interactive Applications Daniel Lo Taejoon

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

Load Balancing with nftables by Laura Garca (Zen Load Balancer Team) Netdev 1.1 Prototype of

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Constant-Overhead Secure Computation using Preprocessing Ivan Damgrd, Sarah Zakarias Aarhus

Composable GPU programming GPUs -- what are they? Basic model: SIMD, SPMD, MIMD; blocks

Duality of upper and lower powerlocales on locally compact locales Tatsuji Kawai University of

Sub-Nyquist Sampling of Wideband Signals Deborah Cohen Technion Israel Institute of

Guaranteed Learning of Latent Variable Models through Tensor Methods Furong Huang University of

Flexible ADMM for Block-Structured Convex and Nonconvex Optimization Zhi-Quan (Tom) Luo Joint

Unit 5: Inference for categorical variables Lecture 2: Inference for 2-sample proportions

C++ Program Information Database for Analysis Tools Wanghong Yuan, Xiangkui Chen, Tao Xie, Hong

Query Processing Query Processing Steps balance < 2500 ( balance ( account)) balance