Coordina(ng the Use of GPU and CPU for Improving Performance of - PowerPoint PPT Presentation

Coordina(ng the Use of GPU and CPU for Improving Performance of Compute Intensive Applica(ons George Teodoro 1 , Rafael Sache>o 1 , Olcay Sertel 2 , Me(n Gurcan 2 , Wagner Meira Jr. 1 , Umit Catalyurek 2 , Renato Ferreira 1 1. Federal University of Minas Gerais, Brasil 2. The Ohio State University, US IEEE Cluster 2009 1

Mo(va(on • High performance compu(ng – Large cluster of‐the‐shelf components – Mul(‐core/Many‐core – GPGPU • Massively parallel • High speedups compared to the CPU IEEE Cluster 2009 2

Mo(va(on • But... GPU is not so fast in all scenarios... • Current frameworks – Assume exclusive use of GPU or CPU IEEE Cluster 2009 3

Goal • Target heterogeneous environments – Mul(ple CPU‐cores/GPUs – Distributed environments • Efficient coordina(on of the devices – Scheduling tasks according to their specifici(es • High level programming abstrac(on IEEE Cluster 2009 4

Outline • Anthill • Suppor(ng heterogeneous environments • Experimental evalua(on • Conclusions IEEE Cluster 2009 5

Anthill • Based on the filter‐stream model (DataCu>er) – Applica(on decomposed into a set of filters – Communica(on using streams – Transparent instance copy – Data flow – Mul(ple dimensions of parallelism • Task parallelism • Data parallelism IEEE Cluster 2009 6

Anthill A B C IEEE Cluster 2009 7

Filter programming abstrac(on • Event driven interface – Aligned with the data flow model • User provide data processing func(ons to be invoked upon availability of data • System controls invoca(on of user func(on – Dependency analysis – Parallelism IEEE Cluster 2009 8

Event handlers • User provided func(ons • Operate on data objects – Updates filter state (global) – May trigger communica(on – Returns aZer processing the data element • Gets invoked automa(cally when data is available – And dependencies are met

Suppor(ng heterogeneous resources • Event handler implemented to mul(ple devices – Each filter may be implemented targe(ng the appropriate device • Mul(ple devices used in parallel • Anthill run‐(me chooses the device for each event IEEE Cluster 2009 10

Heterogeneous support overview IEEE Cluster 2009 11

Device scheduler • Assumes – Events are independent – Out‐of‐order execu(on • Scheduling policies – FCFS – first ‐come, first‐served – DWDR – dynamic weighted round robin • Orders events according to its performance to each device • Selects the event with the highest speedup • User given func(on IEEE Cluster 2009 12

Neuroblastoma Image Analysis System • Classify (ssues in different subtypes of prognos(c significance • Very high resolu(on slides – Divided in smaller (les • Mul(‐resolu(on image analysis – Mimics the way pathologists examine them IEEE Cluster 2009 13

Anthill implementa(on IEEE Cluster 2009 14

Experimental results • Setup – 10 PCs with an Intel Core 2 Duo CPU 2.13GHz / NVIDIA GeForce 8800GT GPU – 4 PCs with a dual quad‐core AMD Opteron 2.00GHz processor/ NVIDIA GeForce GTX 260 – Input data: images of 26,742 (les using two resolu(on levels: 32x32 and 512x512 IEEE Cluster 2009 15

NBIA tasks analysis – performance varia(on Dual quad-core AMD Opteron 2.00GHz/NVIDIA GeForce GTX260 IEEE Cluster 2009 16

Heterogeneous scheduling analysis 16 + 1 = 30 ?? Recalc (%) 12 Resolu(on Low High 1 CPU core‐ FCFS 263 215 1 CPU core – DWRR 21592 4 IEEE Cluster 2009 17

Heterogeneous scheduling analysis FCFS DWDR IEEE Cluster 2009 18

Heterogeneous scheduling analysis # of CPU cores FCFS DWRR Low High Low High 1 637 58 10714 1 2 117 133 15748 2 3 1925 173 18614 5 4 2090 219 18634 28 5 2872 286 20070 40 6 3819 393 20147 76 7 4726 478 20266 57 IEEE Cluster 2009 19

Distributed environment evalua(on IEEE Cluster 2009 20

Conclusions • Rela(ve performance between CPU/GPU is data dependent • Adequate scheduling among heterogeneous processors doubled the performance of the applica(on • Neglect the CPU is a mistake • Data‐flow is an interes(ng model to exploit parallelism IEEE Cluster 2009 21

Future work • New scheduling techniques • Execu(on in cluster with heterogeneity among the compu(ng nodes IEEE Cluster 2009 22

Ques(ons? IEEE Cluster 2009 23

Coordina(ng the Use of GPU and CPU for Improving Performance of - PowerPoint PPT Presentation

Coordina(ng the Use of GPU and CPU for Improving Performance of Compute Intensive Applica(ons George Teodoro 1 , Rafael Sache>o 1 , Olcay Sertel 2 , Me(n Gurcan 2 , Wagner Meira Jr. 1 , Umit Catalyurek 2 , Renato Ferreira 1 1. Federal

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU scheduling CPU 1 P k P 3 P 2 P 1 . . . CPU 2 . . . CPU n The scheduling problem: - Have

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Lets Fix OpenGL Adrian Sampson, Cornell Commands Pixels CPU GPU Display CPU Display

Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems Abdullah Gharaibeh, Elizeu

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Pure Spinor Helicity Methods Rutger Boels Niels Bohr International Academy, Copenhagen R.B.,

State of the Art in PN Gravity Theory Mich` ele Levi Niels Bohr International Academy Niels

Angel alchemy session 2 Bubble clearing, the 4 clairs, the root and sacral chakra What is Angel

How we scaled Songkick Friday, 8 March 13 songkick.com Founded 2007 Hundreds of

On-shell recursion for string theory amplitudes on the disc and sphere Rutger Boels Niels Bohr

Ambipolar diffusion and core accretion in layered protoplanetary discs Oliver Gressel Niels

Core accretion in MHD simulations of layered protoplanetary discs Oliver Gressel Niels Bohr

October 7, 2014 Swan Kim, President of APCTP swan@postech.ac.kr In the beginning in Asia Pacifi

Coordina(ng the Use of GPU and CPU for Improving Performance of - PowerPoint PPT Presentation

Coordina(ng the Use of GPU and CPU for Improving Performance of Compute Intensive Applica(ons George Teodoro 1 , Rafael Sache>o 1 , Olcay Sertel 2 , Me(n Gurcan 2 , Wagner Meira Jr. 1 , Umit Catalyurek 2 , Renato Ferreira 1 1. Federal

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU scheduling CPU 1 P k P 3 P 2 P 1 . . . CPU 2 . . . CPU n The scheduling problem: - Have

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Lets Fix OpenGL Adrian Sampson, Cornell Commands Pixels CPU GPU Display CPU Display

Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems Abdullah Gharaibeh, Elizeu

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Pure Spinor Helicity Methods Rutger Boels Niels Bohr International Academy, Copenhagen R.B.,

State of the Art in PN Gravity Theory Mich` ele Levi Niels Bohr International Academy Niels

Angel alchemy session 2 Bubble clearing, the 4 clairs, the root and sacral chakra What is Angel

How we scaled Songkick Friday, 8 March 13 songkick.com Founded 2007 Hundreds of

On-shell recursion for string theory amplitudes on the disc and sphere Rutger Boels Niels Bohr

Ambipolar diffusion and core accretion in layered protoplanetary discs Oliver Gressel Niels

Core accretion in MHD simulations of layered protoplanetary discs Oliver Gressel Niels Bohr

October 7, 2014 Swan Kim, President of APCTP swan@postech.ac.kr In the beginning in Asia Pacifi

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team