Coordina(ng the Use of GPU and CPU for Improving Performance of - - PowerPoint PPT Presentation

coordina ng the use of gpu and cpu for improving
SMART_READER_LITE
LIVE PREVIEW

Coordina(ng the Use of GPU and CPU for Improving Performance of - - PowerPoint PPT Presentation

Coordina(ng the Use of GPU and CPU for Improving Performance of Compute Intensive Applica(ons George Teodoro 1 , Rafael Sache>o 1 , Olcay Sertel 2 , Me(n Gurcan 2 , Wagner Meira Jr. 1 , Umit Catalyurek 2 , Renato Ferreira 1 1. Federal


slide-1
SLIDE 1

Coordina(ng the Use of GPU and CPU for Improving Performance of Compute Intensive Applica(ons

George Teodoro1, Rafael Sache>o1, Olcay Sertel2, Me(n Gurcan2, Wagner Meira Jr. 1, Umit Catalyurek2, Renato Ferreira1

  • 1. Federal University of Minas Gerais, Brasil
  • 2. The Ohio State University, US

IEEE Cluster 2009 1

slide-2
SLIDE 2

Mo(va(on

  • High performance compu(ng

– Large cluster of‐the‐shelf components – Mul(‐core/Many‐core – GPGPU

  • Massively parallel
  • High speedups compared to the CPU

IEEE Cluster 2009 2

slide-3
SLIDE 3

Mo(va(on

  • But... GPU is not so fast in all scenarios...
  • Current frameworks

– Assume exclusive use of GPU or CPU

IEEE Cluster 2009 3

slide-4
SLIDE 4

Goal

  • Target heterogeneous environments

– Mul(ple CPU‐cores/GPUs – Distributed environments

  • Efficient coordina(on of the devices

– Scheduling tasks according to their specifici(es

  • High level programming abstrac(on

IEEE Cluster 2009 4

slide-5
SLIDE 5

Outline

  • Anthill
  • Suppor(ng heterogeneous environments
  • Experimental evalua(on
  • Conclusions

IEEE Cluster 2009 5

slide-6
SLIDE 6

Anthill

  • Based on the filter‐stream model (DataCu>er)

– Applica(on decomposed into a set of filters – Communica(on using streams – Transparent instance copy – Data flow – Mul(ple dimensions of parallelism

  • Task parallelism
  • Data parallelism

IEEE Cluster 2009 6

slide-7
SLIDE 7

Anthill

IEEE Cluster 2009 7

A B C

slide-8
SLIDE 8

Filter programming abstrac(on

  • Event driven interface

– Aligned with the data flow model

  • User provide data processing func(ons to be invoked

upon availability of data

  • System controls invoca(on of user func(on

– Dependency analysis – Parallelism

IEEE Cluster 2009 8

slide-9
SLIDE 9

Event handlers

  • User provided func(ons
  • Operate on data objects

– Updates filter state (global) – May trigger communica(on – Returns aZer processing the data element

  • Gets invoked automa(cally when data is

available

– And dependencies are met

slide-10
SLIDE 10

Suppor(ng heterogeneous resources

  • Event handler implemented to mul(ple

devices

– Each filter may be implemented targe(ng the appropriate device

  • Mul(ple devices used in parallel
  • Anthill run‐(me chooses the device for each

event

IEEE Cluster 2009 10

slide-11
SLIDE 11

Heterogeneous support overview

IEEE Cluster 2009 11

slide-12
SLIDE 12

Device scheduler

  • Assumes

– Events are independent – Out‐of‐order execu(on

  • Scheduling policies

– FCFS – first ‐come, first‐served – DWDR – dynamic weighted round robin

  • Orders events according to its performance to each

device

  • Selects the event with the highest speedup
  • User given func(on

IEEE Cluster 2009 12

slide-13
SLIDE 13

Neuroblastoma Image Analysis System

  • Classify (ssues in different subtypes of

prognos(c significance

  • Very high resolu(on slides

– Divided in smaller (les

  • Mul(‐resolu(on image analysis

– Mimics the way pathologists examine them

IEEE Cluster 2009 13

slide-14
SLIDE 14

Anthill implementa(on

IEEE Cluster 2009 14

slide-15
SLIDE 15

Experimental results

  • Setup

– 10 PCs with an Intel Core 2 Duo CPU 2.13GHz / NVIDIA GeForce 8800GT GPU – 4 PCs with a dual quad‐core AMD Opteron 2.00GHz processor/ NVIDIA GeForce GTX 260 – Input data: images of 26,742 (les using two resolu(on levels: 32x32 and 512x512

IEEE Cluster 2009 15

slide-16
SLIDE 16

NBIA tasks analysis – performance varia(on

IEEE Cluster 2009 16

Dual quad-core AMD Opteron 2.00GHz/NVIDIA GeForce GTX260

slide-17
SLIDE 17

Heterogeneous scheduling analysis

IEEE Cluster 2009 17

Recalc (%) 12 Resolu(on Low High 1 CPU core‐ FCFS 263 215 1 CPU core – DWRR 21592 4

16 + 1 = 30 ??

slide-18
SLIDE 18

Heterogeneous scheduling analysis

IEEE Cluster 2009 18

FCFS DWDR

slide-19
SLIDE 19

Heterogeneous scheduling analysis

IEEE Cluster 2009 19

# of CPU cores

FCFS DWRR

Low High Low High 1 637 58 10714 1 2 117 133 15748 2 3 1925 173 18614 5 4 2090 219 18634 28 5 2872 286 20070 40 6 3819 393 20147 76 7 4726 478 20266 57

slide-20
SLIDE 20

Distributed environment evalua(on

IEEE Cluster 2009 20

slide-21
SLIDE 21

Conclusions

  • Rela(ve performance between CPU/GPU

is data dependent

  • Adequate scheduling among

heterogeneous processors doubled the performance of the applica(on

  • Neglect the CPU is a mistake
  • Data‐flow is an interes(ng model to

exploit parallelism

IEEE Cluster 2009 21

slide-22
SLIDE 22

Future work

  • New scheduling techniques
  • Execu(on in cluster with heterogeneity among

the compu(ng nodes

IEEE Cluster 2009 22

slide-23
SLIDE 23

Ques(ons?

IEEE Cluster 2009 23