coordina ng the use of gpu and cpu for improving
play

Coordina(ng the Use of GPU and CPU for Improving Performance of - PowerPoint PPT Presentation

Coordina(ng the Use of GPU and CPU for Improving Performance of Compute Intensive Applica(ons George Teodoro 1 , Rafael Sache>o 1 , Olcay Sertel 2 , Me(n Gurcan 2 , Wagner Meira Jr. 1 , Umit Catalyurek 2 , Renato Ferreira 1 1. Federal


  1. Coordina(ng the Use of GPU and CPU for Improving Performance of Compute Intensive Applica(ons George Teodoro 1 , Rafael Sache>o 1 , Olcay Sertel 2 , Me(n Gurcan 2 , Wagner Meira Jr. 1 , Umit Catalyurek 2 , Renato Ferreira 1 1. Federal University of Minas Gerais, Brasil 2. The Ohio State University, US IEEE Cluster 2009 1

  2. Mo(va(on • High performance compu(ng – Large cluster of‐the‐shelf components – Mul(‐core/Many‐core – GPGPU • Massively parallel • High speedups compared to the CPU IEEE Cluster 2009 2

  3. Mo(va(on • But... GPU is not so fast in all scenarios... • Current frameworks – Assume exclusive use of GPU or CPU IEEE Cluster 2009 3

  4. Goal • Target heterogeneous environments – Mul(ple CPU‐cores/GPUs – Distributed environments • Efficient coordina(on of the devices – Scheduling tasks according to their specifici(es • High level programming abstrac(on IEEE Cluster 2009 4

  5. Outline • Anthill • Suppor(ng heterogeneous environments • Experimental evalua(on • Conclusions IEEE Cluster 2009 5

  6. Anthill • Based on the filter‐stream model (DataCu>er) – Applica(on decomposed into a set of filters – Communica(on using streams – Transparent instance copy – Data flow – Mul(ple dimensions of parallelism • Task parallelism • Data parallelism IEEE Cluster 2009 6

  7. Anthill A B C IEEE Cluster 2009 7

  8. Filter programming abstrac(on • Event driven interface – Aligned with the data flow model • User provide data processing func(ons to be invoked upon availability of data • System controls invoca(on of user func(on – Dependency analysis – Parallelism IEEE Cluster 2009 8

  9. Event handlers • User provided func(ons • Operate on data objects – Updates filter state (global) – May trigger communica(on – Returns aZer processing the data element • Gets invoked automa(cally when data is available – And dependencies are met

  10. Suppor(ng heterogeneous resources • Event handler implemented to mul(ple devices – Each filter may be implemented targe(ng the appropriate device • Mul(ple devices used in parallel • Anthill run‐(me chooses the device for each event IEEE Cluster 2009 10

  11. Heterogeneous support overview IEEE Cluster 2009 11

  12. Device scheduler • Assumes – Events are independent – Out‐of‐order execu(on • Scheduling policies – FCFS – first ‐come, first‐served – DWDR – dynamic weighted round robin • Orders events according to its performance to each device • Selects the event with the highest speedup • User given func(on IEEE Cluster 2009 12

  13. Neuroblastoma Image Analysis System • Classify (ssues in different subtypes of prognos(c significance • Very high resolu(on slides – Divided in smaller (les • Mul(‐resolu(on image analysis – Mimics the way pathologists examine them IEEE Cluster 2009 13

  14. Anthill implementa(on IEEE Cluster 2009 14

  15. Experimental results • Setup – 10 PCs with an Intel Core 2 Duo CPU 2.13GHz / NVIDIA GeForce 8800GT GPU – 4 PCs with a dual quad‐core AMD Opteron 2.00GHz processor/ NVIDIA GeForce GTX 260 – Input data: images of 26,742 (les using two resolu(on levels: 32x32 and 512x512 IEEE Cluster 2009 15

  16. NBIA tasks analysis – performance varia(on Dual quad-core AMD Opteron 2.00GHz/NVIDIA GeForce GTX260 IEEE Cluster 2009 16

  17. Heterogeneous scheduling analysis 16 + 1 = 30 ?? Recalc (%) 12 Resolu(on Low High 1 CPU core‐ FCFS 263 215 1 CPU core – DWRR 21592 4 IEEE Cluster 2009 17

  18. Heterogeneous scheduling analysis FCFS DWDR IEEE Cluster 2009 18

  19. Heterogeneous scheduling analysis # of CPU cores FCFS DWRR Low High Low High 1 637 58 10714 1 2 117 133 15748 2 3 1925 173 18614 5 4 2090 219 18634 28 5 2872 286 20070 40 6 3819 393 20147 76 7 4726 478 20266 57 IEEE Cluster 2009 19

  20. Distributed environment evalua(on IEEE Cluster 2009 20

  21. Conclusions • Rela(ve performance between CPU/GPU is data dependent • Adequate scheduling among heterogeneous processors doubled the performance of the applica(on • Neglect the CPU is a mistake • Data‐flow is an interes(ng model to exploit parallelism IEEE Cluster 2009 21

  22. Future work • New scheduling techniques • Execu(on in cluster with heterogeneity among the compu(ng nodes IEEE Cluster 2009 22

  23. Ques(ons? IEEE Cluster 2009 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend