cross platform opencl application development
play

Cross-Platform OpenCL Application Development Tyler Sorensen (led - PowerPoint PPT Presentation

The Hitchhikers Guide to Cross-Platform OpenCL Application Development Tyler Sorensen (led the work and made the slidese) Alastair F. Donaldson (delivered this version of the talk) Imperial College London, UK UKMAC May 2016 1 OpenCL


  1. The Hitchhiker’s Guide to Cross-Platform OpenCL Application Development Tyler Sorensen (led the work and made the slidese) Alastair F. Donaldson (delivered this version of the talk) Imperial College London, UK UKMAC May 2016 1

  2. “OpenCL supports a wide range of applications… through a low-level, high- performance, portable abstraction.” Page 11: OpenCL 2.1 specification 2

  3. “OpenCL supports a wide range of applications… through a low-level, high-performance, portable abstraction.” Page 11: OpenCL 2.1 specification 3

  4. “OpenCL supports a wide range of applications… through a low-level, high-performance, portable abstraction.” Page 11: OpenCL 2.1 specification We consider functional portability rather than performance portability 4

  5. Example • single source shortest path application Quadro K5200 (Nvidia) Intel HD5500 5

  6. Example • single source shortest path application Quadro K5200 (Nvidia) Intel HD5500 6

  7. Example • single source shortest path application Quadro K5200 (Nvidia) Intel HD5500 7

  8. An experience report on OpenCL portability • How well is portability evaluated? • Our experience running applications on 8 GPUs spanning 4 vendors • Recommendations going forward 8

  9. An experience report on OpenCL portability • How well is portability evaluated? • Our experience running applications on 8 GPUs spanning 4 vendors • Recommendations going forward 9

  10. Portability in research literature • Reviewed the 50 most recent OpenCL papers on: http://hgpu.org/ • Only considered papers including GPU targets • Only considered papers with some type of experimental evaluation • How many different vendors did the study experiment with? 10

  11. Portability in research literature Results (number of evaluated vendors) 58% (29) 1 11

  12. Portability in research literature Results (number of evaluated vendors) 36% (18) 58% (29) 1 2 12

  13. Portability in research literature Results (number of evaluated vendors) 6% (3) 36% (18) 58% (29) 1 2 3 13

  14. Portability in research literature Results (which vendor) 39 23 8 3 1 Nvidia AMD Intel ARM Imagination 14

  15. Portability in research literature Results (which vendor) 39 Portability is not well tested in research literature! 23 8 3 1 Nvidia AMD Intel ARM Imagination 15

  16. An experience report on OpenCL portability • How well is portability evaluated? • Our experience running applications on 8 GPUs spanning 4 vendors • Recommendations going forward 16

  17. Applications • Part of a larger study on GPU irregular parallelism https://github.com/pannotia/pannotia 17

  18. Applications Pannotia • Target AMD Radeon HD 7000 • Written in OpenCL 1.x • 4 graph algorithms applications • Our aim: run these benchmarks on OpenCL platorms from several vendors https://github.com/pannotia/pannotia 18

  19. Applications Pannotia • Target AMD Radeon HD 7000 • Written in OpenCL 1.x GPU_linear_algebra_routine1; • 4 graph algorithms applications GPU_linear_algebra_routine2; GPU_linear_algebra_routine3; • Our aim: run these benchmarks on Loop until a fixed point is reached. OpenCL platorms from several vendors https://github.com/pannotia/pannotia 19

  20. Applications LonestarGPU • Target Nvidia Kepler and Fermi • Written in CUDA • 4 graph algorithms applications • Our aim: port these benchmarks to OpenCL to run across a range of platforms http://iss.ices.utexas.edu/?p=projects/galois/lonestargpu 20

  21. Applications LonestarGPU • Target Nvidia Kepler and Fermi • Written in CUDA wg0 • 4 graph algorithms applications shared worklist wg1 • Our aim: port these benchmarks to wg2 OpenCL to run across a range of platforms wg3 http://iss.ices.utexas.edu/?p=projects/galois/lonestargpu 21

  22. GPUs Chip Vendor Compute Units OpenCL Version Type GTX 980 Nvidia 16 1.1 Discrete Quadro K500 Nvidia 12 1.1 Discrete Iris 6100 Intel 47 2.0 Integrated HD 5500 Intel 24 2.0 Integrated Radeon R9 AMD 28 2.0 Discrete Radeon R7 AMD 8 2.0 Integrated Mali-T628 ARM 4 1.2 Integrated Mali-T628 ARM 2 1.2 integrated 22

  23. Portability Issues 12 issues encountered, grouped into categories • 3 Framework bugs • 6 Specification limitations • 3 Programming bugs 23

  24. Portability Issues 12 issues encountered, grouped into categories • 3 Framework bugs • 6 Specification limitations • 3 Programming bugs 24

  25. Framework bugs #1 Compiler crash Platforms : Intel 25

  26. Framework bugs #1 Compiler crash Platforms : Intel 26

  27. Framework bugs #1 Compiler crash Platforms : Intel compiling several large kernels occasionally crashes compiler Workaround : reduce the number of kernels in file 27

  28. Framework bugs #2 Non-terminating loops Platforms : Nvidia and AMD 28

  29. Framework bugs This looping idiom used in kernel code #2 Non-terminating loops Platforms : Nvidia and AMD while (true) { more_work = false; .. // Do computation, .. // if more work, set more_work if (!more_work) break ; } 29

  30. Framework bugs This looping idiom used in kernel code #2 Non-terminating loops Platforms : Nvidia and AMD while (true) { more_work = false; .. // Do computation, .. // if more work, set more_work Does not terminate on Nvidia and AMD platforms!! if (!more_work) break ; } 30

  31. Framework bugs This looping idiom used in kernel code #2 Non-terminating loops Platforms : Nvidia and AMD while (true) { for (int i = 0; i < INT_MAX; i++) { more_work = false; Change while loop to for loop .. // Do computation, .. // if more work, set more_work End value of i is consistent across platforms if (!more_work) break ; } 31

  32. Framework bugs #3 AMD defunct processes Platforms : AMD on Linux Long running kernels become defunct and un-killable requiring a reboot. Workaround : Switch to Windows OS 32

  33. Portability Issues 12 issues encountered, grouped into categories • 3 Framework bugs • 6 Specification limitations • 3 Programming bugs 33

  34. Specification limitations #1 GPU watchdogs Platforms and operating systems handle watchdogs differently. GPU GPU GPU Chrome OS Windows Linux (Ubuntu) 34

  35. Specification limitations #1 GPU watchdogs Platforms and operating systems handle watchdogs differently. Controlled with registry Watchdog kills entire OpenCL process GPU GPU GPU Chrome OS Windows Linux (Ubuntu) 35

  36. Specification limitations #1 GPU watchdogs Platforms and operating systems handle watchdogs differently. Controlled with registry Controlled in X server settings Watchdog kills entire Watchdog only kills kernel OpenCL process GPU GPU GPU Chrome OS Windows Linux (Ubuntu) 36

  37. Specification limitations #1 GPU watchdogs Platforms and operating systems handle watchdogs differently. Controlled with registry Controlled in X server settings Cannot control at all without Watchdog kills entire Watchdog only kills kernel recompiling the driver OpenCL process GPU GPU GPU Chrome OS Windows Linux (Ubuntu) 37

  38. Specification limitations #2 Occupancy vs compute units An OpenCL device has one or more compute units. A workgroup executes on a single compute unit. Intel OpenCL Optimisation Guide 38

  39. Specification limitations #2 Occupancy vs compute units An OpenCL device has one or more compute units. A workgroup executes on a single compute unit. Intel OpenCL Optimisation Guide Persistent thread model (Gupta et al. PIPC’12): once scheduled, a workgroup is guaranteed to make progress 39

  40. Specification limitations #2 Occupancy vs compute units An OpenCL device has one or more compute units. A workgroup executes on a single compute unit. Intel OpenCL Optimisation Guide Persistent thread model (Gupta et al. PIPC’12): once scheduled, a workgroup is guaranteed to make progress LonestarGPU applications depend on this 40

  41. Specification limitations #2 Occupancy vs compute units chip compute units PT occupancy GTX 980 16 Quadro K500 12 Iris 6100 47 HD 5500 24 Radeon R9 28 Radeon R7 8 Mali-T628 4 Mali-T628 2 41

  42. Compute units are safe and optimal Specification limitations #2 Occupancy vs compute units chip compute units PT occupancy GTX 980 16 Quadro K500 12 12 Iris 6100 47 HD 5500 24 Radeon R9 28 Radeon R7 8 Mali-T628 4 4 Mali-T628 2 2 42

  43. Compute units are safe and optimal Specification limitations Compute units are safe but not optimal #2 Occupancy vs compute units chip compute units PT occupancy GTX 980 16 32 Quadro K500 12 12 Iris 6100 47 HD 5500 24 Radeon R9 28 48 Radeon R7 8 16 Mali-T628 4 4 Mali-T628 2 2 43

  44. Compute units are safe and optimal Specification limitations Compute units are safe but not optimal Compute units are not safe #2 Occupancy vs compute units chip compute units PT occupancy GTX 980 16 32 Quadro K500 12 12 Iris 6100 47 6 HD 5500 24 3 Radeon R9 28 48 Radeon R7 8 16 Mali-T628 4 4 Mali-T628 2 2 44

  45. Portability Issues 12 issues encountered, grouped into categories • 3 Framework bugs • 6 Specification limitations • 3 Programming bugs 45

  46. Programming bugs #1 Data-races Application: LonestarGPU bfs and sssp Fix : Add additional synchronisation barriers Quadro K5200 (Nvidia) Intel HD5500 46

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend