Cross-Platform OpenCL Application Development Tyler Sorensen (led - PowerPoint PPT Presentation

The Hitchhiker’s Guide to Cross-Platform OpenCL Application Development Tyler Sorensen (led the work and made the slidese) Alastair F. Donaldson (delivered this version of the talk) Imperial College London, UK UKMAC May 2016 1

“OpenCL supports a wide range of applications… through a low-level, high- performance, portable abstraction.” Page 11: OpenCL 2.1 specification 2

“OpenCL supports a wide range of applications… through a low-level, high-performance, portable abstraction.” Page 11: OpenCL 2.1 specification 3

“OpenCL supports a wide range of applications… through a low-level, high-performance, portable abstraction.” Page 11: OpenCL 2.1 specification We consider functional portability rather than performance portability 4

Example • single source shortest path application Quadro K5200 (Nvidia) Intel HD5500 5

An experience report on OpenCL portability • How well is portability evaluated? • Our experience running applications on 8 GPUs spanning 4 vendors • Recommendations going forward 8

Portability in research literature • Reviewed the 50 most recent OpenCL papers on: http://hgpu.org/ • Only considered papers including GPU targets • Only considered papers with some type of experimental evaluation • How many different vendors did the study experiment with? 10

Portability in research literature Results (number of evaluated vendors) 58% (29) 1 11

Portability in research literature Results (number of evaluated vendors) 36% (18) 58% (29) 1 2 12

Portability in research literature Results (number of evaluated vendors) 6% (3) 36% (18) 58% (29) 1 2 3 13

Portability in research literature Results (which vendor) 39 23 8 3 1 Nvidia AMD Intel ARM Imagination 14

Portability in research literature Results (which vendor) 39 Portability is not well tested in research literature! 23 8 3 1 Nvidia AMD Intel ARM Imagination 15

Applications • Part of a larger study on GPU irregular parallelism https://github.com/pannotia/pannotia 17

Applications Pannotia • Target AMD Radeon HD 7000 • Written in OpenCL 1.x • 4 graph algorithms applications • Our aim: run these benchmarks on OpenCL platorms from several vendors https://github.com/pannotia/pannotia 18

Applications Pannotia • Target AMD Radeon HD 7000 • Written in OpenCL 1.x GPU_linear_algebra_routine1; • 4 graph algorithms applications GPU_linear_algebra_routine2; GPU_linear_algebra_routine3; • Our aim: run these benchmarks on Loop until a fixed point is reached. OpenCL platorms from several vendors https://github.com/pannotia/pannotia 19

Applications LonestarGPU • Target Nvidia Kepler and Fermi • Written in CUDA • 4 graph algorithms applications • Our aim: port these benchmarks to OpenCL to run across a range of platforms http://iss.ices.utexas.edu/?p=projects/galois/lonestargpu 20

Applications LonestarGPU • Target Nvidia Kepler and Fermi • Written in CUDA wg0 • 4 graph algorithms applications shared worklist wg1 • Our aim: port these benchmarks to wg2 OpenCL to run across a range of platforms wg3 http://iss.ices.utexas.edu/?p=projects/galois/lonestargpu 21

GPUs Chip Vendor Compute Units OpenCL Version Type GTX 980 Nvidia 16 1.1 Discrete Quadro K500 Nvidia 12 1.1 Discrete Iris 6100 Intel 47 2.0 Integrated HD 5500 Intel 24 2.0 Integrated Radeon R9 AMD 28 2.0 Discrete Radeon R7 AMD 8 2.0 Integrated Mali-T628 ARM 4 1.2 Integrated Mali-T628 ARM 2 1.2 integrated 22

Portability Issues 12 issues encountered, grouped into categories • 3 Framework bugs • 6 Specification limitations • 3 Programming bugs 23

Framework bugs #1 Compiler crash Platforms : Intel 25

Framework bugs #1 Compiler crash Platforms : Intel 26

Framework bugs #1 Compiler crash Platforms : Intel compiling several large kernels occasionally crashes compiler Workaround : reduce the number of kernels in file 27

Framework bugs #2 Non-terminating loops Platforms : Nvidia and AMD 28

Framework bugs This looping idiom used in kernel code #2 Non-terminating loops Platforms : Nvidia and AMD while (true) { more_work = false; .. // Do computation, .. // if more work, set more_work if (!more_work) break ; } 29

Framework bugs This looping idiom used in kernel code #2 Non-terminating loops Platforms : Nvidia and AMD while (true) { more_work = false; .. // Do computation, .. // if more work, set more_work Does not terminate on Nvidia and AMD platforms!! if (!more_work) break ; } 30

Framework bugs This looping idiom used in kernel code #2 Non-terminating loops Platforms : Nvidia and AMD while (true) { for (int i = 0; i < INT_MAX; i++) { more_work = false; Change while loop to for loop .. // Do computation, .. // if more work, set more_work End value of i is consistent across platforms if (!more_work) break ; } 31

Framework bugs #3 AMD defunct processes Platforms : AMD on Linux Long running kernels become defunct and un-killable requiring a reboot. Workaround : Switch to Windows OS 32

Specification limitations #1 GPU watchdogs Platforms and operating systems handle watchdogs differently. GPU GPU GPU Chrome OS Windows Linux (Ubuntu) 34

Specification limitations #1 GPU watchdogs Platforms and operating systems handle watchdogs differently. Controlled with registry Watchdog kills entire OpenCL process GPU GPU GPU Chrome OS Windows Linux (Ubuntu) 35

Specification limitations #1 GPU watchdogs Platforms and operating systems handle watchdogs differently. Controlled with registry Controlled in X server settings Watchdog kills entire Watchdog only kills kernel OpenCL process GPU GPU GPU Chrome OS Windows Linux (Ubuntu) 36

Specification limitations #1 GPU watchdogs Platforms and operating systems handle watchdogs differently. Controlled with registry Controlled in X server settings Cannot control at all without Watchdog kills entire Watchdog only kills kernel recompiling the driver OpenCL process GPU GPU GPU Chrome OS Windows Linux (Ubuntu) 37

Specification limitations #2 Occupancy vs compute units An OpenCL device has one or more compute units. A workgroup executes on a single compute unit. Intel OpenCL Optimisation Guide 38

Specification limitations #2 Occupancy vs compute units An OpenCL device has one or more compute units. A workgroup executes on a single compute unit. Intel OpenCL Optimisation Guide Persistent thread model (Gupta et al. PIPC’12): once scheduled, a workgroup is guaranteed to make progress 39

Specification limitations #2 Occupancy vs compute units An OpenCL device has one or more compute units. A workgroup executes on a single compute unit. Intel OpenCL Optimisation Guide Persistent thread model (Gupta et al. PIPC’12): once scheduled, a workgroup is guaranteed to make progress LonestarGPU applications depend on this 40

Specification limitations #2 Occupancy vs compute units chip compute units PT occupancy GTX 980 16 Quadro K500 12 Iris 6100 47 HD 5500 24 Radeon R9 28 Radeon R7 8 Mali-T628 4 Mali-T628 2 41

Compute units are safe and optimal Specification limitations #2 Occupancy vs compute units chip compute units PT occupancy GTX 980 16 Quadro K500 12 12 Iris 6100 47 HD 5500 24 Radeon R9 28 Radeon R7 8 Mali-T628 4 4 Mali-T628 2 2 42

Compute units are safe and optimal Specification limitations Compute units are safe but not optimal #2 Occupancy vs compute units chip compute units PT occupancy GTX 980 16 32 Quadro K500 12 12 Iris 6100 47 HD 5500 24 Radeon R9 28 48 Radeon R7 8 16 Mali-T628 4 4 Mali-T628 2 2 43

Compute units are safe and optimal Specification limitations Compute units are safe but not optimal Compute units are not safe #2 Occupancy vs compute units chip compute units PT occupancy GTX 980 16 32 Quadro K500 12 12 Iris 6100 47 6 HD 5500 24 3 Radeon R9 28 48 Radeon R7 8 16 Mali-T628 4 4 Mali-T628 2 2 44

Programming bugs #1 Data-races Application: LonestarGPU bfs and sssp Fix : Add additional synchronisation barriers Quadro K5200 (Nvidia) Intel HD5500 46

Cross-Platform OpenCL Application Development Tyler Sorensen (led - PowerPoint PPT Presentation

The Hitchhikers Guide to Cross-Platform OpenCL Application Development Tyler Sorensen (led the work and made the slidese) Alastair F. Donaldson (delivered this version of the talk) Imperial College London, UK UKMAC May 2016 1 OpenCL

OpenCL Kernel Compilation Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

Building Consistent Cross-Platform Interfaces Building Consistent Cross-Platform Interfaces

Investigation of the OpenCL support in the GeantV's Vectorized Geometry Gabor Biro 22.09.2014.

The OpenCL C++ API Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Introduction to OpenCL David Black-Schaffer david.black-schaffer@it.uu.se 1 Disclaimer I

OpenCL on FPGAs Contains material from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Synchronization in OpenCL Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

02 | 27 SOUTHERN CROSS 23.04 03 | 27 SOUTHERN CROSS 23.04 04 | 27 SOUTHERN CROSS 23.04 06

Han Dong Dibyajyoti Ghosh Fahad Zafar Shujia Zhou Motivation Explore OpenCL in accelerating

The Shadow of the Cross The Cross of Jesus part 1B The Shadow of the Cross Hebrews 10:1-14 The

A Scalable Cross- -Platform Platform A Scalable Cross Infrastructure for Application

Using OpenCL for Performance-Portable, Hardware-Agnostic, Cross-Platform Video Processing GTC

CUDA (Compute Unified Device Dr. Bharathwaj Bharath Muthuswamy Architecture) and OpenCL

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and

PERFORMANCE CONSIDERATIONS FOR OPENCL ON NVIDIA GPUS Karthik Raghavan Ravi, 4/4/16 THE PROBLEM

Accelerating Tandem MS Protein Database Searches Using OpenCL Programming devices the

GN: A Modern Build System For BSD ? https://gn.googlesource.com/gn Bucharest, Septembre 23rd

Cloud Computing Leveling The Access Field T. V. Raman Google http://emacspeak.sf.net/raman

BINARY ANALYSIS NOTES Mariano Graziano Malware Research Team - Cisco Talos M0LECON 2019 Turin,

Software Architecture School of Computer Science, University of Oviedo Lab. 12 Monitoring &

Welcome IT in AOS Michael Havas Dept. of Atmospheric and Oceanic Sciences McGill University

Shared Memory Programming More about parallel loops LASTPRIVATE clause Sometimes need the

Dynamic memory organization Considering one possible approach (gnu) to organize dynamic memory

Can a file system virtualize processors? Lex Stein, Microsoft Research Asia David Holland,

Cross-Platform OpenCL Application Development Tyler Sorensen (led - PowerPoint PPT Presentation

The Hitchhikers Guide to Cross-Platform OpenCL Application Development Tyler Sorensen (led the work and made the slidese) Alastair F. Donaldson (delivered this version of the talk) Imperial College London, UK UKMAC May 2016 1 OpenCL

OpenCL Kernel Compilation Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

Building Consistent Cross-Platform Interfaces Building Consistent Cross-Platform Interfaces

Investigation of the OpenCL support in the GeantV's Vectorized Geometry Gabor Biro 22.09.2014.

The OpenCL C++ API Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Introduction to OpenCL David Black-Schaffer david.black-schaffer@it.uu.se 1 Disclaimer I

OpenCL on FPGAs Contains material from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Synchronization in OpenCL Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

02 | 27 SOUTHERN CROSS 23.04 03 | 27 SOUTHERN CROSS 23.04 04 | 27 SOUTHERN CROSS 23.04 06

Han Dong Dibyajyoti Ghosh Fahad Zafar Shujia Zhou Motivation Explore OpenCL in accelerating

The Shadow of the Cross The Cross of Jesus part 1B The Shadow of the Cross Hebrews 10:1-14 The

A Scalable Cross- -Platform Platform A Scalable Cross Infrastructure for Application

Using OpenCL for Performance-Portable, Hardware-Agnostic, Cross-Platform Video Processing GTC

CUDA (Compute Unified Device Dr. Bharathwaj Bharath Muthuswamy Architecture) and OpenCL

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and

PERFORMANCE CONSIDERATIONS FOR OPENCL ON NVIDIA GPUS Karthik Raghavan Ravi, 4/4/16 THE PROBLEM

Accelerating Tandem MS Protein Database Searches Using OpenCL Programming devices the

GN: A Modern Build System For BSD ? https://gn.googlesource.com/gn Bucharest, Septembre 23rd

Cloud Computing Leveling The Access Field T. V. Raman Google http://emacspeak.sf.net/raman

BINARY ANALYSIS NOTES Mariano Graziano Malware Research Team - Cisco Talos M0LECON 2019 Turin,

Software Architecture School of Computer Science, University of Oviedo Lab. 12 Monitoring &amp;

Welcome IT in AOS Michael Havas Dept. of Atmospheric and Oceanic Sciences McGill University

Shared Memory Programming More about parallel loops LASTPRIVATE clause Sometimes need the

Dynamic memory organization Considering one possible approach (gnu) to organize dynamic memory

Can a file system virtualize processors? Lex Stein, Microsoft Research Asia David Holland,

Software Architecture School of Computer Science, University of Oviedo Lab. 12 Monitoring &