[PPT] - Many-core Architectures and Programming Models Using SHOC M. PowerPoint Presentation

SLIDE 1

ORNL is managed by UT-Battelle for the US Department of Energy

Examining Recent Many-core Architectures and Programming Models Using SHOC

M. Graham Lopez

Jeffrey Young Jeremy S. Meredith Philip C. Roth Mitchel Horton Jeffrey S. Vetter PMBS15 Sunday, 15 Nov 2015

SLIDE 2

2

Answering Questions about Heterogeneous Systems

How does one device perform relative to another?
In which areas is one accelerator better?
How do multiple devices perform (separately or in concert)?
How do heterogeneous programming models compare?
What’s the most productive way to program a given device?

SLIDE 3

SHOC 1.0

SLIDE 4

4

Scalable Heterogeneous Computing Suite

Benchmark suite with a focus on scientific computing workloads
Both performance and stability testing
Supports clusters and individual hosts
intra-node parallelism for multiple GPUs per node
inter-node parallelism with MPI
Both CUDA and OpenCL
Three levels of benchmarks:
Level 0: very low-level device characteristics (bus speed, max flops)
Level 1: low level algorithmic operations (fft, gemm, sorting, n-body)
Level 2: application-level kernels (combustion chemistry, clustering)
A. Danalis, G. Marin, C. McCurdy, J.S. Meredith, P.C. Roth, K. Spafford, V. Tipparaju, J.S. Vetter

“The Scalable Heterogeneous Computing (SHOC) Benchmark Suite” Third Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU-3), 2010.

https://github.com/vetter/shoc

SLIDE 5

SHOC 2.0

SLIDE 6

6

Recent Additions to SHOC

Added new benchmarks
Originals focused on floating point, scientific computing applications
New benchmarks: machine learning, data analytics, and integer operations
Supports new programming models
Original supported OpenCL when it was new
Allowed CUDA vs OpenCL comparisons
Multiple OpenCL implementations could support one platform
Tracking maturity of OpenCL over time
New programming models support directives
OpenACC, OpenMP + offload
Better support for multi-core and new devices (Intel Xeon Phi)

SLIDE 7

New Benchmarks

SLIDE 8

8

MD5Hash

MD5 is a cryptographic hash function
Heavy use of integer and bitwise operations
No floating point operations
Not parallel for a single input string
Would be bandwidth-dependent to be useful anyway
Instead, do a parallel search for a known, random hash
Each thread hashes a large set of short input strings
Input strings are generated programmatically from a given key space

aaaa

74b873374....

aaab

4c189b020....

aaac

3963a2ba6....

aaad

aa836f154....

zzzz

02c425157....

~ ~ ~ ~ ~

threads

SLIDE 9

9

MD5Hash Results

Large generational

improvements for NVIDIA

Kepler K40 vs Fermi m2090

almost 3x

Maxwell 750Ti outperforms

Fermi m2090

AMD better overall for

integer/bit operations

w9100 vs k40 almost 2x

1 2 3 4 5 6 7 NVIDIA m2090 NVIDIA K20m NVIDIA K40 NVIDIA GTX750Ti AMD w9100 Intel i7- 4770k GHash/sec

SLIDE 10

10

10000 20000 30000 40000 k20 k40 Learning Rate training sets/second NN NN w/ PCIe

Neural Net (NN)

Neural Net is represented by a deep learning algorithm that can

identify pictures of handwritten numbers 0-9 from MNIST inputs

CUDA version with CUBLAS support
Phi/MIC version with OpenMP/offload support
Limited MKL use; rectangular matrices impact threading
784 input neurons, ten output neurons,

and one hidden layer with thirty neurons

50,000 training sets

[1] M. Nielsen. Neural networks and deep learning. October 2014. https://github.com/mnielsen/neural-networks-and-deep-learning. [2] Y. LeCun, C. Cortes, and C. J. Burges. The MNIST database of handwritten digits. 2014. http://yann.lecun.com/exdb/mnist/. [3] http://eblearn.sourceforge.net/mnist.html

Visualization of Testing Set [3]

SLIDE 11

11

Neural Net Results

CUBLAS is well tuned for

rectangular matrices

m2090 outperforms all others
MKL does not use threads

for these matrices

Custom OpenMP code
... but was not well vectorized

by the compiler

Poor thread scaling on Xeon

Phi limits its performance

SLIDE 12

12

Data Analytics

Data analytics is represented by relational algebra kernels like

Select, Project, Join, Union

These kernels form the basis of read-only analytics for benchmarks like TPC-

H [1] that have been accelerated with CUDA [2].

SHOC’s OpenCL implementation allows for testing on CPU, GPU,

and Phi without needing a large database input

All tests are standalone with randomly generated tuples
More information on the implementation in related work [3]

[1] T. P. P. Council. TPC Benchmark H (Decision Support) Standard Specification, Revision 2.17.0. 2013. http://www.tpc.org/tpch/ [2] H. Wu, G. Diamos, S. Cadambi, and S. Yalamanchili. Kernel weaver: Automatically fusing database primitives for efficient GPU computation. MICRO 2012 [3] Ifrah Saeed, Jeffrey Young, Sudhakar Yalamanchili, A portable benchmark suite for highly parallel data intensive query processing. PPAA 2015

SLIDE 13

13

Data Analytics Results

Kepler GPU performs best with 7.54 giga-ops/second (GOPS); sensitivity to tuning parameters

(like workgroup size) makes performance portability difficult for this code

Haswell GPU has the best performance when data transfer is included – 1.17 GOPS for 256 MB

input; Haswell GPU has the best “zero-copy” semantics of integrated GPUs

Project, no PCIe Transfer Time Project, Transfer Time Included

0.00E+00 1.00E+09 2.00E+09 3.00E+09 4.00E+09 5.00E+09 6.00E+09 7.00E+09 8.00E+09 8 16 32 64 128 256 512 1024

Queries / second

Input Size (MB) Trinity (C) Trinity (G) NV K20m NV M2090 SNB (C) IVB (C) IVB (G) HSWL (C) HSWL (G) Phi 5110

0.00E+00 2.00E+08 4.00E+08 6.00E+08 8.00E+08 1.00E+09 1.20E+09 1.40E+09 8 16 32 64 128 256 512 1024

Queries / second

Input Size (MB) Trinity (C) Trinity (G) NV K20m NV M2090 SNB (C) IVB (C) IVB (G) HSWL (C) HSWL (G) Phi 5110

SLIDE 14

New Programming Models

SLIDE 15

15

Programming Models

Originally: CUDA, OpenCL
Added: OpenACC, Xeon Phi (OpenMP and LEO)
Planned: pure OpenMP
When compilers support accelerator features
Examples often compare directives to lower-level
Directives aren’t expected to outperform, but how much of a loss?
What are the other issues (if any)?

SLIDE 16

SHOC Example Studies

SLIDE 17

17

SHOC Example Studies

SHOC can be useful for understanding:
heterogeneous and many-core system hardware
programming heterogeneous systems and accelerators
To explore the space of potential studies, we show:
Example hardware comparisons
Example programming model comparisons
These are example analyses to show possibilities
Breadth more than depth
Others may ask and answer entirely new questions using SHOC

SLIDE 18

Hardware Comparisons

SLIDE 19

19

SHOC Example Hardware Studies

Generational improvements for same vendor
NVIDIA Fermi m2090 vs Kepler K40
Large vs small device in same architectural line
NVIDIA K40 (15 SMX) vs Jetson TK1 (1 SMX)
Cross-vendor, i.e., different architectures
NVIDIA K40 vs AMD w9100
NVIDIA K20 vs Intel Xeon Phi (KNC)

SLIDE 20

20

Generational Improvement for Same Vendor

Host platform differences limited bus speed and impacted PCIe results on newer device

0x 1x 2x 3x 4x 5x 6x Speedup K40 over m2090 GPU only With PCIe

SLIDE 21

21

0x 5x 10x 15x 20x 25x 30x 35x 40x 45x Speedup K40 over TK1 GPU only With PCIe

Large vs Small Device of Same Architecture

15:1 raw SMX ratio. Accounting for clockspeeds, expect core=14:1, bandwidth=12:1
Similar host-device speed limits improvement in “PCIe” benchmarks
Unexpected K40 improvements (host/platform, library optimization, or other HW differences)

SLIDE 22

22

Cross-Vendor Comparisons (AMD v NVIDIA, OpenCL)

Raw (level 0) numbers generally better for W9100, translated into several AMD wins
Integer performance on W9100 relatively better (MD5Hash) versus floating point

0.1x 1x 10x Speedup W9100 over K40 (log scale) GPU only With PCIe

SLIDE 23

23

0.1 1 10

MaxFLOPS (SP) MaxFLOPS (DP) Device BW (read) Device BW (write) Device BW (read,stride) Device BW (write,stride) lmem_readbw lmem_writebw FFT (SP) iFFT (SP) FFT (SP) w/PCIe iFFT (SP) w/PCIe FFT (DP) iFFT (DP) FFT (DP) w/PCIe iFFT (DP) w/PCIe SGEMM SGEMM (transp) SGEMM w/PCIe SGEMM (transp) w/PCIe DGEMM DGEMM (transp) DGEMM w/PCIe DGEMM (transp) w/PCIe MD (SP flops) MD (SP BW) MD (SP flops) w/PCIe MD (SP BW) w/PCIe MD (DP flops) MD (DP BW) MD (DP flops) w/PCIe MD (DP BW) w/PCIe Scan (SP) Scan (SP) w/ PCIe Scan (DP) Scan (DP) w/PCIe Sort Sort w/PCIe SpMV (SP,CSR) SpMV (SP,CSR,vec) SpMV (SP,ELLPACKR) SpMV (DP,CSR) SpMV (DP,CSR,vec) SpMV (DP,ELLPACKR) Stencil (SP) Stencil (DP) S3D (SP) S3D (SP) w/PCIe S3D (DP) S3D (DP) w/PCIe Triad (BW)

Speedup K20 vs MIC (log scale)

Cross-Vendor Comparisons (NVIDIA v Intel)

Xeon Phi double precision is relatively better than K20 (i.e. bigger win/smaller loss in DP vs SP)
Cache size vs local memory effects have complex tradeoffs

SLIDE 24

Programming Model Comparisons

SLIDE 25

25

SHOC Example Programming Model Comparisons

Different explicit models
CUDA vs OpenCL was a big interest for SHOC 1.0
Native versus offload models within a device
Xeon Phi with OpenMP
Generational improvements/regressions in APIs/compilers
OpenACC and OpenMP+LEO
Explicit models vs directive models
OpenACC vs CUDA
OpenMP vs OpenCL

SLIDE 26

26

Native vs Offload (Xeon Phi)

Benchmarks with PCIe show bigger improvement in Native
In particular, see Triad BW
However using same directives (offload) for both modes cause some Native slowdowns

0.1 1 10

MaxFLOPS (SP) MaxFLOPS (DP) Device BW (read) Device BW (write) FFT (SP) iFFT (SP) FFT (DP) iFFT (DP) SGEMM SGEMM w/PCIe DGEMM DGEMM w/PCIe MD (SP flops) MD (SP BW) Reduction (SP) Reduction (DP) Scan (SP) Scan (DP) S3D S3D w/PCIe S3D S3D w/PCIe Triad (BW) Speedup

SLIDE 27

27

0.1 1 10 MaxFLOPS (SP) MaxFLOPS (DP) Device BW (read) Device BW (write) Bus BW (download) Bus BW (readback) FFT (SP) iFFT (SP) FFT (SP) w/PCIe iFFT (SP) w/PCIe FFT (DP) iFFT (DP) FFT (DP) w/PCIe iFFT (DP) w/PCIe SGEMM SGEMM (transp) SGEMM w/PCIe SGEMM (transp) w/PCIe DGEMM DGEMM (transp) DGEMM w/PCIe DGEMM (transp) w/PCIe MD (SP flops) MD (SP BW) MD (SP BW) w/PCIe MD (DP flops) MD (DP BW) MD (DP BW) w/PCIe Reduction (SP) Reduction (SP) w/PCIe Reduction (DP) Reduction (DP) w/PCIe Scan (SP) Scan (SP) w/ PCIe Scan (DP) Scan (DP) w/PCIe Stencil (DP) S3D (SP) S3D (SP) w/PCIe S3D (DP) S3D (DP) w/PCIe Triad (BW) Speedup

Compiler Improvement/Regression (Intel 15 vs 13)

Improvements were minimal in the newer compiler
But several major regressions where older compiler was faster

SLIDE 28

28

1.E-02 1.E-01 1.E+00 Speedup vs CUDA 6.5 OpenACC PGI 13.10 OpenACC PGI 14.6 OpenACC PGI 14.7

Explicit vs Directive Models (K40 CUDA vs ACC)

Some OpenACC results approached CUDA results; some were over 10x slower
Generally saw performance regressions, not performance improvements, with newer compiler
except one case when older compiler simply generated incorrect binary

SLIDE 29

29

0.1 1 10 FFT (SP) iFFT (SP) FFT (SP) w/PCIe iFFT (SP) w/PCIe FFT (DP) iFFT (DP) FFT (DP) w/PCIe iFFT (DP) w/PCIe SGEMM SGEMM (transp) SGEMM w/PCIe SGEMM (transp) w/PCIe DGEMM DGEMM (transp) DGEMM w/PCIe DGEMM (transp) w/PCIe MD (SP flops) MD (SP BW) MD (SP flops) w/PCIe MD (SP BW) w/PCIe Reduction (SP) Reduction (SP) w/PCIe Reduction (DP) Reduction (DP) w/PCIe Scan (SP) Scan (SP) w/ PCIe Scan (DP) Scan (DP) w/PCIe Sort Sort w/PCIe Stencil (SP) Stencil (DP) S3D (SP) S3D (SP) w/PCIe S3D (DP) S3D (DP) w/PCIe Triad (BW) Speedup OpenMP vs OpenCL

Explicit vs Directive Models (MIC OpenMP vs OpenCL)

Level 0 results (not shown) were nearly identical
In these Level 1 & 2 kernels, OpenMP was almost always faster than OpenCL

SLIDE 30

Conclusion

SLIDE 31

31

SHOC is useful for benchmarking these systems

Wider variety of kernels in SHOC 2.0
allows a broader view of device performance
Wider variety of programming model support in SHOC 2.0
allows a wider array of device support
Longitudinal studies
across software / hardware generations
Cross-sectional studies
across APIs, across device vendors
Scaling studies
device size, device count

SLIDE 32

32

Lessons learned in the process

Compiler directive support not yet mature
some bugs, occasional language issues
many performance regressions over time
minor compilation differences impact performance
Lack of hardware support hurts performance
e.g. shared memory critical for some kernels, difficult to access with directives
potentially work around with API-specific primitives or language features
Directives imply portability, but not performance portability
difficult to re-imagine key kernels in directive-centric paradigm

SLIDE 33

ORNL is managed by UT-Battelle for the US Department of Energy

Examining Recent Many-core Architectures and Programming Models Using SHOC

Jeffrey Young Jeremy S. Meredith Philip C. Roth Mitchel Horton Jeffrey S. Vetter PMBS15 Sunday, 15 Nov 2015

Answering Questions about Heterogeneous Systems

SHOC 1.0

Scalable Heterogeneous Computing Suite

https://github.com/vetter/shoc

SHOC 2.0

Recent Additions to SHOC

New Benchmarks

MD5Hash

~ ~ ~ ~ ~

MD5Hash Results

improvements for NVIDIA

almost 3x

Fermi m2090

integer/bit operations

Neural Net (NN)

identify pictures of handwritten numbers 0-9 from MNIST inputs

and one hidden layer with thirty neurons

Neural Net Results

rectangular matrices

for these matrices

by the compiler

Phi limits its performance

Data Analytics

Select, Project, Join, Union

H [1] that have been accelerated with CUDA [2].

and Phi without needing a large database input

Data Analytics Results

(like workgroup size) makes performance portability difficult for this code

input; Haswell GPU has the best “zero-copy” semantics of integrated GPUs

New Programming Models

Programming Models

SHOC Example Studies

SHOC Example Studies

Hardware Comparisons

SHOC Example Hardware Studies

Generational Improvement for Same Vendor

Large vs Small Device of Same Architecture

Cross-Vendor Comparisons (AMD v NVIDIA, OpenCL)

Cross-Vendor Comparisons (NVIDIA v Intel)

Programming Model Comparisons

SHOC Example Programming Model Comparisons

Native vs Offload (Xeon Phi)

Compiler Improvement/Regression (Intel 15 vs 13)

Explicit vs Directive Models (K40 CUDA vs ACC)

Explicit vs Directive Models (MIC OpenMP vs OpenCL)

Conclusion

SHOC is useful for benchmarking these systems

Lessons learned in the process

Thanks!

Questions?