ORNL is managed by UT-Battelle for the US Department of Energy
Examining Recent Many-core Architectures and Programming Models Using SHOC
- M. Graham Lopez
Many-core Architectures and Programming Models Using SHOC M. - - PowerPoint PPT Presentation
Examining Recent Many-core Architectures and Programming Models Using SHOC M. Graham Lopez Jeffrey Young Jeremy S. Meredith Philip C. Roth Mitchel Horton Jeffrey S. Vetter PMBS15 Sunday, 15 Nov 2015 ORNL is managed by UT-Battelle for
ORNL is managed by UT-Battelle for the US Department of Energy
2
4
“The Scalable Heterogeneous Computing (SHOC) Benchmark Suite” Third Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU-3), 2010.
6
8
aaaa
74b873374....
aaab
4c189b020....
aaac
3963a2ba6....
aaad
aa836f154....
zzzz
02c425157....
threads
9
1 2 3 4 5 6 7 NVIDIA m2090 NVIDIA K20m NVIDIA K40 NVIDIA GTX750Ti AMD w9100 Intel i7- 4770k GHash/sec
10
10000 20000 30000 40000 k20 k40 Learning Rate training sets/second NN NN w/ PCIe
[1] M. Nielsen. Neural networks and deep learning. October 2014. https://github.com/mnielsen/neural-networks-and-deep-learning. [2] Y. LeCun, C. Cortes, and C. J. Burges. The MNIST database of handwritten digits. 2014. http://yann.lecun.com/exdb/mnist/. [3] http://eblearn.sourceforge.net/mnist.html
Visualization of Testing Set [3]
11
12
[1] T. P. P. Council. TPC Benchmark H (Decision Support) Standard Specification, Revision 2.17.0. 2013. http://www.tpc.org/tpch/ [2] H. Wu, G. Diamos, S. Cadambi, and S. Yalamanchili. Kernel weaver: Automatically fusing database primitives for efficient GPU computation. MICRO 2012 [3] Ifrah Saeed, Jeffrey Young, Sudhakar Yalamanchili, A portable benchmark suite for highly parallel data intensive query processing. PPAA 2015
13
Project, no PCIe Transfer Time Project, Transfer Time Included
0.00E+00 1.00E+09 2.00E+09 3.00E+09 4.00E+09 5.00E+09 6.00E+09 7.00E+09 8.00E+09 8 16 32 64 128 256 512 1024
Queries / second
Input Size (MB) Trinity (C) Trinity (G) NV K20m NV M2090 SNB (C) IVB (C) IVB (G) HSWL (C) HSWL (G) Phi 5110
0.00E+00 2.00E+08 4.00E+08 6.00E+08 8.00E+08 1.00E+09 1.20E+09 1.40E+09 8 16 32 64 128 256 512 1024
Queries / second
Input Size (MB) Trinity (C) Trinity (G) NV K20m NV M2090 SNB (C) IVB (C) IVB (G) HSWL (C) HSWL (G) Phi 5110
15
17
19
20
0x 1x 2x 3x 4x 5x 6x Speedup K40 over m2090 GPU only With PCIe
21
0x 5x 10x 15x 20x 25x 30x 35x 40x 45x Speedup K40 over TK1 GPU only With PCIe
22
0.1x 1x 10x Speedup W9100 over K40 (log scale) GPU only With PCIe
23
0.1 1 10
MaxFLOPS (SP) MaxFLOPS (DP) Device BW (read) Device BW (write) Device BW (read,stride) Device BW (write,stride) lmem_readbw lmem_writebw FFT (SP) iFFT (SP) FFT (SP) w/PCIe iFFT (SP) w/PCIe FFT (DP) iFFT (DP) FFT (DP) w/PCIe iFFT (DP) w/PCIe SGEMM SGEMM (transp) SGEMM w/PCIe SGEMM (transp) w/PCIe DGEMM DGEMM (transp) DGEMM w/PCIe DGEMM (transp) w/PCIe MD (SP flops) MD (SP BW) MD (SP flops) w/PCIe MD (SP BW) w/PCIe MD (DP flops) MD (DP BW) MD (DP flops) w/PCIe MD (DP BW) w/PCIe Scan (SP) Scan (SP) w/ PCIe Scan (DP) Scan (DP) w/PCIe Sort Sort w/PCIe SpMV (SP,CSR) SpMV (SP,CSR,vec) SpMV (SP,ELLPACKR) SpMV (DP,CSR) SpMV (DP,CSR,vec) SpMV (DP,ELLPACKR) Stencil (SP) Stencil (DP) S3D (SP) S3D (SP) w/PCIe S3D (DP) S3D (DP) w/PCIe Triad (BW)
Speedup K20 vs MIC (log scale)
25
26
0.1 1 10
MaxFLOPS (SP) MaxFLOPS (DP) Device BW (read) Device BW (write) FFT (SP) iFFT (SP) FFT (DP) iFFT (DP) SGEMM SGEMM w/PCIe DGEMM DGEMM w/PCIe MD (SP flops) MD (SP BW) Reduction (SP) Reduction (DP) Scan (SP) Scan (DP) S3D S3D w/PCIe S3D S3D w/PCIe Triad (BW) Speedup
27
0.1 1 10 MaxFLOPS (SP) MaxFLOPS (DP) Device BW (read) Device BW (write) Bus BW (download) Bus BW (readback) FFT (SP) iFFT (SP) FFT (SP) w/PCIe iFFT (SP) w/PCIe FFT (DP) iFFT (DP) FFT (DP) w/PCIe iFFT (DP) w/PCIe SGEMM SGEMM (transp) SGEMM w/PCIe SGEMM (transp) w/PCIe DGEMM DGEMM (transp) DGEMM w/PCIe DGEMM (transp) w/PCIe MD (SP flops) MD (SP BW) MD (SP BW) w/PCIe MD (DP flops) MD (DP BW) MD (DP BW) w/PCIe Reduction (SP) Reduction (SP) w/PCIe Reduction (DP) Reduction (DP) w/PCIe Scan (SP) Scan (SP) w/ PCIe Scan (DP) Scan (DP) w/PCIe Stencil (DP) S3D (SP) S3D (SP) w/PCIe S3D (DP) S3D (DP) w/PCIe Triad (BW) Speedup
28
1.E-02 1.E-01 1.E+00 Speedup vs CUDA 6.5 OpenACC PGI 13.10 OpenACC PGI 14.6 OpenACC PGI 14.7
29
0.1 1 10 FFT (SP) iFFT (SP) FFT (SP) w/PCIe iFFT (SP) w/PCIe FFT (DP) iFFT (DP) FFT (DP) w/PCIe iFFT (DP) w/PCIe SGEMM SGEMM (transp) SGEMM w/PCIe SGEMM (transp) w/PCIe DGEMM DGEMM (transp) DGEMM w/PCIe DGEMM (transp) w/PCIe MD (SP flops) MD (SP BW) MD (SP flops) w/PCIe MD (SP BW) w/PCIe Reduction (SP) Reduction (SP) w/PCIe Reduction (DP) Reduction (DP) w/PCIe Scan (SP) Scan (SP) w/ PCIe Scan (DP) Scan (DP) w/PCIe Sort Sort w/PCIe Stencil (SP) Stencil (DP) S3D (SP) S3D (SP) w/PCIe S3D (DP) S3D (DP) w/PCIe Triad (BW) Speedup OpenMP vs OpenCL
31
32
ORNL is managed by UT-Battelle for the US Department of Energy