GTC 2015 | Mathias Wagner | Indiana University |
GPU vs Xeon Phi: Performance of Bandwidth Bound Applications with a Lattice QCD Case Study
Mathias Wagner
GPU vs Xeon Phi: Performance of Bandwidth Bound Applications with a - - PowerPoint PPT Presentation
GPU vs Xeon Phi: Performance of Bandwidth Bound Applications with a Lattice QCD Case Study Mathias Wagner GTC 2015 | Mathias Wagner | Indiana University | Lattice Quantum ChromoDynamics and Deep Learning sorry, not (yet?) here. GTC
GTC 2015 | Mathias Wagner | Indiana University |
Mathias Wagner
GTC 2015 | Mathias Wagner | Indiana University |
and Deep Learning … … sorry, not (yet?) here.
GTC 2015 | Mathias Wagner | Indiana University |
ZQCD (T, µ) = Z DAD ¯ ΨDΨe−SE(T,µ)
includes integral over space and time
GTC 2015 | Mathias Wagner | Indiana University |
sensitive to memory bandwidth
wx = Dx,x0vx0 =
3
X
µ=0
hn Ux,µvx+ˆ
µ − U † x−ˆ µ,µvx−ˆ µ
n Nx,µvx+3ˆ
µ − N † x−3ˆ µ,µvx−3ˆ µ
complex 3x3 matrix 72 byte for fp32 complex 3x3 matrix + U(3) symmetry 56 byte for fp32 complex 3-dim vector 24 byte for fp32 complex 3-dim vector 24 byte for fp32
GTC 2015 | Mathias Wagner | Indiana University |
Sorry, not the ones with liquid helium cooling and TDP > 300W.
GTC 2015 | Mathias Wagner | Indiana University |
5110 7120 K20 K20X K40 Cores / SMX 60 61 13 14 15 Vector instructions 512 bit (16 fp32) CUDA cores / SMX 192 Clock Speed [MHz] 1053 1238 - 1333 705 732 745-875 peak fp32 [TFlop/s] 2.02 2.42 3.52 3.91 4.29 peak fp64 [TFlop/s] 1.01 1.21 1.27 1.31 1.43 Memory [GB] 8 8 5 6 12 Memory Bandwidth [GB/s] 320 352 208 250 288 L1 Cache [kB] / (Core/SMX) [kB] 32 16-48 + 48 (Texture) L2 Cache [MB] 30 (60 x 0.5) 30.5 (61 x 0.5) 1.5 TDP [W] 225 300 225 235 235
How can we achieve this performance? How can we saturate the available bandwidth? How much energy does that require?
GTC 2015 | Mathias Wagner | Indiana University |
What performance can we expect on the different accelerators? Is our code optimized?
GTC 2015 | Mathias Wagner | Indiana University |
bandwidth times arithmetic intensity
Dslash performance ECC
GFlop/s 100 200 300 5110 7120 K20 K40
estimate (peak bw) estimate (triad bw) measured
GTC 2015 | Mathias Wagner | Indiana University |
bandwidth times arithmetic intensity
Dslash performance ECC
GFlop/s 100 200 300 5110 7120 K20 K40
estimate (peak bw) estimate (triad bw) measured
Memory Bandwidth [GB/s] 100 200 300 400 5110 7120 K20 K40
theoretical triad triad ECC
GTC 2015 | Mathias Wagner | Indiana University |
bandwidth times arithmetic intensity
Dslash performance ECC
GFlop/s 100 200 300 5110 7120 K20 K40
estimate (peak bw) estimate (triad bw) measured account for existence of cache in estimate of performance
GTC 2015 | Mathias Wagner | Indiana University |
bytes / site: 1024 x (1-hitrate) 384 + 24
Dslash performance ECC
GFlop/s 80 160 240 5110 7120 K20 K40
measured gauge field 16 vectors 24 byte each 1 vectors
GTC 2015 | Mathias Wagner | Indiana University |
bytes / site: 1024 x (1-hitrate) 384 + 24
→ arithmetic intensity 1.07 (w/o cache 0.80)
Dslash performance ECC
GFlop/s 80 160 240 5110 7120 K20 K40
measured gauge field 16 vectors 24 byte each 1 vectors
GTC 2015 | Mathias Wagner | Indiana University |
bytes / site: 1024 x (1-hitrate) 384 + 24
→ arithmetic intensity 1.07 (w/o cache 0.80)
Dslash performance ECC
GFlop/s 80 160 240 5110 7120 K20 K40
measured gauge field 16 vectors 24 byte each 1 vectors
GTC 2015 | Mathias Wagner | Indiana University |
DRAM L2 SM
L1 Read
Const
SM
– L1 is the “default” –
GTC 2015 | Mathias Wagner | Indiana University |
hit 7 out of 16 (43% hit rate)
in z direction we can hit 2 of 4 elements: 9/16 (56% hit rate)
GTC 2015 | Mathias Wagner | Indiana University |
z-direction
L1
hit 7 out of 16 (43% hit rate)
in z direction we can hit 2 of 4 elements: 9/16 (56% hit rate)
hit rate 0/16 15/16 3/16 5/16 7/16 9/16 arithmetic intensity 0.8 1.07 0.84 0.87 0.91 0.94
GTC 2015 | Mathias Wagner | Indiana University |
hit 7 out of 16 (43% hit rate)
in z direction we can hit 2 of 4 elements: 9/16 (56% hit rate)
Dslash performance K40 ECC, 32x8
GFlop/s 100 170 240 / 1 6 3 / 1 6 5 / 1 6 7 / 1 6 9 / 1 6 1 5 / 1 6 m e a s u r e d
hit rate 0/16 15/16 3/16 5/16 7/16 9/16 arithmetic intensity 0.8 1.07 0.84 0.87 0.91 0.94
profiler: L1 hit rate 44% (L2 7%)
GTC 2015 | Mathias Wagner | Indiana University |
Focus on the arithmetic intensity now … push ups later. Cache effects for vectors but remember they are only ~25% of the memory traffic. What can we do about the gauge links ?
GTC 2015 | Mathias Wagner | Indiana University |
⇣ w(1)
x , w(2) x , . . . , w(n) x
⌘ = Dx,x0 ⇣ v(1)
x0 , v(2) x0 , . . . , v(n) x
⌘
GTC 2015 | Mathias Wagner | Indiana University |
⇣ w(1)
x , w(2) x , . . . , w(n) x
⌘ = Dx,x0 ⇣ v(1)
x0 , v(2) x0 , . . . , v(n) x
⌘
1 2 3 4 5 Flop/byte 0.80 1.25 1.53 1.73 1.87
arithmetic intensity 0.5 1 1.5 2 # rhs 1 2 3 4 5
GTC 2015 | Mathias Wagner | Indiana University |
⇣ w(1)
x , w(2) x , . . . , w(n) x
⌘ = Dx,x0 ⇣ v(1)
x0 , v(2) x0 , . . . , v(n) x
⌘
1 2 3 4 5 Flop/byte 0.80 1.25 1.53 1.73 1.87
GTC 2015 | Mathias Wagner | Indiana University |
→ occupancy / spilling
__global__'Dslashreg'(w1,'w2,'w3,'v1,'v2,'v3'){ ... for(xp=...){ ' w1(x)'='D(x,xp)'*'v1(xp); ' w2(x)'='D(x,xp)'*'v2(xp); ' w3(x)'='D(x,xp)'*'v3(xp);' ' } }
GTC 2015 | Mathias Wagner | Indiana University |
→ occupancy / spilling
→ reduce register pressure
→ only one global load
__global__'Dslashcache'(w,'v) ...
for(xp=...) ' w(x,'offset)'+='D(x,xp)'*'v(x,'offset) }
x=0 v1 x=1 v1 x=BS-1 v1 x=0 v2 x=1 v2 x=BS-1 v2 x=0 v3 x=1 v3 x=BS-1 v3
GTC 2015 | Mathias Wagner | Indiana University |
→ occupancy / spilling
→ reduce register pressure
→ only one global load
__global__'Dslashregcache'(w1,'w2,'w3,'v1,'v2,'v3'){ ...
for(xp=...){ ' w1(x,'offset)'='D(x,xp)'*'v1(xp,'offset); ' w2(x,'offset)'='D(x,xp)'*'v2(xp,'offset); ' w3(x,'offset)'='D(x,xp)'*'v3(xp,'offset);' ' } }
x=0 v1 x=1 v1 x=BS-1 v1 x=0 v2 x=1 v2 x=BS-1 v2 x=0 v3 x=1 v3 x=BS-1 v3
GTC 2015 | Mathias Wagner | Indiana University |
(each sites need 8x72 bytes + 8x56 bytes)
GFlop/s
125 250 375 500
# rhs
1 2 3 4 K20 estimate K40 estimate K20 measured K40 measured
GTC 2015 | Mathias Wagner | Indiana University |
Block [16,4] [128,4] [256,4] [1024,1] regs 63 63 63 62
0.49 0.47 0.48 0.48 eligibl. warps 2.45 2.92 3.08 0.87 IPC 1.92 1.92 1.87 0.77 TC Hits % 51.9 74.3 75.9 3.8 L2 (TC) Hits % 50.0 5.6 0.0 0.0 L1 Hits % 18.2 31.2 33.9 44.3 L2 (L1) Hits % 48.4 37.1 28.9 7.1 Tex+L2 Hits % 75.9 75.7 75.9 3.8 L1+L2 Hits % 57.8 56.7 53.0 48.3
DRAM L2 SM
L1 Read
Const
SM
– L1 is the “default” –
GTC 2015 | Mathias Wagner | Indiana University |
4 x 12 kB Texture / read only cache
Scheduler Scheduler Scheduler Scheduler Cache Cache Cache Cache
GTC 2015 | Mathias Wagner | Indiana University |
Tex Cache Tex Cache Tex Cache Tex Cache (0…15,0) (0…15,1) (0…15,2) (0…15,3) (16…31,0) (16…31,1) (16…31,2) (16…31,3)
Block TC Hits % L2 (TC) Hits % Tex+L2 Hits % [16,4] 51.9 50.0 75.9 [128,4] 74.3 5.6 75.7
GTC 2015 | Mathias Wagner | Indiana University |
Tex Cache Tex Cache Tex Cache Tex Cache (0…15,0) (0…15,1) (0…15,2) (0…15,3) (16…31,0) (16…31,1) (16…31,2) (16…31,3)
Block TC Hits % L2 (TC) Hits % Tex+L2 Hits % [16,4] 51.9 50.0 75.9 [128,4] 74.3 5.6 75.7
GTC 2015 | Mathias Wagner | Indiana University |
Tex Cache Tex Cache Tex Cache Tex Cache (64…95,1) (32…64,1) (0…31,1) (96…127,1) (64…95,0) (32…64,0) (0…31,0) (96…127,0) (64…95,2) (32…64,2) (0…31,2) (96…127,2) (64…95,3) (32…64,3) (0…31,3) (96…127,3)
Block TC Hits % L2 (TC) Hits % Tex+L2 Hits % [16,4] 51.9 50.0 75.9 [128,4] 74.3 5.6 75.7
GTC 2015 | Mathias Wagner | Indiana University |
Tex Cache Tex Cache Tex Cache Tex Cache (64…95,1) (32…64,1) (0…31,1) (96…127,1) (64…95,0) (32…64,0) (0…31,0) (96…127,0) (64…95,2) (32…64,2) (0…31,2) (96…127,2) (64…95,3) (32…64,3) (0…31,3) (96…127,3)
Block TC Hits % L2 (TC) Hits % Tex+L2 Hits % [16,4] 51.9 50.0 75.9 [128,4] 74.3 5.6 75.7
GTC 2015 | Mathias Wagner | Indiana University |
Tex Cache Tex Cache Tex Cache Tex Cache (64…95,1) (32…64,1) (0…31,1) (96…127,1) (64…95,0) (32…64,0) (0…31,0) (96…127,0) (64…95,2) (32…64,2) (0…31,2) (96…127,2) (64…95,3) (32…64,3) (0…31,3) (96…127,3)
Block TC Hits % L2 (TC) Hits % Tex+L2 Hits % [16,4] 51.9 50.0 75.9 [128,4] 74.3 5.6 75.7
GTC 2015 | Mathias Wagner | Indiana University |
Tex Cache Tex Cache Tex Cache Tex Cache (64…95,1) (32…64,1) (0…31,1) (96…127,1) (64…95,0) (32…64,0) (0…31,0) (96…127,0) (64…95,2) (32…64,2) (0…31,2) (96…127,2) (64…95,3) (32…64,3) (0…31,3) (96…127,3)
Block TC Hits % L2 (TC) Hits % Tex+L2 Hits % [16,4] 51.9 50.0 75.9 [128,4] 74.3 5.6 75.7
GTC 2015 | Mathias Wagner | Indiana University |
Tex Cache Tex Cache Tex Cache Tex Cache (64…95,1) (32…64,1) (0…31,1) (96…127,1) (64…95,0) (32…64,0) (0…31,0) (96…127,0) (64…95,2) (32…64,2) (0…31,2) (96…127,2) (64…95,3) (32…64,3) (0…31,3) (96…127,3)
Block TC Hits % L2 (TC) Hits % Tex+L2 Hits % [16,4] 51.9 50.0 75.9 [128,4] 74.3 5.6 75.7
GTC 2015 | Mathias Wagner | Indiana University |
naive 16-fold site fusion
16 matrices times 16 vectors | { z } sites
real imag matrix vector
GTC 2015 | Mathias Wagner | Indiana University |
Gflop/s 75 150 225 300 # rhs 1 2 3 4 5
16-fold 16-fold + prefetch 8-fold 8-fold + prefetch
GTC 2015 | Mathias Wagner | Indiana University |
Results for the full conjugate gradient inverter on Xeon Phi and Tesla
GTC 2015 | Mathias Wagner | Indiana University |
ECC, 4 rhs
GFlop/s
100 200 300 400
Lattice Size
16,4 32,8 48,12 32,64 64,16 5110 7120 K20 K20X K40
GTC 2015 | Mathias Wagner | Indiana University |
64^3 x 16, ECC
GFlop/s
100 200 300 400
# rhs
1 2 3 4 5 5110 7120 K20 K20X K40
GTC 2015 | Mathias Wagner | Indiana University |
ECC, 4 rhs
GFlop/s
100 200 300 400
Lattice Size
16,4 32,8 48,12 32,64 64,16 5110 7120 K20 K20X K40
64^3 x 16, ECC
GFlop/s
100 200 300 400
# rhs
1 2 3 4 5 5110 7120 K20 K20X K40
performance relative to K20, 4 rhs
0.00 0.43 0.85 1.28 1.70 5110 7120 K20 K40 peak bw triad bw 32^3x8 CG 64^3x16 CG
GTC 2015 | Mathias Wagner | Indiana University |
How energy efficient are the two architectures? Oh, does anyone wonder about Maxwell in this respect?
GTC 2015 | Mathias Wagner | Indiana University |
Solver, 4rhs, 32x8
Solver avg. Power [W] 50 100 150 200 250 5110 (est) K20 K40 M6000
TDP CG ECC CG noECC
GTC 2015 | Mathias Wagner | Indiana University |
Solver [GFlop/s] 120 240 360 480 600 CG ECC CG noECC
5110 (est) K20 K40 M6000
[GFlop/s / W] 0.6 1.2 1.8 2.4 3 CG ECC CG noECC
5110 (est) K20 K40 M6000
preliminary: code only optimized for Kepler
GTC 2015 | Mathias Wagner | Indiana University |
GTC 2015 | Mathias Wagner | Indiana University |
performance relative to K20, 4 rhs
0.00 0.50 1.00 1.50 2.00 5110 7120 K20 K40 peak bw triad bw 32^3x8 CG 64^3x16 CG
[GFlop/s / W] 0.6 1.2 1.8 2.4 3 5110 (est) K20 K40 M6000
GTC 2015 | Mathias Wagner | Indiana University |
Contact: mathwagn@indiana.edu http://linked.in/mathwagn @mathwagn Collaborators: P . Steinbrecher (Bielefeld U → Brookhaven National Lab)
References: arXiv:1411.4439 [physics.comp-ph] arXiv:1409.1510 [cs.DC] Thanks to: Jeongnim Kim (Intel) Mike Clark (Nvidia)