Algorithmic time, energy, and power
- n candidate HPC compute building
blocks
Jee Choi, Marat Dukhan, Xing Liu, and Richard Vuduc May 20, 2014 Presented at IPDPS’14
Algorithmic time, energy, and power on candidate HPC compute - - PowerPoint PPT Presentation
Algorithmic time, energy, and power on candidate HPC compute building blocks Jee Choi, Marat Dukhan, Xing Liu, and Richard Vuduc May 20, 2014 Presented at IPDPS14 Contributions Energy roofline (IPDPS13) quantifies relative energy costs of
Algorithmic time, energy, and power
blocks
Jee Choi, Marat Dukhan, Xing Liu, and Richard Vuduc May 20, 2014 Presented at IPDPS’14
relative energy costs of computation to data movement
server- to mobile-class platforms
Slow memory
xPU Fast memory (total size = Z)
τmem = time / (m)op τflop = time / (fl)op Q (m)ops W (fl)ops
1/32 1/16 1/8 1/4 1/2 1
3.6 14
GFLOP/J GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance
“Arch line” “Roofline”[1]
[1] ¡S. ¡Williams, ¡A. ¡Waterman, ¡and ¡D. ¡Pa6erson, ¡“Roofline: ¡an ¡insigh>ul ¡visual ¡performance ¡model ¡for ¡mulDcore ¡architectures,” ¡
1/32 1/16 1/8 1/4 1/2 1
3.6 14
GFLOP/J GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance
“Arch line” “Roofline”[1]
[1] ¡S. ¡Williams, ¡A. ¡Waterman, ¡and ¡D. ¡Pa6erson, ¡“Roofline: ¡an ¡insigh>ul ¡visual ¡performance ¡model ¡for ¡mulDcore ¡architectures,” ¡Commun. ¡ ACM, ¡vol. ¡52, ¡no. ¡4, ¡pp. ¡65–76, ¡Apr. ¡2009. ¡[Online]. ¡Available:h6p://doi.acm.org/10.1145/1498765.1498785 ¡
1 2 4 8
14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512
Intensity (flop:byte) Power, relative to flop−power
Power dissipated by compute units Power dissipated by memory units
NVIDIA GTX 580 (GPU−only) Intel i7−950 (Desktop)
8.2 5.1 (const=0) 4.5 120 W 220 W 280 W 380 W
2.1 (const=0) 2.1 120 W 140 W 160 W 180 W 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 Power 1/4 1/2 1 2 4 8 16 32 64 1/4 1/2 1 2 4 8 16 32 64
Intensity (FLOP : Byte) Power (normalized to flop+const)
NVIDIA GTX 580 (GPU−only) Intel i7−950 (Desktop)
8.2 5.1 (const=0) 4.5 120 W 220 W 280 W 380 W
2.1 (const=0) 2.1 120 W 140 W 160 W 180 W 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 Power 1/4 1/2 1 2 4 8 16 32 64 1/4 1/2 1 2 4 8 16 32 64
Intensity (FLOP : Byte) Power (normalized to flop+const)
power cap prevents peak performance
Power is determined by performance Performance is limited by power
Power is determined by performance Performance is limited by power “Usable” power
Power is determined by performance Performance is limited by power “Usable” power
Power is determined by performance Performance is limited by power “Usable” power
1 2 4 8
14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512
Intensity (flop:byte) Power, relative to flop−power
1 2 4 8
14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512
Intensity (flop:byte) Power, relative to flop−power
1 2 4 8
14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512
Intensity (flop:byte) Power, relative to flop−power
1 2 4 8
14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512
Intensity (flop:byte) Power, relative to flop−power
1 2 4 8
14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512
Intensity (flop:byte) Power, relative to flop−power
1 2 4 8
14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512
Intensity (flop:byte) Power, relative to flop−power
1 2 4 8
14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512
Intensity (flop:byte) Power, relative to flop−power
1 2 4 8
14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512
Intensity (flop:byte) Power, relative to flop−power
1 2 4 8
14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512
Intensity (flop:byte) Power, relative to flop−power
1 2 4 8
14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512
Intensity (flop:byte) Power, relative to flop−power
1 2 4 8
14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512
Intensity (flop:byte) Power, relative to flop−power
1 2 4 8
14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512
Intensity (flop:byte) Power, relative to flop−power
memory
× ¡ × ¡
http://hpcgarage.org/archline
μbenchmark for Ivy Bridge
– aligned memory loads – 1 MUL and 1 ADD AVX instructions issued per cycle – maximize AVX register usage to increase ILP – parallelized over all available cores
vmovapd ymm0, [rdi - 128] vmovapd ymm1, [rdi - 96] vmovapd ymm2, [rdi - 64] vmovapd ymm3, [rdi - 32] vmovapd ymm4, [rdi] vmovapd ymm5, [rdi + 32] vmovapd ymm6, [rdi + 64] vmovapd ymm7, [rdi + 96] %rep MAD_PER_ELEMENT vmulpd ymm0, ymm0, ymm0 vaddpd ymm8, ymm8, ymm0 vmulpd ymm1, ymm1, ymm1 vaddpd ymm9, ymm9, ymm1 vmulpd ymm2, ymm2, ymm2 vaddpd ymm10, ymm10, ymm2 vmulpd ymm3, ymm3, ymm3 vaddpd ymm11, ymm11, ymm3 vmulpd ymm4, ymm4, ymm4 vaddpd ymm12, ymm12, ymm4 vmulpd ymm5, ymm5, ymm5 vaddpd ymm13, ymm13, ymm5 vmulpd ymm6, ymm6, ymm6 vaddpd ymm14, ymm14, ymm6 vmulpd ymm7, ymm7, ymm7 vaddpd ymm15, ymm15, ymm7 %endrep
http://hpcgarage.org/archline
μbenchmark for Ivy Bridge
– aligned memory loads – 1 MUL and 1 ADD AVX instructions issued per cycle – maximize AVX register usage to increase ILP – parallelized over all available cores
vmovapd ymm0, [rdi - 128] vmovapd ymm1, [rdi - 96] vmovapd ymm2, [rdi - 64] vmovapd ymm3, [rdi - 32] vmovapd ymm4, [rdi] vmovapd ymm5, [rdi + 32] vmovapd ymm6, [rdi + 64] vmovapd ymm7, [rdi + 96] %rep MAD_PER_ELEMENT vmulpd ymm0, ymm0, ymm0 vaddpd ymm8, ymm8, ymm0 vmulpd ymm1, ymm1, ymm1 vaddpd ymm9, ymm9, ymm1 vmulpd ymm2, ymm2, ymm2 vaddpd ymm10, ymm10, ymm2 vmulpd ymm3, ymm3, ymm3 vaddpd ymm11, ymm11, ymm3 vmulpd ymm4, ymm4, ymm4 vaddpd ymm12, ymm12, ymm4 vmulpd ymm5, ymm5, ymm5 vaddpd ymm13, ymm13, ymm5 vmulpd ymm6, ymm6, ymm6 vaddpd ymm14, ymm14, ymm6 vmulpd ymm7, ymm7, ymm7 vaddpd ymm15, ymm15, ymm7 %endrep
http://hpcgarage.org/archline
μbenchmark for Ivy Bridge
– aligned memory loads – 1 MUL and 1 ADD AVX instructions issued per cycle – maximize AVX register usage to increase ILP – parallelized over all available cores
vmovapd ymm0, [rdi - 128] vmovapd ymm1, [rdi - 96] vmovapd ymm2, [rdi - 64] vmovapd ymm3, [rdi - 32] vmovapd ymm4, [rdi] vmovapd ymm5, [rdi + 32] vmovapd ymm6, [rdi + 64] vmovapd ymm7, [rdi + 96] %rep MAD_PER_ELEMENT vmulpd ymm0, ymm0, ymm0 vaddpd ymm8, ymm8, ymm0 vmulpd ymm1, ymm1, ymm1 vaddpd ymm9, ymm9, ymm1 vmulpd ymm2, ymm2, ymm2 vaddpd ymm10, ymm10, ymm2 vmulpd ymm3, ymm3, ymm3 vaddpd ymm11, ymm11, ymm3 vmulpd ymm4, ymm4, ymm4 vaddpd ymm12, ymm12, ymm4 vmulpd ymm5, ymm5, ymm5 vaddpd ymm13, ymm13, ymm5 vmulpd ymm6, ymm6, ymm6 vaddpd ymm14, ymm14, ymm6 vmulpd ymm7, ymm7, ymm7 vaddpd ymm15, ymm15, ymm7 %endrep
http://hpcgarage.org/archline
μbenchmark for Ivy Bridge
– aligned memory loads – 1 MUL and 1 ADD AVX instructions issued per cycle – maximize AVX register usage to increase ILP – parallelized over all available cores
vmovapd ymm0, [rdi - 128] vmovapd ymm1, [rdi - 96] vmovapd ymm2, [rdi - 64] vmovapd ymm3, [rdi - 32] vmovapd ymm4, [rdi] vmovapd ymm5, [rdi + 32] vmovapd ymm6, [rdi + 64] vmovapd ymm7, [rdi + 96] %rep MAD_PER_ELEMENT vmulpd ymm0, ymm0, ymm0 vaddpd ymm8, ymm8, ymm0 vmulpd ymm1, ymm1, ymm1 vaddpd ymm9, ymm9, ymm1 vmulpd ymm2, ymm2, ymm2 vaddpd ymm10, ymm10, ymm2 vmulpd ymm3, ymm3, ymm3 vaddpd ymm11, ymm11, ymm3 vmulpd ymm4, ymm4, ymm4 vaddpd ymm12, ymm12, ymm4 vmulpd ymm5, ymm5, ymm5 vaddpd ymm13, ymm13, ymm5 vmulpd ymm6, ymm6, ymm6 vaddpd ymm14, ymm14, ymm6 vmulpd ymm7, ymm7, ymm7 vaddpd ymm15, ymm15, ymm7 %endrep
http://hpcgarage.org/archline
μbenchmark for Cortex A9
– aligned memory loads – 1 ADD every cycle, but 1 MUL every other cycle – maximize register usage to increase ILP – parallelized over all available cores
.rept MLA_PER_DOUBLE VMLA.F64 d12, d0, d0 VADD.F64 d24, d24, d0 VMLA.F64 d13, d1, d1 VADD.F64 d25, d25, d1 VMLA.F64 d14, d2, d2 VADD.F64 d26, d26, d2 VMLA.F64 d15, d3, d3 VADD.F64 d27, d27, d3 VMLA.F64 d16, d4, d4 VADD.F64 d28, d28, d4 VMLA.F64 d17, d5, d5 VADD.F64 d29, d29, d5 VMLA.F64 d18, d6, d6 VADD.F64 d24, d24, d6 VMLA.F64 d19, d7, d7 VADD.F64 d25, d25, d7 VMLA.F64 d20, d8, d8 VADD.F64 d26, d26, d8 VMLA.F64 d21, d9, d9 VADD.F64 d27, d27, d9 VMLA.F64 d22, d10, d10 VADD.F64 d28, d28, d10 VMLA.F64 d23, d11, d11 VADD.F64 d29, d29, d11 .endr
http://hpcgarage.org/archline
μbenchmark for Kepler (vs. Fermi)
– quad warp scheduler selects up to four warps and issue two instructions per warp – each SMX has 192 cores
– we need more than 1 independent instruction in two of the warps unlike some Fermi – theoretical peak impossible to achieve – likely register bandwidth limit
uint tid = threadIdx.x + blockIdx.x * blockDim.x; TYPE tmp1; float x, y, z, w; if(tid < num_threads) { tmp1 = in[tid]; x = tmp1.x; y = tmp1.y; z = tmp1.z; w = tmp1.w; x = x + x * CONST; y = y + y * CONST; z = z + z * CONST; w = w + w * CONST; x = x + x * CONST; y = y + y * CONST; z = z + z * CONST; w = w + w * CONST; ...
http://hpcgarage.org/archline
– read from both ends of the array so that pre-fetched data is (almost always) consumed
– pointer-chasing arrays – fully unrolled loops
parameters
– e.g., # of threads, thread block size
PF1 PF0
stream 0 stream 1
ATX PSU PowerMon 2 PCIe Interposer GPU CPU Motherboard
Input Output1 2 1 2 3 3 4 4 5 5
Power Brick Power Brick
6 7 6 7
ARM Dev Board APU
Vendor’s claimed peak Power (empirical) Energy (and empirical throughput) Random access Platform Processor single Gflop/s double Gflop/s
GB/s ⇡1 Watts (idle) ∆⇡ Watts ✏s pJ/flop (Gflop/s) ✏d pJ/flop (Gflop/s) ✏mem pJ/B (GB/s) ✏L1 pJ/B (GB/s) ✏L2 pJ/B (GB/s) ✏rand nJ/access (Macc/s) Desktop CPU “Nehalem” Intel Core i7-950 (45 nm) 107 53.3 25.6 122 (79.9) 44.2 371 (99.4) 670 (49.7) 795 (19.1) 135 (201) 168 (120) 108 (149) NUC CPU “Ivy Bridge” Intel Core i3-3217U (22 nm) 57.6 28.8 25.6 16.5 (13.2) 7.37 14.7 (55.6) 24.3 (27.9) 418 (17.9) 8.75 (201) 14.3 (103) 54.6 (55.3) NUC GPU HD 4000 269 — 25.6 10.1 (13.2)∗ 17.7 76.1 (268) — 837 (15.4) — — — APU CPU “Bobcat” AMD E2-1800 (40 nm) 13.6 5.10 10.7 20.1 (11.8) 1.39 33.5 (13.4) 119 (5.05) 435 (3.32) 84.0 (25.8) 138 (11.6) 75.6 (8.03) APU GPU “Zacate” HD 7340 109 — 10.7 15.6 (11.8) 3.23 5.82 (104) — 333 (8.70) 6.47 (46.0) — 45.8 (115) GTX 580 “Fermi” NVIDIA GF100 (40 nm) 1580 198 192 122 (148)∗ 146 99.7 (1400) 213 (196) 513 (171) 149 (761) 257 (284) 112 (977) GTX 680 “Kepler” NVIDIA GK104 (28 nm) 3530 147 192 66.4 (100)∗ 145 43.2 (3030) 263 (147) 437 (158) 51 (1150) 195 (297) 184 (1420) GTX Titan “Kepler” NVIDIA GK110 (28 nm) 4990 1660 288 123 (72.9) 164 30.4 (4020) 93.9 (1600) 267 (239) 24.4 (1610) 195 (297) 48.0 (968) Xeon Phi “KNC” Intel 5110P (22 nm) 2020 1010 320 180 (90) 36.1 6.05 (2020) 12.4 (1010) 136 (181) 2.19 (2890) 8.65 (591) 5.11 (706) PandaBoard ES “Cortex-A9” TI OMAP 4460 (45 nm) 9.60 3.60 3.20 3.48 (2.74) 1.19 37.2 (9.47) 302 (3.02) 810 (1.28) 79.5 (18.4) 134 (4.12) 60.9 (12.1) Arndale CPU “Cortex-A15” Samsung Exynos 5 (32 nm) 27.2 6.80 12.8 5.50 (1.72) 2.01 107 (15.8) 275 (3.97) 386 (3.94) 76.3 (50.8) 248 (15.2) 138 (14.8) Arndale GPU “Mali T-604” 72.0 — 12.8 1.28 (1.72)∗ 4.83 84.2 (33.0) — 518 (8.39) 71.4 (33.4) — 125 (33.6)
5 TFLOPS “Supercomputer” 1× GTX Titan GPU ~287 Watts ~5 TFLOPS 70× Mali T-604 GPU ~6 Watts ~72 GFLOPS
Time Energy Power
1/2048 1/1024 1/512 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 1/41/2 1 2 4 8 16 32 641281/41/2 1 2 4 8 16 32 641281/41/2 1 2 4 8 16 32 64128
Intensity (single−precision FLOP : Byte) Normalized performance
GK110; Kepler Samsung Arndale GPU Exynos 5; ARM Mali T−604 Samsung 70 x Arndale GPU Exynos 5; ARM Mali T−604
Hypothetical scaled Arndale GPU design vs. GTX Titan
Time Energy Power
1/2048 1/1024 1/512 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 1/41/2 1 2 4 8 16 32 641281/41/2 1 2 4 8 16 32 641281/41/2 1 2 4 8 16 32 64128
Intensity (single−precision FLOP : Byte) Normalized performance
GK110; Kepler Samsung Arndale GPU Exynos 5; ARM Mali T−604 Samsung 70 x Arndale GPU Exynos 5; ARM Mali T−604
Hypothetical scaled Arndale GPU design vs. GTX Titan
Time Energy Power
1/2048 1/1024 1/512 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 1/41/2 1 2 4 8 16 32 641281/41/2 1 2 4 8 16 32 641281/41/2 1 2 4 8 16 32 64128
Intensity (single−precision FLOP : Byte) Normalized performance
GK110; Kepler Samsung Arndale GPU Exynos 5; ARM Mali T−604 Samsung 70 x Arndale GPU Exynos 5; ARM Mali T−604
Hypothetical scaled Arndale GPU design vs. GTX Titan
Time Energy Power
1/2048 1/1024 1/512 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 1/41/2 1 2 4 8 16 32 641281/41/2 1 2 4 8 16 32 641281/41/2 1 2 4 8 16 32 64128
Intensity (single−precision FLOP : Byte) Normalized performance
GK110; Kepler Samsung Arndale GPU Exynos 5; ARM Mali T−604 Samsung 70 x Arndale GPU Exynos 5; ARM Mali T−604
Hypothetical scaled Arndale GPU design vs. GTX Titan
Time Energy Power
1/2048 1/1024 1/512 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 1/41/2 1 2 4 8 16 32 641281/41/2 1 2 4 8 16 32 641281/41/2 1 2 4 8 16 32 64128
Intensity (single−precision FLOP : Byte) Normalized performance
GK110; Kepler Samsung Arndale GPU Exynos 5; ARM Mali T−604 Samsung 70 x Arndale GPU Exynos 5; ARM Mali T−604
Hypothetical scaled Arndale GPU design vs. GTX Titan
Full 1/2 1/4 1/8
16 Gflop/J, 1.3 GB/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%]
C C C F M
GTX Titan 1/4 1/2 1 2 1/4 1/2 1 2 4 8 16 32 64 128 1/4
Intensity (single−precision FLOP:Byte) Power [normalized to const+cap]
32 64 128 1/4 1/2 1 2 4
Intensity (single−precision FLOP:Byte)
Full 1/2 1/4 1/8
16 Gflop/J, 1.3 GB/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%]
C C C F M
GTX Titan 1/4 1/2 1 2 1/4 1/2 1 2 4 8 16 32 64 128 1/4
Intensity (single−precision FLOP:Byte) Power [normalized to const+cap]
32 64 128 1/4 1/2 1 2 4
Intensity (single−precision FLOP:Byte)
π1
∆π
throttled around Bτ
Full 1/2 1/4 1/8
16 Gflop/J, 1.3 GB/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%]
C C C F M
GTX Titan 1/4 1/2 1 2 1/4 1/2 1 2 4 8 16 32 64 128 1/4
Intensity (single−precision FLOP:Byte) Power [normalized to const+cap]
32 64 128 1/4 1/2 1 2 4
Intensity (single−precision FLOP:Byte)
π1
Pth = 1 2∆π + π1
1 2∆π
when intensity is low (≤ 2) but throttled everywhere else
Full 1/2 1/4 1/8
16 Gflop/J, 1.3 GB/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%]
C C C F M
GTX Titan 1/4 1/2 1 2 1/4 1/2 1 2 4 8 16 32 64 128 1/4
Intensity (single−precision FLOP:Byte) Power [normalized to const+cap]
32 64 128 1/4 1/2 1 2 4
Intensity (single−precision FLOP:Byte)
π1
1 4∆π
Pth = 1 4∆π + π1
everywhere
will be lower than expected at all intensities
1/2 Full 1/8 1/4
16 Gflop/J, 1.3 GB/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%]
C C C F M
1/4 1/2 Full 1/8
11 Gflop/J, 880 MB/J 2.0 Tflop/s [100%], 180 GB/s [57%] 180 W (const) + 36 W (cap) [100%]
C C F F M
Full 1/8 1/4 1/2
8.1 Gflop/J, 1.5 GB/J 33 Gflop/s [46%], 8.4 GB/s [66%] 1.3 W (const) + 4.8 W (cap) [88%]
C C C F M
GTX Titan Xeon Phi Arndale GPU 1/4 1/2 1 2 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128
Intensity (single−precision FLOP:Byte) Power [normalized to const+cap]
1/2 Full 1/8 1/4
16 Gflop/J, 1.3 GB/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%]
C C C M F
1/4 1/2 Full 1/8
11 Gflop/J, 880 MB/J 2.0 Tflop/s [100%], 180 GB/s [57%] 180 W (const) + 36 W (cap) [100%]
C C M F F
Full 1/8 1/4 1/2
8.1 Gflop/J, 1.5 GB/J 33 Gflop/s [46%], 8.4 GB/s [66%] 1.3 W (const) + 4.8 W (cap) [88%]
M C C C F
GTX Titan Xeon Phi Arndale GPU 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128
Intensity (single−precision FLOP:Byte) Flops / Time [normalized to estimated peak]
< 4×
1/2 Full 1/8 1/4
16 Gflop/J, 1.3 GB/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%]
C C C M F
1/4 1/2 Full 1/8
11 Gflop/J, 880 MB/J 2.0 Tflop/s [100%], 180 GB/s [57%] 180 W (const) + 36 W (cap) [100%]
C C M F F
Full 1/8 1/4 1/2
8.1 Gflop/J, 1.5 GB/J 33 Gflop/s [46%], 8.4 GB/s [66%] 1.3 W (const) + 4.8 W (cap) [88%]
M C C C F
GTX Titan Xeon Phi Arndale GPU 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128
Intensity (single−precision FLOP:Byte) Flops / Time [normalized to estimated peak]
~ 8×
1/2 Full 1/8 1/4
16 Gflop/J, 1.3 GB/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%]
C C C M F
1/4 1/2 Full 1/8
11 Gflop/J, 880 MB/J 2.0 Tflop/s [100%], 180 GB/s [57%] 180 W (const) + 36 W (cap) [100%]
C C M F F
Full 1/8 1/4 1/2
8.1 Gflop/J, 1.5 GB/J 33 Gflop/s [46%], 8.4 GB/s [66%] 1.3 W (const) + 4.8 W (cap) [88%]
M C C C F
GTX Titan Xeon Phi Arndale GPU 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128
Intensity (single−precision FLOP:Byte) Flops / Time [normalized to estimated peak]
~ 5× ~ 8×
energy
– provide high-level analytical methodology – hypothetical systems – inter-platform comparisons of energy costs
time (performance)
– every cost must be accounted for – constant power is huge bottleneck
1 2 4 8
14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512
Intensity (flop:byte) Power, relative to flop−power
1/2 Full 1/8 1/4
16 Gflop/J, 1.3 GB/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%]
C C M C F
1/4 1/2 Full 1/8
11 Gflop/J, 880 MB/J 2.0 Tflop/s [100%], 180 GB/s [57%] 180 W (const) + 36 W (cap) [100%]
C C M F F
Full 1/8 1/4 1/2
8.1 Gflop/J, 1.5 GB/J 33 Gflop/s [46%], 8.4 GB/s [66%] 1.3 W (const) + 4.8 W (cap) [88%]
M C C C F
GTX Titan Xeon Phi Arndale GPU 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128
Intensity (single−precision FLOP:Byte) Flops / Energy [normalized to estimated peak]
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 Arndale GPU GTX Titan GTX 680 Xeon Phi GTX 580 NUC CPU Arndale CPU APU GPU PandaBoard ES APU CPU Desktop CPU Rand Mem L2 L1 Flop
Energy per single−precision op
[baseline: 61.0 pJ; includes constant energy]
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 Arndale GPU GTX Titan GTX 680 Xeon Phi GTX 580 NUC CPU Arndale CPU APU GPU PandaBoard ES APU CPU Desktop CPU 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 GTX Titan GTX 680 Xeon Phi Arndale GPU APU GPU GTX 580 NUC CPU PandaBoard ES Arndale CPU APU CPU Desktop CPU 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 Arndale GPU GTX Titan APU GPU GTX 680 GTX 580 Xeon Phi PandaBoard ES NUC CPU Arndale CPU Desktop CPU APU CPU Flop L1 L2 Mem Rand
Energy per single−precision op
[baseline: 61.0 pJ; includes constant energy]
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 Arndale GPU GTX Titan GTX 680 Xeon Phi GTX 580 NUC CPU Arndale CPU APU GPU PandaBoard ES APU CPU Desktop CPU 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 GTX Titan GTX 680 Xeon Phi Arndale GPU APU GPU GTX 580 NUC CPU PandaBoard ES Arndale CPU APU CPU Desktop CPU 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 Arndale GPU GTX Titan APU GPU GTX 680 GTX 580 Xeon Phi PandaBoard ES NUC CPU Arndale CPU Desktop CPU APU CPU Flop L1 L2 Mem Rand
Energy per single−precision op
[baseline: 61.0 pJ; includes constant energy]
1 2 A r n d a l e G P U G T X T i t a n G T X 6 8 X e
P h i G T X 5 8 N U C C P U A r n d a l e C P U 1 2 G T X T i t a n G T X 6 8 X e
P h i A r n d a l e G P U A P U G P U G T X 5 8 P 1 2 A r n d a l e G P U G T X T i t a n A P U G P U G T X 6 8 G T X 5 8 X e
P h i P a n d a B
r d E S
(Model − Measured) / Measured
Capped
Power Prediction Error [single−precision]
1/32 1/16 1/8 1/4 1/2 1
3.6
GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance
Compute bound Memory (bandwidth) bound
T = max (Wτflop, Qτmem) = Wτflop max ✓ 1, Q W τmem τflop ◆ = Wτflop max ✓ 1, Bτ I ◆
Slow memory
xPU Fast memory (total size = Z)
τmem = time/(m)op τflop = time/(fl)op Q mops W (fl)ops
E = W✏flop + Q✏mem + ⇡0T = W✏flop ✓ 1 + B✏ I + ⇡0 ✏flop T W ◆
T = max (Wτflop, Qτmem) = Wτflop max ✓ 1, Q W τmem τflop ◆ = Wτflop max ✓ 1, Bτ I ◆
Energy balance
(flop : mop)
Slow memory
xPU Fast memory (total size = Z)
τmem = time/(m)op τflop = time/(fl)op Q mops W (fl)ops Constant
power
E = W✏flop + Q✏mem + ⇡1T = W✏flop ✓ 1 + B✏ I + ⇡1 ✏flop T W ◆
1/32 1/16 1/8 1/4 1/2 1
3.6 14
GFLOP/J GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance
Time-energy balance gap
bound in time but memory- bound in energy
than optimizing for time
efficiency, but not vice-versa, breaking “race-to-halt”
1 2 4 8
14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512
Intensity (flop:byte) Power, relative to flop−power
P = E T = W✏flop
I
I
⌧flop min ✓ 1 + B✏ I , I + B✏ B⌧ ◆
I → ∞ ⇒ P = ✏flop ⌧flop = Pflop I → 0 ⇒ P = Pflop B✏ B⌧ I = B⌧ ⇒ P = Pflop ✓ 1 + B✏ B⌧ ◆
1 2 4 8
14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512
Intensity (flop:byte) Power, relative to flop−power
1/2 Full 1/8 1/4
16 Gflop/J, 1.3 GB/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%]
C C C F M
1/4 1/2 Full 1/8
11 Gflop/J, 880 MB/J 2.0 Tflop/s [100%], 180 GB/s [57%] 180 W (const) + 36 W (cap) [100%]
C C F F M
Full 1/8 1/4 1/2
8.1 Gflop/J, 1.5 GB/J 33 Gflop/s [46%], 8.4 GB/s [66%] 1.3 W (const) + 4.8 W (cap) [88%]
C C C F M
GTX Titan Xeon Phi Arndale GPU 1/4 1/2 1 2 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128
Intensity (single−precision FLOP:Byte) Power [normalized to const+cap]
1/2 Full 1/8 1/4
16 Gflop/J, 1.3 GB/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%]
C C C M F
1/4 1/2 Full 1/8
11 Gflop/J, 880 MB/J 2.0 Tflop/s [100%], 180 GB/s [57%] 180 W (const) + 36 W (cap) [100%]
C C M F F
Full 1/8 1/4 1/2
8.1 Gflop/J, 1.5 GB/J 33 Gflop/s [46%], 8.4 GB/s [66%] 1.3 W (const) + 4.8 W (cap) [88%]
M C C C F
GTX Titan Xeon Phi Arndale GPU 2−14 2−12 2−10 1/256 1/64 1/16 1/4 1 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128
Intensity (single−precision FLOP:Byte) Flops / Time [normalized to 4.0 Tflop/s]