Generalized roofline analysis? Jee Choi Marat Dukhan Richard (Rich) - PowerPoint PPT Presentation

hpcgarage.org/13401 GFLOP/s 1 Compute bound 1/2 Relative performance (a) Intel Xeon (Clovertown) 1/4 128 Memory peak DP 64 +balanced (bandwidth) mul/add 32 bound +SIMD 1/8 16 GFlops/s +ILP 8 TLP only 1/16 4 peak stream bandwidth e e v v i i t t FFT (512 3 ) FFT (128 3 ) c c e e f f f f e e n LBMHD r i e 2 Stencil r t e l fi t l fi p o p o o n o s n + s 1 3.6 flop:byte 1/32 1/16 1/8 1/4 1/2 1 2 4 8 16 Operational Intensity (Flops/Byte) 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011) Wednesday, October 2, 13

hpcgarage.org/13401 GFLOP/s 1 Compute bound in time 1/2 Relative performance 1/4 Memory (bandwidth) bound 1/8 in time 1/16 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011) Wednesday, October 2, 13

hpcgarage.org/13401 GFLOP/s 1 GFLOP/J “Arch line” 1/2 Relative performance 1/4 1/8 1/16 3.6 14 flop:byte flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimates for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011) Wednesday, October 2, 13

hpcgarage.org/13401 GFLOP/s 1 GFLOP/J “Arch line” 1/2 Relative performance 1/4 1/8 1/16 Energy balance (flop : mop) 3.6 14 flop:byte flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimates for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011) Wednesday, October 2, 13

hpcgarage.org/13401 GFLOP/s 1 GFLOP/J “Arch line” 1/2 Relative performance 1/4 Memory Compute bound (bandwidth) in energy 1/8 bound in energy 1/16 Energy balance (flop : mop) 3.6 14 flop:byte flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimates for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011) Wednesday, October 2, 13

hpcgarage.org/13401 Power, relative to flop − power 8 3.6 14 flop:byte flop:byte 5.0 4.0 4 “ Power line ” (average power) 2 1.0 1 0.5 1 2 4 8 16 32 64 128 256 512 Intensity (flop:byte) Balance estimates for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011) Wednesday, October 2, 13

hpcgarage.org/13401 For real systems, the model should account for “ constant power ” and “ power capping .” Powerline Roofline 2 Normalized Value 1 1/2 1/4 1/8 1/16 1 2 4 8 16 32 64 1 2 4 8 16 32 64 Intensity (flop:byte) Wednesday, October 2, 13

hpcgarage.org/13401 For real systems, the model should account for “ constant power ” and “ power capping .” Powerline Roofline 2 Normalized Value 1 1/2 π 0 : Constant 1/4 1/8 1/16 1 2 4 8 16 32 64 1 2 4 8 16 32 64 Intensity (flop:byte) Wednesday, October 2, 13

hpcgarage.org/13401 For real systems, the model should account for “ constant power ” and “ power capping .” Powerline Roofline 2 Normalized Value 1 ∆ π : Cap 1/2 π 0 : Constant 1/4 1/8 1/16 1 2 4 8 16 32 64 1 2 4 8 16 32 64 Intensity (flop:byte) Wednesday, October 2, 13

“Desktop GPU” (NVIDIA) “Mobile GPU” (Samsung/ARM) GTX Titan Arndale GPU (4.0 Tflop/s, 16 Gflop/J, 290 W) (33 Gflop/s, 8.1 Gflop/J, 7.0 W) 2 Efficiency (higher is better) Time 1 Time 1/2 Energy 1/4 1/8 Energy 1/16 1/32 1/64 1/128 1/256 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128 Intensity (single − precision flop:Byte) Wednesday, October 2, 13

hpcgarage.org/13401 So what? Wednesday, October 2, 13

hpcgarage.org/13401 So what? Possibility 1: A “first principles” view of time & energy in systems. Wednesday, October 2, 13

hpcgarage.org/13401 GFLOP/s 1 GFLOP/J “Arch line” 1/2 Relative performance W ✏ flop 1 1/4 = 1 + B ✏ E I 1/8 1/16 3.6 14 flop:byte flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimates for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011) Wednesday, October 2, 13

hpcgarage.org/13401 GFLOP/s 1 GFLOP/J “Arch line” 1/2 Relative performance W ✏ flop 1 1/4 = 1 + B ✏ E I 1/8 Time-energy balance gap What does this imply? 1/16 3.6 14 flop:byte flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimates for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011) Wednesday, October 2, 13

hpcgarage.org/13401 GFLOP/s 1 GFLOP/J “Arch line” 1/2 Relative performance W ✏ flop 1 1/4 = 1 + B ✏ E I 1/8 Time-energy balance gap Compute-bound in time but memory-bound in energy? 1/16 3.6 14 flop:byte flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimates for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011) Wednesday, October 2, 13

hpcgarage.org/13401 GFLOP/s 1 GFLOP/J “Arch line” 1/2 Relative performance W ✏ flop 1 1/4 = 1 + B ✏ E I 1/8 Time-energy balance gap Is optimizing for energy harder than optimizing for time? 1/16 3.6 14 flop:byte flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimates for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011) Wednesday, October 2, 13

hpcgarage.org/13401 GFLOP/s 1 GFLOP/J “Arch line” 1/2 Relative performance W ✏ flop 1 1/4 = 1 + B ✏ E I 1/8 Time-energy balance gap Energy-e ffi ciency likely implies time-e ffi ciency, but not vice- 1/16 versa, breaking “race-to-halt” 3.6 14 flop:byte flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimates for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011) Wednesday, October 2, 13

hpcgarage.org/13401 So what? Possibility 2: Another view of power caps and throttling . Wednesday, October 2, 13

GTX Titan GTX 680 Xeon Phi NUC GPU Arndale GPU APU GPU hpcgarage.org/13401 1.15 Recall: “ Power line ” 1.1 1.05 Power, relative to flop − power Cap Cap Cap Cap Memory Cap Memory Cap 1 Memory 0.95 8 Compute Memory 0.9 3.6 14 Compute 0.85 Compute Compute Compute 5.0 0.8 0.75 4.0 Memory Memory 4 Power [normalized] 0.7 0.65 16 Gflop/J 15 Gflop/J 11 Gflop/J 8.8 Gflop/J 8.1 Gflop/J 6.4 Gflop/J 4.0 Tflop/s [81%], 240 GB/s [83%] 3.0 Tflop/s [86%], 160 GB/s [82%] 2.0 Tflop/s [100%], 180 GB/s [57%] 270 Gflop/s [100%], 15 GB/s [60%] 33 Gflop/s [46%], 8.4 GB/s [66%] 100 Gflop/s [95%], 8.7 GB/s [81%] 0.6 120 W (const) + 160 W (cap) [99%] 66 W (const) + 140 W (cap) [100%] 180 W (const) + 36 W (cap) [100%] 10 W (const) + 18 W (cap) [91%] 1.3 W (const) + 4.8 W (cap) [88%] 16 W (const) + 3.2 W (cap) [100%] GTX 580 NUC CPU PandaBoard ES Arndale CPU APU CPU Desktop CPU 2 1.15 1.1 1.05 Cap Cap Cap Cap Cap Cap 1 1.0 Compute 1 0.95 Compute Memory Compute Memory Compute Memory Compute 0.9 Memory 0.5 1 2 4 8 16 32 64 128 256 512 0.85 Intensity (flop:byte) 0.8 Compute 0.75 0.7 0.65 5.4 Gflop/J 3.2 Gflop/J 2.5 Gflop/J 2.2 Gflop/J 650 Mflop/J 620 Mflop/J 1.4 Tflop/s [88%], 170 GB/s [89%] 56 Gflop/s [97%], 18 GB/s [70%] 9.5 Gflop/s [99%], 1.3 GB/s [40%] 16 Gflop/s [58%], 3.9 GB/s [31%] 13 Gflop/s [98%], 3.3 GB/s [31%] 99 Gflop/s [93%], 19 GB/s [74%] 0.6 120 W (const) + 150 W (cap) [94%] 17 W (const) + 7.4 W (cap) [98%] 3.5 W (const) + 1.2 W (cap) [95%] 5.5 W (const) + 2.0 W (cap) [97%] 20 W (const) + 1.4 W (cap) [98%] 120 W (const) + 44 W (cap) [99%] 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 Intensity (single − precision flop:Byte) Wednesday, October 2, 13

GTX Titan GTX 680 Xeon Phi NUC GPU Arndale GPU APU GPU 1.15 “Desktop GPU” “Desktop GPU” “Desktop xPU” “Mobile GPU” “Mobile GPU” “APU GPU” 1.1 1.05 Cap Cap Cap Cap Memory Cap Memory Cap 1 Memory 0.95 Compute Memory 0.9 Compute 0.85 Compute Compute Compute 0.8 0.75 Memory Memory Power [normalized] 0.7 0.65 16 Gflop/J 15 Gflop/J 11 Gflop/J 8.8 Gflop/J 8.1 Gflop/J 6.4 Gflop/J 4.0 Tflop/s [81%], 240 GB/s [83%] 3.0 Tflop/s [86%], 160 GB/s [82%] 2.0 Tflop/s [100%], 180 GB/s [57%] 270 Gflop/s [100%], 15 GB/s [60%] 33 Gflop/s [46%], 8.4 GB/s [66%] 100 Gflop/s [95%], 8.7 GB/s [81%] 0.6 120 W (const) + 160 W (cap) [99%] 66 W (const) + 140 W (cap) [100%] 180 W (const) + 36 W (cap) [100%] 10 W (const) + 18 W (cap) [91%] 1.3 W (const) + 4.8 W (cap) [88%] 16 W (const) + 3.2 W (cap) [100%] GTX 580 NUC CPU PandaBoard ES Arndale CPU APU CPU Desktop CPU 1.15 “Desktop GPU” “Mobile CPU” “Mobile CPU” “Mobile CPU” “APU CPU” “Desktop CPU” 1.1 1.05 Cap Cap Cap Cap Cap Cap 1 Compute 0.95 Compute Memory Compute Memory Compute Memory Compute 0.9 Memory 0.85 0.8 Compute 0.75 0.7 0.65 5.4 Gflop/J 3.2 Gflop/J 2.5 Gflop/J 2.2 Gflop/J 650 Mflop/J 620 Mflop/J 1.4 Tflop/s [88%], 170 GB/s [89%] 56 Gflop/s [97%], 18 GB/s [70%] 9.5 Gflop/s [99%], 1.3 GB/s [40%] 16 Gflop/s [58%], 3.9 GB/s [31%] 13 Gflop/s [98%], 3.3 GB/s [31%] 99 Gflop/s [93%], 19 GB/s [74%] 0.6 120 W (const) + 150 W (cap) [94%] 17 W (const) + 7.4 W (cap) [98%] 3.5 W (const) + 1.2 W (cap) [95%] 5.5 W (const) + 2.0 W (cap) [97%] 20 W (const) + 1.4 W (cap) [98%] 120 W (const) + 44 W (cap) [99%] 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 Intensity (single − precision flop:Byte) Wednesday, October 2, 13

hpcgarage.org/13401 NVIDIA GTX 580 GF100; Fermi 64 Can infer throttling, Flops 32 given a power cap Throttling coefficient 16 8 4 2 1 Mops 1/4 1/2 1 2 4 8 16 32 64 Intensity (single − precision FLOP : Byte) Wednesday, October 2, 13

hpcgarage.org/13401 So what? Possibility 3: Abstract algorithm analysis . Wednesday, October 2, 13

Abstract work-communication trade-o ff s Algorithm 2 = ( fW, Q Algorithm 1 = ( W, Q ) m ) versus I ≡ W Q Wednesday, October 2, 13

Abstract work-communication trade-o ff s Algorithm 2 = ( fW, Q Algorithm 1 = ( W, Q ) m ) versus I ≡ W Q T 1 , 1 Speedup ∆ T = T f,m E 1 , 1 “Greenup” ∆ E = E f,m Wednesday, October 2, 13

Abstract work-communication trade-o ff s Algorithm 2 = ( fW, Q Algorithm 1 = ( W, Q ) m ) versus I ≡ W Q T 1 , 1 Speedup ∆ T = T f,m E 1 , 1 “Greenup” ∆ E = E f,m f < 1 + m − 1 B ✏ ∆ E > 1 = ⇒ m I A general “greenup” condition Wednesday, October 2, 13

hpcgarage.org/13401 GFLOP/s 1 GFLOP/J Algorithm 1 1/2 Relative performance 1/4 1/8 1/16 3.6 14 flop:byte flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimates for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011) Wednesday, October 2, 13

hpcgarage.org/13401 GFLOP/s 1 GFLOP/J Algorithm 1 1/2 Relative performance 1/4 1/8 f < 1 + B ✏ ∆ E > 1 = ⇒ B ⌧ 1/16 3.6 14 flop:byte flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimates for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011) Wednesday, October 2, 13

hpcgarage.org/13401 So what? Possibility 4: Abstract architectural bake-o ff s . http://stickofachef.files.wordpress.com/2008/03/iron_chef.jpg Wednesday, October 2, 13

hpcgarage.org/13401 Flop / Time Flop / Energy Power 2 1 1/2 GTX Titan GTX Titan 1/4 Relative value 1/8 1/16 GTX Titan Arndale GPU Arndale GPU 1/32 1/64 1/128 1/256 Arndale GPU 1/512 2 − 10 2 − 11 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 Intensity (single − precision flop:Byte) Wednesday, October 2, 13

hpcgarage.org/13401 Flop / Time Flop / Energy Power 2 1 1/2 GTX Titan GTX Titan 1/4 Relative value ~ 47 ⨉ 1/8 1/16 GTX Titan Arndale GPU Arndale GPU 1/32 1/64 1/128 1/256 Arndale GPU 1/512 2 − 10 2 − 11 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 Intensity (single − precision flop:Byte) Wednesday, October 2, 13

hpcgarage.org/13401 Flop / Time Flop / Energy Power 2 1 1/2 GTX Titan GTX Titan 1/4 Relative value ~ 47 ⨉ 1/8 ~ 122 ⨉ 1/16 GTX Titan Arndale GPU Arndale GPU 1/32 1/64 1/128 1/256 Arndale GPU 1/512 2 − 10 2 − 11 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 Intensity (single − precision flop:Byte) Wednesday, October 2, 13

hpcgarage.org/13401 Flop / Time Flop / Energy Power 2 1 1/2 GTX Titan GTX Titan 1/4 Relative value ~ 47 ⨉ 1/8 ~ 122 ⨉ 1/16 GTX Titan Arndale GPU Arndale GPU 1/32 1/64 1/128 ~ 28 ⨉ 1/256 Arndale GPU 1/512 2 − 10 2 − 11 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 Intensity (single − precision flop:Byte) Wednesday, October 2, 13

hpcgarage.org/13401 Flop / Time Flop / Energy Power 2 1 47 x Arndale GPU 1/2 GTX Titan GTX Titan 1/4 Relative value 1/8 1/16 GTX Titan Arndale GPU Arndale GPU 1/32 1/64 47 x Arndale GPU 1/128 1/256 Arndale GPU 1/512 2 − 10 2 − 11 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 Intensity (single − precision flop:Byte) Wednesday, October 2, 13

hpcgarage.org/13401 Flop / Time Flop / Energy Power 2 1 47 x Arndale GPU 1/2 GTX Titan GTX Titan ~ 0.4 ⨉ 1/4 Relative value 1/8 as fast 1/16 GTX Titan Arndale GPU Arndale GPU 1/32 1/64 47 x Arndale GPU 1/128 1/256 Arndale GPU 1/512 2 − 10 2 − 11 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 Intensity (single − precision flop:Byte) Wednesday, October 2, 13

hpcgarage.org/13401 Flop / Time Flop / Energy Power 2 1 47 x Arndale GPU 1/2 GTX Titan GTX Titan ~ 0.4 ⨉ 1/4 Relative value 1/8 as fast 1/16 GTX Titan Arndale GPU Arndale GPU 1/32 1/64 47 x Arndale GPU 1/128 1/256 ~ 1.6 ⨉ faster Arndale GPU 1/512 — but that’s 2 − 10 optimistic! 2 − 11 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 Intensity (single − precision flop:Byte) Wednesday, October 2, 13

hpcgarage.org/13401 Flop / Time Flop / Energy Power 2 1 47 x Arndale GPU 1/2 GTX Titan GTX Titan ~ 0.4 ⨉ 1/4 Relative value 1/8 as fast 1/16 GTX Titan Arndale GPU Arndale GPU 1/32 1/64 47 x Arndale GPU 1/128 1/256 ~ 1.6 ⨉ faster Arndale GPU same! 1/512 — but that’s 2 − 10 optimistic! 2 − 11 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 Intensity (single − precision flop:Byte) Wednesday, October 2, 13

hpcgarage.org/13401 Flop / Time Flop / Energy Power 2 ~ 2 ⨉ 1 47 x Arndale GPU 1/2 GTX Titan GTX Titan ~ 0.4 ⨉ 1/4 Relative value 1/8 as fast 1/16 GTX Titan Arndale GPU Arndale GPU 1/32 1/64 47 x Arndale GPU 1/128 1/256 ~ 1.6 ⨉ faster Arndale GPU same! 1/512 — but that’s 2 − 10 optimistic! 2 − 11 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 Intensity (single − precision flop:Byte) Wednesday, October 2, 13

hpcgarage.org/13401 f ¡ (energy-e ffi ciency) Applying ideas of R. Numrich , we can identify a unitless (dimensionless) manifold of all possible systems in the model, and (intensity, balance) v ¡ (intensity, power) use di ff erential geometry to estimate the “distance” between systems. Numrich, R. W. (2010). u “Computer performance analysis and the Pi Theorem.” Comp. Sci. R&D . ¡ doi: 10.1007/s00450-010-0147-8 Wednesday, October 2, 13

hpcgarage.org/13401 GFLOP/s 1 GFLOP/J 64 Zacate GPU 1/2 Energy balance (single − precision flop:byte) ? Relative performance 1/4 32 ? B ✏ > B ⌧ ● Ivy Bridge 1/8 ● KNC ARM Cortex − A9 1/16 16 3.6 14 1/32 Mobile ● Bobcat 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Mini Ivy Bridge ๏ What are the intrinsic relationships Kepler Kepler 8 among time, energy, and power? Desktop ARM Mali T − 604 ๏ What do these relationships say about Fermi 4 algorithms and software? ● ARM Cortex − A15 ๏ About architectures, hardware, and co- ● design? 2 Nehalem 4 8 16 Time balance (single − precision flop:byte) ๏ Can {roof,arch,power,…}-lines, suitably refined, guide autotuning systems? xPU CPU GPU ● Wednesday, October 2, 13

Backup Wednesday, October 2, 13

Jee Whan Choi Kent Czechowski Aparna Chandramowlishwaran Marat Dukhan Autotuning for Co-design [Now R.S. @ MIT] Math libraries power and energy Fast multipole method & machine learning 2010 Gordon Bell Prize (+ G. Biros) 2010 IPDPS Best Paper (+ K. Knobe, Intel CnC lead) 2012 SIAM Data Mining Best Paper (D. Lee [GE Research] + A. Gray) See our recent 2013 IPDPS papers, posted at: hpcgarage.org/ppam13 * Jee Choi — “A roofline model of energy.” * Kent Czechowski —“A theoretical framework for algorithm-architecture co-design.” Wednesday, October 2, 13

hpcgarage.org/modsim13 Slow memory Q mops # (fl)ops W ≡ Fast memory # mem. ops (mops) Q ≡ (total size = Z ) = Q ( Z ) xPU W (fl)ops von Neumann-like system Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); … Wednesday, October 2, 13

hpcgarage.org/modsim13 = max ( W ⌧ flop , Q ⌧ mem ) T Slow memory ✓ ◆ 1 , Q ⌧ mem = W ⌧ flop max W ⌧ flop τ mem = time/mop ✓ ◆ 1 , B ⌧ = W ⌧ flop max I Fast memory (total size = Z ) = W ✏ flop + Q ✏ mem E xPU ✓ ◆ 1 + B ✏ = W ✏ flop I τ flop = time/flop W ⌧ flop W ✏ flop Consider: and von Neumann-like system T E Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); … Wednesday, October 2, 13

hpcgarage.org/modsim13 = max ( W ⌧ flop , Q ⌧ mem ) T Slow memory ✓ ◆ 1 , Q ⌧ mem = W ⌧ flop max W ⌧ flop τ mem = time/mop ✓ ◆ 1 , B ⌧ = W ⌧ flop max I Fast memory Minimum time (total size = Z ) = W ✏ flop + Q ✏ mem E xPU ✓ ◆ 1 + B ✏ = W ✏ flop I τ flop = time/flop W ⌧ flop W ✏ flop Consider: and von Neumann-like system T E Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); … Wednesday, October 2, 13

hpcgarage.org/modsim13 = max ( W ⌧ flop , Q ⌧ mem ) T Slow memory ✓ ◆ 1 , Q ⌧ mem = W ⌧ flop max W ⌧ flop τ mem = time/mop ✓ ◆ 1 , B ⌧ = W ⌧ flop max I Fast memory Minimum time (total size = Z ) Intensity = W ✏ flop + Q ✏ mem E (flop : mop) xPU ✓ ◆ 1 + B ✏ = W ✏ flop I τ flop = time/flop W ⌧ flop W ✏ flop Consider: and von Neumann-like system T E Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); … Wednesday, October 2, 13

hpcgarage.org/modsim13 = max ( W ⌧ flop , Q ⌧ mem ) T Slow memory ✓ ◆ 1 , Q ⌧ mem = W ⌧ flop max W ⌧ flop τ mem = time/mop ✓ ◆ 1 , B ⌧ = W ⌧ flop max I Fast memory Minimum time (total size = Z ) Intensity = W ✏ flop + Q ✏ mem E (flop : mop) xPU ✓ ◆ 1 + B ✏ = W ✏ flop Balance I τ flop = time/flop (flop : mop) W ⌧ flop W ✏ flop Consider: and von Neumann-like system T E Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); … Wednesday, October 2, 13

hpcgarage.org/modsim13 = max ( W ⌧ flop , Q ⌧ mem ) T Slow memory ✓ ◆ 1 , Q ⌧ mem = W ⌧ flop max W ⌧ flop ε mem = energy/mop ✓ ◆ 1 , B ⌧ = W ⌧ flop max I Fast memory (total size = Z ) = W ✏ flop + Q ✏ mem E xPU ✓ ◆ 1 + B ✏ = W ✏ flop I ε flop = energy/flop W ⌧ flop W ✏ flop Consider: and An energy analogue T E Wednesday, October 2, 13

hpcgarage.org/modsim13 = max ( W ⌧ flop , Q ⌧ mem ) T Slow memory ✓ ◆ 1 , Q ⌧ mem = W ⌧ flop max W ⌧ flop ε mem = energy/mop ✓ ◆ 1 , B ⌧ = W ⌧ flop max I Fast memory (total size = Z ) = W ✏ flop + Q ✏ mem E xPU ✓ ◆ 1 + B ✏ = W ✏ flop I ε flop = energy/flop W ⌧ flop W ✏ flop Consider: and An energy analogue Energy balance T E (flop : mop) Wednesday, October 2, 13

“Roofline” — Williams et al. ( Comm. ACM ’09) hpcgarage.org/modsim13 GFLOP/s 1 1/2 Relative performance W τ flop 1 1/4 = 1 , B τ � � T max I 1/8 1/16 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011) Wednesday, October 2, 13

“Roofline” — Williams et al. ( Comm. ACM ’09) hpcgarage.org/modsim13 GFLOP/s 1 Compute bound 1/2 Relative performance W τ flop 1 1/4 = 1 , B τ � � T max I 1/8 1/16 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011) Wednesday, October 2, 13

“Roofline” — Williams et al. ( Comm. ACM ’09) hpcgarage.org/modsim13 GFLOP/s 1 Compute bound 1/2 Relative performance W τ flop 1 1/4 = Memory 1 , B τ � � T max I (bandwidth) bound 1/8 1/16 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011) Wednesday, October 2, 13

“Roofline” — Williams et al. ( Comm. ACM ’09) hpcgarage.org/modsim13 GFLOP/s 1 Compute bound 1/2 Relative performance W τ flop 1 1/4 = Memory 1 , B τ � � T max I (bandwidth) bound 1/8 Dense matrix multiply 1/16 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011) Wednesday, October 2, 13

“Roofline” — Williams et al. ( Comm. ACM ’09) hpcgarage.org/modsim13 GFLOP/s 1 Compute bound 1/2 Relative performance W τ flop 1 1/4 = Memory 1 , B τ � � T max I (bandwidth) bound 1/8 Dense matrix multiply sparse matvec; 1/16 stencils 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011) Wednesday, October 2, 13

“Roofline” — Williams et al. ( Comm. ACM ’09) hpcgarage.org/modsim13 GFLOP/s 1 Compute bound 1/2 Relative performance W τ flop 1 1/4 = Memory 1 , B τ � � T max I (bandwidth) bound 1/8 Dense matrix multiply sparse matvec; FFTs 1/16 stencils 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011) Wednesday, October 2, 13

hpcgarage.org/modsim13 GFLOP/s 1 Compute bound in time 1/2 Relative performance W τ flop 1 1/4 = Memory 1 , B τ � � T max I (bandwidth) bound 1/8 in time 1/16 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimates for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011) Wednesday, October 2, 13

hpcgarage.org/modsim13 GFLOP/s 1 GFLOP/J “Arch line” 1/2 Relative performance W ✏ flop 1 1/4 = 1 + B ✏ E I 1/8 1/16 3.6 14 flop:byte flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimates for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011) Wednesday, October 2, 13

hpcgarage.org/modsim13 GFLOP/s 1 GFLOP/J “Arch line” 1/2 Relative performance W ✏ flop 1 1/4 = Memory 1 + B ✏ E I (bandwidth) 1/8 bound in energy Compute bound 1/16 in energy 3.6 14 flop:byte flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimates for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011) Wednesday, October 2, 13

That was theory. What happens in practice? ⇒ Cannot ignore constant power . Let’s add it to our model and measure. Wednesday, October 2, 13

= max ( W τ flop , Q τ mem ) T Slow memory ✓ ◆ 1 , Q τ mem = W τ flop max W τ flop τ mem = time/mop ✓ ◆ 1 , B τ = W τ flop max I Fast memory (total size = Z ) E W ✏ flop + Q ✏ mem + T ⇡ 0 = xPU ✓ ◆ 1 + B ✏ T I + ⇡ 0 W ✏ flop = W τ flop = time/flop ✏ flop W ⌧ flop W ✏ flop Constant power Consider: and T E Wednesday, October 2, 13

= max ( W τ flop , Q τ mem ) T Slow memory ✓ ◆ 1 , Q τ mem = W τ flop max W τ flop τ mem = time/mop ✓ ◆ 1 , B τ Constant = W τ flop max I power Fast memory (total size = Z ) E W ✏ flop + Q ✏ mem + T ⇡ 0 = xPU ✓ ◆ 1 + B ✏ T I + ⇡ 0 W ✏ flop = W τ flop = time/flop ✏ flop W ⌧ flop W ✏ flop Constant power Consider: and T E Wednesday, October 2, 13

NVIDIA GTX 580 − double − precision NVIDIA GTX 580 − double − precision Intel i7 − 950 − double − precision Intel i7 − 950 − double − precision 1 = 197 GFLOP/s = 197 GFLOP/s = 53.3 GFLOP/s = 53.3 GFLOP/s ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1/2 A microbenchmark study: ● ● Time ● ● Normalized performance 1/4 “GPU energy” includes all GPU card components (card, ● ● memory, fan) but excludes the host. 1.03 2.08 1/8 1 = 1.23 GFLOP/J = 0.316 GFLOP/J ● ● ● ● ● ● ● ● ● ● ● Same for CPU but without GPU card. ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1/2 ● ● Measurements use “PowerMon” (Bedard & Fowler ’09) Energy ● ● ● with sub-millisecond sampling) 1/4 ● ● 0.890 1.72 (static=0) 0.896 (static=0) 1.01 1/8 1/4 1/2 1 2 4 8 16 1/4 1/2 1 2 4 8 16 Intensity (FLOP : Byte) Wednesday, October 2, 13

NVIDIA GTX 580 − double − precision NVIDIA GTX 580 − double − precision Intel i7 − 950 − double − precision 1 1 = 197 GFLOP/s = 197 GFLOP/s = 53.3 GFLOP/s ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1/2 1/2 ● ● ● Time ● ● Normalized performance Normalized performance 1/4 1/4 ● ● ● 1.03 1.03 2.08 1/8 1/8 1 1 = 1.23 GFLOP/J = 1.23 GFLOP/J = 0.316 GFLOP/J ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1/2 1/2 ● ● ● Energy ● ● ● ● 1/4 1/4 ● ● ● 0.890 0.890 1.72 (static=0) 1.72 (static=0) 0.896 (static=0) 1.01 1/8 1/8 1/4 1/4 1/2 1/2 1 1 1 2 2 2 4 4 4 8 8 8 16 16 16 1/4 1/4 1/4 1/2 1/2 1 1 2 2 4 4 8 16 Intensity (FLOP : Byte) Intensity (FLOP : Byte) Intensity (FLOP : Byte) Wednesday, October 2, 13

NVIDIA GTX 580 − double − precision NVIDIA GTX 580 − double − precision Intel i7 − 950 − double − precision 1 1 = 197 GFLOP/s = 197 GFLOP/s = 53.3 GFLOP/s ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1/2 1/2 ● ● ● Time ● ● Normalized performance Normalized performance 1/4 1/4 ● ● ● 1.03 1.03 2.08 1/8 1/8 1 1 = 1.23 GFLOP/J = 1.23 GFLOP/J = 0.316 GFLOP/J ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Constant power ● ● ● ● ● ● ● ● can shift 1/2 1/2 ● ● ● Energy energy-balance ● ● ● ● 1/4 1/4 ● ● ● 0.890 0.890 1.72 (static=0) 1.72 (static=0) 0.896 (static=0) 1.01 1/8 1/8 1/4 1/4 1/2 1/2 1 1 1 2 2 2 4 4 4 8 8 8 16 16 16 1/4 1/4 1/4 1/2 1/2 1 1 2 2 4 4 8 16 Intensity (FLOP : Byte) Intensity (FLOP : Byte) Intensity (FLOP : Byte) Wednesday, October 2, 13

NVIDIA GTX 580 − double − precision NVIDIA GTX 580 − double − precision Intel i7 − 950 − double − precision 1 1 = 197 GFLOP/s = 197 GFLOP/s = 53.3 GFLOP/s ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1/2 1/2 ● ● ● Time ● ● Normalized performance Normalized performance 1/4 1/4 ● ● ● 1.03 1.03 2.08 1/8 1/8 1 1 = 1.23 GFLOP/J = 1.23 GFLOP/J = 0.316 GFLOP/J ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● … and ● ● ● ● ● ● ● ● ● ● “race-to-halt” ● ● ● 1/2 1/2 ● is an artifact of ● ● Energy ● this shift ● ● ● 1/4 1/4 ● ● ● 0.890 0.890 1.72 (static=0) 1.72 (static=0) 0.896 (static=0) 1.01 1/8 1/8 1/4 1/4 1/2 1/2 1 1 1 2 2 2 4 4 4 8 8 8 16 16 16 1/4 1/4 1/4 1/2 1/2 1 1 2 2 4 4 8 16 Intensity (FLOP : Byte) Intensity (FLOP : Byte) Intensity (FLOP : Byte) Wednesday, October 2, 13

NVIDIA GTX 580 − double − precision Intel i7 − 950 − double − precision 1 = 197 GFLOP/s = 53.3 GFLOP/s ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1/2 ● ● Time ● ● Normalized performance 1/4 ● ● 1.03 2.08 1/8 1 = 1.23 GFLOP/J = 0.316 GFLOP/J ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1/2 ● ● Energy ● ● ● 1/4 ● ● 0.890 1.72 (static=0) 0.896 (static=0) 1.01 1/8 1/4 1/2 1 1 2 2 4 4 8 8 16 16 1/4 1/4 1/2 1/2 1 1 2 2 4 4 8 16 Intensity (FLOP : Byte) Intensity (FLOP : Byte) Wednesday, October 2, 13

NVIDIA GTX 580 − double − precision Intel i7 − 950 − double − precision 300 288 W 275 250 ● ● ● ● 225 ● ● 212 W ● ● Normalized performance 200 195 W ● 175 ● ● ● ● 169 W ● ● ● ● ● ● ● ● ● ● ● ● ● 160 W ● Power ● ● 150 ● ● 134 W ● ● 125 108 W 100 84.0 W 75 50 25 1.03 2.08 0.890 1.72 (static=0) 0.896 (static=0) 1.01 0 1/4 1/2 1 2 4 8 16 1/4 1/2 1 2 4 8 16 Intensity (FLOP : Byte) Wednesday, October 2, 13

NVIDIA GTX 580 − double − precision Intel i7 − 950 − double − precision 300 288 W 275 Source of error? 250 ● ● ● ● 225 ● ● 212 W ● ● Normalized performance 200 195 W ● 175 ● ● ● ● 169 W ● ● ● ● ● ● ● ● ● ● ● ● ● 160 W ● Power ● ● 150 ● ● 134 W ● ● 125 108 W 100 84.0 W 75 50 25 1.03 2.08 0.890 1.72 (static=0) 0.896 (static=0) 1.01 0 1/4 1/2 1 2 4 8 16 1/4 1/2 1 2 4 8 16 Intensity (FLOP : Byte) Wednesday, October 2, 13

Power capping is critical. It’s also easy to add. Wednesday, October 2, 13

What might a power cap look like? Powerline Roofline 2 Normalized Value 1 1/2 1/4 1/8 1/16 1 2 4 8 16 32 64 1 2 4 8 16 32 64 Intensity (flop:byte) Wednesday, October 2, 13

Powerline Roofline 2 Normalized Value 1 1/2 1/4 1/8 1/16 1 2 4 8 16 32 64 1 2 4 8 16 32 64 Intensity (flop:byte) Adding a power cap Wednesday, October 2, 13

Powerline Roofline 2 Normalized Value 1 1/2 1/4 1/8 1/16 1 2 4 8 16 32 64 1 2 4 8 16 32 64 Intensity (flop:byte) π 0 : constant Adding a power cap Wednesday, October 2, 13

Powerline Roofline 2 Normalized Value 1 1/2 π 0 1/4 1/8 1/16 1 2 4 8 16 32 64 1 2 4 8 16 32 64 Intensity (flop:byte) π 0 : constant Adding a power cap Wednesday, October 2, 13

Powerline Roofline 2 Normalized Value 1 π 0 + ∆ π : max power 1/2 π 0 1/4 ∆ π : usable 1/8 1/16 1 2 4 8 16 32 64 1 2 4 8 16 32 64 Intensity (flop:byte) π 0 : constant Adding a power cap Wednesday, October 2, 13

Powerline Roofline 2 Normalized Value 1 ∆ π π 0 + ∆ π : max power 1/2 π 0 1/4 ∆ π : usable 1/8 1/16 1 2 4 8 16 32 64 1 2 4 8 16 32 64 Intensity (flop:byte) π 0 : constant Adding a power cap Wednesday, October 2, 13

Powerline Roofline 2 Normalized Value 1 ∆ π π 0 + ∆ π : max power 1/2 π 0 1/4 ∆ π : usable 1/8 1/16 1 2 4 8 16 32 64 1 2 4 8 16 32 64 Intensity (flop:byte) π 0 : constant = max ( W ⌧ flop , Q ⌧ mem ) T free ⇓ ✓ ◆ W ⌧ flop , Q ⌧ mem , W ✏ flop + Q ✏ mem = max T Adding a power cap ∆ ⇡ Wednesday, October 2, 13

Time Energy Power 1.6 TFLOP/s at y=1 5.7 GFLOP/J at y=1 280 Watts at y=1 2 Normalized performance 1 ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● 1/2 ● ● ● 1/4 ● ● ● ● ●●● ● ●● 1/8 ● ● ● ● ● ● 1/16 ● ● 1/32 ● 1/64 1/128 1/256 1/512 1/41/2 1 2 4 8 16 32 641281/41/2 1 2 4 8 16 32 641281/41/2 1 2 4 8 16 32 64128 Intensity (single − precision FLOP : Byte) NVIDIA GTX 580 ● GF100 Fermi Wednesday, October 2, 13

Time Energy Power 1.6 TFLOP/s at y=1 5.7 GFLOP/J at y=1 280 Watts at y=1 2 Normalized performance 1 ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● 1/2 ● ● ● 1/4 ● ● ● ● ●●● ● ●● 1/8 ● ● ● ● ● ● 1/16 ● ● 1/32 ● 1/64 A better fit! 1/128 1/256 1/512 1/41/2 1 2 4 8 16 32 641281/41/2 1 2 4 8 16 32 641281/41/2 1 2 4 8 16 32 64128 Intensity (single − precision FLOP : Byte) NVIDIA GTX 580 ● GF100 Fermi Wednesday, October 2, 13

The model suggests a structure in the time, energy, and power relationships. It also facilitates analysis . Wednesday, October 2, 13

GTX Titan GTX 680 Xeon Phi NUC GPU Arndale GPU APU GPU 1.15 1.1 1.05 Cap Cap Cap Cap Memory Cap Memory Cap 1 Memory 0.95 Compute Memory 0.9 Compute 0.85 Compute Compute Compute 0.8 0.75 Memory Memory Power [normalized] 0.7 0.65 16 Gflop/J 15 Gflop/J 11 Gflop/J 8.8 Gflop/J 8.1 Gflop/J 6.4 Gflop/J 4.0 Tflop/s [81%], 240 GB/s [83%] 3.0 Tflop/s [86%], 160 GB/s [82%] 2.0 Tflop/s [100%], 180 GB/s [57%] 270 Gflop/s [100%], 15 GB/s [60%] 33 Gflop/s [46%], 8.4 GB/s [66%] 100 Gflop/s [95%], 8.7 GB/s [81%] 0.6 120 W (const) + 160 W (cap) [99%] 66 W (const) + 140 W (cap) [100%] 180 W (const) + 36 W (cap) [100%] 10 W (const) + 18 W (cap) [91%] 1.3 W (const) + 4.8 W (cap) [88%] 16 W (const) + 3.2 W (cap) [100%] GTX 580 NUC CPU PandaBoard ES Arndale CPU APU CPU Desktop CPU 1.15 1.1 1.05 Cap Cap Cap Cap Cap Cap 1 Compute 0.95 Compute Memory Compute Memory Compute Memory Compute 0.9 Memory 0.85 0.8 Compute 0.75 0.7 0.65 5.4 Gflop/J 3.2 Gflop/J 2.5 Gflop/J 2.2 Gflop/J 650 Mflop/J 620 Mflop/J 1.4 Tflop/s [88%], 170 GB/s [89%] 56 Gflop/s [97%], 18 GB/s [70%] 9.5 Gflop/s [99%], 1.3 GB/s [40%] 16 Gflop/s [58%], 3.9 GB/s [31%] 13 Gflop/s [98%], 3.3 GB/s [31%] 99 Gflop/s [93%], 19 GB/s [74%] 0.6 120 W (const) + 150 W (cap) [94%] 17 W (const) + 7.4 W (cap) [98%] 3.5 W (const) + 1.2 W (cap) [95%] 5.5 W (const) + 2.0 W (cap) [97%] 20 W (const) + 1.4 W (cap) [98%] 120 W (const) + 44 W (cap) [99%] 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 Intensity (single − precision flop:Byte) Wednesday, October 2, 13

✓ ◆ W ⌧ flop , Q ⌧ mem , W ✏ flop + Q ✏ mem T = max ∆ ⇡ π 0 + ∆ π : max power ⇓ T Throttling factors W ≡ ⌧ flop s flop ≡ ˜ ⌧ flop ∆ π : usable i.e., allowable slowdown T Q ≡ ⌧ mem s mem ≡ ˜ ⌧ mem π 0 : constant ⇓ ⇢ ✓ ◆� 1 , B ⌧ I , ✏ flop / ⌧ flop 1 + B ✏ s flop ≡ max I ∆ ⇡ I Example: s mem s flop ≡ Caps imply throttling! B ⌧ Wednesday, October 2, 13

NVIDIA GTX 580 GF100; Fermi 64 Flops 32 Throttling coefficient 16 8 4 2 1 Mops 1/4 1/2 1 2 4 8 16 32 64 Intensity (single − precision FLOP : Byte) Wednesday, October 2, 13

GTX Titan GTX 680 Xeon Phi NUC GPU Arndale GPU APU GPU Cap Cap 2048 Compute Cap Compute 512 Compute Cap 128 Memory Memory Memory Cap Memory Compute 32 Cap Compute Memory 8 Flops / Time [Gflop/s] 2 Memory 1/2 16 Gflop/J, 1.3 GB/J 15 Gflop/J, 1.2 GB/J 11 Gflop/J, 880 MB/J 8.8 Gflop/J, 670 MB/J 8.1 Gflop/J, 1.5 GB/J 6.4 Gflop/J, 470 MB/J 4.0 Tflop/s [81%], 240 GB/s [83%] 3.0 Tflop/s [86%], 160 GB/s [82%] 2.0 Tflop/s [100%], 180 GB/s [57%] 270 Gflop/s [100%], 15 GB/s [60%] 33 Gflop/s [46%], 8.4 GB/s [66%] 100 Gflop/s [95%], 8.7 GB/s [81%] 1/8 120 W (const) + 160 W (cap) [99%] 66 W (const) + 140 W (cap) [100%] 180 W (const) + 36 W (cap) [100%] 10 W (const) + 18 W (cap) [91%] 1.3 W (const) + 4.8 W (cap) [88%] 16 W (const) + 3.2 W (cap) [100%] GTX 580 NUC CPU PandaBoard ES Arndale CPU APU CPU Desktop CPU Cap 2048 Compute 512 Cap 128 Memory Compute Memory Compute 32 Cap Compute 8 Memory Compute Cap Compute Cap Memory Cap 2 1/2 5.3 Gflop/J, 810 MB/J 3.2 Gflop/J, 750 MB/J 2.5 Gflop/J, 280 MB/J 2.2 Gflop/J, 560 MB/J 650 Mflop/J, 150 MB/J 620 Mflop/J, 140 MB/J 1.4 Tflop/s [88%], 170 GB/s [89%] 56 Gflop/s [97%], 18 GB/s [70%] 9.5 Gflop/s [99%], 1.3 GB/s [40%] 16 Gflop/s [58%], 3.9 GB/s [31%] 13 Gflop/s [98%], 3.3 GB/s [31%] 99 Gflop/s [93%], 19 GB/s [74%] 1/8 120 W (const) + 150 W (cap) [94%] 17 W (const) + 7.4 W (cap) [98%] 3.5 W (const) + 1.2 W (cap) [95%] 5.5 W (const) + 2.0 W (cap) [97%] 20 W (const) + 1.4 W (cap) [98%] 120 W (const) + 44 W (cap) [99%] 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 Intensity (single − precision flop:Byte) Wednesday, October 2, 13

Arndale GPU GTX Titan GTX 680 Xeon Phi GTX 580 NUC CPU Cap Compute 16 Compute Cap Compute Compute Cap Compute Compute 4 Memory Cap Cap Flops / Energy [Gflop/J] 1 Memory Memory Memory Cap 1/4 Memory 1/16 8.1 Gflop/J, 1.5 GB/J 16 Gflop/J, 1.3 GB/J 15 Gflop/J, 1.2 GB/J 11 Gflop/J, 880 MB/J 5.3 Gflop/J, 810 MB/J 3.2 Gflop/J, 750 MB/J 33 Gflop/s [46%], 8.4 GB/s [66%] 4.0 Tflop/s [81%], 240 GB/s [83%] 3.0 Tflop/s [86%], 160 GB/s [82%] 2.0 Tflop/s [100%], 180 GB/s [57%] 1.4 Tflop/s [88%], 170 GB/s [89%] 56 Gflop/s [97%], 18 GB/s [70%] 1/64 1.3 W (const) + 4.8 W (cap) [88%] 120 W (const) + 160 W (cap) [99%] 66 W (const) + 140 W (cap) [100%] 180 W (const) + 36 W (cap) [100%] 120 W (const) + 150 W (cap) [94%] 17 W (const) + 7.4 W (cap) [98%] NUC GPU Arndale CPU APU GPU PandaBoard ES APU CPU Desktop CPU 16 Cap Compute 4 Cap Cap Compute Memory Memory Compute Memory Compute Cap Compute 1 Memory Cap 1/4 Cap Memory 1/16 8.8 Gflop/J, 670 MB/J 2.2 Gflop/J, 560 MB/J 6.4 Gflop/J, 470 MB/J 2.5 Gflop/J, 280 MB/J 650 Mflop/J, 150 MB/J 620 Mflop/J, 140 MB/J 270 Gflop/s [100%], 15 GB/s [60%] 16 Gflop/s [58%], 3.9 GB/s [31%] 100 Gflop/s [95%], 8.7 GB/s [81%] 9.5 Gflop/s [99%], 1.3 GB/s [40%] 13 Gflop/s [98%], 3.3 GB/s [31%] 99 Gflop/s [93%], 19 GB/s [74%] 1/64 10 W (const) + 18 W (cap) [91%] 5.5 W (const) + 2.0 W (cap) [97%] 16 W (const) + 3.2 W (cap) [100%] 3.5 W (const) + 1.2 W (cap) [95%] 20 W (const) + 1.4 W (cap) [98%] 120 W (const) + 44 W (cap) [99%] 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 Intensity (single − precision flop:Byte) Wednesday, October 2, 13

Generalized roofline analysis? Jee Choi Marat Dukhan Richard (Rich) - PowerPoint PPT Presentation

Follow along at hpcgarage.org/13401 Generalized roofline analysis? Jee Choi Marat Dukhan Richard (Rich) Vuduc October 2, 2013 Dagstuhl Seminar 13401: Automatic Application Autotuning for HPC Architectures Wednesday, October 2, 13

Roofline plot for RICH pattern detection algorithm on Intels Knights Landing Platform Christina

Performance Analysis of GPU-Accelerated Applications using the Roofline Model GTC 2019, San Jose

Performance Analysis of GPU Programming Models using the Roofline Scaling Trajectories Khaled

Generalized MPLS Signaling draft-ietf-mpls-generalized-signaling-05.txt

Overview of logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models

Using the Roofline Model and Intel Advisor Samuel Williams Tuomas Koskela SWWilliams@lbl.gov

PlaFRIM Exploration The Roofline model Performance Methodology Court` es L., Ru e F.

Semifinite Generalized Quadrangles G. Eric Moorhouse Department of Mathematics University of

Generalized Weyl algebras and their global dimension V. V. Bavula 1 Generalized Weyl algebras

Generalized Contagion Generalized Model of Contagion Principles of Complex Systems References

Generalized Additive Models September 10, 2019 Generalized Additive Models September 10, 2019 1

Extremal generalized smooth words Kolakoski word Run-length encoding Smooth words Generalized

ICCL Summer School 2008 The logic of generalized truth values. A tour into Philosophical Logic

Generalized Nonlinear Models gnm : a Package for Generalized Nonlinear Models Same form as

The Classification of Generalized Riemann Derivatives Stefan Catoiu DePaul University, Chicago

Algebras of Generalized Functions and Nonstandard Analysis Hans Vernaeve (joint work with Todor

Balanced Independent Sets on Colored Interval Graphs Sujoy Bhore, Jan-Henrik Haunert, Fabian

Parallel Clustering for Visualizing Large Scien5fic Line Data

CSE 373: AVL trees Question: is this also an AVL tree? 2 5 6 10 9 8 12 11 13 14 5 AVL

Sub-6GHz 5G TDD Wireless Infrastructure Block Diagram High Power Switch - LNA MAIA-011002

Using SimGrid to Evaluate the Impact of AMPI Load Balancing In a Geophysics HPC Application

Load balancing David Bindel 12 Nov 2015 Inefficiencies in parallel code Poor single

Instruction Scheduling Last time Instruction scheduling using list scheduling Today

Distributed, Secure Load Balancing with Skew, Heterogeneity, and Churn Jonathan Ledlie and Margo