Generalized roofline analysis?
Jee Choi ∙ Marat Dukhan ∙ Richard (Rich) Vuduc October 2, 2013 Dagstuhl Seminar 13401: Automatic Application Autotuning for HPC Architectures Follow along at hpcgarage.org/13401
Wednesday, October 2, 13
Generalized roofline analysis? Jee Choi Marat Dukhan Richard (Rich) - - PowerPoint PPT Presentation
Follow along at hpcgarage.org/13401 Generalized roofline analysis? Jee Choi Marat Dukhan Richard (Rich) Vuduc October 2, 2013 Dagstuhl Seminar 13401: Automatic Application Autotuning for HPC Architectures Wednesday, October 2, 13
Jee Choi ∙ Marat Dukhan ∙ Richard (Rich) Vuduc October 2, 2013 Dagstuhl Seminar 13401: Automatic Application Autotuning for HPC Architectures Follow along at hpcgarage.org/13401
Wednesday, October 2, 13282
R.W. Hockney, l.J. Curington /fl/2." A parameter to characterize bottlenecks
/ i i
~(o) // / / / / i
J
// / / / J I I 1 2 3 1 (ml ............ r']1/2
1 II
,I
4 5
6 X
8 2 3 Z 4.
I I I I I , I1 1; 2 3 Z
for the case of a combined memory I/O and arithmetic pipefine, when the I/O and arithmetic can be overlapped. Full lines: parameters when arithmetic dominates, equations (14b); dotted lines: parameters when I/O dominates, equa- tions (13b). Notation as Fig. 1, and p = 1.5.
In order to find out whether I/0 or arithmetic dominates, one must examine the breakeven vector length, hi, at which I/O and arithmetic take equal times. This occurs when till= t2, whence
H1 = (H~72) - ~'"(a)~''1/2 )/(Z--
1) (15) where z = fr~m)/r~
a).
,,(m)/,(a) = 1.5, and the The variation of n I with z is drawn in Fig. 3 for the case ~ = ,,1/2/,,1/2 regions of the (z, n])-plane corresponding to I/O or arithmetic dominance are shown. If z > ~,, the arithmetic time dominates for all vector lengths, because n I is negative. Equations (14) apply, and the asymptotic performance is constant and equal to the r~ a) of the arithmetic pipeline. Since this is the situation when f~ oo we have, by definition, the peak performance & = r~ a), (16a) the same as for the sequential I/O case. If, however, z < 1, I/O dominates for all vector lengths (because n~ is again negative) and equations (13) apply. The total computation time, ta, for f vector operations is constant, hence the asymptotic performance rises hnearly with f, reaching the peak performance P~ -- r~ a) when:
f= r~a)/r~ m). The asymptotic performance reaches half the peak performance when f reaches
half this value, hence by definition
fl/2 = ½r~a)//r~
m). (16b) Thus overlapping halves the value of f]/2 from that obtained for sequential I/O. Between 1 < z < u, either I/O or arithmetic may dominate depending on the vector length (because, now, n 1 is positive). Figure 2 shows that if I/O dominates n < nl) the asymptotic ~ performance can exceed the asymptotic performance of the arithmetic pipeline r~ a), and it might appear that this is absurd and against physical intuition. However, this is not the case, because ~ is a theoretical asymptotic performance (for n ~ ~) which in this case can never be
R.W. Hockney and I.J. Curington (1989). “f½: A parameter to characterize memory and communication bottlenecks.” doi: 10.1016/0167-8191(89)90100-2
Operational Intensity (Flops/Byte) (a) Intel Xeon (Clovertown) peak DP +balanced mul/add +SIMD +ILP TLP only peak stream bandwidth +snoop filter effective LBMHD FFT (5123) FFT (1283) s n
fi l t e r i n e f f e c t i v e Stencil GFlops/s 128 64 32 16 8 4 2 1 1/16 1/8 1/4 1/2 1 2 4 8 16
“Roofline: An insightful visual performance model for multicore architectures.” doi: 10.1145/1498765.1498785
hpcgarage.org/13401
Wednesday, October 2, 13R.W. Hockney, l.J. Curington /fl/2." A parameter to characterize
/ i i
~(o) // / / / / i
J
// / / / J I I 1 2 3 1 (ml
1 II
,I
4 5
6 X
8 2 3 Z 4.
R.W. Hockney and I.J. Curington (1989). “f½: A parameter to characterize memory and communication bottlenecks.” doi: 10.1016/0167-8191(89)90100-2
Operational Intensity (Flops/Byte) (a) Intel Xeon (Clovertown) peak DP +balanced mul/add +SIMD +ILP TLP only peak stream bandwidth +snoop filter effective LBMHD FFT (5123) FFT (1283) s n
fi l t e r i n e f f e c t i v e Stencil GFlops/s 128 64 32 16 8 4 2 1 1/16 1/8 1/4 1/2 1 2 4 8 16
“Roofline: An insightful visual performance model for multicore architectures.” doi: 10.1145/1498765.1498785
hpcgarage.org/13401
Wednesday, October 2, 13u(intensity, balance) Applying ideas of R. Numrich, we can identify a unitless (dimensionless) manifold
use differential geometry to estimate the “distance” between systems. v(intensity, power) f (energy-efficiency)
Numrich, R. W. (2010). “Computer performance analysis and the Pi Theorem.” Comp. Sci. R&D. doi:10.1007/s00450-010-0147-8
hpcgarage.org/13401
Wednesday, October 2, 13There are many ways to “generalize” the roofline: metrics beyond time, e.g., energy and power; intrinsic algorithmic properties beyond “flop:byte”; hypothetical architectural features; among others. Example: Time, energy, and power of an abstract computation.
Jee Choi doing actual science
hpcgarage.org/13401
Wednesday, October 2, 13von Neumann system
Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …
Slow memory
xPU Fast memory (total size = Z)
hpcgarage.org/13401
Wednesday, October 2, 13von Neumann system
Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …
W ≡ # (fl)ops Q ≡ # mem. ops (mops) = Q(Z) I ≡ W Q = Intensity (flop:mop)
Slow memory
xPU Fast memory (total size = Z)
hpcgarage.org/13401
Q mops W (fl)ops
Wednesday, October 2, 13von Neumann system
Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …
W ≡ # (fl)ops Q ≡ # mem. ops (mops) = Q(Z) I ≡ W Q = Intensity (flop:mop) τflop ≡ time per (fl)op τmem ≡ time per mop Bτ ≡ τmem τflop = Balance (flop:mop)
Slow memory
xPU Fast memory (total size = Z)
Q mops W (fl)ops τflop = time/flop τmem = time/mop
hpcgarage.org/13401
Wednesday, October 2, 131/32 1/16 1/8 1/4 1/2 1
3.6
GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
flop:byte
“Roofline” — Williams et al. (Comm. ACM ’09)
hpcgarage.org/13401
Wednesday, October 2, 131/32 1/16 1/8 1/4 1/2 1
3.6
GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
flop:byte
“Roofline” — Williams et al. (Comm. ACM ’09)
Balance (flop : mop)
hpcgarage.org/13401
Wednesday, October 2, 131/32 1/16 1/8 1/4 1/2 1
3.6
GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
flop:byte
Compute bound
“Roofline” — Williams et al. (Comm. ACM ’09)
Balance (flop : mop)
hpcgarage.org/13401
Wednesday, October 2, 131/32 1/16 1/8 1/4 1/2 1
3.6
GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
flop:byte
Compute bound
“Roofline” — Williams et al. (Comm. ACM ’09)
Memory (bandwidth) bound Balance (flop : mop)
hpcgarage.org/13401
Wednesday, October 2, 131/32 1/16 1/8 1/4 1/2 1
3.6
GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
flop:byte
Compute bound
“Roofline” — Williams et al. (Comm. ACM ’09)
Memory (bandwidth) bound Dense matrix multiply Balance (flop : mop)
hpcgarage.org/13401
Wednesday, October 2, 131/32 1/16 1/8 1/4 1/2 1
3.6
GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
flop:byte
Compute bound
“Roofline” — Williams et al. (Comm. ACM ’09)
Memory (bandwidth) bound sparse matvec; stencils Dense matrix multiply Balance (flop : mop)
hpcgarage.org/13401
Wednesday, October 2, 131/32 1/16 1/8 1/4 1/2 1
3.6
GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
flop:byte
Compute bound
“Roofline” — Williams et al. (Comm. ACM ’09)
Memory (bandwidth) bound sparse matvec; stencils FFTs Dense matrix multiply Balance (flop : mop)
hpcgarage.org/13401
Wednesday, October 2, 131/32 1/16 1/8 1/4 1/2 1
3.6
GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance
flop:byte
Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
Operational Intensity (Flops/Byte) (a) Intel Xeon (Clovertown) peak DP +balanced mul/add +SIMD +ILP TLP only peak stream bandwidth + s n
fi l t e r e f f e c t i v e LBMHD FFT (5123) FFT (1283) s n
fi l t e r i n e f f e c t i v e Stencil GFlops/s 128 64 32 16 8 4 2 1 1/16 1/8 1/4 1/2 1 2 4 8 16
Memory (bandwidth) bound Compute bound
hpcgarage.org/13401
Wednesday, October 2, 131/32 1/16 1/8 1/4 1/2 1
3.6
GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance
flop:byte
Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
Compute bound in time Memory (bandwidth) bound in time
hpcgarage.org/13401
Wednesday, October 2, 131/32 1/16 1/8 1/4 1/2 1
3.6 14
GFLOP/J GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance
flop:byte flop:byte
Balance estimates for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
hpcgarage.org/13401
Wednesday, October 2, 131/32 1/16 1/8 1/4 1/2 1
3.6 14
GFLOP/J GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance
flop:byte flop:byte
Balance estimates for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
Energy balance (flop : mop)
hpcgarage.org/13401
Wednesday, October 2, 131/32 1/16 1/8 1/4 1/2 1
3.6 14
GFLOP/J GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance
flop:byte flop:byte
Balance estimates for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
Memory (bandwidth) bound in energy Compute bound in energy Energy balance (flop : mop)
hpcgarage.org/13401
Wednesday, October 2, 131 2 4 8 3.6 14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512
Intensity (flop:byte) Power, relative to flop−power
“Power line”
(average power)
flop:byte flop:byte
Balance estimates for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
hpcgarage.org/13401
Wednesday, October 2, 13For real systems, the model should account for “constant power” and “power capping.”
Powerline Roofline 1/16 1/8 1/4 1/2 1 2 1 2 4 8 16 32 64 1 2 4 8 16 32 64
Intensity (flop:byte) Normalized Value
hpcgarage.org/13401
Wednesday, October 2, 13For real systems, the model should account for “constant power” and “power capping.”
Powerline Roofline 1/16 1/8 1/4 1/2 1 2 1 2 4 8 16 32 64 1 2 4 8 16 32 64
Intensity (flop:byte) Normalized Value
π0 : Constant
hpcgarage.org/13401
Wednesday, October 2, 13For real systems, the model should account for “constant power” and “power capping.”
Powerline Roofline 1/16 1/8 1/4 1/2 1 2 1 2 4 8 16 32 64 1 2 4 8 16 32 64
Intensity (flop:byte) Normalized Value
π0 : Constant
hpcgarage.org/13401
Wednesday, October 2, 13(4.0 Tflop/s, 16 Gflop/J, 290 W)
Time Energy
(33 Gflop/s, 8.1 Gflop/J, 7.0 W)
Time Energy
GTX Titan Arndale GPU 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128
Intensity (single−precision flop:Byte) Efficiency (higher is better)
“Desktop GPU” (NVIDIA) “Mobile GPU” (Samsung/ARM)
Wednesday, October 2, 13So what?
hpcgarage.org/13401
Wednesday, October 2, 13So what?
Possibility 1:
A “first principles” view of time & energy in systems.
hpcgarage.org/13401
Wednesday, October 2, 131/32 1/16 1/8 1/4 1/2 1
3.6 14
GFLOP/J GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance
flop:byte flop:byte
Balance estimates for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
hpcgarage.org/13401
W✏flop E = 1 1 + B✏
I
Wednesday, October 2, 131/32 1/16 1/8 1/4 1/2 1
3.6 14
GFLOP/J GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance
flop:byte flop:byte
Time-energy balance gap What does this imply?
Balance estimates for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
hpcgarage.org/13401
W✏flop E = 1 1 + B✏
I
Wednesday, October 2, 131/32 1/16 1/8 1/4 1/2 1
3.6 14
GFLOP/J GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance
flop:byte flop:byte
Time-energy balance gap Compute-bound in time but memory-bound in energy?
Balance estimates for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
hpcgarage.org/13401
W✏flop E = 1 1 + B✏
I
Wednesday, October 2, 131/32 1/16 1/8 1/4 1/2 1
3.6 14
GFLOP/J GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance
flop:byte flop:byte
Time-energy balance gap Is optimizing for energy harder than optimizing for time?
Balance estimates for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
hpcgarage.org/13401
W✏flop E = 1 1 + B✏
I
Wednesday, October 2, 131/32 1/16 1/8 1/4 1/2 1
3.6 14
GFLOP/J GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance
flop:byte flop:byte
Time-energy balance gap Energy-efficiency likely implies time-efficiency, but not vice- versa, breaking “race-to-halt”
Balance estimates for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
hpcgarage.org/13401
W✏flop E = 1 1 + B✏
I
Wednesday, October 2, 13So what?
Possibility 2:
Another view of power caps and throttling.
hpcgarage.org/13401
Wednesday, October 2, 1316 Gflop/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%]
Cap Memory Compute
15 Gflop/J 3.0 Tflop/s [86%], 160 GB/s [82%] 66 W (const) + 140 W (cap) [100%]
Cap Memory Compute
11 Gflop/J 2.0 Tflop/s [100%], 180 GB/s [57%] 180 W (const) + 36 W (cap) [100%]
Cap Memory Compute
8.8 Gflop/J 270 Gflop/s [100%], 15 GB/s [60%] 10 W (const) + 18 W (cap) [91%]
Cap Memory
8.1 Gflop/J 33 Gflop/s [46%], 8.4 GB/s [66%] 1.3 W (const) + 4.8 W (cap) [88%]
Cap Memory Compute
6.4 Gflop/J 100 Gflop/s [95%], 8.7 GB/s [81%] 16 W (const) + 3.2 W (cap) [100%]
Cap Memory Compute
5.4 Gflop/J 1.4 Tflop/s [88%], 170 GB/s [89%] 120 W (const) + 150 W (cap) [94%]
Cap Memory Compute
3.2 Gflop/J 56 Gflop/s [97%], 18 GB/s [70%] 17 W (const) + 7.4 W (cap) [98%]
Cap Compute
2.5 Gflop/J 9.5 Gflop/s [99%], 1.3 GB/s [40%] 3.5 W (const) + 1.2 W (cap) [95%]
Cap Memory Compute
2.2 Gflop/J 16 Gflop/s [58%], 3.9 GB/s [31%] 5.5 W (const) + 2.0 W (cap) [97%]
Cap Memory Compute
650 Mflop/J 13 Gflop/s [98%], 3.3 GB/s [31%] 20 W (const) + 1.4 W (cap) [98%]
Cap Compute
620 Mflop/J 99 Gflop/s [93%], 19 GB/s [74%] 120 W (const) + 44 W (cap) [99%]
Cap Memory Compute
GTX Titan GTX 680 Xeon Phi NUC GPU Arndale GPU APU GPU GTX 580 NUC CPU PandaBoard ES Arndale CPU APU CPU Desktop CPU 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512
Intensity (single−precision flop:Byte) Power [normalized]
1 2 4 8 3.6 14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512
Intensity (flop:byte) Power, relative to flop−power
hpcgarage.org/13401
Wednesday, October 2, 1316 Gflop/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%]
Cap Memory Compute
15 Gflop/J 3.0 Tflop/s [86%], 160 GB/s [82%] 66 W (const) + 140 W (cap) [100%]
Cap Memory Compute
11 Gflop/J 2.0 Tflop/s [100%], 180 GB/s [57%] 180 W (const) + 36 W (cap) [100%]
Cap Memory Compute
8.8 Gflop/J 270 Gflop/s [100%], 15 GB/s [60%] 10 W (const) + 18 W (cap) [91%]
Cap Memory
8.1 Gflop/J 33 Gflop/s [46%], 8.4 GB/s [66%] 1.3 W (const) + 4.8 W (cap) [88%]
Cap Memory Compute
6.4 Gflop/J 100 Gflop/s [95%], 8.7 GB/s [81%] 16 W (const) + 3.2 W (cap) [100%]
Cap Memory Compute
5.4 Gflop/J 1.4 Tflop/s [88%], 170 GB/s [89%] 120 W (const) + 150 W (cap) [94%]
Cap Memory Compute
3.2 Gflop/J 56 Gflop/s [97%], 18 GB/s [70%] 17 W (const) + 7.4 W (cap) [98%]
Cap Compute
2.5 Gflop/J 9.5 Gflop/s [99%], 1.3 GB/s [40%] 3.5 W (const) + 1.2 W (cap) [95%]
Cap Memory Compute
2.2 Gflop/J 16 Gflop/s [58%], 3.9 GB/s [31%] 5.5 W (const) + 2.0 W (cap) [97%]
Cap Memory Compute
650 Mflop/J 13 Gflop/s [98%], 3.3 GB/s [31%] 20 W (const) + 1.4 W (cap) [98%]
Cap Compute
620 Mflop/J 99 Gflop/s [93%], 19 GB/s [74%] 120 W (const) + 44 W (cap) [99%]
Cap Memory Compute
GTX Titan GTX 680 Xeon Phi NUC GPU Arndale GPU APU GPU GTX 580 NUC CPU PandaBoard ES Arndale CPU APU CPU Desktop CPU 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512
Intensity (single−precision flop:Byte) Power [normalized]
“Desktop GPU” “Desktop GPU” “Desktop xPU” “Mobile GPU” “Mobile GPU” “APU GPU” “Desktop GPU” “Mobile CPU” “Mobile CPU” “Mobile CPU” “APU CPU” “Desktop CPU”
Wednesday, October 2, 13Flops Mops
1 2 4 8 16 32 64 1/4 1/2 1 2 4 8 16 32 64
Intensity (single−precision FLOP : Byte) Throttling coefficient NVIDIA GTX 580 GF100; Fermi
hpcgarage.org/13401
Wednesday, October 2, 13So what?
Possibility 3:
Abstract algorithm analysis.
hpcgarage.org/13401
Wednesday, October 2, 13Abstract work-communication trade-offs
Algorithm 1 = (W, Q) versus Algorithm 2 = (fW, Q m) I ≡ W Q
Wednesday, October 2, 13Speedup ∆T = T1,1 Tf,m “Greenup” ∆E = E1,1 Ef,m
Abstract work-communication trade-offs
Algorithm 1 = (W, Q) versus Algorithm 2 = (fW, Q m) I ≡ W Q
Wednesday, October 2, 13Speedup ∆T = T1,1 Tf,m “Greenup” ∆E = E1,1 Ef,m
Abstract work-communication trade-offs
A general “greenup” condition
Algorithm 1 = (W, Q) versus Algorithm 2 = (fW, Q m) I ≡ W Q
Wednesday, October 2, 131/32 1/16 1/8 1/4 1/2 1
3.6 14
GFLOP/J GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance
flop:byte flop:byte
Algorithm 1
Balance estimates for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
hpcgarage.org/13401
Wednesday, October 2, 131/32 1/16 1/8 1/4 1/2 1
3.6 14
GFLOP/J GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance
flop:byte flop:byte
Algorithm 1
∆E > 1 = ⇒ f < 1 + B✏ B⌧
Balance estimates for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
hpcgarage.org/13401
Wednesday, October 2, 13So what?
Possibility 4:
Abstract architectural bake-offs.
hpcgarage.org/13401
http://stickofachef.files.wordpress.com/2008/03/iron_chef.jpg
Wednesday, October 2, 13Flop / Time Flop / Energy Power
GTX Titan Arndale GPU GTX Titan Arndale GPU GTX Titan Arndale GPU
2−11 2−10 1/512 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256
Intensity (single−precision flop:Byte) Relative value
hpcgarage.org/13401
Wednesday, October 2, 13Flop / Time Flop / Energy Power
GTX Titan Arndale GPU GTX Titan Arndale GPU GTX Titan Arndale GPU
2−11 2−10 1/512 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256
Intensity (single−precision flop:Byte) Relative value
~ 47⨉
hpcgarage.org/13401
Wednesday, October 2, 13Flop / Time Flop / Energy Power
GTX Titan Arndale GPU GTX Titan Arndale GPU GTX Titan Arndale GPU
2−11 2−10 1/512 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256
Intensity (single−precision flop:Byte) Relative value
~ 122⨉ ~ 47⨉
hpcgarage.org/13401
Wednesday, October 2, 13Flop / Time Flop / Energy Power
GTX Titan Arndale GPU GTX Titan Arndale GPU GTX Titan Arndale GPU
2−11 2−10 1/512 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256
Intensity (single−precision flop:Byte) Relative value
~ 122⨉ ~ 28⨉ ~ 47⨉
hpcgarage.org/13401
Wednesday, October 2, 13Flop / Time Flop / Energy Power
GTX Titan Arndale GPU 47 x Arndale GPU GTX Titan Arndale GPU GTX Titan Arndale GPU 47 x Arndale GPU
2−11 2−10 1/512 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256
Intensity (single−precision flop:Byte) Relative value
hpcgarage.org/13401
Wednesday, October 2, 13Flop / Time Flop / Energy Power
GTX Titan Arndale GPU 47 x Arndale GPU GTX Titan Arndale GPU GTX Titan Arndale GPU 47 x Arndale GPU
2−11 2−10 1/512 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256
Intensity (single−precision flop:Byte) Relative value
~ 0.4⨉
as fast
hpcgarage.org/13401
Wednesday, October 2, 13Flop / Time Flop / Energy Power
GTX Titan Arndale GPU 47 x Arndale GPU GTX Titan Arndale GPU GTX Titan Arndale GPU 47 x Arndale GPU
2−11 2−10 1/512 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256
Intensity (single−precision flop:Byte) Relative value
~ 0.4⨉
as fast
~ 1.6⨉ faster
— but that’s
hpcgarage.org/13401
Wednesday, October 2, 13Flop / Time Flop / Energy Power
GTX Titan Arndale GPU 47 x Arndale GPU GTX Titan Arndale GPU GTX Titan Arndale GPU 47 x Arndale GPU
2−11 2−10 1/512 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256
Intensity (single−precision flop:Byte) Relative value
~ 0.4⨉
as fast
~ 1.6⨉ faster
— but that’s
same!
hpcgarage.org/13401
Wednesday, October 2, 13Flop / Time Flop / Energy Power
GTX Titan Arndale GPU 47 x Arndale GPU GTX Titan Arndale GPU GTX Titan Arndale GPU 47 x Arndale GPU
2−11 2−10 1/512 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256
Intensity (single−precision flop:Byte) Relative value
~ 0.4⨉
as fast
~ 1.6⨉ faster
— but that’s
same! ~ 2⨉
hpcgarage.org/13401
Wednesday, October 2, 13u
¡(intensity, balance) Applying ideas of R. Numrich, we can identify a unitless (dimensionless) manifold
use differential geometry to estimate the “distance” between systems. v ¡(intensity, power) f ¡(energy-efficiency)
Numrich, R. W. (2010). “Computer performance analysis and the Pi Theorem.” Comp. Sci. R&D. doi:10.1007/s00450-010-0147-8
hpcgarage.org/13401
Wednesday, October 2, 13๏ What are the intrinsic relationships
among time, energy, and power?
๏ What do these relationships say about
algorithms and software?
๏ About architectures, hardware, and co-
design?
๏ Can {roof,arch,power,…}-lines, suitably
refined, guide autotuning systems?
1/32 1/16 1/8 1/4 1/2 13.6 14
GFLOP/J GFLOP/s
1/2 1 2 4 8 16 32 64 128Intensity (FLOP:Byte) Relative performance
?
Ivy Bridge Ivy Bridge Kepler Kepler Fermi ARM Cortex−A9 ARM Cortex−A15 ARM Mali T−604 Bobcat Zacate GPU KNC
Mini Mobile Desktop
2 4 8 16 32 64 4 8 16
Time balance (single−precision flop:byte) Energy balance (single−precision flop:byte)
xPU
GPU
hpcgarage.org/13401
Wednesday, October 2, 13Jee Whan Choi Autotuning for power and energy Aparna Chandramowlishwaran [Now R.S. @ MIT] Fast multipole method Marat Dukhan Math libraries & machine learning
2010 Gordon Bell Prize (+ G. Biros) 2010 IPDPS Best Paper (+ K. Knobe, Intel CnC lead) 2012 SIAM Data Mining Best Paper (D. Lee [GE Research] + A. Gray)
See our recent 2013 IPDPS papers, posted at: hpcgarage.org/ppam13 * Jee Choi — “A roofline model of energy.” * Kent Czechowski —“A theoretical framework for algorithm-architecture co-design.”
Kent Czechowski Co-design
Wednesday, October 2, 13von Neumann-like system Slow memory
xPU Fast memory (total size = Z)
W ≡ # (fl)ops Q ≡ # mem. ops (mops) = Q(Z)
Q mops W (fl)ops
Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …
hpcgarage.org/modsim13
Wednesday, October 2, 13von Neumann-like system
T = max (W⌧flop, Q⌧mem) = W⌧flop max ✓ 1, Q W ⌧mem ⌧flop ◆ = W⌧flop max ✓ 1, B⌧ I ◆ E = W✏flop + Q✏mem = W✏flop ✓ 1 + B✏ I ◆ Consider: W⌧flop T and W✏flop E
Slow memory
xPU
τmem = time/mop τflop = time/flop
Fast memory (total size = Z)
Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …
hpcgarage.org/modsim13
Wednesday, October 2, 13von Neumann-like system
T = max (W⌧flop, Q⌧mem) = W⌧flop max ✓ 1, Q W ⌧mem ⌧flop ◆ = W⌧flop max ✓ 1, B⌧ I ◆ E = W✏flop + Q✏mem = W✏flop ✓ 1 + B✏ I ◆ Consider: W⌧flop T and W✏flop E
Minimum time Slow memory
xPU
τmem = time/mop τflop = time/flop
Fast memory (total size = Z)
Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …
hpcgarage.org/modsim13
Wednesday, October 2, 13von Neumann-like system
T = max (W⌧flop, Q⌧mem) = W⌧flop max ✓ 1, Q W ⌧mem ⌧flop ◆ = W⌧flop max ✓ 1, B⌧ I ◆ E = W✏flop + Q✏mem = W✏flop ✓ 1 + B✏ I ◆ Consider: W⌧flop T and W✏flop E
Intensity (flop : mop) Minimum time Slow memory
xPU
τmem = time/mop τflop = time/flop
Fast memory (total size = Z)
Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …
hpcgarage.org/modsim13
Wednesday, October 2, 13von Neumann-like system
T = max (W⌧flop, Q⌧mem) = W⌧flop max ✓ 1, Q W ⌧mem ⌧flop ◆ = W⌧flop max ✓ 1, B⌧ I ◆ E = W✏flop + Q✏mem = W✏flop ✓ 1 + B✏ I ◆ Consider: W⌧flop T and W✏flop E
Intensity (flop : mop) Balance (flop : mop) Minimum time Slow memory
xPU
τmem = time/mop τflop = time/flop
Fast memory (total size = Z)
Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …
hpcgarage.org/modsim13
Wednesday, October 2, 13An energy analogue Slow memory
xPU
εmem = energy/mop εflop = energy/flop
Fast memory (total size = Z)
T = max (W⌧flop, Q⌧mem) = W⌧flop max ✓ 1, Q W ⌧mem ⌧flop ◆ = W⌧flop max ✓ 1, B⌧ I ◆ E = W✏flop + Q✏mem = W✏flop ✓ 1 + B✏ I ◆ Consider: W⌧flop T and W✏flop E
hpcgarage.org/modsim13
Wednesday, October 2, 13An energy analogue Slow memory
xPU
εmem = energy/mop εflop = energy/flop
Fast memory (total size = Z)
T = max (W⌧flop, Q⌧mem) = W⌧flop max ✓ 1, Q W ⌧mem ⌧flop ◆ = W⌧flop max ✓ 1, B⌧ I ◆ E = W✏flop + Q✏mem = W✏flop ✓ 1 + B✏ I ◆ Consider: W⌧flop T and W✏flop E
Energy balance (flop : mop)
hpcgarage.org/modsim13
Wednesday, October 2, 131/32 1/16 1/8 1/4 1/2 1
3.6
GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
flop:byte
“Roofline” — Williams et al. (Comm. ACM ’09)
Wτflop T = 1 max
I
1/32 1/16 1/8 1/4 1/2 1
3.6
GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
flop:byte
Compute bound
“Roofline” — Williams et al. (Comm. ACM ’09)
Wτflop T = 1 max
I
1/32 1/16 1/8 1/4 1/2 1
3.6
GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
flop:byte
Compute bound
“Roofline” — Williams et al. (Comm. ACM ’09)
Memory (bandwidth) bound
Wτflop T = 1 max
I
1/32 1/16 1/8 1/4 1/2 1
3.6
GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
flop:byte
Compute bound
“Roofline” — Williams et al. (Comm. ACM ’09)
Memory (bandwidth) bound Dense matrix multiply
Wτflop T = 1 max
I
1/32 1/16 1/8 1/4 1/2 1
3.6
GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
flop:byte
Compute bound
“Roofline” — Williams et al. (Comm. ACM ’09)
Memory (bandwidth) bound sparse matvec; stencils Dense matrix multiply
Wτflop T = 1 max
I
1/32 1/16 1/8 1/4 1/2 1
3.6
GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
flop:byte
Compute bound
“Roofline” — Williams et al. (Comm. ACM ’09)
Memory (bandwidth) bound sparse matvec; stencils FFTs Dense matrix multiply
Wτflop T = 1 max
I
1/32 1/16 1/8 1/4 1/2 1
3.6
GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance
flop:byte
Memory (bandwidth) bound in time Compute bound in time
Balance estimates for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
Wτflop T = 1 max
I
1/32 1/16 1/8 1/4 1/2 1
3.6 14
GFLOP/J GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance
flop:byte flop:byte
Balance estimates for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
hpcgarage.org/modsim13
W✏flop E = 1 1 + B✏
I
Wednesday, October 2, 131/32 1/16 1/8 1/4 1/2 1
3.6 14
GFLOP/J GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance
flop:byte flop:byte
Balance estimates for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
Memory (bandwidth) bound in energy Compute bound in energy
hpcgarage.org/modsim13
W✏flop E = 1 1 + B✏
I
Wednesday, October 2, 13That was theory. What happens in practice? ⇒ Cannot ignore constant power. Let’s add it to our model and measure.
Wednesday, October 2, 13Constant power Slow memory
xPU
τmem = time/mop τflop = time/flop
Fast memory (total size = Z)
T = max (Wτflop, Qτmem) = Wτflop max ✓ 1, Q W τmem τflop ◆ = Wτflop max ✓ 1, Bτ I ◆ Consider: W⌧flop T and W✏flop E E = W✏flop + Q✏mem + T⇡0 = W✏flop ✓ 1 + B✏ I + ⇡0 ✏flop T W ◆
Wednesday, October 2, 13Constant power Slow memory
xPU
τmem = time/mop τflop = time/flop
Fast memory (total size = Z)
Constant power
T = max (Wτflop, Qτmem) = Wτflop max ✓ 1, Q W τmem τflop ◆ = Wτflop max ✓ 1, Bτ I ◆ Consider: W⌧flop T and W✏flop E E = W✏flop + Q✏mem + T⇡0 = W✏flop ✓ 1 + B✏ I + ⇡0 ✏flop T W ◆
Wednesday, October 2, 13NVIDIA GTX 580 − double−precision Intel i7−950 − double−precision 1/8 1/4 1/2 1 1/8 1/4 1/2 1
= 197 GFLOP/s
0.890
= 1.23 GFLOP/J
= 53.3 GFLOP/s
1.01
= 0.316 GFLOP/J Time Energy 1/4 1/2 1 2 4 8 16 1/4 1/2 1 2 4 8 16
Intensity (FLOP : Byte) Normalized performance
NVIDIA GTX 580 − double−precision Intel i7−950 − double−precision = 197 GFLOP/s = 53.3 GFLOP/s
A microbenchmark study: “GPU energy” includes all GPU card components (card, memory, fan) but excludes the host. Same for CPU but without GPU card. Measurements use “PowerMon” (Bedard & Fowler ’09) with sub-millisecond sampling)
Wednesday, October 2, 13NVIDIA GTX 580 − double−precision Intel i7−950 − double−precision 1/8 1/4 1/2 1 1/8 1/4 1/2 1
= 197 GFLOP/s
0.890
= 1.23 GFLOP/J
= 53.3 GFLOP/s
1.01
= 0.316 GFLOP/J Time Energy 1/4 1/2 1 2 4 8 16 1/4 1/2 1 2 4 8 16
Intensity (FLOP : Byte) Normalized performance
NVIDIA GTX 580 − double−precision 1/8 1/4 1/2 1 1/8 1/4 1/2 1
= 197 GFLOP/s
0.890
= 1.23 GFLOP/J 1/4 1/2 1 2 4 8 16 1/4
Intensity (FLOP : Byte) Normalized performance
1 2 4 8 16 1/4 1/2 1 2 4
Intensity (FLOP : Byte)
Wednesday, October 2, 13NVIDIA GTX 580 − double−precision Intel i7−950 − double−precision 1/8 1/4 1/2 1 1/8 1/4 1/2 1
= 197 GFLOP/s
0.890
= 1.23 GFLOP/J
= 53.3 GFLOP/s
1.01
= 0.316 GFLOP/J Time Energy 1/4 1/2 1 2 4 8 16 1/4 1/2 1 2 4 8 16
Intensity (FLOP : Byte) Normalized performance
NVIDIA GTX 580 − double−precision 1/8 1/4 1/2 1 1/8 1/4 1/2 1
= 197 GFLOP/s
0.890
= 1.23 GFLOP/J 1/4 1/2 1 2 4 8 16 1/4
Intensity (FLOP : Byte) Normalized performance
1 2 4 8 16 1/4 1/2 1 2 4
Intensity (FLOP : Byte)
Constant power can shift energy-balance
Wednesday, October 2, 13NVIDIA GTX 580 − double−precision Intel i7−950 − double−precision 1/8 1/4 1/2 1 1/8 1/4 1/2 1
= 197 GFLOP/s
0.890
= 1.23 GFLOP/J
= 53.3 GFLOP/s
1.01
= 0.316 GFLOP/J Time Energy 1/4 1/2 1 2 4 8 16 1/4 1/2 1 2 4 8 16
Intensity (FLOP : Byte) Normalized performance
NVIDIA GTX 580 − double−precision 1/8 1/4 1/2 1 1/8 1/4 1/2 1
= 197 GFLOP/s
0.890
= 1.23 GFLOP/J 1/4 1/2 1 2 4 8 16 1/4
Intensity (FLOP : Byte) Normalized performance
1 2 4 8 16 1/4 1/2 1 2 4
Intensity (FLOP : Byte)
… and “race-to-halt” is an artifact of this shift
Wednesday, October 2, 13NVIDIA GTX 580 − double−precision Intel i7−950 − double−precision 1/8 1/4 1/2 1 1/8 1/4 1/2 1
= 197 GFLOP/s
0.890
= 1.23 GFLOP/J
= 53.3 GFLOP/s
1.01
= 0.316 GFLOP/J Time Energy 1/4 1/2 1 2 4 8 16 1/4 1/2 1 2 4 8 16
Intensity (FLOP : Byte) Normalized performance
1 2 4 8 16 1/4 1/2 1 2 4
Intensity (FLOP : Byte)
Wednesday, October 2, 13NVIDIA GTX 580 − double−precision Intel i7−950 − double−precision 25 50 75 100 125 150 175 200 225 250 275 300
1.72 (static=0) 0.890
84.0 W 160 W 212 W 288 W
0.896 (static=0) 1.01
108 W 134 W 169 W 195 W Power 1/4 1/2 1 2 4 8 16 1/4 1/2 1 2 4 8 16
Intensity (FLOP : Byte) Normalized performance
Wednesday, October 2, 13NVIDIA GTX 580 − double−precision Intel i7−950 − double−precision 25 50 75 100 125 150 175 200 225 250 275 300
1.72 (static=0) 0.890
84.0 W 160 W 212 W 288 W
0.896 (static=0) 1.01
108 W 134 W 169 W 195 W Power 1/4 1/2 1 2 4 8 16 1/4 1/2 1 2 4 8 16
Intensity (FLOP : Byte) Normalized performance
Source of error?
Wednesday, October 2, 13Power capping is critical. It’s also easy to add.
Wednesday, October 2, 13Powerline Roofline 1/16 1/8 1/4 1/2 1 2 1 2 4 8 16 32 64 1 2 4 8 16 32 64
Intensity (flop:byte) Normalized Value
What might a power cap look like?
Wednesday, October 2, 13Adding a power cap
Powerline Roofline 1/16 1/8 1/4 1/2 1 2 1 2 4 8 16 32 64 1 2 4 8 16 32 64
Intensity (flop:byte) Normalized Value
Wednesday, October 2, 13Adding a power cap π0 : constant
Powerline Roofline 1/16 1/8 1/4 1/2 1 2 1 2 4 8 16 32 64 1 2 4 8 16 32 64
Intensity (flop:byte) Normalized Value
Wednesday, October 2, 13Adding a power cap π0 : constant
Powerline Roofline 1/16 1/8 1/4 1/2 1 2 1 2 4 8 16 32 64 1 2 4 8 16 32 64
Intensity (flop:byte) Normalized Value
π0
Wednesday, October 2, 13Adding a power cap
π0 + ∆π : max power π0 : constant
Powerline Roofline 1/16 1/8 1/4 1/2 1 2 1 2 4 8 16 32 64 1 2 4 8 16 32 64
Intensity (flop:byte) Normalized Value
π0
Wednesday, October 2, 13Adding a power cap
π0 + ∆π : max power π0 : constant
Powerline Roofline 1/16 1/8 1/4 1/2 1 2 1 2 4 8 16 32 64 1 2 4 8 16 32 64
Intensity (flop:byte) Normalized Value
π0
Adding a power cap
π0 + ∆π : max power
Tfree = max (W⌧flop, Q⌧mem) ⇓ T = max ✓ W⌧flop, Q⌧mem, W✏flop + Q✏mem ∆⇡ ◆
π0 : constant
Powerline Roofline 1/16 1/8 1/4 1/2 1 2 1 2 4 8 16 32 64 1 2 4 8 16 32 64
Intensity (flop:byte) Normalized Value
π0
Time 1.6 TFLOP/s at y=1 Energy 5.7 GFLOP/J at y=1 Power 280 Watts at y=1
1/512 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 1/41/2 1 2 4 8 16 32 641281/41/2 1 2 4 8 16 32 641281/41/2 1 2 4 8 16 32 64128
Intensity (single−precision FLOP : Byte) Normalized performance
GTX 580 GF100 Fermi
Wednesday, October 2, 13Time 1.6 TFLOP/s at y=1 Energy 5.7 GFLOP/J at y=1 Power 280 Watts at y=1
1/512 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 1/41/2 1 2 4 8 16 32 641281/41/2 1 2 4 8 16 32 641281/41/2 1 2 4 8 16 32 64128
Intensity (single−precision FLOP : Byte) Normalized performance
GTX 580 GF100 Fermi
A better fit!
Wednesday, October 2, 13The model suggests a structure in the time, energy, and power relationships. It also facilitates analysis.
Wednesday, October 2, 1316 Gflop/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%]
Cap Memory Compute
15 Gflop/J 3.0 Tflop/s [86%], 160 GB/s [82%] 66 W (const) + 140 W (cap) [100%]
Cap Memory Compute
11 Gflop/J 2.0 Tflop/s [100%], 180 GB/s [57%] 180 W (const) + 36 W (cap) [100%]
Cap Memory Compute
8.8 Gflop/J 270 Gflop/s [100%], 15 GB/s [60%] 10 W (const) + 18 W (cap) [91%]
Cap Memory
8.1 Gflop/J 33 Gflop/s [46%], 8.4 GB/s [66%] 1.3 W (const) + 4.8 W (cap) [88%]
Cap Memory Compute
6.4 Gflop/J 100 Gflop/s [95%], 8.7 GB/s [81%] 16 W (const) + 3.2 W (cap) [100%]
Cap Memory Compute
5.4 Gflop/J 1.4 Tflop/s [88%], 170 GB/s [89%] 120 W (const) + 150 W (cap) [94%]
Cap Memory Compute
3.2 Gflop/J 56 Gflop/s [97%], 18 GB/s [70%] 17 W (const) + 7.4 W (cap) [98%]
Cap Compute
2.5 Gflop/J 9.5 Gflop/s [99%], 1.3 GB/s [40%] 3.5 W (const) + 1.2 W (cap) [95%]
Cap Memory Compute
2.2 Gflop/J 16 Gflop/s [58%], 3.9 GB/s [31%] 5.5 W (const) + 2.0 W (cap) [97%]
Cap Memory Compute
650 Mflop/J 13 Gflop/s [98%], 3.3 GB/s [31%] 20 W (const) + 1.4 W (cap) [98%]
Cap Compute
620 Mflop/J 99 Gflop/s [93%], 19 GB/s [74%] 120 W (const) + 44 W (cap) [99%]
Cap Memory Compute
GTX Titan GTX 680 Xeon Phi NUC GPU Arndale GPU APU GPU GTX 580 NUC CPU PandaBoard ES Arndale CPU APU CPU Desktop CPU 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512
Intensity (single−precision flop:Byte) Power [normalized]
Wednesday, October 2, 13Example: Caps imply throttling!
π0 + ∆π : max power π0 : constant
Throttling factors
i.e., allowable slowdown
T = max ✓ W⌧flop, Q⌧mem, W✏flop + Q✏mem ∆⇡ ◆ ⇓ ˜ ⌧flop ≡ T W ≡ ⌧flopsflop ˜ ⌧mem ≡ T Q ≡ ⌧memsmem ⇓ sflop ≡ max ⇢ 1, B⌧ I , ✏flop/⌧flop ∆⇡ ✓ 1 + B✏ I ◆ smem ≡ sflop I B⌧
Wednesday, October 2, 13Example: Caps imply throttling!
π0 + ∆π : max power π0 : constant
Throttling factors
i.e., allowable slowdown
T = max ✓ W⌧flop, Q⌧mem, W✏flop + Q✏mem ∆⇡ ◆ ⇓ ˜ ⌧flop ≡ T W ≡ ⌧flopsflop ˜ ⌧mem ≡ T Q ≡ ⌧memsmem ⇓ sflop ≡ max ⇢ 1, B⌧ I , ✏flop/⌧flop ∆⇡ ✓ 1 + B✏ I ◆ smem ≡ sflop I B⌧
Wednesday, October 2, 13Example: Caps imply throttling!
π0 + ∆π : max power π0 : constant
Throttling factors
i.e., allowable slowdown
T = max ✓ W⌧flop, Q⌧mem, W✏flop + Q✏mem ∆⇡ ◆ ⇓ ˜ ⌧flop ≡ T W ≡ ⌧flopsflop ˜ ⌧mem ≡ T Q ≡ ⌧memsmem ⇓ sflop ≡ max ⇢ 1, B⌧ I , ✏flop/⌧flop ∆⇡ ✓ 1 + B✏ I ◆ smem ≡ sflop I B⌧
Wednesday, October 2, 13Flops Mops
1 2 4 8 16 32 64 1/4 1/2 1 2 4 8 16 32 64
Intensity (single−precision FLOP : Byte) Throttling coefficient NVIDIA GTX 580 GF100; Fermi
Wednesday, October 2, 1316 Gflop/J, 1.3 GB/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%]
Cap Memory Compute
15 Gflop/J, 1.2 GB/J 3.0 Tflop/s [86%], 160 GB/s [82%] 66 W (const) + 140 W (cap) [100%]
Cap Memory Compute
11 Gflop/J, 880 MB/J 2.0 Tflop/s [100%], 180 GB/s [57%] 180 W (const) + 36 W (cap) [100%]
Cap Memory Compute
8.8 Gflop/J, 670 MB/J 270 Gflop/s [100%], 15 GB/s [60%] 10 W (const) + 18 W (cap) [91%]
Cap Memory
8.1 Gflop/J, 1.5 GB/J 33 Gflop/s [46%], 8.4 GB/s [66%] 1.3 W (const) + 4.8 W (cap) [88%]
Cap Memory Compute
6.4 Gflop/J, 470 MB/J 100 Gflop/s [95%], 8.7 GB/s [81%] 16 W (const) + 3.2 W (cap) [100%]
Cap Memory Compute
5.3 Gflop/J, 810 MB/J 1.4 Tflop/s [88%], 170 GB/s [89%] 120 W (const) + 150 W (cap) [94%]
Cap Memory Compute
3.2 Gflop/J, 750 MB/J 56 Gflop/s [97%], 18 GB/s [70%] 17 W (const) + 7.4 W (cap) [98%]
Cap Compute
2.5 Gflop/J, 280 MB/J 9.5 Gflop/s [99%], 1.3 GB/s [40%] 3.5 W (const) + 1.2 W (cap) [95%]
Cap Memory Compute
2.2 Gflop/J, 560 MB/J 16 Gflop/s [58%], 3.9 GB/s [31%] 5.5 W (const) + 2.0 W (cap) [97%]
Cap Memory Compute
650 Mflop/J, 150 MB/J 13 Gflop/s [98%], 3.3 GB/s [31%] 20 W (const) + 1.4 W (cap) [98%]
Cap Compute
620 Mflop/J, 140 MB/J 99 Gflop/s [93%], 19 GB/s [74%] 120 W (const) + 44 W (cap) [99%]
Cap Memory Compute
GTX Titan GTX 680 Xeon Phi NUC GPU Arndale GPU APU GPU GTX 580 NUC CPU PandaBoard ES Arndale CPU APU CPU Desktop CPU 1/8 1/2 2 8 32 128 512 2048 1/8 1/2 2 8 32 128 512 2048 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512
Intensity (single−precision flop:Byte) Flops / Time [Gflop/s]
Wednesday, October 2, 138.1 Gflop/J, 1.5 GB/J 33 Gflop/s [46%], 8.4 GB/s [66%] 1.3 W (const) + 4.8 W (cap) [88%]
Cap Memory Compute
16 Gflop/J, 1.3 GB/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%]
Cap Memory Compute
15 Gflop/J, 1.2 GB/J 3.0 Tflop/s [86%], 160 GB/s [82%] 66 W (const) + 140 W (cap) [100%]
Cap Memory Compute
11 Gflop/J, 880 MB/J 2.0 Tflop/s [100%], 180 GB/s [57%] 180 W (const) + 36 W (cap) [100%]
Cap Memory Compute
5.3 Gflop/J, 810 MB/J 1.4 Tflop/s [88%], 170 GB/s [89%] 120 W (const) + 150 W (cap) [94%]
Cap Memory Compute
3.2 Gflop/J, 750 MB/J 56 Gflop/s [97%], 18 GB/s [70%] 17 W (const) + 7.4 W (cap) [98%]
Cap Compute
8.8 Gflop/J, 670 MB/J 270 Gflop/s [100%], 15 GB/s [60%] 10 W (const) + 18 W (cap) [91%]
Cap Memory
2.2 Gflop/J, 560 MB/J 16 Gflop/s [58%], 3.9 GB/s [31%] 5.5 W (const) + 2.0 W (cap) [97%]
Cap Memory Compute
6.4 Gflop/J, 470 MB/J 100 Gflop/s [95%], 8.7 GB/s [81%] 16 W (const) + 3.2 W (cap) [100%]
Cap Memory Compute
2.5 Gflop/J, 280 MB/J 9.5 Gflop/s [99%], 1.3 GB/s [40%] 3.5 W (const) + 1.2 W (cap) [95%]
Cap Memory Compute
650 Mflop/J, 150 MB/J 13 Gflop/s [98%], 3.3 GB/s [31%] 20 W (const) + 1.4 W (cap) [98%]
Cap Compute
620 Mflop/J, 140 MB/J 99 Gflop/s [93%], 19 GB/s [74%] 120 W (const) + 44 W (cap) [99%]
Cap Memory Compute
Arndale GPU GTX Titan GTX 680 Xeon Phi GTX 580 NUC CPU NUC GPU Arndale CPU APU GPU PandaBoard ES APU CPU Desktop CPU 1/64 1/16 1/4 1 4 16 1/64 1/16 1/4 1 4 16 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512
Intensity (single−precision flop:Byte) Flops / Energy [Gflop/J]
Wednesday, October 2, 13Caps imply throttling!
What power cap will obviate throttling? The balance gap dictates a sufficient (algorithm-independent) condition.
π0 + ∆π : max power π0 : constant
Peak power per flop (or mop)
Wednesday, October 2, 13