Generalized roofline analysis? Jee Choi Marat Dukhan Richard (Rich) - - PowerPoint PPT Presentation

generalized roofline analysis
SMART_READER_LITE
LIVE PREVIEW

Generalized roofline analysis? Jee Choi Marat Dukhan Richard (Rich) - - PowerPoint PPT Presentation

Follow along at hpcgarage.org/13401 Generalized roofline analysis? Jee Choi Marat Dukhan Richard (Rich) Vuduc October 2, 2013 Dagstuhl Seminar 13401: Automatic Application Autotuning for HPC Architectures Wednesday, October 2, 13


slide-1
SLIDE 1

Generalized roofline analysis?

Jee Choi ∙ Marat Dukhan ∙ Richard (Rich) Vuduc October 2, 2013 Dagstuhl Seminar 13401: Automatic Application Autotuning for HPC Architectures Follow along at hpcgarage.org/13401

Wednesday, October 2, 13
slide-2
SLIDE 2

282

R.W. Hockney, l.J. Curington /fl/2." A parameter to characterize bottlenecks

/ i i

~(o) // / / / / i

J

// / / / J I I 1 2 3 1 (ml ............ r']1/2

1 I

I

,

I

4 5

6 X

8 2 3 Z 4.

I I I I I , I

1 1; 2 3 Z

  • Fig. 2. The variation of (?~, hl/2) with f

for the case of a combined memory I/O and arithmetic pipefine, when the I/O and arithmetic can be overlapped. Full lines: parameters when arithmetic dominates, equations (14b); dotted lines: parameters when I/O dominates, equa- tions (13b). Notation as Fig. 1, and p = 1.5.

In order to find out whether I/0 or arithmetic dominates, one must examine the breakeven vector length, hi, at which I/O and arithmetic take equal times. This occurs when till= t2, whence

H1 = (H~72) - ~'"(a)~''1/2 )/(Z--

1) (15) where z = fr~m)/r~

a).

,,(m)/,(a) = 1.5, and the The variation of n I with z is drawn in Fig. 3 for the case ~ = ,,1/2/,,1/2 regions of the (z, n])-plane corresponding to I/O or arithmetic dominance are shown. If z > ~,, the arithmetic time dominates for all vector lengths, because n I is negative. Equations (14) apply, and the asymptotic performance is constant and equal to the r~ a) of the arithmetic pipeline. Since this is the situation when f~ oo we have, by definition, the peak performance & = r~ a), (16a) the same as for the sequential I/O case. If, however, z < 1, I/O dominates for all vector lengths (because n~ is again negative) and equations (13) apply. The total computation time, ta, for f vector operations is constant, hence the asymptotic performance rises hnearly with f, reaching the peak performance P~ -- r~ a) when:

f= r~a)/r~ m). The asymptotic performance reaches half the peak performance when f reaches

half this value, hence by definition

fl/2 = ½r~a)//r~

m). (16b) Thus overlapping halves the value of f]/2 from that obtained for sequential I/O. Between 1 < z < u, either I/O or arithmetic may dominate depending on the vector length (because, now, n 1 is positive). Figure 2 shows that if I/O dominates n < nl) the asymptotic ~ performance can exceed the asymptotic performance of the arithmetic pipeline r~ a), and it might appear that this is absurd and against physical intuition. However, this is not the case, because ~ is a theoretical asymptotic performance (for n ~ ~) which in this case can never be

R.W. Hockney and I.J. Curington (1989). “f½: A parameter to characterize memory and communication bottlenecks.” doi: 10.1016/0167-8191(89)90100-2

Operational Intensity (Flops/Byte) (a) Intel Xeon (Clovertown) peak DP +balanced mul/add +SIMD +ILP TLP only peak stream bandwidth +snoop filter effective LBMHD FFT (5123) FFT (1283) s n

  • p

fi l t e r i n e f f e c t i v e Stencil GFlops/s 128 64 32 16 8 4 2 1 1/16 1/8 1/4 1/2 1 2 4 8 16

  • S. Williams, A. Waterman, D. Patterson (2009).

“Roofline: An insightful visual performance model for multicore architectures.” doi: 10.1145/1498765.1498785

hpcgarage.org/13401

Wednesday, October 2, 13
slide-3
SLIDE 3

R.W. Hockney, l.J. Curington /fl/2." A parameter to characterize

/ i i

~(o) // / / / / i

J

// / / / J I I 1 2 3 1 (ml

1 I

I

,

I

4 5

6 X

8 2 3 Z 4.

R.W. Hockney and I.J. Curington (1989). “f½: A parameter to characterize memory and communication bottlenecks.” doi: 10.1016/0167-8191(89)90100-2

Operational Intensity (Flops/Byte) (a) Intel Xeon (Clovertown) peak DP +balanced mul/add +SIMD +ILP TLP only peak stream bandwidth +snoop filter effective LBMHD FFT (5123) FFT (1283) s n

  • p

fi l t e r i n e f f e c t i v e Stencil GFlops/s 128 64 32 16 8 4 2 1 1/16 1/8 1/4 1/2 1 2 4 8 16

  • S. Williams, A. Waterman, D. Patterson (2009).

“Roofline: An insightful visual performance model for multicore architectures.” doi: 10.1145/1498765.1498785

Rooflines provide insights into the

limits on performance due to

intrinsic properties of a computation as they relate to architectural features of a system.

hpcgarage.org/13401

Wednesday, October 2, 13
slide-4
SLIDE 4

u(intensity, balance) Applying ideas of R. Numrich, we can identify a unitless (dimensionless) manifold

  • f all possible systems in the model, and

use differential geometry to estimate the “distance” between systems. v(intensity, power) f (energy-efficiency)

Numrich, R. W. (2010). “Computer performance analysis and the Pi Theorem.” Comp. Sci. R&D. doi:10.1007/s00450-010-0147-8

hpcgarage.org/13401

Wednesday, October 2, 13
slide-5
SLIDE 5

There are many ways to “generalize” the roofline: metrics beyond time, e.g., energy and power; intrinsic algorithmic properties beyond “flop:byte”; hypothetical architectural features; among others. Example: Time, energy, and power of an abstract computation.

Jee Choi doing actual science

hpcgarage.org/13401

Wednesday, October 2, 13
slide-6
SLIDE 6

von Neumann system

Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

Slow memory

xPU Fast memory (total size = Z)

hpcgarage.org/13401

Wednesday, October 2, 13
slide-7
SLIDE 7

von Neumann system

Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

W ≡ # (fl)ops Q ≡ # mem. ops (mops) = Q(Z) I ≡ W Q = Intensity (flop:mop)

Slow memory

xPU Fast memory (total size = Z)

hpcgarage.org/13401

Q mops W (fl)ops

Wednesday, October 2, 13
slide-8
SLIDE 8

von Neumann system

Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

W ≡ # (fl)ops Q ≡ # mem. ops (mops) = Q(Z) I ≡ W Q = Intensity (flop:mop) τflop ≡ time per (fl)op τmem ≡ time per mop Bτ ≡ τmem τflop = Balance (flop:mop)

Slow memory

xPU Fast memory (total size = Z)

Q mops W (fl)ops τflop = time/flop τmem = time/mop

hpcgarage.org/13401

Wednesday, October 2, 13
slide-9
SLIDE 9

1/32 1/16 1/8 1/4 1/2 1

3.6

GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

flop:byte

“Roofline” — Williams et al. (Comm. ACM ’09)

hpcgarage.org/13401

Wednesday, October 2, 13
slide-10
SLIDE 10

1/32 1/16 1/8 1/4 1/2 1

3.6

GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

flop:byte

“Roofline” — Williams et al. (Comm. ACM ’09)

Balance (flop : mop)

hpcgarage.org/13401

Wednesday, October 2, 13
slide-11
SLIDE 11

1/32 1/16 1/8 1/4 1/2 1

3.6

GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

flop:byte

Compute bound

“Roofline” — Williams et al. (Comm. ACM ’09)

Balance (flop : mop)

hpcgarage.org/13401

Wednesday, October 2, 13
slide-12
SLIDE 12

1/32 1/16 1/8 1/4 1/2 1

3.6

GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

flop:byte

Compute bound

“Roofline” — Williams et al. (Comm. ACM ’09)

Memory (bandwidth) bound Balance (flop : mop)

hpcgarage.org/13401

Wednesday, October 2, 13
slide-13
SLIDE 13

1/32 1/16 1/8 1/4 1/2 1

3.6

GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

flop:byte

Compute bound

“Roofline” — Williams et al. (Comm. ACM ’09)

Memory (bandwidth) bound Dense matrix multiply Balance (flop : mop)

hpcgarage.org/13401

Wednesday, October 2, 13
slide-14
SLIDE 14

1/32 1/16 1/8 1/4 1/2 1

3.6

GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

flop:byte

Compute bound

“Roofline” — Williams et al. (Comm. ACM ’09)

Memory (bandwidth) bound sparse matvec; stencils Dense matrix multiply Balance (flop : mop)

hpcgarage.org/13401

Wednesday, October 2, 13
slide-15
SLIDE 15

1/32 1/16 1/8 1/4 1/2 1

3.6

GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

flop:byte

Compute bound

“Roofline” — Williams et al. (Comm. ACM ’09)

Memory (bandwidth) bound sparse matvec; stencils FFTs Dense matrix multiply Balance (flop : mop)

hpcgarage.org/13401

Wednesday, October 2, 13
slide-16
SLIDE 16

1/32 1/16 1/8 1/4 1/2 1

3.6

GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance

flop:byte

Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

Operational Intensity (Flops/Byte) (a) Intel Xeon (Clovertown) peak DP +balanced mul/add +SIMD +ILP TLP only peak stream bandwidth + s n

  • p

fi l t e r e f f e c t i v e LBMHD FFT (5123) FFT (1283) s n

  • p

fi l t e r i n e f f e c t i v e Stencil GFlops/s 128 64 32 16 8 4 2 1 1/16 1/8 1/4 1/2 1 2 4 8 16

Memory (bandwidth) bound Compute bound

hpcgarage.org/13401

Wednesday, October 2, 13
slide-17
SLIDE 17

1/32 1/16 1/8 1/4 1/2 1

3.6

GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance

flop:byte

Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

Compute bound in time Memory (bandwidth) bound in time

hpcgarage.org/13401

Wednesday, October 2, 13
slide-18
SLIDE 18

1/32 1/16 1/8 1/4 1/2 1

3.6 14

GFLOP/J GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance

flop:byte flop:byte

“Arch line”

Balance estimates for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

hpcgarage.org/13401

Wednesday, October 2, 13
slide-19
SLIDE 19

1/32 1/16 1/8 1/4 1/2 1

3.6 14

GFLOP/J GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance

flop:byte flop:byte

“Arch line”

Balance estimates for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

Energy balance (flop : mop)

hpcgarage.org/13401

Wednesday, October 2, 13
slide-20
SLIDE 20

1/32 1/16 1/8 1/4 1/2 1

3.6 14

GFLOP/J GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance

flop:byte flop:byte

“Arch line”

Balance estimates for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

Memory (bandwidth) bound in energy Compute bound in energy Energy balance (flop : mop)

hpcgarage.org/13401

Wednesday, October 2, 13
slide-21
SLIDE 21

1 2 4 8 3.6 14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512

Intensity (flop:byte) Power, relative to flop−power

“Power line”

(average power)

flop:byte flop:byte

Balance estimates for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

hpcgarage.org/13401

Wednesday, October 2, 13
slide-22
SLIDE 22

For real systems, the model should account for “constant power” and “power capping.”

Powerline Roofline 1/16 1/8 1/4 1/2 1 2 1 2 4 8 16 32 64 1 2 4 8 16 32 64

Intensity (flop:byte) Normalized Value

hpcgarage.org/13401

Wednesday, October 2, 13
slide-23
SLIDE 23

For real systems, the model should account for “constant power” and “power capping.”

Powerline Roofline 1/16 1/8 1/4 1/2 1 2 1 2 4 8 16 32 64 1 2 4 8 16 32 64

Intensity (flop:byte) Normalized Value

π0 : Constant

hpcgarage.org/13401

Wednesday, October 2, 13
slide-24
SLIDE 24

For real systems, the model should account for “constant power” and “power capping.”

Powerline Roofline 1/16 1/8 1/4 1/2 1 2 1 2 4 8 16 32 64 1 2 4 8 16 32 64

Intensity (flop:byte) Normalized Value

π0 : Constant

∆π : Cap

hpcgarage.org/13401

Wednesday, October 2, 13
slide-25
SLIDE 25

(4.0 Tflop/s, 16 Gflop/J, 290 W)

Time Energy

(33 Gflop/s, 8.1 Gflop/J, 7.0 W)

Time Energy

GTX Titan Arndale GPU 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128

Intensity (single−precision flop:Byte) Efficiency (higher is better)

“Desktop GPU” (NVIDIA) “Mobile GPU” (Samsung/ARM)

Wednesday, October 2, 13
slide-26
SLIDE 26

So what?

hpcgarage.org/13401

Wednesday, October 2, 13
slide-27
SLIDE 27

So what?

Possibility 1:

A “first principles” view of time & energy in systems.

hpcgarage.org/13401

Wednesday, October 2, 13
slide-28
SLIDE 28

1/32 1/16 1/8 1/4 1/2 1

3.6 14

GFLOP/J GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance

flop:byte flop:byte

Balance estimates for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

hpcgarage.org/13401

“Arch line”

W✏flop E = 1 1 + B✏

I

Wednesday, October 2, 13
slide-29
SLIDE 29

1/32 1/16 1/8 1/4 1/2 1

3.6 14

GFLOP/J GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance

flop:byte flop:byte

Time-energy balance gap What does this imply?

Balance estimates for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

hpcgarage.org/13401

“Arch line”

W✏flop E = 1 1 + B✏

I

Wednesday, October 2, 13
slide-30
SLIDE 30

1/32 1/16 1/8 1/4 1/2 1

3.6 14

GFLOP/J GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance

flop:byte flop:byte

Time-energy balance gap Compute-bound in time but memory-bound in energy?

Balance estimates for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

hpcgarage.org/13401

“Arch line”

W✏flop E = 1 1 + B✏

I

Wednesday, October 2, 13
slide-31
SLIDE 31

1/32 1/16 1/8 1/4 1/2 1

3.6 14

GFLOP/J GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance

flop:byte flop:byte

Time-energy balance gap Is optimizing for energy harder than optimizing for time?

Balance estimates for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

hpcgarage.org/13401

“Arch line”

W✏flop E = 1 1 + B✏

I

Wednesday, October 2, 13
slide-32
SLIDE 32

1/32 1/16 1/8 1/4 1/2 1

3.6 14

GFLOP/J GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance

flop:byte flop:byte

Time-energy balance gap Energy-efficiency likely implies time-efficiency, but not vice- versa, breaking “race-to-halt”

Balance estimates for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

hpcgarage.org/13401

“Arch line”

W✏flop E = 1 1 + B✏

I

Wednesday, October 2, 13
slide-33
SLIDE 33

So what?

Possibility 2:

Another view of power caps and throttling.

hpcgarage.org/13401

Wednesday, October 2, 13
slide-34
SLIDE 34

16 Gflop/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%]

Cap Memory Compute

15 Gflop/J 3.0 Tflop/s [86%], 160 GB/s [82%] 66 W (const) + 140 W (cap) [100%]

Cap Memory Compute

11 Gflop/J 2.0 Tflop/s [100%], 180 GB/s [57%] 180 W (const) + 36 W (cap) [100%]

Cap Memory Compute

8.8 Gflop/J 270 Gflop/s [100%], 15 GB/s [60%] 10 W (const) + 18 W (cap) [91%]

Cap Memory

8.1 Gflop/J 33 Gflop/s [46%], 8.4 GB/s [66%] 1.3 W (const) + 4.8 W (cap) [88%]

Cap Memory Compute

6.4 Gflop/J 100 Gflop/s [95%], 8.7 GB/s [81%] 16 W (const) + 3.2 W (cap) [100%]

Cap Memory Compute

5.4 Gflop/J 1.4 Tflop/s [88%], 170 GB/s [89%] 120 W (const) + 150 W (cap) [94%]

Cap Memory Compute

3.2 Gflop/J 56 Gflop/s [97%], 18 GB/s [70%] 17 W (const) + 7.4 W (cap) [98%]

Cap Compute

2.5 Gflop/J 9.5 Gflop/s [99%], 1.3 GB/s [40%] 3.5 W (const) + 1.2 W (cap) [95%]

Cap Memory Compute

2.2 Gflop/J 16 Gflop/s [58%], 3.9 GB/s [31%] 5.5 W (const) + 2.0 W (cap) [97%]

Cap Memory Compute

650 Mflop/J 13 Gflop/s [98%], 3.3 GB/s [31%] 20 W (const) + 1.4 W (cap) [98%]

Cap Compute

620 Mflop/J 99 Gflop/s [93%], 19 GB/s [74%] 120 W (const) + 44 W (cap) [99%]

Cap Memory Compute

GTX Titan GTX 680 Xeon Phi NUC GPU Arndale GPU APU GPU GTX 580 NUC CPU PandaBoard ES Arndale CPU APU CPU Desktop CPU 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512

Intensity (single−precision flop:Byte) Power [normalized]

1 2 4 8 3.6 14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512

Intensity (flop:byte) Power, relative to flop−power

Recall: “Power line”

hpcgarage.org/13401

Wednesday, October 2, 13
slide-35
SLIDE 35

16 Gflop/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%]

Cap Memory Compute

15 Gflop/J 3.0 Tflop/s [86%], 160 GB/s [82%] 66 W (const) + 140 W (cap) [100%]

Cap Memory Compute

11 Gflop/J 2.0 Tflop/s [100%], 180 GB/s [57%] 180 W (const) + 36 W (cap) [100%]

Cap Memory Compute

8.8 Gflop/J 270 Gflop/s [100%], 15 GB/s [60%] 10 W (const) + 18 W (cap) [91%]

Cap Memory

8.1 Gflop/J 33 Gflop/s [46%], 8.4 GB/s [66%] 1.3 W (const) + 4.8 W (cap) [88%]

Cap Memory Compute

6.4 Gflop/J 100 Gflop/s [95%], 8.7 GB/s [81%] 16 W (const) + 3.2 W (cap) [100%]

Cap Memory Compute

5.4 Gflop/J 1.4 Tflop/s [88%], 170 GB/s [89%] 120 W (const) + 150 W (cap) [94%]

Cap Memory Compute

3.2 Gflop/J 56 Gflop/s [97%], 18 GB/s [70%] 17 W (const) + 7.4 W (cap) [98%]

Cap Compute

2.5 Gflop/J 9.5 Gflop/s [99%], 1.3 GB/s [40%] 3.5 W (const) + 1.2 W (cap) [95%]

Cap Memory Compute

2.2 Gflop/J 16 Gflop/s [58%], 3.9 GB/s [31%] 5.5 W (const) + 2.0 W (cap) [97%]

Cap Memory Compute

650 Mflop/J 13 Gflop/s [98%], 3.3 GB/s [31%] 20 W (const) + 1.4 W (cap) [98%]

Cap Compute

620 Mflop/J 99 Gflop/s [93%], 19 GB/s [74%] 120 W (const) + 44 W (cap) [99%]

Cap Memory Compute

GTX Titan GTX 680 Xeon Phi NUC GPU Arndale GPU APU GPU GTX 580 NUC CPU PandaBoard ES Arndale CPU APU CPU Desktop CPU 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512

Intensity (single−precision flop:Byte) Power [normalized]

“Desktop GPU” “Desktop GPU” “Desktop xPU” “Mobile GPU” “Mobile GPU” “APU GPU” “Desktop GPU” “Mobile CPU” “Mobile CPU” “Mobile CPU” “APU CPU” “Desktop CPU”

Wednesday, October 2, 13
slide-36
SLIDE 36

Flops Mops

1 2 4 8 16 32 64 1/4 1/2 1 2 4 8 16 32 64

Intensity (single−precision FLOP : Byte) Throttling coefficient NVIDIA GTX 580 GF100; Fermi

Can infer throttling, given a power cap

hpcgarage.org/13401

Wednesday, October 2, 13
slide-37
SLIDE 37

So what?

Possibility 3:

Abstract algorithm analysis.

hpcgarage.org/13401

Wednesday, October 2, 13
slide-38
SLIDE 38

Abstract work-communication trade-offs

Algorithm 1 = (W, Q) versus Algorithm 2 = (fW, Q m) I ≡ W Q

Wednesday, October 2, 13
slide-39
SLIDE 39

Speedup ∆T = T1,1 Tf,m “Greenup” ∆E = E1,1 Ef,m

Abstract work-communication trade-offs

Algorithm 1 = (W, Q) versus Algorithm 2 = (fW, Q m) I ≡ W Q

Wednesday, October 2, 13
slide-40
SLIDE 40

Speedup ∆T = T1,1 Tf,m “Greenup” ∆E = E1,1 Ef,m

Abstract work-communication trade-offs

∆E > 1 = ⇒ f < 1 + m − 1 m B✏ I

A general “greenup” condition

Algorithm 1 = (W, Q) versus Algorithm 2 = (fW, Q m) I ≡ W Q

Wednesday, October 2, 13
slide-41
SLIDE 41

1/32 1/16 1/8 1/4 1/2 1

3.6 14

GFLOP/J GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance

flop:byte flop:byte

Algorithm 1

Balance estimates for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

hpcgarage.org/13401

Wednesday, October 2, 13
slide-42
SLIDE 42

1/32 1/16 1/8 1/4 1/2 1

3.6 14

GFLOP/J GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance

flop:byte flop:byte

Algorithm 1

∆E > 1 = ⇒ f < 1 + B✏ B⌧

Balance estimates for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

hpcgarage.org/13401

Wednesday, October 2, 13
slide-43
SLIDE 43

So what?

Possibility 4:

Abstract architectural bake-offs.

hpcgarage.org/13401

http://stickofachef.files.wordpress.com/2008/03/iron_chef.jpg

Wednesday, October 2, 13
slide-44
SLIDE 44

Flop / Time Flop / Energy Power

GTX Titan Arndale GPU GTX Titan Arndale GPU GTX Titan Arndale GPU

2−11 2−10 1/512 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256

Intensity (single−precision flop:Byte) Relative value

hpcgarage.org/13401

Wednesday, October 2, 13
slide-45
SLIDE 45

Flop / Time Flop / Energy Power

GTX Titan Arndale GPU GTX Titan Arndale GPU GTX Titan Arndale GPU

2−11 2−10 1/512 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256

Intensity (single−precision flop:Byte) Relative value

~ 47⨉

hpcgarage.org/13401

Wednesday, October 2, 13
slide-46
SLIDE 46

Flop / Time Flop / Energy Power

GTX Titan Arndale GPU GTX Titan Arndale GPU GTX Titan Arndale GPU

2−11 2−10 1/512 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256

Intensity (single−precision flop:Byte) Relative value

~ 122⨉ ~ 47⨉

hpcgarage.org/13401

Wednesday, October 2, 13
slide-47
SLIDE 47

Flop / Time Flop / Energy Power

GTX Titan Arndale GPU GTX Titan Arndale GPU GTX Titan Arndale GPU

2−11 2−10 1/512 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256

Intensity (single−precision flop:Byte) Relative value

~ 122⨉ ~ 28⨉ ~ 47⨉

hpcgarage.org/13401

Wednesday, October 2, 13
slide-48
SLIDE 48

Flop / Time Flop / Energy Power

GTX Titan Arndale GPU 47 x Arndale GPU GTX Titan Arndale GPU GTX Titan Arndale GPU 47 x Arndale GPU

2−11 2−10 1/512 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256

Intensity (single−precision flop:Byte) Relative value

hpcgarage.org/13401

Wednesday, October 2, 13
slide-49
SLIDE 49

Flop / Time Flop / Energy Power

GTX Titan Arndale GPU 47 x Arndale GPU GTX Titan Arndale GPU GTX Titan Arndale GPU 47 x Arndale GPU

2−11 2−10 1/512 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256

Intensity (single−precision flop:Byte) Relative value

~ 0.4⨉

as fast

hpcgarage.org/13401

Wednesday, October 2, 13
slide-50
SLIDE 50

Flop / Time Flop / Energy Power

GTX Titan Arndale GPU 47 x Arndale GPU GTX Titan Arndale GPU GTX Titan Arndale GPU 47 x Arndale GPU

2−11 2−10 1/512 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256

Intensity (single−precision flop:Byte) Relative value

~ 0.4⨉

as fast

~ 1.6⨉ faster

— but that’s

  • ptimistic!

hpcgarage.org/13401

Wednesday, October 2, 13
slide-51
SLIDE 51

Flop / Time Flop / Energy Power

GTX Titan Arndale GPU 47 x Arndale GPU GTX Titan Arndale GPU GTX Titan Arndale GPU 47 x Arndale GPU

2−11 2−10 1/512 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256

Intensity (single−precision flop:Byte) Relative value

~ 0.4⨉

as fast

~ 1.6⨉ faster

— but that’s

  • ptimistic!

same!

hpcgarage.org/13401

Wednesday, October 2, 13
slide-52
SLIDE 52

Flop / Time Flop / Energy Power

GTX Titan Arndale GPU 47 x Arndale GPU GTX Titan Arndale GPU GTX Titan Arndale GPU 47 x Arndale GPU

2−11 2−10 1/512 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256 1/8 1/4 1/2 1 2 4 8 16 32 64 128256

Intensity (single−precision flop:Byte) Relative value

~ 0.4⨉

as fast

~ 1.6⨉ faster

— but that’s

  • ptimistic!

same! ~ 2⨉

hpcgarage.org/13401

Wednesday, October 2, 13
slide-53
SLIDE 53

u

¡

(intensity, balance) Applying ideas of R. Numrich, we can identify a unitless (dimensionless) manifold

  • f all possible systems in the model, and

use differential geometry to estimate the “distance” between systems. v ¡(intensity, power) f ¡(energy-efficiency)

Numrich, R. W. (2010). “Computer performance analysis and the Pi Theorem.” Comp. Sci. R&D. doi:10.1007/s00450-010-0147-8

hpcgarage.org/13401

Wednesday, October 2, 13
slide-54
SLIDE 54

๏ What are the intrinsic relationships

among time, energy, and power?

๏ What do these relationships say about

algorithms and software?

๏ About architectures, hardware, and co-

design?

๏ Can {roof,arch,power,…}-lines, suitably

refined, guide autotuning systems?

1/32 1/16 1/8 1/4 1/2 1

3.6 14

GFLOP/J GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance

?

B✏

?

> B⌧

  • Nehalem

Ivy Bridge Ivy Bridge Kepler Kepler Fermi ARM Cortex−A9 ARM Cortex−A15 ARM Mali T−604 Bobcat Zacate GPU KNC

Mini Mobile Desktop

2 4 8 16 32 64 4 8 16

Time balance (single−precision flop:byte) Energy balance (single−precision flop:byte)

xPU

  • CPU

GPU

hpcgarage.org/13401

Wednesday, October 2, 13
slide-55
SLIDE 55

Backup

Wednesday, October 2, 13
slide-56
SLIDE 56

Jee Whan Choi Autotuning for power and energy Aparna Chandramowlishwaran [Now R.S. @ MIT] Fast multipole method Marat Dukhan Math libraries & machine learning

2010 Gordon Bell Prize (+ G. Biros) 2010 IPDPS Best Paper (+ K. Knobe, Intel CnC lead) 2012 SIAM Data Mining Best Paper (D. Lee [GE Research] + A. Gray)

See our recent 2013 IPDPS papers, posted at: hpcgarage.org/ppam13 * Jee Choi — “A roofline model of energy.” * Kent Czechowski —“A theoretical framework for algorithm-architecture co-design.”

Kent Czechowski Co-design

Wednesday, October 2, 13
slide-57
SLIDE 57

von Neumann-like system Slow memory

xPU Fast memory (total size = Z)

W ≡ # (fl)ops Q ≡ # mem. ops (mops) = Q(Z)

Q mops W (fl)ops

Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

hpcgarage.org/modsim13

Wednesday, October 2, 13
slide-58
SLIDE 58

von Neumann-like system

T = max (W⌧flop, Q⌧mem) = W⌧flop max ✓ 1, Q W ⌧mem ⌧flop ◆ = W⌧flop max ✓ 1, B⌧ I ◆ E = W✏flop + Q✏mem = W✏flop ✓ 1 + B✏ I ◆ Consider: W⌧flop T and W✏flop E

Slow memory

xPU

τmem = time/mop τflop = time/flop

Fast memory (total size = Z)

Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

hpcgarage.org/modsim13

Wednesday, October 2, 13
slide-59
SLIDE 59

von Neumann-like system

T = max (W⌧flop, Q⌧mem) = W⌧flop max ✓ 1, Q W ⌧mem ⌧flop ◆ = W⌧flop max ✓ 1, B⌧ I ◆ E = W✏flop + Q✏mem = W✏flop ✓ 1 + B✏ I ◆ Consider: W⌧flop T and W✏flop E

Minimum time Slow memory

xPU

τmem = time/mop τflop = time/flop

Fast memory (total size = Z)

Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

hpcgarage.org/modsim13

Wednesday, October 2, 13
slide-60
SLIDE 60

von Neumann-like system

T = max (W⌧flop, Q⌧mem) = W⌧flop max ✓ 1, Q W ⌧mem ⌧flop ◆ = W⌧flop max ✓ 1, B⌧ I ◆ E = W✏flop + Q✏mem = W✏flop ✓ 1 + B✏ I ◆ Consider: W⌧flop T and W✏flop E

Intensity (flop : mop) Minimum time Slow memory

xPU

τmem = time/mop τflop = time/flop

Fast memory (total size = Z)

Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

hpcgarage.org/modsim13

Wednesday, October 2, 13
slide-61
SLIDE 61

von Neumann-like system

T = max (W⌧flop, Q⌧mem) = W⌧flop max ✓ 1, Q W ⌧mem ⌧flop ◆ = W⌧flop max ✓ 1, B⌧ I ◆ E = W✏flop + Q✏mem = W✏flop ✓ 1 + B✏ I ◆ Consider: W⌧flop T and W✏flop E

Intensity (flop : mop) Balance (flop : mop) Minimum time Slow memory

xPU

τmem = time/mop τflop = time/flop

Fast memory (total size = Z)

Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

hpcgarage.org/modsim13

Wednesday, October 2, 13
slide-62
SLIDE 62

An energy analogue Slow memory

xPU

εmem = energy/mop εflop = energy/flop

Fast memory (total size = Z)

T = max (W⌧flop, Q⌧mem) = W⌧flop max ✓ 1, Q W ⌧mem ⌧flop ◆ = W⌧flop max ✓ 1, B⌧ I ◆ E = W✏flop + Q✏mem = W✏flop ✓ 1 + B✏ I ◆ Consider: W⌧flop T and W✏flop E

hpcgarage.org/modsim13

Wednesday, October 2, 13
slide-63
SLIDE 63

An energy analogue Slow memory

xPU

εmem = energy/mop εflop = energy/flop

Fast memory (total size = Z)

T = max (W⌧flop, Q⌧mem) = W⌧flop max ✓ 1, Q W ⌧mem ⌧flop ◆ = W⌧flop max ✓ 1, B⌧ I ◆ E = W✏flop + Q✏mem = W✏flop ✓ 1 + B✏ I ◆ Consider: W⌧flop T and W✏flop E

Energy balance (flop : mop)

hpcgarage.org/modsim13

Wednesday, October 2, 13
slide-64
SLIDE 64

1/32 1/16 1/8 1/4 1/2 1

3.6

GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

flop:byte

“Roofline” — Williams et al. (Comm. ACM ’09)

Wτflop T = 1 max

  • 1, Bτ

I

  • hpcgarage.org/modsim13
Wednesday, October 2, 13
slide-65
SLIDE 65

1/32 1/16 1/8 1/4 1/2 1

3.6

GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

flop:byte

Compute bound

“Roofline” — Williams et al. (Comm. ACM ’09)

Wτflop T = 1 max

  • 1, Bτ

I

  • hpcgarage.org/modsim13
Wednesday, October 2, 13
slide-66
SLIDE 66

1/32 1/16 1/8 1/4 1/2 1

3.6

GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

flop:byte

Compute bound

“Roofline” — Williams et al. (Comm. ACM ’09)

Memory (bandwidth) bound

Wτflop T = 1 max

  • 1, Bτ

I

  • hpcgarage.org/modsim13
Wednesday, October 2, 13
slide-67
SLIDE 67

1/32 1/16 1/8 1/4 1/2 1

3.6

GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

flop:byte

Compute bound

“Roofline” — Williams et al. (Comm. ACM ’09)

Memory (bandwidth) bound Dense matrix multiply

Wτflop T = 1 max

  • 1, Bτ

I

  • hpcgarage.org/modsim13
Wednesday, October 2, 13
slide-68
SLIDE 68

1/32 1/16 1/8 1/4 1/2 1

3.6

GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

flop:byte

Compute bound

“Roofline” — Williams et al. (Comm. ACM ’09)

Memory (bandwidth) bound sparse matvec; stencils Dense matrix multiply

Wτflop T = 1 max

  • 1, Bτ

I

  • hpcgarage.org/modsim13
Wednesday, October 2, 13
slide-69
SLIDE 69

1/32 1/16 1/8 1/4 1/2 1

3.6

GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

flop:byte

Compute bound

“Roofline” — Williams et al. (Comm. ACM ’09)

Memory (bandwidth) bound sparse matvec; stencils FFTs Dense matrix multiply

Wτflop T = 1 max

  • 1, Bτ

I

  • hpcgarage.org/modsim13
Wednesday, October 2, 13
slide-70
SLIDE 70

1/32 1/16 1/8 1/4 1/2 1

3.6

GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance

flop:byte

Memory (bandwidth) bound in time Compute bound in time

Balance estimates for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

Wτflop T = 1 max

  • 1, Bτ

I

  • hpcgarage.org/modsim13
Wednesday, October 2, 13
slide-71
SLIDE 71

1/32 1/16 1/8 1/4 1/2 1

3.6 14

GFLOP/J GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance

flop:byte flop:byte

“Arch line”

Balance estimates for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

hpcgarage.org/modsim13

W✏flop E = 1 1 + B✏

I

Wednesday, October 2, 13
slide-72
SLIDE 72

1/32 1/16 1/8 1/4 1/2 1

3.6 14

GFLOP/J GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance

flop:byte flop:byte

“Arch line”

Balance estimates for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

Memory (bandwidth) bound in energy Compute bound in energy

hpcgarage.org/modsim13

W✏flop E = 1 1 + B✏

I

Wednesday, October 2, 13
slide-73
SLIDE 73

That was theory. What happens in practice? ⇒ Cannot ignore constant power. Let’s add it to our model and measure.

Wednesday, October 2, 13
slide-74
SLIDE 74

Constant power Slow memory

xPU

τmem = time/mop τflop = time/flop

Fast memory (total size = Z)

T = max (Wτflop, Qτmem) = Wτflop max ✓ 1, Q W τmem τflop ◆ = Wτflop max ✓ 1, Bτ I ◆ Consider: W⌧flop T and W✏flop E E = W✏flop + Q✏mem + T⇡0 = W✏flop ✓ 1 + B✏ I + ⇡0 ✏flop T W ◆

Wednesday, October 2, 13
slide-75
SLIDE 75

Constant power Slow memory

xPU

τmem = time/mop τflop = time/flop

Fast memory (total size = Z)

Constant power

T = max (Wτflop, Qτmem) = Wτflop max ✓ 1, Q W τmem τflop ◆ = Wτflop max ✓ 1, Bτ I ◆ Consider: W⌧flop T and W✏flop E E = W✏flop + Q✏mem + T⇡0 = W✏flop ✓ 1 + B✏ I + ⇡0 ✏flop T W ◆

Wednesday, October 2, 13
slide-76
SLIDE 76

NVIDIA GTX 580 − double−precision Intel i7−950 − double−precision 1/8 1/4 1/2 1 1/8 1/4 1/2 1

  • 1.03

= 197 GFLOP/s

  • 1.72 (static=0)

0.890

= 1.23 GFLOP/J

  • ● ● ●
  • ● ● ● ●
  • 2.08

= 53.3 GFLOP/s

  • ● ● ●
  • ● ● ● ●
  • 0.896 (static=0)

1.01

= 0.316 GFLOP/J Time Energy 1/4 1/2 1 2 4 8 16 1/4 1/2 1 2 4 8 16

Intensity (FLOP : Byte) Normalized performance

NVIDIA GTX 580 − double−precision Intel i7−950 − double−precision = 197 GFLOP/s = 53.3 GFLOP/s

A microbenchmark study: “GPU energy” includes all GPU card components (card, memory, fan) but excludes the host. Same for CPU but without GPU card. Measurements use “PowerMon” (Bedard & Fowler ’09) with sub-millisecond sampling)

Wednesday, October 2, 13
slide-77
SLIDE 77

NVIDIA GTX 580 − double−precision Intel i7−950 − double−precision 1/8 1/4 1/2 1 1/8 1/4 1/2 1

  • 1.03

= 197 GFLOP/s

  • 1.72 (static=0)

0.890

= 1.23 GFLOP/J

  • ● ● ●
  • ● ● ● ●
  • 2.08

= 53.3 GFLOP/s

  • ● ● ●
  • ● ● ● ●
  • 0.896 (static=0)

1.01

= 0.316 GFLOP/J Time Energy 1/4 1/2 1 2 4 8 16 1/4 1/2 1 2 4 8 16

Intensity (FLOP : Byte) Normalized performance

NVIDIA GTX 580 − double−precision 1/8 1/4 1/2 1 1/8 1/4 1/2 1

  • 1.03

= 197 GFLOP/s

  • 1.72 (static=0)

0.890

= 1.23 GFLOP/J 1/4 1/2 1 2 4 8 16 1/4

Intensity (FLOP : Byte) Normalized performance

1 2 4 8 16 1/4 1/2 1 2 4

Intensity (FLOP : Byte)

Wednesday, October 2, 13
slide-78
SLIDE 78

NVIDIA GTX 580 − double−precision Intel i7−950 − double−precision 1/8 1/4 1/2 1 1/8 1/4 1/2 1

  • 1.03

= 197 GFLOP/s

  • 1.72 (static=0)

0.890

= 1.23 GFLOP/J

  • ● ● ●
  • ● ● ● ●
  • 2.08

= 53.3 GFLOP/s

  • ● ● ●
  • ● ● ● ●
  • 0.896 (static=0)

1.01

= 0.316 GFLOP/J Time Energy 1/4 1/2 1 2 4 8 16 1/4 1/2 1 2 4 8 16

Intensity (FLOP : Byte) Normalized performance

NVIDIA GTX 580 − double−precision 1/8 1/4 1/2 1 1/8 1/4 1/2 1

  • 1.03

= 197 GFLOP/s

  • 1.72 (static=0)

0.890

= 1.23 GFLOP/J 1/4 1/2 1 2 4 8 16 1/4

Intensity (FLOP : Byte) Normalized performance

1 2 4 8 16 1/4 1/2 1 2 4

Intensity (FLOP : Byte)

Constant power can shift energy-balance

Wednesday, October 2, 13
slide-79
SLIDE 79

NVIDIA GTX 580 − double−precision Intel i7−950 − double−precision 1/8 1/4 1/2 1 1/8 1/4 1/2 1

  • 1.03

= 197 GFLOP/s

  • 1.72 (static=0)

0.890

= 1.23 GFLOP/J

  • ● ● ●
  • ● ● ● ●
  • 2.08

= 53.3 GFLOP/s

  • ● ● ●
  • ● ● ● ●
  • 0.896 (static=0)

1.01

= 0.316 GFLOP/J Time Energy 1/4 1/2 1 2 4 8 16 1/4 1/2 1 2 4 8 16

Intensity (FLOP : Byte) Normalized performance

NVIDIA GTX 580 − double−precision 1/8 1/4 1/2 1 1/8 1/4 1/2 1

  • 1.03

= 197 GFLOP/s

  • 1.72 (static=0)

0.890

= 1.23 GFLOP/J 1/4 1/2 1 2 4 8 16 1/4

Intensity (FLOP : Byte) Normalized performance

1 2 4 8 16 1/4 1/2 1 2 4

Intensity (FLOP : Byte)

… and “race-to-halt” is an artifact of this shift

Wednesday, October 2, 13
slide-80
SLIDE 80

NVIDIA GTX 580 − double−precision Intel i7−950 − double−precision 1/8 1/4 1/2 1 1/8 1/4 1/2 1

  • 1.03

= 197 GFLOP/s

  • 1.72 (static=0)

0.890

= 1.23 GFLOP/J

  • ● ● ●
  • ● ● ● ●
  • 2.08

= 53.3 GFLOP/s

  • ● ● ●
  • ● ● ● ●
  • 0.896 (static=0)

1.01

= 0.316 GFLOP/J Time Energy 1/4 1/2 1 2 4 8 16 1/4 1/2 1 2 4 8 16

Intensity (FLOP : Byte) Normalized performance

1 2 4 8 16 1/4 1/2 1 2 4

Intensity (FLOP : Byte)

Wednesday, October 2, 13
slide-81
SLIDE 81

NVIDIA GTX 580 − double−precision Intel i7−950 − double−precision 25 50 75 100 125 150 175 200 225 250 275 300

  • 1.03

1.72 (static=0) 0.890

84.0 W 160 W 212 W 288 W

  • ● ● ●
  • ● ● ● ● ●
  • 2.08

0.896 (static=0) 1.01

108 W 134 W 169 W 195 W Power 1/4 1/2 1 2 4 8 16 1/4 1/2 1 2 4 8 16

Intensity (FLOP : Byte) Normalized performance

Wednesday, October 2, 13
slide-82
SLIDE 82

NVIDIA GTX 580 − double−precision Intel i7−950 − double−precision 25 50 75 100 125 150 175 200 225 250 275 300

  • 1.03

1.72 (static=0) 0.890

84.0 W 160 W 212 W 288 W

  • ● ● ●
  • ● ● ● ● ●
  • 2.08

0.896 (static=0) 1.01

108 W 134 W 169 W 195 W Power 1/4 1/2 1 2 4 8 16 1/4 1/2 1 2 4 8 16

Intensity (FLOP : Byte) Normalized performance

Source of error?

Wednesday, October 2, 13
slide-83
SLIDE 83

Power capping is critical. It’s also easy to add.

Wednesday, October 2, 13
slide-84
SLIDE 84

Powerline Roofline 1/16 1/8 1/4 1/2 1 2 1 2 4 8 16 32 64 1 2 4 8 16 32 64

Intensity (flop:byte) Normalized Value

What might a power cap look like?

Wednesday, October 2, 13
slide-85
SLIDE 85

Adding a power cap

Powerline Roofline 1/16 1/8 1/4 1/2 1 2 1 2 4 8 16 32 64 1 2 4 8 16 32 64

Intensity (flop:byte) Normalized Value

Wednesday, October 2, 13
slide-86
SLIDE 86

Adding a power cap π0 : constant

Powerline Roofline 1/16 1/8 1/4 1/2 1 2 1 2 4 8 16 32 64 1 2 4 8 16 32 64

Intensity (flop:byte) Normalized Value

Wednesday, October 2, 13
slide-87
SLIDE 87

Adding a power cap π0 : constant

Powerline Roofline 1/16 1/8 1/4 1/2 1 2 1 2 4 8 16 32 64 1 2 4 8 16 32 64

Intensity (flop:byte) Normalized Value

π0

Wednesday, October 2, 13
slide-88
SLIDE 88

Adding a power cap

∆π : usable

π0 + ∆π : max power π0 : constant

Powerline Roofline 1/16 1/8 1/4 1/2 1 2 1 2 4 8 16 32 64 1 2 4 8 16 32 64

Intensity (flop:byte) Normalized Value

π0

Wednesday, October 2, 13
slide-89
SLIDE 89

Adding a power cap

∆π : usable

π0 + ∆π : max power π0 : constant

Powerline Roofline 1/16 1/8 1/4 1/2 1 2 1 2 4 8 16 32 64 1 2 4 8 16 32 64

Intensity (flop:byte) Normalized Value

π0

∆π

Wednesday, October 2, 13
slide-90
SLIDE 90

Adding a power cap

∆π : usable

π0 + ∆π : max power

Tfree = max (W⌧flop, Q⌧mem) ⇓ T = max ✓ W⌧flop, Q⌧mem, W✏flop + Q✏mem ∆⇡ ◆

π0 : constant

Powerline Roofline 1/16 1/8 1/4 1/2 1 2 1 2 4 8 16 32 64 1 2 4 8 16 32 64

Intensity (flop:byte) Normalized Value

π0

∆π

Wednesday, October 2, 13
slide-91
SLIDE 91

Time 1.6 TFLOP/s at y=1 Energy 5.7 GFLOP/J at y=1 Power 280 Watts at y=1

  • ●●
  • ●●
  • ● ●●●
  • ●●

1/512 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 1/41/2 1 2 4 8 16 32 641281/41/2 1 2 4 8 16 32 641281/41/2 1 2 4 8 16 32 64128

Intensity (single−precision FLOP : Byte) Normalized performance

  • NVIDIA

GTX 580 GF100 Fermi

Wednesday, October 2, 13
slide-92
SLIDE 92

Time 1.6 TFLOP/s at y=1 Energy 5.7 GFLOP/J at y=1 Power 280 Watts at y=1

  • ●●
  • ●●
  • ● ●●●
  • ●●

1/512 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 1/41/2 1 2 4 8 16 32 641281/41/2 1 2 4 8 16 32 641281/41/2 1 2 4 8 16 32 64128

Intensity (single−precision FLOP : Byte) Normalized performance

  • NVIDIA

GTX 580 GF100 Fermi

A better fit!

Wednesday, October 2, 13
slide-93
SLIDE 93

The model suggests a structure in the time, energy, and power relationships. It also facilitates analysis.

Wednesday, October 2, 13
slide-94
SLIDE 94

16 Gflop/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%]

Cap Memory Compute

15 Gflop/J 3.0 Tflop/s [86%], 160 GB/s [82%] 66 W (const) + 140 W (cap) [100%]

Cap Memory Compute

11 Gflop/J 2.0 Tflop/s [100%], 180 GB/s [57%] 180 W (const) + 36 W (cap) [100%]

Cap Memory Compute

8.8 Gflop/J 270 Gflop/s [100%], 15 GB/s [60%] 10 W (const) + 18 W (cap) [91%]

Cap Memory

8.1 Gflop/J 33 Gflop/s [46%], 8.4 GB/s [66%] 1.3 W (const) + 4.8 W (cap) [88%]

Cap Memory Compute

6.4 Gflop/J 100 Gflop/s [95%], 8.7 GB/s [81%] 16 W (const) + 3.2 W (cap) [100%]

Cap Memory Compute

5.4 Gflop/J 1.4 Tflop/s [88%], 170 GB/s [89%] 120 W (const) + 150 W (cap) [94%]

Cap Memory Compute

3.2 Gflop/J 56 Gflop/s [97%], 18 GB/s [70%] 17 W (const) + 7.4 W (cap) [98%]

Cap Compute

2.5 Gflop/J 9.5 Gflop/s [99%], 1.3 GB/s [40%] 3.5 W (const) + 1.2 W (cap) [95%]

Cap Memory Compute

2.2 Gflop/J 16 Gflop/s [58%], 3.9 GB/s [31%] 5.5 W (const) + 2.0 W (cap) [97%]

Cap Memory Compute

650 Mflop/J 13 Gflop/s [98%], 3.3 GB/s [31%] 20 W (const) + 1.4 W (cap) [98%]

Cap Compute

620 Mflop/J 99 Gflop/s [93%], 19 GB/s [74%] 120 W (const) + 44 W (cap) [99%]

Cap Memory Compute

GTX Titan GTX 680 Xeon Phi NUC GPU Arndale GPU APU GPU GTX 580 NUC CPU PandaBoard ES Arndale CPU APU CPU Desktop CPU 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512

Intensity (single−precision flop:Byte) Power [normalized]

Wednesday, October 2, 13
slide-95
SLIDE 95

Example: Caps imply throttling!

∆π : usable

π0 + ∆π : max power π0 : constant

Throttling factors

i.e., allowable slowdown

T = max ✓ W⌧flop, Q⌧mem, W✏flop + Q✏mem ∆⇡ ◆ ⇓ ˜ ⌧flop ≡ T W ≡ ⌧flopsflop ˜ ⌧mem ≡ T Q ≡ ⌧memsmem ⇓ sflop ≡ max ⇢ 1, B⌧ I , ✏flop/⌧flop ∆⇡ ✓ 1 + B✏ I ◆ smem ≡ sflop I B⌧

Wednesday, October 2, 13
slide-96
SLIDE 96

Example: Caps imply throttling!

∆π : usable

π0 + ∆π : max power π0 : constant

Throttling factors

i.e., allowable slowdown

T = max ✓ W⌧flop, Q⌧mem, W✏flop + Q✏mem ∆⇡ ◆ ⇓ ˜ ⌧flop ≡ T W ≡ ⌧flopsflop ˜ ⌧mem ≡ T Q ≡ ⌧memsmem ⇓ sflop ≡ max ⇢ 1, B⌧ I , ✏flop/⌧flop ∆⇡ ✓ 1 + B✏ I ◆ smem ≡ sflop I B⌧

Wednesday, October 2, 13
slide-97
SLIDE 97

Example: Caps imply throttling!

∆π : usable

π0 + ∆π : max power π0 : constant

Throttling factors

i.e., allowable slowdown

T = max ✓ W⌧flop, Q⌧mem, W✏flop + Q✏mem ∆⇡ ◆ ⇓ ˜ ⌧flop ≡ T W ≡ ⌧flopsflop ˜ ⌧mem ≡ T Q ≡ ⌧memsmem ⇓ sflop ≡ max ⇢ 1, B⌧ I , ✏flop/⌧flop ∆⇡ ✓ 1 + B✏ I ◆ smem ≡ sflop I B⌧

Wednesday, October 2, 13
slide-98
SLIDE 98

Flops Mops

1 2 4 8 16 32 64 1/4 1/2 1 2 4 8 16 32 64

Intensity (single−precision FLOP : Byte) Throttling coefficient NVIDIA GTX 580 GF100; Fermi

Wednesday, October 2, 13
slide-99
SLIDE 99

16 Gflop/J, 1.3 GB/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%]

Cap Memory Compute

15 Gflop/J, 1.2 GB/J 3.0 Tflop/s [86%], 160 GB/s [82%] 66 W (const) + 140 W (cap) [100%]

Cap Memory Compute

11 Gflop/J, 880 MB/J 2.0 Tflop/s [100%], 180 GB/s [57%] 180 W (const) + 36 W (cap) [100%]

Cap Memory Compute

8.8 Gflop/J, 670 MB/J 270 Gflop/s [100%], 15 GB/s [60%] 10 W (const) + 18 W (cap) [91%]

Cap Memory

8.1 Gflop/J, 1.5 GB/J 33 Gflop/s [46%], 8.4 GB/s [66%] 1.3 W (const) + 4.8 W (cap) [88%]

Cap Memory Compute

6.4 Gflop/J, 470 MB/J 100 Gflop/s [95%], 8.7 GB/s [81%] 16 W (const) + 3.2 W (cap) [100%]

Cap Memory Compute

5.3 Gflop/J, 810 MB/J 1.4 Tflop/s [88%], 170 GB/s [89%] 120 W (const) + 150 W (cap) [94%]

Cap Memory Compute

3.2 Gflop/J, 750 MB/J 56 Gflop/s [97%], 18 GB/s [70%] 17 W (const) + 7.4 W (cap) [98%]

Cap Compute

2.5 Gflop/J, 280 MB/J 9.5 Gflop/s [99%], 1.3 GB/s [40%] 3.5 W (const) + 1.2 W (cap) [95%]

Cap Memory Compute

2.2 Gflop/J, 560 MB/J 16 Gflop/s [58%], 3.9 GB/s [31%] 5.5 W (const) + 2.0 W (cap) [97%]

Cap Memory Compute

650 Mflop/J, 150 MB/J 13 Gflop/s [98%], 3.3 GB/s [31%] 20 W (const) + 1.4 W (cap) [98%]

Cap Compute

620 Mflop/J, 140 MB/J 99 Gflop/s [93%], 19 GB/s [74%] 120 W (const) + 44 W (cap) [99%]

Cap Memory Compute

GTX Titan GTX 680 Xeon Phi NUC GPU Arndale GPU APU GPU GTX 580 NUC CPU PandaBoard ES Arndale CPU APU CPU Desktop CPU 1/8 1/2 2 8 32 128 512 2048 1/8 1/2 2 8 32 128 512 2048 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512

Intensity (single−precision flop:Byte) Flops / Time [Gflop/s]

Wednesday, October 2, 13
slide-100
SLIDE 100

8.1 Gflop/J, 1.5 GB/J 33 Gflop/s [46%], 8.4 GB/s [66%] 1.3 W (const) + 4.8 W (cap) [88%]

Cap Memory Compute

16 Gflop/J, 1.3 GB/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%]

Cap Memory Compute

15 Gflop/J, 1.2 GB/J 3.0 Tflop/s [86%], 160 GB/s [82%] 66 W (const) + 140 W (cap) [100%]

Cap Memory Compute

11 Gflop/J, 880 MB/J 2.0 Tflop/s [100%], 180 GB/s [57%] 180 W (const) + 36 W (cap) [100%]

Cap Memory Compute

5.3 Gflop/J, 810 MB/J 1.4 Tflop/s [88%], 170 GB/s [89%] 120 W (const) + 150 W (cap) [94%]

Cap Memory Compute

3.2 Gflop/J, 750 MB/J 56 Gflop/s [97%], 18 GB/s [70%] 17 W (const) + 7.4 W (cap) [98%]

Cap Compute

8.8 Gflop/J, 670 MB/J 270 Gflop/s [100%], 15 GB/s [60%] 10 W (const) + 18 W (cap) [91%]

Cap Memory

2.2 Gflop/J, 560 MB/J 16 Gflop/s [58%], 3.9 GB/s [31%] 5.5 W (const) + 2.0 W (cap) [97%]

Cap Memory Compute

6.4 Gflop/J, 470 MB/J 100 Gflop/s [95%], 8.7 GB/s [81%] 16 W (const) + 3.2 W (cap) [100%]

Cap Memory Compute

2.5 Gflop/J, 280 MB/J 9.5 Gflop/s [99%], 1.3 GB/s [40%] 3.5 W (const) + 1.2 W (cap) [95%]

Cap Memory Compute

650 Mflop/J, 150 MB/J 13 Gflop/s [98%], 3.3 GB/s [31%] 20 W (const) + 1.4 W (cap) [98%]

Cap Compute

620 Mflop/J, 140 MB/J 99 Gflop/s [93%], 19 GB/s [74%] 120 W (const) + 44 W (cap) [99%]

Cap Memory Compute

Arndale GPU GTX Titan GTX 680 Xeon Phi GTX 580 NUC CPU NUC GPU Arndale CPU APU GPU PandaBoard ES APU CPU Desktop CPU 1/64 1/16 1/4 1 4 16 1/64 1/16 1/4 1 4 16 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512 1/8 1/2 2 8 32 128 512

Intensity (single−precision flop:Byte) Flops / Energy [Gflop/J]

Wednesday, October 2, 13
slide-101
SLIDE 101

Caps imply throttling!

What power cap will obviate throttling? The balance gap dictates a sufficient (algorithm-independent) condition.

∆π : usable

π0 + ∆π : max power π0 : constant

⇡flop ≡ ✏flop ⌧flop ⇡mem ≡ ✏mem ⌧mem ⇓ ∆⇡ ≥ ⇡flop + ⇡mem = ⇡flop ✓ 1 + B✏ B⌧ ◆

Peak power per flop (or mop)

Wednesday, October 2, 13