Algorithmic time, energy, and power on candidate HPC compute - - PowerPoint PPT Presentation

algorithmic time energy and power on candidate hpc
SMART_READER_LITE
LIVE PREVIEW

Algorithmic time, energy, and power on candidate HPC compute - - PowerPoint PPT Presentation

Algorithmic time, energy, and power on candidate HPC compute building blocks Jee Choi, Marat Dukhan, Xing Liu, and Richard Vuduc May 20, 2014 Presented at IPDPS14 Contributions Energy roofline (IPDPS13) quantifies relative energy costs of


slide-1
SLIDE 1

Algorithmic time, energy, and power

  • n candidate HPC compute building

blocks

Jee Choi, Marat Dukhan, Xing Liu, and Richard Vuduc May 20, 2014 Presented at IPDPS’14

slide-2
SLIDE 2

Contributions

  • Energy roofline (IPDPS’13) quantifies

relative energy costs of computation to data movement

  • Power “cap” limits performance
  • μbenchmark suite for testing different levels
  • f the memory hierarchy
  • Empirical data on systems ranging from

server- to mobile-class platforms

  • Analysis using our methodology
slide-3
SLIDE 3

Roofline in energy (IPDPS’13)

Slow memory

xPU Fast memory (total size = Z)

τmem = time / (m)op τflop = time / (fl)op Q (m)ops W (fl)ops

slide-4
SLIDE 4

Roofline in energy (IPDPS’13) ¡

1/32 1/16 1/8 1/4 1/2 1

3.6 14

GFLOP/J GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance

“Arch line” “Roofline”[1]

[1] ¡S. ¡Williams, ¡A. ¡Waterman, ¡and ¡D. ¡Pa6erson, ¡“Roofline: ¡an ¡insigh>ul ¡visual ¡performance ¡model ¡for ¡mulDcore ¡architectures,” ¡

  • Commun. ¡ACM, ¡vol. ¡52, ¡no. ¡4, ¡pp. ¡65–76, ¡Apr. ¡2009. ¡[Online]. ¡Available: ¡h6p://doi.acm.org/10.1145/1498765.1498785 ¡
slide-5
SLIDE 5

Roofline in energy (IPDPS’13) ¡

1/32 1/16 1/8 1/4 1/2 1

3.6 14

GFLOP/J GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance

“Arch line” “Roofline”[1]

[1] ¡S. ¡Williams, ¡A. ¡Waterman, ¡and ¡D. ¡Pa6erson, ¡“Roofline: ¡an ¡insigh>ul ¡visual ¡performance ¡model ¡for ¡mulDcore ¡architectures,” ¡Commun. ¡ ACM, ¡vol. ¡52, ¡no. ¡4, ¡pp. ¡65–76, ¡Apr. ¡2009. ¡[Online]. ¡Available:h6p://doi.acm.org/10.1145/1498765.1498785 ¡

slide-6
SLIDE 6

1 2 4 8

  • 3.6

14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512

Intensity (flop:byte) Power, relative to flop−power

Roofline in energy (IPDPS’13) ¡

Power dissipated by compute units Power dissipated by memory units

slide-7
SLIDE 7

Roofline in energy (IPDPS’13)

slide-8
SLIDE 8

NVIDIA GTX 580 (GPU−only) Intel i7−950 (Desktop)

  • ● ● ●●●

8.2 5.1 (const=0) 4.5 120 W 220 W 280 W 380 W

  • ● ● ●●●●●
  • ● ● ●●●●●
  • 4.2

2.1 (const=0) 2.1 120 W 140 W 160 W 180 W 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 Power 1/4 1/2 1 2 4 8 16 32 64 1/4 1/2 1 2 4 8 16 32 64

Intensity (FLOP : Byte) Power (normalized to flop+const)

Roofline in energy (IPDPS’13)

slide-9
SLIDE 9

NVIDIA GTX 580 (GPU−only) Intel i7−950 (Desktop)

  • ● ● ●●●

8.2 5.1 (const=0) 4.5 120 W 220 W 280 W 380 W

  • ● ● ●●●●●
  • ● ● ●●●●●
  • 4.2

2.1 (const=0) 2.1 120 W 140 W 160 W 180 W 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 Power 1/4 1/2 1 2 4 8 16 32 64 1/4 1/2 1 2 4 8 16 32 64

Intensity (FLOP : Byte) Power (normalized to flop+const)

Roofline in energy (IPDPS’13)

power cap prevents peak performance

slide-10
SLIDE 10

Power Cap

Power is determined by performance Performance is limited by power

slide-11
SLIDE 11

Power Cap

Power is determined by performance Performance is limited by power “Usable” power

slide-12
SLIDE 12

Power Cap

Power is determined by performance Performance is limited by power “Usable” power

slide-13
SLIDE 13

Power Cap

Power is determined by performance Performance is limited by power “Usable” power

slide-14
SLIDE 14

Power Cap

1 2 4 8

  • 3.6

14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512

Intensity (flop:byte) Power, relative to flop−power

slide-15
SLIDE 15

Power Cap

1 2 4 8

  • 3.6

14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512

Intensity (flop:byte) Power, relative to flop−power

slide-16
SLIDE 16

Power Cap

1 2 4 8

  • 3.6

14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512

Intensity (flop:byte) Power, relative to flop−power

slide-17
SLIDE 17

Power Cap

1 2 4 8

  • 3.6

14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512

Intensity (flop:byte) Power, relative to flop−power

slide-18
SLIDE 18

Power Cap

1 2 4 8

  • 3.6

14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512

Intensity (flop:byte) Power, relative to flop−power

slide-19
SLIDE 19

Power Cap

1 2 4 8

  • 3.6

14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512

Intensity (flop:byte) Power, relative to flop−power

slide-20
SLIDE 20

Power Cap

1 2 4 8

  • 3.6

14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512

Intensity (flop:byte) Power, relative to flop−power

slide-21
SLIDE 21

1 2 4 8

  • 3.6

14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512

Intensity (flop:byte) Power, relative to flop−power

Power Cap

slide-22
SLIDE 22

1 2 4 8

  • 3.6

14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512

Intensity (flop:byte) Power, relative to flop−power

Power Cap

slide-23
SLIDE 23

1 2 4 8

  • 3.6

14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512

Intensity (flop:byte) Power, relative to flop−power

Power Cap

slide-24
SLIDE 24

1 2 4 8

  • 3.6

14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512

Intensity (flop:byte) Power, relative to flop−power

Power Cap

slide-25
SLIDE 25

1 2 4 8

  • 3.6

14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512

Intensity (flop:byte) Power, relative to flop−power

Power Cap

slide-26
SLIDE 26

μbenchmark Suite

  • Intensity
  • flops, bytes
  • Cache
  • shared

memory

  • L1, L2, etc.
  • Random access
  • Performance
  • Energy
  • x86 CPU
  • Intel, AMD
  • ARM CPU
  • A9, A15
  • GPU
  • NVIDIA, AMD, ARM
  • Xeon Phi

× ¡ × ¡

http://hpcgarage.org/archline

slide-27
SLIDE 27

μbenchmark Suite

  • CPU Intensity

μbenchmark for Ivy Bridge

– aligned memory loads – 1 MUL and 1 ADD AVX instructions issued per cycle – maximize AVX register usage to increase ILP – parallelized over all available cores

vmovapd ymm0, [rdi - 128] vmovapd ymm1, [rdi - 96] vmovapd ymm2, [rdi - 64] vmovapd ymm3, [rdi - 32] vmovapd ymm4, [rdi] vmovapd ymm5, [rdi + 32] vmovapd ymm6, [rdi + 64] vmovapd ymm7, [rdi + 96] %rep MAD_PER_ELEMENT vmulpd ymm0, ymm0, ymm0 vaddpd ymm8, ymm8, ymm0 vmulpd ymm1, ymm1, ymm1 vaddpd ymm9, ymm9, ymm1 vmulpd ymm2, ymm2, ymm2 vaddpd ymm10, ymm10, ymm2 vmulpd ymm3, ymm3, ymm3 vaddpd ymm11, ymm11, ymm3 vmulpd ymm4, ymm4, ymm4 vaddpd ymm12, ymm12, ymm4 vmulpd ymm5, ymm5, ymm5 vaddpd ymm13, ymm13, ymm5 vmulpd ymm6, ymm6, ymm6 vaddpd ymm14, ymm14, ymm6 vmulpd ymm7, ymm7, ymm7 vaddpd ymm15, ymm15, ymm7 %endrep

http://hpcgarage.org/archline

slide-28
SLIDE 28

μbenchmark Suite

  • CPU Intensity

μbenchmark for Ivy Bridge

– aligned memory loads – 1 MUL and 1 ADD AVX instructions issued per cycle – maximize AVX register usage to increase ILP – parallelized over all available cores

vmovapd ymm0, [rdi - 128] vmovapd ymm1, [rdi - 96] vmovapd ymm2, [rdi - 64] vmovapd ymm3, [rdi - 32] vmovapd ymm4, [rdi] vmovapd ymm5, [rdi + 32] vmovapd ymm6, [rdi + 64] vmovapd ymm7, [rdi + 96] %rep MAD_PER_ELEMENT vmulpd ymm0, ymm0, ymm0 vaddpd ymm8, ymm8, ymm0 vmulpd ymm1, ymm1, ymm1 vaddpd ymm9, ymm9, ymm1 vmulpd ymm2, ymm2, ymm2 vaddpd ymm10, ymm10, ymm2 vmulpd ymm3, ymm3, ymm3 vaddpd ymm11, ymm11, ymm3 vmulpd ymm4, ymm4, ymm4 vaddpd ymm12, ymm12, ymm4 vmulpd ymm5, ymm5, ymm5 vaddpd ymm13, ymm13, ymm5 vmulpd ymm6, ymm6, ymm6 vaddpd ymm14, ymm14, ymm6 vmulpd ymm7, ymm7, ymm7 vaddpd ymm15, ymm15, ymm7 %endrep

http://hpcgarage.org/archline

slide-29
SLIDE 29

μbenchmark Suite

  • CPU Intensity

μbenchmark for Ivy Bridge

– aligned memory loads – 1 MUL and 1 ADD AVX instructions issued per cycle – maximize AVX register usage to increase ILP – parallelized over all available cores

vmovapd ymm0, [rdi - 128] vmovapd ymm1, [rdi - 96] vmovapd ymm2, [rdi - 64] vmovapd ymm3, [rdi - 32] vmovapd ymm4, [rdi] vmovapd ymm5, [rdi + 32] vmovapd ymm6, [rdi + 64] vmovapd ymm7, [rdi + 96] %rep MAD_PER_ELEMENT vmulpd ymm0, ymm0, ymm0 vaddpd ymm8, ymm8, ymm0 vmulpd ymm1, ymm1, ymm1 vaddpd ymm9, ymm9, ymm1 vmulpd ymm2, ymm2, ymm2 vaddpd ymm10, ymm10, ymm2 vmulpd ymm3, ymm3, ymm3 vaddpd ymm11, ymm11, ymm3 vmulpd ymm4, ymm4, ymm4 vaddpd ymm12, ymm12, ymm4 vmulpd ymm5, ymm5, ymm5 vaddpd ymm13, ymm13, ymm5 vmulpd ymm6, ymm6, ymm6 vaddpd ymm14, ymm14, ymm6 vmulpd ymm7, ymm7, ymm7 vaddpd ymm15, ymm15, ymm7 %endrep

http://hpcgarage.org/archline

slide-30
SLIDE 30

μbenchmark Suite

  • CPU Intensity

μbenchmark for Ivy Bridge

– aligned memory loads – 1 MUL and 1 ADD AVX instructions issued per cycle – maximize AVX register usage to increase ILP – parallelized over all available cores

vmovapd ymm0, [rdi - 128] vmovapd ymm1, [rdi - 96] vmovapd ymm2, [rdi - 64] vmovapd ymm3, [rdi - 32] vmovapd ymm4, [rdi] vmovapd ymm5, [rdi + 32] vmovapd ymm6, [rdi + 64] vmovapd ymm7, [rdi + 96] %rep MAD_PER_ELEMENT vmulpd ymm0, ymm0, ymm0 vaddpd ymm8, ymm8, ymm0 vmulpd ymm1, ymm1, ymm1 vaddpd ymm9, ymm9, ymm1 vmulpd ymm2, ymm2, ymm2 vaddpd ymm10, ymm10, ymm2 vmulpd ymm3, ymm3, ymm3 vaddpd ymm11, ymm11, ymm3 vmulpd ymm4, ymm4, ymm4 vaddpd ymm12, ymm12, ymm4 vmulpd ymm5, ymm5, ymm5 vaddpd ymm13, ymm13, ymm5 vmulpd ymm6, ymm6, ymm6 vaddpd ymm14, ymm14, ymm6 vmulpd ymm7, ymm7, ymm7 vaddpd ymm15, ymm15, ymm7 %endrep

http://hpcgarage.org/archline

slide-31
SLIDE 31

μbenchmark Suite

  • CPU Intensity

μbenchmark for Cortex A9

– aligned memory loads – 1 ADD every cycle, but 1 MUL every other cycle – maximize register usage to increase ILP – parallelized over all available cores

.rept MLA_PER_DOUBLE VMLA.F64 d12, d0, d0 VADD.F64 d24, d24, d0 VMLA.F64 d13, d1, d1 VADD.F64 d25, d25, d1 VMLA.F64 d14, d2, d2 VADD.F64 d26, d26, d2 VMLA.F64 d15, d3, d3 VADD.F64 d27, d27, d3 VMLA.F64 d16, d4, d4 VADD.F64 d28, d28, d4 VMLA.F64 d17, d5, d5 VADD.F64 d29, d29, d5 VMLA.F64 d18, d6, d6 VADD.F64 d24, d24, d6 VMLA.F64 d19, d7, d7 VADD.F64 d25, d25, d7 VMLA.F64 d20, d8, d8 VADD.F64 d26, d26, d8 VMLA.F64 d21, d9, d9 VADD.F64 d27, d27, d9 VMLA.F64 d22, d10, d10 VADD.F64 d28, d28, d10 VMLA.F64 d23, d11, d11 VADD.F64 d29, d29, d11 .endr

http://hpcgarage.org/archline

slide-32
SLIDE 32

μbenchmark Suite

  • GPU Intensity

μbenchmark for Kepler (vs. Fermi)

– quad warp scheduler selects up to four warps and issue two instructions per warp – each SMX has 192 cores

  • r 6 warp-cores

– we need more than 1 independent instruction in two of the warps unlike some Fermi – theoretical peak impossible to achieve – likely register bandwidth limit

uint tid = threadIdx.x + blockIdx.x * blockDim.x; TYPE tmp1; float x, y, z, w; if(tid < num_threads) { tmp1 = in[tid]; x = tmp1.x; y = tmp1.y; z = tmp1.z; w = tmp1.w; x = x + x * CONST; y = y + y * CONST; z = z + z * CONST; w = w + w * CONST; x = x + x * CONST; y = y + y * CONST; z = z + z * CONST; w = w + w * CONST; ...

  • ut[tid] = float4 (x, y, z, w);

http://hpcgarage.org/archline

slide-33
SLIDE 33

μbenchmark Suite

  • Memory μbenchmarks

– read from both ends of the array so that pre-fetched data is (almost always) consumed

  • Reducing integer overhead

– pointer-chasing arrays – fully unrolled loops

  • Auto tuning framework for determining optimal performance

parameters

– e.g., # of threads, thread block size

PF1 PF0

stream 0 stream 1

slide-34
SLIDE 34

Experimental Setup

ATX PSU PowerMon 2 PCIe Interposer GPU CPU Motherboard

Input Output

1 2 1 2 3 3 4 4 5 5

Power Brick Power Brick

6 7 6 7

ARM Dev Board APU

slide-35
SLIDE 35

Experimental Results

Column 1 2 3 4 5 6 7 8 9 10 11 12 13

Vendor’s claimed peak Power (empirical) Energy (and empirical throughput) Random access Platform Processor single Gflop/s double Gflop/s

  • mem. bw.

GB/s ⇡1 Watts (idle) ∆⇡ Watts ✏s pJ/flop (Gflop/s) ✏d pJ/flop (Gflop/s) ✏mem pJ/B (GB/s) ✏L1 pJ/B (GB/s) ✏L2 pJ/B (GB/s) ✏rand nJ/access (Macc/s) Desktop CPU “Nehalem” Intel Core i7-950 (45 nm) 107 53.3 25.6 122 (79.9) 44.2 371 (99.4) 670 (49.7) 795 (19.1) 135 (201) 168 (120) 108 (149) NUC CPU “Ivy Bridge” Intel Core i3-3217U (22 nm) 57.6 28.8 25.6 16.5 (13.2) 7.37 14.7 (55.6) 24.3 (27.9) 418 (17.9) 8.75 (201) 14.3 (103) 54.6 (55.3) NUC GPU HD 4000 269 — 25.6 10.1 (13.2)∗ 17.7 76.1 (268) — 837 (15.4) — — — APU CPU “Bobcat” AMD E2-1800 (40 nm) 13.6 5.10 10.7 20.1 (11.8) 1.39 33.5 (13.4) 119 (5.05) 435 (3.32) 84.0 (25.8) 138 (11.6) 75.6 (8.03) APU GPU “Zacate” HD 7340 109 — 10.7 15.6 (11.8) 3.23 5.82 (104) — 333 (8.70) 6.47 (46.0) — 45.8 (115) GTX 580 “Fermi” NVIDIA GF100 (40 nm) 1580 198 192 122 (148)∗ 146 99.7 (1400) 213 (196) 513 (171) 149 (761) 257 (284) 112 (977) GTX 680 “Kepler” NVIDIA GK104 (28 nm) 3530 147 192 66.4 (100)∗ 145 43.2 (3030) 263 (147) 437 (158) 51 (1150) 195 (297) 184 (1420) GTX Titan “Kepler” NVIDIA GK110 (28 nm) 4990 1660 288 123 (72.9) 164 30.4 (4020) 93.9 (1600) 267 (239) 24.4 (1610) 195 (297) 48.0 (968) Xeon Phi “KNC” Intel 5110P (22 nm) 2020 1010 320 180 (90) 36.1 6.05 (2020) 12.4 (1010) 136 (181) 2.19 (2890) 8.65 (591) 5.11 (706) PandaBoard ES “Cortex-A9” TI OMAP 4460 (45 nm) 9.60 3.60 3.20 3.48 (2.74) 1.19 37.2 (9.47) 302 (3.02) 810 (1.28) 79.5 (18.4) 134 (4.12) 60.9 (12.1) Arndale CPU “Cortex-A15” Samsung Exynos 5 (32 nm) 27.2 6.80 12.8 5.50 (1.72) 2.01 107 (15.8) 275 (3.97) 386 (3.94) 76.3 (50.8) 248 (15.2) 138 (14.8) Arndale GPU “Mali T-604” 72.0 — 12.8 1.28 (1.72)∗ 4.83 84.2 (33.0) — 518 (8.39) 71.4 (33.4) — 125 (33.6)

slide-36
SLIDE 36

Hypothetical system

5 TFLOPS “Supercomputer” 1× GTX Titan GPU ~287 Watts ~5 TFLOPS 70× Mali T-604 GPU ~6 Watts ~72 GFLOPS

slide-37
SLIDE 37

Hypothetical system

Time Energy Power

  • 1/4096

1/2048 1/1024 1/512 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 1/41/2 1 2 4 8 16 32 641281/41/2 1 2 4 8 16 32 641281/41/2 1 2 4 8 16 32 64128

Intensity (single−precision FLOP : Byte) Normalized performance

  • NVIDIA GTX Titan

GK110; Kepler Samsung Arndale GPU Exynos 5; ARM Mali T−604 Samsung 70 x Arndale GPU Exynos 5; ARM Mali T−604

Hypothetical scaled Arndale GPU design vs. GTX Titan

slide-38
SLIDE 38

Hypothetical system

Time Energy Power

  • 1/4096

1/2048 1/1024 1/512 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 1/41/2 1 2 4 8 16 32 641281/41/2 1 2 4 8 16 32 641281/41/2 1 2 4 8 16 32 64128

Intensity (single−precision FLOP : Byte) Normalized performance

  • NVIDIA GTX Titan

GK110; Kepler Samsung Arndale GPU Exynos 5; ARM Mali T−604 Samsung 70 x Arndale GPU Exynos 5; ARM Mali T−604

Hypothetical scaled Arndale GPU design vs. GTX Titan

slide-39
SLIDE 39

Hypothetical system

Time Energy Power

  • 1/4096

1/2048 1/1024 1/512 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 1/41/2 1 2 4 8 16 32 641281/41/2 1 2 4 8 16 32 641281/41/2 1 2 4 8 16 32 64128

Intensity (single−precision FLOP : Byte) Normalized performance

  • NVIDIA GTX Titan

GK110; Kepler Samsung Arndale GPU Exynos 5; ARM Mali T−604 Samsung 70 x Arndale GPU Exynos 5; ARM Mali T−604

Hypothetical scaled Arndale GPU design vs. GTX Titan

slide-40
SLIDE 40

Hypothetical system

Time Energy Power

  • 1/4096

1/2048 1/1024 1/512 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 1/41/2 1 2 4 8 16 32 641281/41/2 1 2 4 8 16 32 641281/41/2 1 2 4 8 16 32 64128

Intensity (single−precision FLOP : Byte) Normalized performance

  • NVIDIA GTX Titan

GK110; Kepler Samsung Arndale GPU Exynos 5; ARM Mali T−604 Samsung 70 x Arndale GPU Exynos 5; ARM Mali T−604

Hypothetical scaled Arndale GPU design vs. GTX Titan

slide-41
SLIDE 41

Hypothetical system

Time Energy Power

  • 1/4096

1/2048 1/1024 1/512 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 1/41/2 1 2 4 8 16 32 641281/41/2 1 2 4 8 16 32 641281/41/2 1 2 4 8 16 32 64128

Intensity (single−precision FLOP : Byte) Normalized performance

  • NVIDIA GTX Titan

GK110; Kepler Samsung Arndale GPU Exynos 5; ARM Mali T−604 Samsung 70 x Arndale GPU Exynos 5; ARM Mali T−604

Hypothetical scaled Arndale GPU design vs. GTX Titan

slide-42
SLIDE 42

Power throttling

Full 1/2 1/4 1/8

16 Gflop/J, 1.3 GB/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%]

C C C F M

GTX Titan 1/4 1/2 1 2 1/4 1/2 1 2 4 8 16 32 64 128 1/4

Intensity (single−precision FLOP:Byte) Power [normalized to const+cap]

32 64 128 1/4 1/2 1 2 4

Intensity (single−precision FLOP:Byte)

slide-43
SLIDE 43

Power throttling

Full 1/2 1/4 1/8

16 Gflop/J, 1.3 GB/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%]

C C C F M

GTX Titan 1/4 1/2 1 2 1/4 1/2 1 2 4 8 16 32 64 128 1/4

Intensity (single−precision FLOP:Byte) Power [normalized to const+cap]

32 64 128 1/4 1/2 1 2 4

Intensity (single−precision FLOP:Byte)

π1

∆π

  • Slightly

throttled around Bτ

slide-44
SLIDE 44

Power throttling

Full 1/2 1/4 1/8

16 Gflop/J, 1.3 GB/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%]

C C C F M

GTX Titan 1/4 1/2 1 2 1/4 1/2 1 2 4 8 16 32 64 128 1/4

Intensity (single−precision FLOP:Byte) Power [normalized to const+cap]

32 64 128 1/4 1/2 1 2 4

Intensity (single−precision FLOP:Byte)

π1

Pth = 1 2∆π + π1

1 2∆π

  • Not throttled

when intensity is low (≤ 2) but throttled everywhere else

slide-45
SLIDE 45

Power throttling

Full 1/2 1/4 1/8

16 Gflop/J, 1.3 GB/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%]

C C C F M

GTX Titan 1/4 1/2 1 2 1/4 1/2 1 2 4 8 16 32 64 128 1/4

Intensity (single−precision FLOP:Byte) Power [normalized to const+cap]

32 64 128 1/4 1/2 1 2 4

Intensity (single−precision FLOP:Byte)

π1

1 4∆π

Pth = 1 4∆π + π1

  • Throttled

everywhere

  • Performance

will be lower than expected at all intensities

slide-46
SLIDE 46

Power throttling

slide-47
SLIDE 47

Power throttling

slide-48
SLIDE 48

Power throttling

Full 1/2 1/4 1/8 16 Gflop/J, 1.3 GB/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%] C C M C F Full 1/2 1/4 1/8 15 Gflop/J, 1.2 GB/J 3.0 Tflop/s [86%], 160 GB/s [82%] 66 W (const) + 140 W (cap) [100%] C C M C F 1/4 1/2 Full 1/8 11 Gflop/J, 880 MB/J 2.0 Tflop/s [100%], 180 GB/s [57%] 180 W (const) + 36 W (cap) [100%] C C M F F Full 1/2 1/4 1/8 8.8 Gflop/J, 670 MB/J 270 Gflop/s [100%], 15 GB/s [60%] 10 W (const) + 18 W (cap) [91%] C M C C C Full 1/2 1/4 1/8 8.1 Gflop/J, 1.5 GB/J 33 Gflop/s [46%], 8.4 GB/s [66%] 1.3 W (const) + 4.8 W (cap) [88%] M C C C F Full 1/2 1/8 1/4 6.4 Gflop/J, 470 MB/J 100 Gflop/s [95%], 8.7 GB/s [81%] 16 W (const) + 3.2 W (cap) [100%] C M C F F Full 1/2 1/4 1/8 5.3 Gflop/J, 810 MB/J 1.4 Tflop/s [88%], 170 GB/s [89%] 120 W (const) + 150 W (cap) [94%] C M C C F Full 1/2 1/4 1/8 3.2 Gflop/J, 750 MB/J 56 Gflop/s [97%], 18 GB/s [70%] 17 W (const) + 7.4 W (cap) [98%] C F F F 1/2 Full 1/4 1/8 2.5 Gflop/J, 280 MB/J 9.5 Gflop/s [99%], 1.3 GB/s [40%] 3.5 W (const) + 1.2 W (cap) [95%] M C C F F Full 1/2 1/4 1/8 2.2 Gflop/J, 560 MB/J 16 Gflop/s [58%], 3.9 GB/s [31%] 5.5 W (const) + 2.0 W (cap) [97%] C M C C F Full 1/8 1/4 1/2 650 Mflop/J, 150 MB/J 13 Gflop/s [98%], 3.3 GB/s [31%] 20 W (const) + 1.4 W (cap) [98%] C C F F Full 1/2 1/4 1/8 620 Mflop/J, 140 MB/J 99 Gflop/s [93%], 19 GB/s [74%] 120 W (const) + 44 W (cap) [99%] C C C M F GTX Titan GTX 680 Xeon Phi NUC GPU Arndale GPU APU GPU GTX 580 NUC CPU PandaBoard ES Arndale CPU APU CPU Desktop CPU 2−12 2−10 1/256 1/64 1/16 1/4 1 2−12 2−10 1/256 1/64 1/16 1/4 1 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128 Intensity (single−precision FLOP:Byte) Flops / Energy [normalized to 16 Gflop/J]
slide-49
SLIDE 49

1/2 Full 1/8 1/4

16 Gflop/J, 1.3 GB/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%]

C C C F M

1/4 1/2 Full 1/8

11 Gflop/J, 880 MB/J 2.0 Tflop/s [100%], 180 GB/s [57%] 180 W (const) + 36 W (cap) [100%]

C C F F M

Full 1/8 1/4 1/2

8.1 Gflop/J, 1.5 GB/J 33 Gflop/s [46%], 8.4 GB/s [66%] 1.3 W (const) + 4.8 W (cap) [88%]

C C C F M

GTX Titan Xeon Phi Arndale GPU 1/4 1/2 1 2 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128

Intensity (single−precision FLOP:Byte) Power [normalized to const+cap]

slide-50
SLIDE 50

1/2 Full 1/8 1/4

16 Gflop/J, 1.3 GB/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%]

C C C M F

1/4 1/2 Full 1/8

11 Gflop/J, 880 MB/J 2.0 Tflop/s [100%], 180 GB/s [57%] 180 W (const) + 36 W (cap) [100%]

C C M F F

Full 1/8 1/4 1/2

8.1 Gflop/J, 1.5 GB/J 33 Gflop/s [46%], 8.4 GB/s [66%] 1.3 W (const) + 4.8 W (cap) [88%]

M C C C F

GTX Titan Xeon Phi Arndale GPU 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128

Intensity (single−precision FLOP:Byte) Flops / Time [normalized to estimated peak]

< 4×

slide-51
SLIDE 51

1/2 Full 1/8 1/4

16 Gflop/J, 1.3 GB/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%]

C C C M F

1/4 1/2 Full 1/8

11 Gflop/J, 880 MB/J 2.0 Tflop/s [100%], 180 GB/s [57%] 180 W (const) + 36 W (cap) [100%]

C C M F F

Full 1/8 1/4 1/2

8.1 Gflop/J, 1.5 GB/J 33 Gflop/s [46%], 8.4 GB/s [66%] 1.3 W (const) + 4.8 W (cap) [88%]

M C C C F

GTX Titan Xeon Phi Arndale GPU 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128

Intensity (single−precision FLOP:Byte) Flops / Time [normalized to estimated peak]

~ 8×

slide-52
SLIDE 52

1/2 Full 1/8 1/4

16 Gflop/J, 1.3 GB/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%]

C C C M F

1/4 1/2 Full 1/8

11 Gflop/J, 880 MB/J 2.0 Tflop/s [100%], 180 GB/s [57%] 180 W (const) + 36 W (cap) [100%]

C C M F F

Full 1/8 1/4 1/2

8.1 Gflop/J, 1.5 GB/J 33 Gflop/s [46%], 8.4 GB/s [66%] 1.3 W (const) + 4.8 W (cap) [88%]

M C C C F

GTX Titan Xeon Phi Arndale GPU 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128

Intensity (single−precision FLOP:Byte) Flops / Time [normalized to estimated peak]

~ 5× ~ 8×

slide-53
SLIDE 53

Conclusions and future work

  • Algorithmic first-principles approach to

energy

– provide high-level analytical methodology – hypothetical systems – inter-platform comparisons of energy costs

  • Energy is fundamentally different from

time (performance)

– every cost must be accounted for – constant power is huge bottleneck

slide-54
SLIDE 54

Power Cap

1 2 4 8

  • 3.6

14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512

Intensity (flop:byte) Power, relative to flop−power

slide-55
SLIDE 55

1/2 Full 1/8 1/4

16 Gflop/J, 1.3 GB/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%]

C C M C F

1/4 1/2 Full 1/8

11 Gflop/J, 880 MB/J 2.0 Tflop/s [100%], 180 GB/s [57%] 180 W (const) + 36 W (cap) [100%]

C C M F F

Full 1/8 1/4 1/2

8.1 Gflop/J, 1.5 GB/J 33 Gflop/s [46%], 8.4 GB/s [66%] 1.3 W (const) + 4.8 W (cap) [88%]

M C C C F

GTX Titan Xeon Phi Arndale GPU 1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128

Intensity (single−precision FLOP:Byte) Flops / Energy [normalized to estimated peak]

slide-56
SLIDE 56

Breaking down the costs

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 Arndale GPU GTX Titan GTX 680 Xeon Phi GTX 580 NUC CPU Arndale CPU APU GPU PandaBoard ES APU CPU Desktop CPU Rand Mem L2 L1 Flop

Energy per single−precision op

[baseline: 61.0 pJ; includes constant energy]

slide-57
SLIDE 57

Breaking down the costs

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 Arndale GPU GTX Titan GTX 680 Xeon Phi GTX 580 NUC CPU Arndale CPU APU GPU PandaBoard ES APU CPU Desktop CPU 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 GTX Titan GTX 680 Xeon Phi Arndale GPU APU GPU GTX 580 NUC CPU PandaBoard ES Arndale CPU APU CPU Desktop CPU 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 Arndale GPU GTX Titan APU GPU GTX 680 GTX 580 Xeon Phi PandaBoard ES NUC CPU Arndale CPU Desktop CPU APU CPU Flop L1 L2 Mem Rand

Energy per single−precision op

[baseline: 61.0 pJ; includes constant energy]

slide-58
SLIDE 58

Breaking down the costs

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 Arndale GPU GTX Titan GTX 680 Xeon Phi GTX 580 NUC CPU Arndale CPU APU GPU PandaBoard ES APU CPU Desktop CPU 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 GTX Titan GTX 680 Xeon Phi Arndale GPU APU GPU GTX 580 NUC CPU PandaBoard ES Arndale CPU APU CPU Desktop CPU 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 Arndale GPU GTX Titan APU GPU GTX 680 GTX 580 Xeon Phi PandaBoard ES NUC CPU Arndale CPU Desktop CPU APU CPU Flop L1 L2 Mem Rand

Energy per single−precision op

[baseline: 61.0 pJ; includes constant energy]

1 2 A r n d a l e G P U G T X T i t a n G T X 6 8 X e

  • n

P h i G T X 5 8 N U C C P U A r n d a l e C P U 1 2 G T X T i t a n G T X 6 8 X e

  • n

P h i A r n d a l e G P U A P U G P U G T X 5 8 P 1 2 A r n d a l e G P U G T X T i t a n A P U G P U G T X 6 8 G T X 5 8 X e

  • n

P h i P a n d a B

  • a

r d E S

slide-59
SLIDE 59

Power cap model error

  • 0.25
0.5 0.75 1 1.25 1.5 Arndale GPU** Samsung Exynos 5 [ARM Mali T−604] NUC GPU** Intel HD 4000 [Ivy Bridge] Arndale CPU** Samsung Exynos 5 [ARM Cortex−A15] GTX 680** NVIDIA GK104 [Kepler] PandaBoard ES** TI OMAP4460 [ARM Cortex−A9] GTX Titan NVIDIA GK110 [Kepler] GTX 580 NVIDIA GF100 [Fermi] Xeon Phi** Intel 5110P [KNC] Desktop CPU Intel i7−950 [Nehalem] NUC CPU Intel i3−3217U [Ivy Bridge] APU GPU** AMD HD7340 [Zacate GPU] APU CPU AMD E2−1800 [Bobcat]

(Model − Measured) / Measured

  • Uncapped

Capped

Power Prediction Error [single−precision]

slide-60
SLIDE 60

Roofline in time ¡

1/32 1/16 1/8 1/4 1/2 1

3.6

GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance

Compute bound Memory (bandwidth) bound

slide-61
SLIDE 61

Cost models

T = max (Wτflop, Qτmem) = Wτflop max ✓ 1, Q W τmem τflop ◆ = Wτflop max ✓ 1, Bτ I ◆

Slow memory

xPU Fast memory (total size = Z)

τmem = time/(m)op τflop = time/(fl)op Q mops W (fl)ops

E = W✏flop + Q✏mem + ⇡0T = W✏flop ✓ 1 + B✏ I + ⇡0 ✏flop T W ◆

slide-62
SLIDE 62

Cost models

T = max (Wτflop, Qτmem) = Wτflop max ✓ 1, Q W τmem τflop ◆ = Wτflop max ✓ 1, Bτ I ◆

Energy balance

(flop : mop)

Slow memory

xPU Fast memory (total size = Z)

τmem = time/(m)op τflop = time/(fl)op Q mops W (fl)ops Constant

power

E = W✏flop + Q✏mem + ⇡1T = W✏flop ✓ 1 + B✏ I + ⇡1 ✏flop T W ◆

slide-63
SLIDE 63

Roofline in energy ¡

1/32 1/16 1/8 1/4 1/2 1

3.6 14

GFLOP/J GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance

Time-energy balance gap

  • 1. Simultaneously Compute-

bound in time but memory- bound in energy

  • 2. Optimizing for energy is harder

than optimizing for time

  • 3. Energy-efficiency implies time-

efficiency, but not vice-versa, breaking “race-to-halt”

slide-64
SLIDE 64

Roofline in power (power line) ¡

1 2 4 8

  • 3.6

14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512

Intensity (flop:byte) Power, relative to flop−power

P = E T = W✏flop

  • 1 + B✏

I

  • W⌧flop max
  • 1, B⌧

I

  • = ✏flop

⌧flop min ✓ 1 + B✏ I , I + B✏ B⌧ ◆

I → ∞ ⇒ P = ✏flop ⌧flop = Pflop I → 0 ⇒ P = Pflop B✏ B⌧ I = B⌧ ⇒ P = Pflop ✓ 1 + B✏ B⌧ ◆

slide-65
SLIDE 65

Power Cap

1 2 4 8

  • 3.6

14 5.0 4.0 1.0 0.5 1 2 4 8 16 32 64 128 256 512

Intensity (flop:byte) Power, relative to flop−power

slide-66
SLIDE 66

Power budgeting

slide-67
SLIDE 67

Power budgeting

1/2 Full 1/8 1/4

16 Gflop/J, 1.3 GB/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%]

C C C F M

1/4 1/2 Full 1/8

11 Gflop/J, 880 MB/J 2.0 Tflop/s [100%], 180 GB/s [57%] 180 W (const) + 36 W (cap) [100%]

C C F F M

Full 1/8 1/4 1/2

8.1 Gflop/J, 1.5 GB/J 33 Gflop/s [46%], 8.4 GB/s [66%] 1.3 W (const) + 4.8 W (cap) [88%]

C C C F M

GTX Titan Xeon Phi Arndale GPU 1/4 1/2 1 2 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128

Intensity (single−precision FLOP:Byte) Power [normalized to const+cap]

slide-68
SLIDE 68

Power budgeting

1/2 Full 1/8 1/4

16 Gflop/J, 1.3 GB/J 4.0 Tflop/s [81%], 240 GB/s [83%] 120 W (const) + 160 W (cap) [99%]

C C C M F

1/4 1/2 Full 1/8

11 Gflop/J, 880 MB/J 2.0 Tflop/s [100%], 180 GB/s [57%] 180 W (const) + 36 W (cap) [100%]

C C M F F

Full 1/8 1/4 1/2

8.1 Gflop/J, 1.5 GB/J 33 Gflop/s [46%], 8.4 GB/s [66%] 1.3 W (const) + 4.8 W (cap) [88%]

M C C C F

GTX Titan Xeon Phi Arndale GPU 2−14 2−12 2−10 1/256 1/64 1/16 1/4 1 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128 1/4 1/2 1 2 4 8 16 32 64 128

Intensity (single−precision FLOP:Byte) Flops / Time [normalized to 4.0 Tflop/s]

slide-69
SLIDE 69

Experimental Setup