Exascale-ability Today N=4096 3 12.3 10 12 Flops 1.1 TB of Data - - PowerPoint PPT Presentation

exascale ability today
SMART_READER_LITE
LIVE PREVIEW

Exascale-ability Today N=4096 3 12.3 10 12 Flops 1.1 TB of Data - - PowerPoint PPT Presentation

O N THE C OMMUNICATION C OMPLEXITY OF 3D FFT S AND ITS I MPLICATIONS FOR E XASCALE Kent Czechowski, Chris McClanahan, Casey Battaglino, Kartik Iyer, P .-K. Yeung, and Richard Vuduc Exascale-ability Today N=4096 3 12.3 10 12 Flops 1.1 TB of


slide-1
SLIDE 1

ON THE COMMUNICATION COMPLEXITY OF 3D FFTS AND ITS IMPLICATIONS FOR EXASCALE

Kent Czechowski, Chris McClanahan, Casey Battaglino, Kartik Iyer, P .-K. Yeung, and Richard Vuduc

slide-2
SLIDE 2

Exascale-ability 3D FFT N=40963 12.3⨯1012 Flops 1.1 TB of Data Today

slide-3
SLIDE 3

Exascale-ability 3D FFT N=131,0723 .574⨯1018 Flops 36 PB of Data Tomorrow

slide-4
SLIDE 4

Swim lane 2

+

vs.

= ?

Swim lane 1 3D FFT

slide-5
SLIDE 5

Performance Model

slide-6
SLIDE 6

3D FFT Decompositions Problem size N=nxnxn

slide-7
SLIDE 7

Pencil Decomposition Problem size N=nxnxn

slide-8
SLIDE 8

Distributed 3D FFT - Performance Model

slide-9
SLIDE 9

Distributed 3D FFT - Performance Model

  • Cache Capacity: Z

Memory BW: βmem

Each node computes n2/p 1D FFTs of size n

Tmem ≈ 3 × n2 P · A × n(max(logZ n, 1.0)) βmem

Tflops = 3 × n2 P × 5n log n Cnode

Nodes: P Compute Throughput: Cnode

Arithmetic Computation Time Memory Access Time

slide-10
SLIDE 10

Distributed 3D FFT - Performance Model

  • Cache Capacity: Z

Memory BW: βmem

Each node computes n2/p 1D FFTs of size n

Tmem ≈ 3 × n2 P · A × n(max(logZ n, 1.0)) βmem

Tflops = 3 × n2 P × 5n log n Cnode

Nodes: P Compute Throughput: Cnode

Arithmetic Computation Time Memory Access Time

slide-11
SLIDE 11

Distributed 3D FFT - Performance Model

  • Cache Capacity: Z

Memory BW: βmem

Each node computes n2/p 1D FFTs of size n

Tmem ≈ 3 × n2 P · A × n(max(logZ n, 1.0)) βmem

Tflops = 3 × n2 P × 5n log n Cnode

Nodes: P Compute Throughput: Cnode

Arithmetic Computation Time Memory Access Time

f Θ (1 + (n/L)(1 + logZ n))

Lower bound (Frigo 1999)

slide-12
SLIDE 12

Distributed 3D FFT - Performance Model

  • Nodes: P Network BW: βlink

Tnet ≈ 2 × n3 P

2 3 βlink

√p-node All-to-All communications Network Time

slide-13
SLIDE 13

Validation

slide-14
SLIDE 14

3D FFT Software Distributed 3D FFT Framework Optimized 1D FFT Library

+

p3dfft FFTW ESSL MKL

( (

slide-15
SLIDE 15

3D FFT Software Distributed 3D FFT Framework Optimized 1D FFT Library

+

p3dfft FFTW ESSL MKL CUFFT

( (

slide-16
SLIDE 16

Test Machines Hopper 6,392 Nodes Opteron 6100 CPU Processor Peak: 50.4 GF/s Cores: 6 Memory BW: 21.3 GB/s Fast Memory: 6 MB Link BW: 10 GB/s Keeneland 120 Nodes (3xGPUs per node) Tesla M2070 GPU Processor Peak: 515 GF/s Cores: 448 Memory BW: 144 GB/s Fast Memory: 2.7 MB Link BW: 2 GB/s

slide-17
SLIDE 17

Artifacts

slide-18
SLIDE 18

175 350 525 700 750 1500 2250 3000

FFT Performance on Keeneland

Gflops/s Problem Size

GPU CPU

GPU vs CPU Performance

slide-19
SLIDE 19

175 350 525 700 750 1500 2250 3000

FFT Performance on Keeneland

Gflops/s Problem Size

GPU CPU

GPU vs CPU Performance 20% Difference

slide-20
SLIDE 20

Artifacts - PCIe Bottleneck

Core0 Core1 Core2 Core3

CPU 1

DRAM

1

Infiniband

1 2 3

CPU 2

DRAM

QPI DDR3 QPI

I/O hub I/O hub

QPI

integrated PCIe x16 PCIe x16 PCIe x16

GPU 1 GPU 2 GPU 3

Node

slide-21
SLIDE 21

Projections

slide-22
SLIDE 22

Predicting 2020 Technology

0.0001 0.0010 0.0100 0.1000 1.0000 10.0000 100.0000 1990 2000 2010 2020

Component Performance Relative to 2010 Technology Relative Performance Year

←Historical

(59x) Compute (32x) Cache (22x) Network BW (10x) Memory BW

Projected→

slide-23
SLIDE 23

2010

CPU-Based Processor Peak: 50.4 GF/s Memory BW: 21.3 GB/s Fast Memory: 6 MB Link BW: 10 GB/s 79,400 Processors GPU-Based Processor Peak: 515 GF/s Memory BW: 144 GB/s Fast Memory: 2.7 MB Link BW: 10 GB/s 6,392 Processors CPU-Based Processor Peak: 3 TF/s Memory BW: 206 GB/s Fast Memory: 192 MB Link BW: 218 GB/s 1.3 M Processors GPU-Based Processor Peak: 30 TF/s Memory BW: 1.4 TB/s Fast Memory: 86.4 MB Link BW: 218 GB/s 135,000 Processors

2020

Technology Extrapolation

slide-24
SLIDE 24

3−D FFTs at Exascale: Year=2020, n=21000

Time (seconds)

0.0 0.2 0.4 0.6 0.8 GPU 131k sockets Peak = 3.98 EF/s Bisection = 1.12 PB/s

0.528 0.19

CPU−1: Same Peak 1M sockets Peak = 3.98 EF/s Bisection = 5.29 PB/s

0.112 0.116

Comm. Memory Network

3D FFTs at Exascale (2020, n=21000)

slide-25
SLIDE 25

Performance vs Balance

Cache (6 MB)

Processor

(50.4 GFlop/s)

Cache (2.7 MB)

Processor

(515 GFlop/s)

CPU

Memory BW (21.3 GB/s)

Memory

Memory BW (144 GB/s)

GPU The GPU offers better performance, but is less balanced 6.7x 10.2x Flop/s : Byte/s = 2.3 Flop/s : Byte/s = 3.6

Memory

slide-26
SLIDE 26

Memory Processor Memory Processor Memory Processor Memory Processor Memory Processor Memory Processor Memory Processor Memory Processor Memory Processor

Two Costs: Tmemory + Tnetwork

slide-27
SLIDE 27

Memory Processor Memory Processor Memory Processor Memory Processor Memory Processor Memory Processor Memory Processor Memory Processor Memory Processor

Two Costs: Tmemory + Tnetwork

slide-28
SLIDE 28

Memory Processor Memory Processor Memory Processor Memory Processor Memory Processor Memory Processor Memory Processor Memory Processor Memory Processor

Two Costs: Tmemory + Tnetwork

slide-29
SLIDE 29

Mem Proc Mem Proc Mem Proc Mem Proc Mem Proc Mem Proc Mem Proc Mem Proc Mem Proc

Impact of Machine Balance

Memory Processor Memory Processor Memory Processor Memory Processor

vs.

Tmem ≈ O ✓ 1 Rpeak · Cnode βmem · n3 logZ n ◆

Tnet ≈ O ✓ 1 Rκ

peak

· Cκ

node

βlink · n3 ◆

Number of processors: Rpeak / Cnode

slide-30
SLIDE 30

Mem Proc Mem Proc Mem Proc Mem Proc Mem Proc Mem Proc Mem Proc Mem Proc Mem Proc

Impact of Machine Balance

Memory Processor Memory Processor Memory Processor Memory Processor

vs.

Tmem ≈ O ✓ 1 Rpeak · Cnode βmem · n3 logZ n ◆

Tnet ≈ O ✓ 1 Rκ

peak

· Cκ

node

βlink · n3 ◆

Number of processors: Rpeak / Cnode

slide-31
SLIDE 31

3−D FFTs at Exascale: Year=2020, n=21000

Time (seconds)

0.0 0.2 0.4 0.6 0.8 GPU 131k sockets Peak = 3.98 EF/s Bisection = 1.12 PB/s

0.528 0.19

CPU−1: Same Peak 1M sockets Peak = 3.98 EF/s Bisection = 5.29 PB/s

0.112 0.116

Comm. Memory Network

3D FFTs at Exascale (2020, n=21000)

slide-32
SLIDE 32

3D FFTs at Exascale (2020, n=21000)

3−D FFTs at Exascale: Year=2020, n=21000

Time (seconds)

0.0 0.2 0.4 0.6 0.8 GPU 131k sockets Peak = 3.98 EF/s Bisection = 1.12 PB/s

0.528 0.19

CPU−1: Same Peak 1M sockets Peak = 3.98 EF/s Bisection = 5.29 PB/s

0.112 0.116

CPU−2: Same Total 350k sockets Peak = 1.04 EF/s Bisection = 2.16 PB/s

0.274 0.444

CPU−3: Same Overlap 295k sockets Peak = 876 PF/s Bisection = 1.93 PB/s

0.307 0.528

Comm. Memory Network

slide-33
SLIDE 33

Questions?

slide-34
SLIDE 34

counts are scaled to reflect a 4 PF/s machine

Doubling 10-year 2010 time increase Parameter values (in years) factor value Peak: CCPU 50.4 GF/s 1.7 59.0× 3.0 TF/s CGPU 515 GF/s 30 TF/s Cores:a ρCPU 6 1.87 40.7× 134 ρGPU 448 18k Memory βCPU 21.3 GB/s 3.0 9.7× 206 GB/s bandwidth: βGPU 144 GB/s 1.4 TB/s Fast ZCPU 6 MB 2.0 32.0× 192 MB memory ZGPU 2.7 MBb 86.4 MB Line size: LCPU 64 B 10.2 2.0× 128 B LGPU 128 B 256 B Link βlink 10 GB/s 2.25 21.8× 218 GB/s bandwidth: Machine Rpeak 4 PF/s 1.0 1000× 4 EF/s peak: System E 635 TB 1.3 208× 132 PB memory: Nodes PCPU 79,400 2.4 17.4× 1.3M (

Rpeak C

): PGPU 7,770 135,000

a

slide-35
SLIDE 35

Distributed 3D FFT - Performance Model

  • Cache Capacity (Z)

Nodes (P) Memory BW (훃mem) Network BW(훃net)

Tnet ≈ 2 × n3 P

2 3 βlink

Tmem ≈ 3 × n2 P · A × n(max(logZ n, 1.0)) βmem

slide-36
SLIDE 36

Network 36% Data Shuffle 16% GPU<->CPU 27% FFT Comp 21% DiGPUFFT (GPU)

3D FFT on GPU Cluster

slide-37
SLIDE 37

0.0001 0.0010 0.0100 0.1000 1.0000 10.0000 100 1000 10000

FFT Performance on Hopper

Time (s) Problem Size Communication Computation