ON THE COMMUNICATION COMPLEXITY OF 3D FFTS AND ITS IMPLICATIONS FOR EXASCALE
Kent Czechowski, Chris McClanahan, Casey Battaglino, Kartik Iyer, P .-K. Yeung, and Richard Vuduc
Exascale-ability Today N=4096 3 12.3 10 12 Flops 1.1 TB of Data - - PowerPoint PPT Presentation
O N THE C OMMUNICATION C OMPLEXITY OF 3D FFT S AND ITS I MPLICATIONS FOR E XASCALE Kent Czechowski, Chris McClanahan, Casey Battaglino, Kartik Iyer, P .-K. Yeung, and Richard Vuduc Exascale-ability Today N=4096 3 12.3 10 12 Flops 1.1 TB of
Kent Czechowski, Chris McClanahan, Casey Battaglino, Kartik Iyer, P .-K. Yeung, and Richard Vuduc
Memory BW: βmem
Tmem ≈ 3 × n2 P · A × n(max(logZ n, 1.0)) βmem
Tflops = 3 × n2 P × 5n log n Cnode
Nodes: P Compute Throughput: Cnode
Memory BW: βmem
Tmem ≈ 3 × n2 P · A × n(max(logZ n, 1.0)) βmem
Tflops = 3 × n2 P × 5n log n Cnode
Nodes: P Compute Throughput: Cnode
Memory BW: βmem
Tmem ≈ 3 × n2 P · A × n(max(logZ n, 1.0)) βmem
Tflops = 3 × n2 P × 5n log n Cnode
Nodes: P Compute Throughput: Cnode
f Θ (1 + (n/L)(1 + logZ n))
Lower bound (Frigo 1999)
Tnet ≈ 2 × n3 P
2 3 βlink
175 350 525 700 750 1500 2250 3000
Gflops/s Problem Size
GPU CPU
175 350 525 700 750 1500 2250 3000
Gflops/s Problem Size
GPU CPU
Core0 Core1 Core2 Core3
CPU 1
DRAM
1
Infiniband
1 2 3
CPU 2
DRAM
QPI DDR3 QPI
I/O hub I/O hub
QPI
integrated PCIe x16 PCIe x16 PCIe x16
GPU 1 GPU 2 GPU 3
Node
0.0001 0.0010 0.0100 0.1000 1.0000 10.0000 100.0000 1990 2000 2010 2020
Component Performance Relative to 2010 Technology Relative Performance Year
←Historical
(59x) Compute (32x) Cache (22x) Network BW (10x) Memory BW
Projected→
3−D FFTs at Exascale: Year=2020, n=21000
Time (seconds)
0.0 0.2 0.4 0.6 0.8 GPU 131k sockets Peak = 3.98 EF/s Bisection = 1.12 PB/s
0.528 0.19
CPU−1: Same Peak 1M sockets Peak = 3.98 EF/s Bisection = 5.29 PB/s
0.112 0.116
Comm. Memory Network
Cache (6 MB)
(50.4 GFlop/s)
Cache (2.7 MB)
(515 GFlop/s)
Memory BW (21.3 GB/s)
Memory
Memory BW (144 GB/s)
Memory
Memory Processor Memory Processor Memory Processor Memory Processor Memory Processor Memory Processor Memory Processor Memory Processor Memory Processor
Memory Processor Memory Processor Memory Processor Memory Processor Memory Processor Memory Processor Memory Processor Memory Processor Memory Processor
Memory Processor Memory Processor Memory Processor Memory Processor Memory Processor Memory Processor Memory Processor Memory Processor Memory Processor
Mem Proc Mem Proc Mem Proc Mem Proc Mem Proc Mem Proc Mem Proc Mem Proc Mem Proc
Memory Processor Memory Processor Memory Processor Memory Processor
Tmem ≈ O ✓ 1 Rpeak · Cnode βmem · n3 logZ n ◆
Tnet ≈ O ✓ 1 Rκ
peak
· Cκ
node
βlink · n3 ◆
Mem Proc Mem Proc Mem Proc Mem Proc Mem Proc Mem Proc Mem Proc Mem Proc Mem Proc
Memory Processor Memory Processor Memory Processor Memory Processor
Tmem ≈ O ✓ 1 Rpeak · Cnode βmem · n3 logZ n ◆
Tnet ≈ O ✓ 1 Rκ
peak
· Cκ
node
βlink · n3 ◆
3−D FFTs at Exascale: Year=2020, n=21000
Time (seconds)
0.0 0.2 0.4 0.6 0.8 GPU 131k sockets Peak = 3.98 EF/s Bisection = 1.12 PB/s
0.528 0.19
CPU−1: Same Peak 1M sockets Peak = 3.98 EF/s Bisection = 5.29 PB/s
0.112 0.116
Comm. Memory Network
3−D FFTs at Exascale: Year=2020, n=21000
Time (seconds)
0.0 0.2 0.4 0.6 0.8 GPU 131k sockets Peak = 3.98 EF/s Bisection = 1.12 PB/s
0.528 0.19
CPU−1: Same Peak 1M sockets Peak = 3.98 EF/s Bisection = 5.29 PB/s
0.112 0.116
CPU−2: Same Total 350k sockets Peak = 1.04 EF/s Bisection = 2.16 PB/s
0.274 0.444
CPU−3: Same Overlap 295k sockets Peak = 876 PF/s Bisection = 1.93 PB/s
0.307 0.528
Comm. Memory Network
counts are scaled to reflect a 4 PF/s machine
Doubling 10-year 2010 time increase Parameter values (in years) factor value Peak: CCPU 50.4 GF/s 1.7 59.0× 3.0 TF/s CGPU 515 GF/s 30 TF/s Cores:a ρCPU 6 1.87 40.7× 134 ρGPU 448 18k Memory βCPU 21.3 GB/s 3.0 9.7× 206 GB/s bandwidth: βGPU 144 GB/s 1.4 TB/s Fast ZCPU 6 MB 2.0 32.0× 192 MB memory ZGPU 2.7 MBb 86.4 MB Line size: LCPU 64 B 10.2 2.0× 128 B LGPU 128 B 256 B Link βlink 10 GB/s 2.25 21.8× 218 GB/s bandwidth: Machine Rpeak 4 PF/s 1.0 1000× 4 EF/s peak: System E 635 TB 1.3 208× 132 PB memory: Nodes PCPU 79,400 2.4 17.4× 1.3M (
Rpeak C
): PGPU 7,770 135,000
a
Nodes (P) Memory BW (훃mem) Network BW(훃net)
Tnet ≈ 2 × n3 P
2 3 βlink
Tmem ≈ 3 × n2 P · A × n(max(logZ n, 1.0)) βmem
Network 36% Data Shuffle 16% GPU<->CPU 27% FFT Comp 21% DiGPUFFT (GPU)
0.0001 0.0010 0.0100 0.1000 1.0000 10.0000 100 1000 10000
Time (s) Problem Size Communication Computation