Exascale-ability Today N=4096 3 12.3 10 12 Flops 1.1 TB of Data - PowerPoint PPT Presentation

O N THE C OMMUNICATION C OMPLEXITY OF 3D FFT S AND ITS I MPLICATIONS FOR E XASCALE Kent Czechowski, Chris McClanahan, Casey Battaglino, Kartik Iyer, P .-K. Yeung, and Richard Vuduc

Exascale-ability Today N=4096 3 12.3 ⨯ 10 12 Flops 1.1 TB of Data 3D FFT

Exascale-ability Tomorrow N=131,072 3 .574 ⨯ 10 18 Flops 36 PB of Data 3D FFT

vs. Swim lane 1 Swim lane 2 + = ? 3D FFT

Performance Model

3D FFT Decompositions Problem size N=n x n x n

Pencil Decomposition Problem size N=n x n x n

Distributed 3D FFT - Performance Model ��

Distributed 3D FFT - Performance Model �� Each node computes n 2 /p 1D FFTs of size n Arithmetic Computation Time Memory Access Time T flops = 3 × n 2 3 × n 2 P × 5 n log n P · A × n (max(log Z n, 1 . 0)) T mem ≈ β mem C node Nodes: P Cache Capacity: Z Compute Throughput: C node Memory BW: β mem

Distributed 3D FFT - Performance Model �� Each node computes n 2 /p 1D FFTs of size n Arithmetic Computation Time Memory Access Time T flops = 3 × n 2 3 × n 2 P × 5 n log n P · A × n (max(log Z n, 1 . 0)) T mem ≈ β mem C node f Θ (1 + ( n/L )(1 + log Z n )) Lower bound (Frigo 1999) Nodes: P Cache Capacity: Z Compute Throughput: C node Memory BW: β mem

Distributed 3D FFT - Performance Model �� √ p-node All-to-All communications Network Time n 3 T net 2 × ≈ 2 3 β link P Nodes: P Network BW: β link

Validation

3D FFT Software Optimized 1D FFT Library Distributed 3D FFT Framework ( ( FFTW + ESSL MKL p3dfft

3D FFT Software Optimized 1D FFT Library Distributed 3D FFT Framework ( ( FFTW + ESSL MKL CUFFT p3dfft

Test Machines �� Hopper Keeneland 6,392 Nodes 120 Nodes (3xGPUs per node) Opteron 6100 CPU Tesla M2070 GPU Processor Peak: 50.4 GF/s Processor Peak: 515 GF/s Cores: 6 Cores: 448 Memory BW: 21.3 GB/s Memory BW: 144 GB/s Fast Memory: 6 MB Fast Memory: 2.7 MB Link BW: 10 GB/s Link BW: 2 GB/s

Artifacts

GPU vs CPU Performance FFT Performance on Keeneland 700 525 Gflops/s 350 GPU CPU 175 0 0 750 1500 2250 3000 Problem Size

GPU vs CPU Performance FFT Performance on Keeneland 700 20% Difference 525 Gflops/s 350 GPU CPU 175 0 0 750 1500 2250 3000 Problem Size

Artifacts - PCIe Bottleneck Node CPU 1 1 integrated Infiniband DRAM Core0 Core1 I/O DDR3 QPI hub GPU 1 Core2 Core3 PCIe x16 QPI QPI PCIe x16 GPU 2 0 1 DRAM I/O hub GPU 3 2 3 PCIe x16 CPU 2

Projections

Predicting 2020 Technology Component Performance Relative to 2010 Technology 100.0000 (59x) Compute (32x) Cache (22x) Network BW 10.0000 (10x) Memory BW Relative Performance 1.0000 0.1000 0.0100 0.0010 ← Historical Projected → 0.0001 1990 2000 2010 2020 Year

Technology Extrapolation 2020 2010 CPU-Based CPU-Based Processor Peak: 3 TF/s Processor Peak: 50.4 GF/s Memory BW: 206 GB/s Memory BW: 21.3 GB/s Fast Memory: 192 MB Fast Memory: 6 MB Link BW: 218 GB/s Link BW: 10 GB/s 1.3 M Processors 79,400 Processors GPU-Based GPU-Based Processor Peak: 515 GF/s Processor Peak: 30 TF/s Memory BW: 144 GB/s Memory BW: 1.4 TB/s Fast Memory: 2.7 MB Fast Memory: 86.4 MB Link BW: 10 GB/s Link BW: 218 GB/s 6,392 Processors 135,000 Processors

3D FFTs at Exascale (2020, n=21000) 3 − D FFTs at Exascale: Year=2020, n=21000 GPU CPU − 1: Same Peak 131k sockets 1M sockets Peak = 3.98 EF/s Peak = 3.98 EF/s Bisection = 1.12 PB/s Bisection = 5.29 PB/s 0.8 0.19 0.6 Time (seconds) Comm. Memory Network 0.4 0.528 0.2 0.116 0.112 0.0

Performance vs Balance GPU CPU Memory Memory Memory BW Memory BW (144 GB/s) 6.7x (21.3 GB/s) Cache (2.7 MB) Cache (6 MB) 10.2x Processor Processor (515 GFlop/s) (50.4 GFlop/s) Flop/s : Byte/s = 2.3 Flop/s : Byte/s = 3.6 The GPU offers better performance, but is less balanced

Two Costs: T memory + T network Memory Memory Memory Processor Processor Processor Memory Memory Memory Processor Processor Processor Memory Memory Memory Processor Processor Processor

Impact of Machine Balance Mem Mem Mem Memory Memory Proc Proc Proc Processor Processor Mem Mem Mem vs. Proc Proc Proc Memory Memory Mem Mem Mem Processor Processor Proc Proc Proc Number of processors: R peak / C node ✓ ◆ 1 · C κ ✓ ◆ R peak · C node 1 β mem · n 3 log Z n β link · n 3 node T net ≈ O T mem ≈ O R κ peak

3D FFTs at Exascale (2020, n=21000) 3 − D FFTs at Exascale: Year=2020, n=21000 GPU CPU − 1: Same Peak 131k sockets 1M sockets Peak = 3.98 EF/s Peak = 3.98 EF/s Bisection = 1.12 PB/s Bisection = 5.29 PB/s 0.8 0.19 0.6 Time (seconds) Comm. Memory Network 0.4 0.528 0.2 0.116 0.112 0.0

3D FFTs at Exascale (2020, n=21000) 3 − D FFTs at Exascale: Year=2020, n=21000 GPU CPU − 1: Same Peak CPU − 2: Same Total CPU − 3: Same Overlap 131k sockets 1M sockets 350k sockets 295k sockets Peak = 3.98 EF/s Peak = 3.98 EF/s Peak = 1.04 EF/s Peak = 876 PF/s Bisection = 1.12 PB/s Bisection = 5.29 PB/s Bisection = 2.16 PB/s Bisection = 1.93 PB/s 0.8 0.19 0.6 Time (seconds) 0.528 Comm. 0.444 Memory Network 0.4 0.528 0.2 0.116 0.307 0.274 0.112 0.0

Questions?

counts are scaled to reflect a 4 PF/s machine Doubling 10-year 2010 time increase Parameter values (in years) factor value Peak: 50.4 GF/s 1.7 59.0 × 3.0 TF/s C CPU 515 GF/s 30 TF/s C GPU Cores: a 6 1.87 40.7 × 134 ρ CPU 448 18k ρ GPU Memory 21.3 GB/s 3.0 9.7 × 206 GB/s β CPU bandwidth: 144 GB/s 1.4 TB/s β GPU Fast 6 MB 2.0 32.0 × 192 MB Z CPU 2.7 MB b memory 86.4 MB Z GPU Line size: 64 B 10.2 2.0 × 128 B L CPU 128 B 256 B L GPU Link 10 GB/s 2.25 21.8 × 218 GB/s β link bandwidth: Machine 4 PF/s 1.0 1000 × 4 EF/s R peak peak: System 635 TB 1.3 208 × 132 PB E memory: Nodes 79,400 2.4 17.4 × 1.3M P CPU R peak ( ): 7,770 135,000 P GPU C a

Exascale-ability Today N=4096 3 12.3 10 12 Flops 1.1 TB of Data - PowerPoint PPT Presentation

O N THE C OMMUNICATION C OMPLEXITY OF 3D FFT S AND ITS I MPLICATIONS FOR E XASCALE Kent Czechowski, Chris McClanahan, Casey Battaglino, Kartik Iyer, P .-K. Yeung, and Richard Vuduc Exascale-ability Today N=4096 3 12.3 10 12 Flops 1.1 TB of

Why Nobody Should Care About Operating Systems for Exascale Operating Systems for Exascale Ron

exascale road in China Ruibo WANG National University of Defense Technology Contents NUDT

Major Challenges to Achieve Exascale Performance Shekhar Borkar Intel Corp. April 29, 2009

HPC Future Look Exascale and Challenges Outline Future architectures Exascale initiatives

The Exascale Computing Project (ECP) Paul Messina, ECP Director Stephen Lee, ECP Deputy Director

Exa-DM: Enabling Scientific Discovery in Exascale Simulations Jeremy Iverson 1 , 2 , Ya Ju Fan 1 ,

Containment Domains Resilience Mechanisms and Tools Toward Exascale Resilience Mattan Erez The

Squeezing Information from Data at Exascale Joel Saltz Emory University Georgia Tech Squeezing

Exascale Computing Project: Software Technology Perspective Rajeev Thakur, Argonne National Lab.

Time to Start over? Software for Exascale William Gropp www.cs.illinois.edu/~wgropp Why Is

The U.S. D.O.E. Exascale Computing Project Goals and Challenges Paul Messina, ECP Director

Exascale: Parallelism gone wild! Craig Stunkel, IBM Research IBM Research Outline Why are

EXASCALE IN 2018 REALLY? FRANCK CAPPELLO INRIA&UIUC What are we talking about? 100M cores

How I Learned to Stop Worrying about Exascale and Love MPI (Yes, MPI is indeed da bomb!) Pavan

Exascale Science Rick Stevens Argonne National Laboratory University of Chicago Outline

An Introductory Exascale Feasibility Study for FFTs and Multigrid Hormozd Gahvari William Gropp

Approximating Orthogonal Matrices with Effective Givens Factorization Thomas Frerix Technical

for BlueGene/P Franz Franchetti 1 , Yevgen Voronenko 2 , Gheorghe Almasi 3 1 Carnegie Mellon

3rd Grade Shapes and Perimeter 2015-11-10 www.njctl.org Slide 3 / 102 Slide 4 / 102 Table of

Synchronising C/C++ and POWER Susmit Sarkar 1 Kayvan Memarian 1 Scott Owens 1 Mark Batty 1 Peter

The tangent FFT D. J. Bernstein University of Illinois at Chicago See online version of paper,

Automatic physical inference with information maximising neural networks Physical Review D 97 ,

Model dependences, uncertain1es, and combined analysis Intro

Gravitational wave and lensing inference from the CMB polarization Ethan Anderes : (UC Davis

Sambuz

Useful Links

Newsletter

Mail Us

Exascale-ability Today N=4096 3 12.3 10 12 Flops 1.1 TB of Data - PowerPoint PPT Presentation

O N THE C OMMUNICATION C OMPLEXITY OF 3D FFT S AND ITS I MPLICATIONS FOR E XASCALE Kent Czechowski, Chris McClanahan, Casey Battaglino, Kartik Iyer, P .-K. Yeung, and Richard Vuduc Exascale-ability Today N=4096 3 12.3 10 12 Flops 1.1 TB of

Why Nobody Should Care About Operating Systems for Exascale Operating Systems for Exascale Ron

exascale road in China Ruibo WANG National University of Defense Technology Contents NUDT

Major Challenges to Achieve Exascale Performance Shekhar Borkar Intel Corp. April 29, 2009

HPC Future Look Exascale and Challenges Outline Future architectures Exascale initiatives

The Exascale Computing Project (ECP) Paul Messina, ECP Director Stephen Lee, ECP Deputy Director

Exa-DM: Enabling Scientific Discovery in Exascale Simulations Jeremy Iverson 1 , 2 , Ya Ju Fan 1 ,

Containment Domains Resilience Mechanisms and Tools Toward Exascale Resilience Mattan Erez The

Squeezing Information from Data at Exascale Joel Saltz Emory University Georgia Tech Squeezing

Exascale Computing Project: Software Technology Perspective Rajeev Thakur, Argonne National Lab.

Time to Start over? Software for Exascale William Gropp www.cs.illinois.edu/~wgropp Why Is

The U.S. D.O.E. Exascale Computing Project Goals and Challenges Paul Messina, ECP Director

Exascale: Parallelism gone wild! Craig Stunkel, IBM Research IBM Research Outline Why are

EXASCALE IN 2018 REALLY? FRANCK CAPPELLO INRIA&amp;UIUC What are we talking about? 100M cores

How I Learned to Stop Worrying about Exascale and Love MPI (Yes, MPI is indeed da bomb!) Pavan

Exascale Science Rick Stevens Argonne National Laboratory University of Chicago Outline

An Introductory Exascale Feasibility Study for FFTs and Multigrid Hormozd Gahvari William Gropp

Approximating Orthogonal Matrices with Effective Givens Factorization Thomas Frerix Technical

for BlueGene/P Franz Franchetti 1 , Yevgen Voronenko 2 , Gheorghe Almasi 3 1 Carnegie Mellon

3rd Grade Shapes and Perimeter 2015-11-10 www.njctl.org Slide 3 / 102 Slide 4 / 102 Table of

Synchronising C/C++ and POWER Susmit Sarkar 1 Kayvan Memarian 1 Scott Owens 1 Mark Batty 1 Peter

The tangent FFT D. J. Bernstein University of Illinois at Chicago See online version of paper,

Automatic physical inference with information maximising neural networks Physical Review D 97 ,

Model dependences, uncertain1es, and combined analysis Intro

Gravitational wave and lensing inference from the CMB polarization Ethan Anderes : (UC Davis

Sambuz

Useful Links

Newsletter

Mail Us

EXASCALE IN 2018 REALLY? FRANCK CAPPELLO INRIA&UIUC What are we talking about? 100M cores