for BlueGene/P Franz Franchetti 1 , Yevgen Voronenko 2 , Gheorghe - PowerPoint PPT Presentation

Carnegie Mellon Carnegie Mellon Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P Franz Franchetti 1 , Yevgen Voronenko 2 , Gheorghe Almasi 3 1 Carnegie Mellon University and SpiralGen, Inc. 2 AccuRay, Inc., 3 IBM Research Presenter: Richard M. Veras Carnegie Mellon University This work was supported by NSF, ONR, and ANL BlueGene/Q ESP

Carnegie Mellon Carnegie Mellon The HPC Challenge’s Global FFT Benchmark HPC Challenge  New HPC Benchmark suite  HPL, STREAM, RandomAccess, PTRANS, FFT, DGEMM, and b_eff  Better characterization than HPL Global FFT  Large, parallel 1D FFT across the whole machine  Strongly limited by the machine’s communication system  Baseline implementation: FFTE http://icl.cs.utk.edu/hpcc/ Goal: Auto-generate efficient Global FFT implementation

Carnegie Mellon Carnegie Mellon Outline  Spiral: Library Generation  MPI-Friendly Global FFT Algorithm  Experimental Results  Summary

Carnegie Mellon Carnegie Mellon Outline  Spiral: Library Generation  MPI-Friendly Global FFT Algorithm  Experimental Results  Summary M. Püschel, F. Franchetti, Y. Voronenko: Spiral. Encyclopedia of Parallel Computing, D. A. Padua (Editor), 2011 . Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo: SPIRAL: Code Generation for DSP Transforms. Special issue, Proceedings of the IEEE 93(2), 2005.

Carnegie Mellon Carnegie Mellon Spiral: Automating Library Tuning Traditionally Spiral Approach Spiral Comparable High performance library High performance library performance optimized for given platform optimized for given platform

Carnegie Mellon Carnegie Mellon Spiral’s Formal Framework  Transform = Matrix-vector multiplication Example: Discrete Fourier transform (DFT) input vector (signal) transform = matrix output vector (signal)  Fast algorithm = sparse matrix factorization = SPL formula Example: Cooley-Tukey FFT algorithm                     1 1 1 1 1 1 1 1 1 1                        1 1 1 1 1 1 1 1 j j                                   1 1 1 1 1 1 1 1 1 1                                   1 j 1 j 1 1 j 1 1 1

Carnegie Mellon Carnegie Mellon Transforms and Breakdown Rules “Teaches” Spiral about existing algorithm knowledge (~200 journal papers) Base case rules Teaches Spiral about FFT algorithms

Carnegie Mellon Carnegie Mellon One Approach for all Types of Parallelism  Multithreading (Multicore)  Vector SIMD (SSE, VMX/Altivec ,…)  Message Passing (Clusters, MPP)  Streaming/multibuffering (Cell)  Graphics Processors (GPUs)  Gate-level parallelism (FPGA)  HW/SW partitioning (CPU + FPGA)

Carnegie Mellon Carnegie Mellon Translating a Formula into Code Rewriting Input: Output = OL Formula: ∑ -OL: C Code:

Carnegie Mellon Carnegie Mellon Synthesizing General Size Libraries Input:  Transform :  Algorithms :  Vectorization : 2-way SSE  Threading : Yes Output: Spiral  Optimized library (10,000 lines of C++)  For general input size ( not collection of fixed sizes) High-Performance Library  Vectorized (FFTW-like, MKL-like, ESSL-like)  Multithreaded  With runtime adaptation mechanism  Performance competitive with hand-written code

Carnegie Mellon Carnegie Mellon FFT needs Local FFTs and Global Transposes Transpose Local FFTs Transpose Local FFTs Transpose FFTs along rows and columns of distributed square matrix

Carnegie Mellon Carnegie Mellon Linear Memory vs. Tiled Memory Column FFTs: Need contiguous columns Row FFTs need contiguous rows Processor i Processor i +1 Processor i +2 p node MPI all-to-all needs contiguous tiles Requires MPI all-to-allv, explicit copy, or FFT on tiled memory

Carnegie Mellon Carnegie Mellon MPI All-to- all “Friendly” Six Step FFT Standard batch FFT library (on 1D contiguous memory) Specialized node FFT library: batch FFT+twiddles on 2D tiled memory Standard MPI all-to-all on contiguous data Node-local pre-processing (data scrambling)

Carnegie Mellon Carnegie Mellon SIMD Vectorization for FFT Standard FFT Automatic formula rewriting Vectorized arithmetic Data reorganization (requires architecture specific Short Vector FFT for  -way vector instructions vetorization) Only 3 permutations require architecture-specific vectorization: Works for any N=mn with  2 |N F. Franchetti, M. Püschel: “Short Vector Code Generation for the Discrete Fourier Transform,” In Proceedings of the 17th International Parallel and Distributed Processing Symposium (IPDPS '03), pages 58-67 F. Franchetti, S. Kral, J. Lorenz, C. W. Ueberhuber: “Efficient Utilization of SIMD Extensions,’’ Proceedings of the IEEE Special Issue on "Program Generation, Optimization, and Adaptation," Vol. 93, No. 2, 2005, pages 409-425

Carnegie Mellon Carnegie Mellon Rewriting for SMP Parallelization Two types of base cases: load-balanced, no false sharing F. Franchetti, Y. Voronenko, M. Püschel: “FFT Program Generation for Shared Memory: SMP and Multicore ,” In Proceedings Supercomputing (SC), 2006.

Carnegie Mellon Carnegie Mellon BlueGene/P Intrepid at ANL  40 racks of Blue Gene/P  1024 nodes per rack  one 850 MHz quad-core processor and 2GB RAM per node  Double FPU SIMD  3D Torus network

Carnegie Mellon Carnegie Mellon HPC Challenge Global FFT on BlueGene/P 1D Global FFT performance [Gflop/s] 6.4 Tflop/s FFTE baseline: 5 Tflop/s G. Almási, B. Dalton, L. L. Hu, F. Franchetti, Y. Liu, A. Sidelnik, T. Spelce, I. G. Tānase , E. Tiotto, Y. Voronenko, X. Xue: 2010 IBM HPC Challenge Class II Submission. Winner of the 2010 HPC Challenge Class II Award (Most Productive System).

Carnegie Mellon Carnegie Mellon Double FPU and Multicore Performance DFT, double precision, XL C compiler DFT, double precision, XL C compiler performance [Mflop/s] performance [Mflop/s] 2000 1600 4 threads (450d) 1800 SPIRAL C99 + 440d single core (450d) 1400 single core (450) SPIRAL C + 440d 1600 GSL 1.5 SPIRAL C + 440 1200 2x 1400 FFTW 2.1.5 GNU GSL 1000 1200 3.5x 1000 800 800 600 600 400 400 200 BlueGene/L: custom FPU BlueGene/P: custom FPU + 4 cores 200 0 0 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16 32 64 128 256 512 1024 2048 4096 8192 problem size problem size Single BlueGene/L CPU at 700 MHz Single BlueGene/P node (4 CPUs) at 850 MHz IBM T. J. Watson Research Center Argonne National Laboratory SIMD vectorization SIMD vectorization + multi-threading F. Gygi, E. W. Draeger, M. Schulz, B. R. de Supinski, J. A. Gunnels, V. Austel, J. C. Sexton, F. Franchetti, S. Kral, C. W. Ueberhuber, J. Lorenz: Large-Scale Electronic Structure Calculations of High-Z Metals on the BlueGene/L Platform. In Proceedings of Supercomputing, 2006. Winner of the 2006 Gordon Bell Prize (Peak Performance Award). J. Lorenz, S. Kral, F. Franchetti, C. W. Ueberhuber: Vectorization Techniques for the Blue Gene/L double FPU. IBM Journal of Research and Development, Vol. 49, No. 2/3, 2005.

for BlueGene/P Franz Franchetti 1 , Yevgen Voronenko 2 , Gheorghe - PowerPoint PPT Presentation

Carnegie Mellon Carnegie Mellon Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P Franz Franchetti 1 , Yevgen Voronenko 2 , Gheorghe Almasi 3 1 Carnegie Mellon University and SpiralGen, Inc. 2 AccuRay, Inc., 3 IBM

Implementing Optimized Collective Communication Routines on the IBM BlueGene/L Supercomputer CS

Converse BlueGene Emulator Gengbin Zheng Parallel Programming Lab 2/27/2001 1 Objective

A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng ,

Scaling Communication-Intensive Applications on BlueGene/P Using One- Sided Communication and

TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional Handong Ye,

Memory Debugging Parallel Applications on BlueGene SciComp May 21, 2009 1 Ed Hinkel Agenda

Site Report on Physics Plans and ILDG Usage for US Balint Joo Jefferson Lab Machines used for

Claude TADONKI Mines ParisTech Paris/France Seminar at Universidad Santiago de Chile August 6,

3rd Grade Shapes and Perimeter 2015-11-10 www.njctl.org Slide 3 / 102 Slide 4 / 102 Table of

Synchronising C/C++ and POWER Susmit Sarkar 1 Kayvan Memarian 1 Scott Owens 1 Mark Batty 1 Peter

Warm up Sketch the graph of f ( x ) = ( x 3)( x 2)( x 1) = x 3 6 x 2 + 11 x 6

Formal Methods for Probabilistic Systems Annabelle McIver Carroll Morgan Source-level

Approximating Orthogonal Matrices with Effective Givens Factorization Thomas Frerix Technical

Exascale-ability Today N=4096 3 12.3 10 12 Flops 1.1 TB of Data 3D FFT Exascale-ability

The tangent FFT D. J. Bernstein University of Illinois at Chicago See online version of paper,

Automatic physical inference with information maximising neural networks Physical Review D 97 ,

Model dependences, uncertain1es, and combined analysis Intro

Gravitational wave and lensing inference from the CMB polarization Ethan Anderes : (UC Davis

Bayesian Hierarchical Models for parameter inference with missing

Near Detectors for the Hyper-K Experiment Mark Hartz TRIUMF & Kavli IPMU TAUP 2019, Toyama,

Effect of substructure on tidal streams Denis Erkal University of Surrey Halo Substructure and

Neutrino Mixing and Oscillations Carlo Giunti INFN, Sez. di Torino, and Dip. di Fisica Teorica,

Statistical methods Toma Podobnik Oddelek za fiziko, FMF, UNI-LJ , , Odsek za

Introduction to filters Consider v ( t ) = v 1 ( t ) + v 2 ( t ) = V m 1 sin 1 t + V m 2 sin

for BlueGene/P Franz Franchetti 1 , Yevgen Voronenko 2 , Gheorghe - PowerPoint PPT Presentation

Carnegie Mellon Carnegie Mellon Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P Franz Franchetti 1 , Yevgen Voronenko 2 , Gheorghe Almasi 3 1 Carnegie Mellon University and SpiralGen, Inc. 2 AccuRay, Inc., 3 IBM

Implementing Optimized Collective Communication Routines on the IBM BlueGene/L Supercomputer CS

Converse BlueGene Emulator Gengbin Zheng Parallel Programming Lab 2/27/2001 1 Objective

A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng ,

Scaling Communication-Intensive Applications on BlueGene/P Using One- Sided Communication and

TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional Handong Ye,

Memory Debugging Parallel Applications on BlueGene SciComp May 21, 2009 1 Ed Hinkel Agenda

Site Report on Physics Plans and ILDG Usage for US Balint Joo Jefferson Lab Machines used for

Claude TADONKI Mines ParisTech Paris/France Seminar at Universidad Santiago de Chile August 6,

3rd Grade Shapes and Perimeter 2015-11-10 www.njctl.org Slide 3 / 102 Slide 4 / 102 Table of

Synchronising C/C++ and POWER Susmit Sarkar 1 Kayvan Memarian 1 Scott Owens 1 Mark Batty 1 Peter

Warm up Sketch the graph of f ( x ) = ( x 3)( x 2)( x 1) = x 3 6 x 2 + 11 x 6

Formal Methods for Probabilistic Systems Annabelle McIver Carroll Morgan Source-level

Approximating Orthogonal Matrices with Effective Givens Factorization Thomas Frerix Technical

Exascale-ability Today N=4096 3 12.3 10 12 Flops 1.1 TB of Data 3D FFT Exascale-ability

The tangent FFT D. J. Bernstein University of Illinois at Chicago See online version of paper,

Automatic physical inference with information maximising neural networks Physical Review D 97 ,

Model dependences, uncertain1es, and combined analysis Intro

Gravitational wave and lensing inference from the CMB polarization Ethan Anderes : (UC Davis

Bayesian Hierarchical Models for parameter inference with missing

Near Detectors for the Hyper-K Experiment Mark Hartz TRIUMF &amp; Kavli IPMU TAUP 2019, Toyama,

Effect of substructure on tidal streams Denis Erkal University of Surrey Halo Substructure and

Neutrino Mixing and Oscillations Carlo Giunti INFN, Sez. di Torino, and Dip. di Fisica Teorica,

Statistical methods Toma Podobnik Oddelek za fiziko, FMF, UNI-LJ , , Odsek za

Introduction to filters Consider v ( t ) = v 1 ( t ) + v 2 ( t ) = V m 1 sin 1 t + V m 2 sin

Near Detectors for the Hyper-K Experiment Mark Hartz TRIUMF & Kavli IPMU TAUP 2019, Toyama,