Carnegie Mellon Generating High Performance Generating High - PowerPoint PPT Presentation

Carnegie Mellon Generating High ‐ Performance Generating High ‐ Performance General Size Linear Transform General Size Linear Transform Libraries Using Spiral Libraries Using Spiral Yevgen Voronenko Franz Franchetti Frédéric de Mesmay Markus Püschel Carnegie Mellon University HPEC, September 2008, Lexington, MA, USA This work was supported by DARPA DESA program, NSF ‐ NGS/ITR, NSF ‐ ACR, and Intel

Carnegie Mellon The Problem: Example DFT The Problem: Example DFT Best code 30x 30x 12x 12x Numerical recipes Standard desktop computer � Same operations count ≈ 4nlog 2 (n) � 2 Similar plots can be shown for all numerical problems �

Carnegie Mellon DFT Plot: Analysis DFT Plot: Analysis Discrete Fourier Transform ( DFT) (on 2xCore2Duo 3 GHz) Perform ance [ Gflop/ s] 30 Multiple threads: 2x 25 20 15 Vector instructions: 3x 10 Memory hierarchy: 5x 5 0 input size � High performance library development = nightmare � Automation? 3

Carnegie Mellon Idea: Textbook to Adaptive Library Idea: Textbook to Adaptive Library Textbook FFT ? ? “FFTW” 4

Carnegie Mellon Goal: Teach Computers to Write Libraries Teach Computers to Write Libraries Goal: Input: Key technologies: Transform : � Layered domain specific � Algorithm : language � Hardware: 2 ‐ way SIMD + multithreaded Algorithm manipulation � � via rewriting Feedback ‐ driven search � Spiral Spiral Result: Full automation � Output: � FFTW equivalent library � For general input size � Vectorized and multithreaded � Performance competitive 5

Carnegie Mellon Contribution: General Size Library Contribution: General Size Library Transform T Spiral Spiral DFT of size 1024 library for DFT of or any size Env_1 dft(1024); dft_1024(X, Y); dft.compute(X, Y); Fundamentally different problems 6

Carnegie Mellon Beyond Fourier Transform and FFTW Beyond Fourier Transform and FFTW Cooley ‐ Tukey “Cooley ‐ Tukey” DCT Overlap ‐ save/add FIR FFT Spiral Spiral Spiral Spiral Spiral Spiral “FFTW” “FCTW” “FIRW” Fast Walsh Transform Fast Wavelet Transform Fast Hartley Transform Spiral Spiral Spiral Spiral Spiral Spiral “WHTW” “FWTW” “FHTW” 7

Carnegie Mellon Examples of Generated Libraries Examples of Generated Libraries RDFT DHT DCT2 DCT3 DCT4 DFT • 2 ‐ way vectorized, 2 ‐ threaded • 2 ‐ way vectorized, 2 ‐ threaded • Most are faster than hand ‐ written libs • Most are faster than hand ‐ written libs • Code size: 8–120 KLOC or 0.5–5 MB • Code size: 8–120 KLOC or 0.5–5 MB • Generation time: 1–3 hours • Generation time: 1–3 hours Filter Wavelet Total: 300 KLOC / 13.3 MB of code generated in < 20 hours from a few simple algorithm specs 8 Intel IPP library 6.0 will include Spiral generated code

Carnegie Mellon I. Background II. Library Generation III. Experimental Results IV. Conclusions and Future Work 9

Carnegie Mellon Linear Transforms Linear Transforms � Mathematically: matrix ‐ vector product Output vector Transform matrix Input vector � Examples: 10

Carnegie Mellon Fast Algorithms, Example: 4 ‐ ‐ point FFT point FFT Fast Algorithms, Example: 4 � Fast algorithms = matrix factorizations 12 adds 4 adds 1 mult 4 adds 4 mults (when multiplied with input vector � ) Fourier transform Kronecker product Identity Permutation � SPL = mathematical, declarative specification � Space of algorithms generated using breakdown rules 11

Carnegie Mellon Examples of Breakdown Rules Examples of Breakdown Rules DFT Cooley ‐ Tukey DCT “Cooley ‐ Tukey” � “Teach” Spiral domain knowledge of algorithms. Never obsolete. � “Teach” Spiral domain knowledge of algorithms. Never obsolete. 12 � Each rule leads to a library � Each rule leads to a library

Carnegie Mellon How Library Generation Works How Library Generation Works Transforms + Library Target Breakdown rules (FFTW, VSIPL, IPP FFT, ...) Library Structure Library Structure recursion step closure as Σ‐ SPL formulas Library Implementation Library Implementation 14 High ‐ performance library

Carnegie Mellon Breakdown Rules to Library Code Breakdown Rules to Library Code Cooley ‐ Tukey Fast Fourier Transform (FFT) � DFT k=4 Naive implementation � void dft ( int n, cplx X[], cplx Y[]) { k = choose_factor(n); m = n/k; Z = permute(X) for i=0 to k ‐ 1 dft_subvec (m, Z, Y, …) for i=0 to n ‐ 1 Y[i] = Y[i]*T[i]; for i=0 to m ‐ 1 dft_strided (k, Y, Y, …) } 2 extra functions needed 15

Carnegie Mellon Breakdown Rules to Library Code Breakdown Rules to Library Code Cooley ‐ Tukey Fast Fourier Transform (FFT) � DFT Optimized implementation Naive implementation � � void dft ( int n, cplx X[], cplx Y[]) { void dft ( int n, cplx X[], cplx Y[]) { k = choose_factor(n); m = n/k; k = choose_factor(n); m = n/k; Z = permute(X) for i=0 to k ‐ 1 for i=0 to k ‐ 1 dft_strided2 (m, X, Y, …) dft_subvec (m, Z, Y, …) for i=0 to m ‐ 1 for i=0 to n ‐ 1 dft_strided3_scaled (k, Y, Y, T, …) Y[i] = Y[i]*T[i]; for i=0 to m ‐ 1 } dft_strided (k, Y, Y, …) } 2 extra functions needed 2 extra functions needed 16 How to discover these specialized variants automatically? How to discover these specialized variants automatically?

Carnegie Mellon Library Structure Library Structure Input: � � Breakdown rules Output: � Library Structure Library Structure � Recursion step closure � Σ‐ SPL Implementation of each recursion step Parallelization/Vectorization � � Adds additional breakdown rules � Orthogonal to the closure generation 17

Carnegie Mellon Computing Recursion Step Closure Computing Recursion Step Closure Input: transform T and a breakdown rule � Output: spawned recursion steps + Σ‐ SPL implementation � Algorithm: � 1. Apply the breakdown rule Convert to Σ ‐ SPL 2. 3. Apply loop merging + index simplification rules. 4. Extract recursion steps 5. Repeat until closure is reached Parametrization (not shown) derives the independent parameter set 18 for each recursion step

Carnegie Mellon Recursion Step Closure Examples Recursion Step Closure Examples DFT (scalar) 4 mutually recursive functions ‐ computed automatically ‐ described using Σ‐ SPL formulas DCT4 (vectorized) 19 17 mutually recursive functions

Carnegie Mellon Base Cases Base Cases � Base cases are called “codelets” in FFTW � Why needed: � Closure is converted into mutually recursive functions � Recursion must be terminated � Larger base cases eliminate overhead from recursion � How many: � In FFTW 3.2: 183 codelets for complex DFT (21 types) 147 codelets for real DFT (18 types) � In our generator: # codelet types � # recursion steps � Obtained by using standard Spiral to generate fixed size code 20 . . .

Carnegie Mellon Library Implementation Library Implementation � Input: � Recursion step closure � Σ‐ SPL implementation of each recursion step (base cases + recursions) � Output: � High ‐ performance library � Target language: C++, Java, etc. Library Implementation Library Implementation � Process: � Build library plan � Perform hot/cold partitioning � Generate target language code 21 High ‐ performance library

Carnegie Mellon Double Precision Performance: Intel Xeon 5160 Double Precision Performance: Intel Xeon 5160 2 ‐ ‐ way vectorization, up to 2 threads way vectorization, up to 2 threads 2 Generated library Generated library FFTW Intel IPP Complex DFT Real DFT Generated library Generated library DCT ‐ 2 WHT 23

Carnegie Mellon FIR Filter Performance FIR Filter Performance 2 ‐ ‐ and 4 ‐ ‐ way vectorization, up to 2 threads way vectorization, up to 2 threads 2 and 4 Generated Generated library library Intel IPP 8 ‐ tap filter 8 ‐ tap wavelet Generated Generated library library 32 ‐ tap filter 32 ‐ tap wavelet 24

Carnegie Mellon 2 ‐ ‐ D Transforms Performance D Transforms Performance 2 2 ‐ ‐ or 4 ‐ ‐ way vectorization, up to 2 threads way vectorization, up to 2 threads 2 or 4 Generated library FFTW Generated library Intel IPP 2 ‐ D DFT double 2 ‐ D DFT single Generated library Generated library 2 ‐ D DCT ‐ 2 double 2 ‐ D DCT ‐ 2 single 25

Carnegie Mellon Customization: Code Size Customization: Code Size 13 KLOC 3 KLOC FFTW: 150 KLOC 2 KLOC 1.3 KLOC 1 KLOC 26

Carnegie Mellon Backend Customization: Java Backend Customization: Java Generated library Generated library JTransforms Complex DFT Real DFT Generated library Generated library DCT ‐ 2 FIR Filter 27 Portable, but only 50% of scalar C performance

Carnegie Mellon Summary Summary FFT � Full automation: Textbook to adaptive library Spiral Spiral � Performance � SIMD “FFTW” � Multicore � Customization FIR � Industry collaboration Spiral Spiral � Intel IPP 6.0 will include Spiral generated code “FIRW” 28

Carnegie Mellon Generating High Performance Generating High - PowerPoint PPT Presentation

Carnegie Mellon Generating High Performance Generating High Performance General Size Linear Transform General Size Linear Transform Libraries Using Spiral Libraries Using Spiral Yevgen Voronenko Franz Franchetti Frdric de Mesmay Markus

Carnegie Mellon University Search TRECVID 2004 Workshop November 2004 Mike Christel, Jun

Brendan Meeder Carnegie Mellon University Christos Faloutsos Carnegie Mellon University Given a

A First Look Franz Franchetti Carnegie Mellon University in collaboration with Daniele G.

H2 F2009 H2 F2009 GENERATING GENERATING GENERATING GENERATING FREE CASH FLOW FREE CASH FLOW

Cache Lab Implementation and Blocking Slides courtesy of: Aditya Shah, CMU 1 Carnegie Mellon

From Carnegie Mellon to Kyoto: How Far Can We Go? Project Courses at Carnegie Mellon Involve

Running Incomplete Programs Ian Voysey Carnegie Mellon University Cyrus Omar Carnegie Mellon

Feature Selection Matters for Anchor-Free Object Detection Chenchen Zhu Carnegie Mellon

SPIRAL, FFTX, and the Path to SpectralPACK Franz Franchetti Carnegie Mellon University

15-213 Recitation: Attack Lab Jenna MacCarley 28 Sep 2015 Carnegie Mellon Reminder Bomb lab

15-213 Recitation: Bomb Lab 21 Sep 2015 Monil Shah, Shelton DSouza Carnegie Mellon Agenda

for BlueGene/P Franz Franchetti 1 , Yevgen Voronenko 2 , Gheorghe Almasi 3 1 Carnegie Mellon

More is Less? Non-parametric Language Models and Efficiency Graham Neubig Carnegie Mellon

GEMM-Like Operations Applications Richard Michael Veras Platforms Carnegie Mellon Want

A First Look Franz Franchetti, Daniele G. Spampinato, Anuva Kulkarni, Tze Meng Low Carnegie

Machine Translation for Human Translators Carnegie Mellon Ph.D. Thesis Michael Denkowski

ts tr ss

SELF-PRESENTATION for the application for the initiation of a Doctor of Science qualification 1.

Comparison between flow dynamics inside street canyon with two geometries of roof shape Radka

Vert Systems: Mountain Biking Tracking Simon Fowler Jonathan Black, Jacqueline Christmas, Mikhail

The experimental investigation on the role of E x B The experimental investigation on the role of

HI source finding algorithms Comparing the general purpose Duchamp algorithm to a purpose built

INTEGRATED STUDIES OF CLIMATIC STUDIES OF CLIMATIC INTEGRATED CHANGES IN SIBERIA: METODS AND

CONTROL SYSTEMS (IPEICS-19) 08 th August, 2019 Session: 1 Chair: Prof. M. H. Rashid Co-Chair: Prof.