Spiral Computer Generation of Performance Libraries Applications - - PowerPoint PPT Presentation

▶

Oct 29, 2023 300 likes •429 views

Carnegie Mellon Performance Spiral Computer Generation of Performance Libraries Applications Jos M. F. Moura Markus Pschel Franz Franchetti Platforms & the Spiral Team Carnegie Mellon What is Spiral? Traditionally Spiral

SLIDE 1

Carnegie Mellon

Spiral

Computer Generation of Performance Libraries

José M. F. Moura Markus Püschel Franz Franchetti & the Spiral Team

Applications Platforms Performance

SLIDE 2

Carnegie Mellon

What is Spiral?

Traditionally Spiral Approach

High performance library

ptimized for given platform

Spiral

High performance library

ptimized for given platform

Comparable performance

SLIDE 3

Carnegie Mellon

Main Idea: Program Generation

ν p μ

Architectural parameter: Vector length, #processors, …

rewriting defines

Kernel: problem size, algorithm choice pick search abstraction abstraction Model: common abstraction = spaces of matching formulas

architecture space algorithm space

ptimization

SLIDE 4

Carnegie Mellon

How Spiral Works

Algorithm Generation Algorithm Optimization Implementation Code Optimization Compilation Compiler Optimizations Problem specification (“DFT 1024” or “DFT”) algorithm C code Fast executable performance Search controls controls

Spiral

Complete automation of the implementation and

ptimization task

Basic ideas:

Declarative representation
f algorithms
Rewriting systems to

generate and optimize algorithms at a high level

f abstraction

SLIDE 5

Carnegie Mellon

Algorithms: Rules in Domain Specific Language

Viterbi Decoding Linear Transforms Matrix-Matrix Multiplication Synthetic Aperture Radar (SAR)

interpolation 2D iFFT matched filtering preprocessing convolutional encoder Viterbi decoder

010001 11 10 00 01 10 01 11 00 010001 11 10 01 01 10 10 11 00

= £

SLIDE 6

Carnegie Mellon

One Approach for all Types of Parallelism

Multithreading (Multicore)
Vector SIMD (SSE, VMX/Altivec,…)
Message Passing (Clusters, MPP)
Streaming/multibuffering (Cell)
Graphics Processors (GPUs)
Gate-level parallelism (FPGA)
HW/SW partitioning (CPU + FPGA)

SLIDE 7

Carnegie Mellon

Example: Code Generation for Multicore CPUs

Tensor product: embarrassingly parallel operator

A A A A x y

Processor 0 Processor 1 Processor 2 Processor 3

Permutation: problematic; may produce false sharing

x y

Hardware abstraction: shared cache with cache lines

SLIDE 8

Carnegie Mellon

Spiral: Meta-Tool to Autotuning Libraries

High-Performance Library “FFTW-like”

Spiral Library Generator

Input:

 Transform:  Algorithms:  Vectorization: 2-way SSE  Threading: Yes

Output:

 Optimized library (10,000 lines of C++)  For general input size

(not collection of fixed sizes)

 Vectorized  Multithreaded  With runtime adaptation mechanism  Performance competitive with hand-written code

SLIDE 9

Carnegie Mellon

 Verify algorithms symbolically  Check rules through verification of instances  Check code empirically

Verification and Testing = ? = ? = ?

DFT4([0,1,0,0])

DFT4_rnd([0.1,1.77,2.28,-55.3])) DFT4([0.1,1.77,2.28,-55.3]) = ?

SLIDE 10

Carnegie Mellon

Samsung i9100 Galaxy S II Dual-core ARM at 1.2GHz with NEON ISA SIMD vectorization + multi-threading

Range: Cell Phone To Supercomputer

G. Almási, B. Dalton, L. L. Hu, F. Franchetti, Y. Liu, A. Sidelnik, T. Spelce, I. G. Tānase, E. Tiotto, Y. Voronenko, X. Xue:

2010 IBM HPC Challenge Class II Submission. Winner of the 2010 HPC Challenge Class II Award (Most Productive System). Global FFT (1D FFT, HPC Challenge)

performance [Gflop/s]

BlueGene/P at Argonne National Laboratory 128k cores (quad-core CPUs) at 850 MHz SIMD vectorization + multi-threading + MPI 6.4 Tflop/s

BlueGene/P

SLIDE 11

Carnegie Mellon

More Results: Spiral Outperforms Humans

FFT on Multicore FFT on FPGA SAR SDR

improvement

SLIDE 12

Carnegie Mellon

More Information: www.spiral.net www.spiralgen.com