Application and Platform Adaptive Scientific Software Lennart - - PowerPoint PPT Presentation

application and platform adaptive scientific software
SMART_READER_LITE
LIVE PREVIEW

Application and Platform Adaptive Scientific Software Lennart - - PowerPoint PPT Presentation

Texas Learning and Computation Center Application and Platform Adaptive Scientific Software Lennart Johnsson Dragan Mirkovic University of Houston Alliance Performance Expedition Workshop March 14, 2002 Lennart Johnsson Texas Learning and


slide-1
SLIDE 1

Texas Learning and Computation Center

Alliance Performance Expedition Workshop March 14, 2002 Lennart Johnsson

Application and Platform Adaptive Scientific Software

Lennart Johnsson Dragan Mirkovic University of Houston

slide-2
SLIDE 2

Texas Learning and Computation Center

Alliance Performance Expedition Workshop March 14, 2002 Lennart Johnsson

  • Diversity of execution environments

– Growing complexity of modern microprocessors.

  • Deep memory hierarchies
  • Out-of-order execution
  • Instruction level parallelism

– Growing diversity of platform characteristics

  • SMPs
  • Clusters (employing a range of interconnect

technologies)

  • Grids (heterogeneity, wide range of characteristics)
  • Wide range of application needs

– Dimensionality and sizes – Data structures and data types

Challenges

slide-3
SLIDE 3

Texas Learning and Computation Center

Alliance Performance Expedition Workshop March 14, 2002 Lennart Johnsson

  • Algorithmic

– Unfavorable data access pattern (big 2n strides) – High efficiency of the algorithm

  • low floating-point v.s. load/store ratio

– Additions/multiplications unbalance

  • Version explosion

– Verification – Maintenance

Challenges

slide-4
SLIDE 4

Texas Learning and Computation Center

Alliance Performance Expedition Workshop March 14, 2002 Lennart Johnsson

  • Automatic algorithm selection – polyalgorithmic functions
  • Code generation from high-level descriptions
  • Extensive application independent compile-time analysis
  • Integrated performance modeling and analysis
  • Run-time application and execution environment

dependent composition

  • Automated installation process

Approach

slide-5
SLIDE 5

Texas Learning and Computation Center

Alliance Performance Expedition Workshop March 14, 2002 Lennart Johnsson

  • Code preparation at installation (platform dependent)
  • Integrated performance models and data bases
  • Algorithm selection at run-time from set defined at

installation

  • Program construction at run-time based on

application and performance predictions

Approach

slide-6
SLIDE 6

Texas Learning and Computation Center

Alliance Performance Expedition Workshop March 14, 2002 Lennart Johnsson

The UHFFT An Adaptive FFT Library

  • Application of WN requires O(N2) operations
  • Fast algorithms use sparse factorizations of WN,

Wn = A1A2 …… Ak, where Ai’s are sparse and requires O(n) operations and k=O(logN)

  • The fact that WN has many sparse factorizations is

exploited for performance adaptivity

slide-7
SLIDE 7

Texas Learning and Computation Center

Alliance Performance Expedition Workshop March 14, 2002 Lennart Johnsson

Key:

Fixed library code Generated code Code generator Unparser Scheduler Optimizer Initializer

(Algorithm Abstraction)

FFT Code Generator Library of FFT Modules Initialization Routines Mixed-Radix (Cooly-Tukey) Prime Factor Algorithm Split-Radix Algorithm Rader's Algorithm Execution Routines Utilities UHFFT Library

UHFFT Library Architecture

slide-8
SLIDE 8

Texas Learning and Computation Center

Alliance Performance Expedition Workshop March 14, 2002 Lennart Johnsson

Performance Tuning Methodology

Input Parameters System specifics, UHFFT Code generator Library of FFT modules Performance database User options

Installation

Input Parameters Size, dim., … Initialization Select best plan (factorization) Execution Calculate one

  • r more FFTs

Run-time

slide-9
SLIDE 9

Texas Learning and Computation Center

Alliance Performance Expedition Workshop March 14, 2002 Lennart Johnsson

Grid Application Development Software (GrADS)

Program Preparation System Execution Environment

slide-10
SLIDE 10

Texas Learning and Computation Center

Alliance Performance Expedition Workshop March 14, 2002 Lennart Johnsson

Characteristics of Some Target Architectures

L1: 64K+64K, L2: 4M 1.66 GFlops 833 MHz Alpha EV67/68 L1: 64K+64K, L2: 256K 1.4 GFlops 1.4 GHz AMD Athlon L1: 32K+32K, L2: 4M 1 GFlop 500 MHz MIPS R1x000 L1: 1.5M + 0.75M 3 GFlops 750 MHz HP PA 8x00 L1: 64K+32K, L2: 1-16M 1.5 GFlops 375 MHz IBM Power3/4 L1: 16K+16K L2: 92K, L3: 2-4M 3.2 GFlops 800 Mhz Intel Itanium L1: 32K+32K L2: 256K, L3: 1-2M 867 MFlops 867 MHz PowerPC G4 L1: 8K+8K, L2: 256K 1.8 GFlops 1.8 GHz Intel Pentium IV Cache structure Peak Performance Clock frequency Processor

slide-11
SLIDE 11

Texas Learning and Computation Center

Alliance Performance Expedition Workshop March 14, 2002 Lennart Johnsson

Radix-4 codelet performance, 32-bit architectures

Intel PIV 1.8 GHz AMD Athlon 1.4 GHz PowerPC G4 867 MHz

slide-12
SLIDE 12

Texas Learning and Computation Center

Alliance Performance Expedition Workshop March 14, 2002 Lennart Johnsson

Radix-8 codelet performance, 32-bit architectures

Intel PIV 1.8 GHz AMD Athlon 1.4 GHz PowerPC G4 867 MHz

slide-13
SLIDE 13

Texas Learning and Computation Center

Alliance Performance Expedition Workshop March 14, 2002 Lennart Johnsson

Codelet performance 32-bit architectures

Intel PIV 1.8 GHz AMD Athlon 1.4 GHz PowerPC G4 867 MHz

slide-14
SLIDE 14

Texas Learning and Computation Center

Alliance Performance Expedition Workshop March 14, 2002 Lennart Johnsson

Plan Performance, 32-bit Architectures

slide-15
SLIDE 15

Texas Learning and Computation Center

Alliance Performance Expedition Workshop March 14, 2002 Lennart Johnsson

  • Intel Itanium 800 MHz

– 2 GB SDRAM – 2 MB of L3 cache – Bus speed: 133 MHz – Inherent parallelism in IA-64 – Multiple FPUs with fused multiply-add instructions – Large number of registers provide good support for ILP – Relatively small L1 cache (16k+16k)

  • Large codelets do not perform very well

– Complex scheduling problem

  • Cache reuse and parallelism have opposite requirements

– OS: HP-Unix 11i version 1.5 – Compiler: gcc version 2.96 – Compiler options: -O2 –fomit-frame-pointer –funroll-all-loops

Itanium

slide-16
SLIDE 16

Texas Learning and Computation Center

Alliance Performance Expedition Workshop March 14, 2002 Lennart Johnsson

Itanium Codelet performance examples

Best and “worst”

slide-17
SLIDE 17

Texas Learning and Computation Center

Alliance Performance Expedition Workshop March 14, 2002 Lennart Johnsson

Itanium maximum codelet performance

slide-18
SLIDE 18

Texas Learning and Computation Center

Alliance Performance Expedition Workshop March 14, 2002 Lennart Johnsson

Itanium minimum codelet performance

slide-19
SLIDE 19

Texas Learning and Computation Center

Alliance Performance Expedition Workshop March 14, 2002 Lennart Johnsson

  • Compaq Alpha 833 MHz

– 2 Gb SDRAM – Bus speed: 133 MHz – OS: True64 Unix – Compiler: gcc version 2.96 – Compiler options: -O2 –fomit-frame-pointer –funroll-all-loops – Complex–to-complex, out-of-place, double precision transforms – Codelet sizes: 2 – 25, 32, 36, 45, 64 – Strides: 2[0-16] – Performance:

  • Absolute: 5*n*log(n)/tCPU in “FLOPS”
  • Relative: Absolute/(Peak performance of the processor)

– Peak performance: 1.66 GFLOPS

Alpha

slide-20
SLIDE 20

Texas Learning and Computation Center

Alliance Performance Expedition Workshop March 14, 2002 Lennart Johnsson

Alpha codelet performance example

slide-21
SLIDE 21

Texas Learning and Computation Center

Alliance Performance Expedition Workshop March 14, 2002 Lennart Johnsson

Power3 codelet performance examples

slide-22
SLIDE 22

Texas Learning and Computation Center

Alliance Performance Expedition Workshop March 14, 2002 Lennart Johnsson

Power3 plan performance example

50 100 150 200 250 300 350

MFLOPS

16 2 8 4 4 8 2 2 2 4 2 4 2 4 2 2 2 2 2 2

Plan

222 MHz

slide-23
SLIDE 23

Texas Learning and Computation Center

Alliance Performance Expedition Workshop March 14, 2002 Lennart Johnsson

340 350 360 370 380 390 400 410 420 430

"MFLOPS"

9 5 8 7 7 9 5 8 5 7 8 9 5 8 7 9 8 9 5 7 8 5 7 9 8 7 9 5 9 7 5 8 5 9 8 7 8 7 5 9 9 5 7 8 9 7 8 5 5 8 9 7 5 7 9 8 7 8 9 5 7 5 8 9 8 5 9 7 9 8 5 7 7 8 5 9 7 5 9 8 8 9 7 5 9 8 7 5 7 9 8 5 5 9 7 8

Plan

n = 2520 (PFA Plan)

Power3 plan performance

222 MHz

slide-24
SLIDE 24

Texas Learning and Computation Center

Alliance Performance Expedition Workshop March 14, 2002 Lennart Johnsson 800 Mflops peak PFA sizes

Power3 plan performance

slide-25
SLIDE 25

Texas Learning and Computation Center

Alliance Performance Expedition Workshop March 14, 2002 Lennart Johnsson

Advantages of the UHFFT Approach

  • Code generator written in C
  • Code is generated at installation
  • Codelet library is tuned to the underlying architecture
  • The whole library can be easily customized through parameter

specification – No need for laborious manual changes in the source – Existing code generation infrastructure allows easy library extensions

  • Future:

– Inclusion of vector/streaming instruction set extension for various architectures – Implementation of new scheduling/optimization algorithms – New codelet types and better execution routines – Unified algorithm specification on all levels

slide-26
SLIDE 26

Texas Learning and Computation Center

Alliance Performance Expedition Workshop March 14, 2002 Lennart Johnsson

  • UHFFT employs more ways of combining codelets for

execution than any other library

  • Better coverage of the space of possible algorithms
  • The PFA algorithm yields good performance where the

Mixed-Radix algorithm (MR) performs poorly

– PFA algorithm requires less FP operations than MR – Data access pattern in PFA is more complex than in MR, but large 2n strides can be avoided

  • Example IBM Power3

– Good: 128-way set associative L1 data and instruction caches – Bad: Direct mapped L2 cache very vulnerable to cache trashing despite the large cache size – PFA execution model works better for large FFT sizes

Advantages of the UHFFT Approach

slide-27
SLIDE 27

Texas Learning and Computation Center

Alliance Performance Expedition Workshop March 14, 2002 Lennart Johnsson

  • Extension to sine and cosine transforms
  • Further optimization of parallel aspects
  • Extension to Multigrid methods
  • Extension to Wavelets
  • Extension to Convolution
  • Extension to Lagrangian Finite Elements
  • Extensions to Spherical Transforms
  • Toolbox for code generation and optimization for FFT,

  • Extension to parallel programming paradigms other than

MPI

Contemplated Extensions

slide-28
SLIDE 28

Texas Learning and Computation Center

Alliance Performance Expedition Workshop March 14, 2002 Lennart Johnsson

Related Efforts

  • Wassem
  • CMSSL
  • CWP
  • FFTW
  • Spiral
slide-29
SLIDE 29

Texas Learning and Computation Center

Alliance Performance Expedition Workshop March 14, 2002 Lennart Johnsson

  • UHFFT Web site:

– http://www.cs.uh.edu/~mirkovic/uhfft

  • Publications

[1] Mirkovic, D., Johnsson S.L. (2001) Automatic Performance Tuning in UHFFT Library. In proceedings of the 2001 International Conference on Computational Science, ICCS 2001, May 2001, San Francisco, USA, Lecture Notes in Computer Science 2073, Vol. 1, pp. 71-80. [2] Mirkovic, D., Mahasoom R., Johnsson S.L. (2000) An Adaptive Software Library for Fast Fourier Transforms. Proceedings of the 2000 International Conference on Supercomputing, Santa Fe, NM , pp. 215-224.

The UHFFT: An Adaptive FFT Library