HPC Challenge Benchmark Piotr Luszczek University of Tennessee - - PDF document

hpc challenge benchmark
SMART_READER_LITE
LIVE PREVIEW

HPC Challenge Benchmark Piotr Luszczek University of Tennessee - - PDF document

Introduction Overview Details Submissions Future Directions HPC Challenge Benchmark Piotr Luszczek University of Tennessee Knoxville SC2004, November 6-12, 2004, Pittsburgh, PA SC2004; Pittsburgh, PA 1/14 Introduction Overview


slide-1
SLIDE 1

Introduction Overview Details Submissions Future Directions

HPC Challenge Benchmark

Piotr Luszczek

University of Tennessee Knoxville

SC2004, November 6-12, 2004, Pittsburgh, PA

SC2004; Pittsburgh, PA 1/14 Introduction Overview Details Submissions Future Directions

Contents

1

Introduction

2

Overview

3

Details

4

Submissions

5

Future Directions

SC2004; Pittsburgh, PA 2/14

slide-2
SLIDE 2

Introduction Overview Details Submissions Future Directions

Motivation and Sponsors for HPC Challenge

Uniform benchmarking framework for performance tests Measure performance of various memory access patterns Testing Peta-scale systems

Has to challenge all hardware aspects

Analyzing productivity

Implementation in various programming languages Architecture support

Rules for running and verification

Base run required for submission Optimized run possible Verification Reporting all aspects of run: compiler, libraries, runtime environment

Sponsors

High Productivity Computing Systems (HPCS) DARPA, DOE, NSF

SC2004; Pittsburgh, PA 3/14 Introduction Overview Details Submissions Future Directions

Active Collaborators

David Bailey NERSC/LBL Jack Dongarra UTK/ORNL Jeremy Kepner MIT Lincoln Lab David Koester MITRE Bob Lucas ISI/USC John McCalpin IBM Austin Rolf Rabenseifner HLRS Stuttgart Daisuke Takahashi Tsukuba

SC2004; Pittsburgh, PA 4/14

slide-3
SLIDE 3

Introduction Overview Details Submissions Future Directions

Testing Scenarios

Local P1 .. . . Pr ... . . PN Interconnect Embarrassingly Parallel P1 . . . Pr . . . PN Interconnect Global P1 . . . Pr . . . PN Interconnect

SC2004; Pittsburgh, PA 5/14 Introduction Overview Details Submissions Future Directions

Performance Bounds: Memory Access Patterns

Spatial locality PTRANS HPL STREAM DGEMM CFD Radar X Applications TSP DSP RandomAccess FFT Temporal locality

SC2004; Pittsburgh, PA 6/14

slide-4
SLIDE 4

Introduction Overview Details Submissions Future Directions

Effective performace peak: HPL and DGEMM

Effective performance peak (unit: TFlop/s and GFlop/s)

Global (entire system): High Performance Linpack (HPL) Local (single node): DGEMM

Top500 November 2004: 16%-99% of peak

Entries #99 and #309

HPL – High Performance Linpack

Written by Antoine Petitet (while at ICL) Non-trivial configuration

Global matrix size (≈ total memory) Process grid (≈ square) Blocking factor (for BLAS and BLACS) Described at http://www.netlib.org/benchmark/hpl/

Runs well on CISC, RISC, VLIW, and vector computers

DGEMM is matrix-matrix multiply with double precision reals.

SC2004; Pittsburgh, PA 7/14 Introduction Overview Details Submissions Future Directions

Application Bandwidth: PTRANS and STREAM

Measures sustainable bandwidth for stride one access

Global: PTRANS Local: STREAM

PTRANS – parallel matrix transpose

Repeated exchanges of large amounts of data Depends on global bisection bandwidth

STREAM – simple linear algebra vector kernels

Well known and understood Known optimizations

No cache allocation on Crays Threading on IBMs

SC2004; Pittsburgh, PA 8/14

slide-5
SLIDE 5

Introduction Overview Details Submissions Future Directions

Irregular Memory Updates: RandomAccess (GUPS)

Measures ability to hide latencies (local and global)

Bandwidth (almost) irrelevant Important: capacity for simultaneous message Irregularity in data access kills common hardware tricks

Many implementations

MPI-1: non-blocking Send()/Recv() MPI-2: uses Put()/Get() UPC: much faster than all above

Verification procedure

Up to 1% updates may not be performed Allows loosening shared memory consistency

↓ ↓ ↓ ↓ SC2004; Pittsburgh, PA 9/14 Introduction Overview Details Submissions Future Directions

Fast Fourier Transform with FFTE

Complex 1D, double precision DFT

64-bit input vector size No mixed-stride memory accesses (as in multi-dimensional FFTs)

Scalability problems

“Corner turns”

Global transpose with MPI Alltoall() Three transposes (data is never scrambled)

But time is not an issue – it runs fast

SC2004; Pittsburgh, PA 10/14

slide-6
SLIDE 6

Introduction Overview Details Submissions Future Directions

Rules for Running and Reporting

Base run is required to submit to the database

Reference MPI-1 implementation publicly available Each test is checked for correctness

Optimzed runs may follow the base run

Performance critical (timed) portion of code can be changed Changes are to be described upon submission

Records effort (productivity) and architecture optimization techniques

Correctness check doesn’t change

Results submitted via web form

Output file from the run Hardware information Programming environment: compilers, libraries Submission must be confirmed via email Data immediately available (no restrictions)

HTML XML Microsoft Excel

SC2004; Pittsburgh, PA 11/14 Introduction Overview Details Submissions Future Directions

Submission Statistics

Army computing centers: ARL, ERDC, NAVO, . . . Government labs: ORNL Hardware vendors/integrators

Chip makers: Cray, IBM, NEC Integrators: Dalco, Scali

Universities

Europe: Aachen/RWTH, Manchester Asia: Tohoku (Sendai, Japan) North America: Tennessee

Supercomputing centers

DKRZ (Hamburg) HLRS (Stuttgart) OSC (Ohio) PSC (Pittsburgh)

Countries Germany, Japan, Norway, Switzerland, U.K., U.S.A. Interconnects

Crossbar Fat tree Omega Tori: 1D, 2D

Processors

CISC RISC Vector VLIW

SC2004; Pittsburgh, PA 12/14

slide-7
SLIDE 7

Introduction Overview Details Submissions Future Directions

Planned Activities

Code improvements

New languages: Fortran 90, UPC, CAF, . . . Automated configuration

Website/submission improvements End-user tools for data analysis Reporting guidelines

Especially for vendor comparisons Cores Processors Threading

OpenMP HyperThreading, Simultaneous Mulithreading, . . . ViVA (Virtual Vector Architecture)

SC2004; Pittsburgh, PA 13/14 Introduction Overview Details Submissions Future Directions SC2004; Pittsburgh, PA 14/14