HPC Challenge Benchmark Piotr Luszczek University of Tennessee - PDF document

Introduction Overview Details Submissions Future Directions HPC Challenge Benchmark Piotr � Luszczek University of Tennessee Knoxville SC2004, November 6-12, 2004, Pittsburgh, PA SC2004; Pittsburgh, PA 1/14 Introduction Overview Details Submissions Future Directions Contents Introduction 1 Overview 2 Details 3 Submissions 4 Future Directions 5 SC2004; Pittsburgh, PA 2/14

Introduction Overview Details Submissions Future Directions Motivation and Sponsors for HPC Challenge Uniform benchmarking framework for performance tests Measure performance of various memory access patterns Testing Peta-scale systems Has to challenge all hardware aspects Analyzing productivity Implementation in various programming languages Architecture support Rules for running and verification Base run required for submission Optimized run possible Verification Reporting all aspects of run: compiler, libraries, runtime environment Sponsors High Productivity Computing Systems (HPCS) DARPA, DOE, NSF SC2004; Pittsburgh, PA 3/14 Introduction Overview Details Submissions Future Directions Active Collaborators David Bailey NERSC/LBL Jack Dongarra UTK/ORNL Jeremy Kepner MIT Lincoln Lab David Koester MITRE Bob Lucas ISI/USC John McCalpin IBM Austin Rolf Rabenseifner HLRS Stuttgart Daisuke Takahashi Tsukuba SC2004; Pittsburgh, PA 4/14

Introduction Overview Details Submissions Future Directions Testing Scenarios P r .. . . ... . . P 1 P N Local Interconnect P 1 P r P N . . . . . . Embarrassingly Parallel Interconnect P 1 P r P N . . . . . . Global Interconnect SC2004; Pittsburgh, PA 5/14 Introduction Overview Details Submissions Future Directions Performance Bounds: Memory Access Patterns PTRANS HPL STREAM DGEMM CFD Radar X Spatial locality Applications TSP DSP RandomAccess FFT 0 Temporal locality SC2004; Pittsburgh, PA 6/14

Introduction Overview Details Submissions Future Directions Effective performace peak: HPL and DGEMM Effective performance peak (unit: TFlop/s and GFlop/s) Global (entire system): High Performance Linpack (HPL) Local (single node): DGEMM Top500 November 2004: 16%-99% of peak Entries #99 and #309 HPL – High Performance Linpack Written by Antoine Petitet (while at ICL) Non-trivial configuration Global matrix size ( ≈ total memory) Process grid ( ≈ square) Blocking factor (for BLAS and BLACS) Described at http://www.netlib.org/benchmark/hpl/ Runs well on CISC, RISC, VLIW, and vector computers DGEMM is matrix-matrix multiply with double precision reals. SC2004; Pittsburgh, PA 7/14 Introduction Overview Details Submissions Future Directions Application Bandwidth: PTRANS and STREAM Measures sustainable bandwidth for stride one access Global: PTRANS Local: STREAM PTRANS – parallel matrix transpose Repeated exchanges of large amounts of data Depends on global bisection bandwidth STREAM – simple linear algebra vector kernels Well known and understood Known optimizations No cache allocation on Crays Threading on IBMs SC2004; Pittsburgh, PA 8/14

Introduction Overview Details Submissions Future Directions Irregular Memory Updates: RandomAccess (GUPS) Measures ability to hide latencies (local and global) Bandwidth (almost) irrelevant Important: capacity for simultaneous message Irregularity in data access kills common hardware tricks Many implementations MPI-1: non-blocking Send() / Recv() MPI-2: uses Put() / Get() UPC: much faster than all above Verification procedure Up to 1% updates may not be performed Allows loosening shared memory consistency ↓ ↓ ↓ ↓ SC2004; Pittsburgh, PA 9/14 Introduction Overview Details Submissions Future Directions Fast Fourier Transform with FFTE Complex 1D, double precision DFT 64-bit input vector size No mixed-stride memory accesses (as in multi-dimensional FFTs) Scalability problems “Corner turns” Global transpose with MPI Alltoall() Three transposes (data is never scrambled) But time is not an issue – it runs fast SC2004; Pittsburgh, PA 10/14

Introduction Overview Details Submissions Future Directions Rules for Running and Reporting Base run is required to submit to the database Reference MPI-1 implementation publicly available Each test is checked for correctness Optimzed runs may follow the base run Performance critical (timed) portion of code can be changed Changes are to be described upon submission Records effort (productivity) and architecture optimization techniques Correctness check doesn’t change Results submitted via web form Output file from the run Hardware information Programming environment: compilers, libraries Submission must be confirmed via email Data immediately available (no restrictions) HTML XML Microsoft Excel SC2004; Pittsburgh, PA 11/14 Introduction Overview Details Submissions Future Directions Submission Statistics Army computing centers: Countries ARL, ERDC, NAVO, . . . Germany, Japan, Government labs: ORNL Norway, Switzerland, U.K., Hardware vendors/integrators U.S.A. Chip makers: Cray, IBM, NEC Integrators: Dalco, Scali Interconnects Universities Crossbar Fat tree Europe: Aachen/RWTH, Manchester Omega Asia: Tohoku (Sendai, Japan) Tori: 1D, 2D North America: Tennessee Processors Supercomputing centers CISC DKRZ (Hamburg) RISC HLRS (Stuttgart) Vector OSC (Ohio) VLIW PSC (Pittsburgh) SC2004; Pittsburgh, PA 12/14

Introduction Overview Details Submissions Future Directions Planned Activities Code improvements New languages: Fortran 90, UPC, CAF, . . . Automated configuration Website/submission improvements End-user tools for data analysis Reporting guidelines Especially for vendor comparisons Cores Processors Threading OpenMP HyperThreading, Simultaneous Mulithreading, . . . ViVA (Virtual Vector Architecture) SC2004; Pittsburgh, PA 13/14 Introduction Overview Details Submissions Future Directions SC2004; Pittsburgh, PA 14/14

HPC Challenge Benchmark Piotr Luszczek University of Tennessee - PDF document

Introduction Overview Details Submissions Future Directions HPC Challenge Benchmark Piotr Luszczek University of Tennessee Knoxville SC2004, November 6-12, 2004, Pittsburgh, PA SC2004; Pittsburgh, PA 1/14 Introduction Overview

The HPC Challenge Benchmark: The HPC Challenge Benchmark: A Candidate for Replacing A Candidate

The HPC Challenge Benchmark The HPC Challenge Benchmark http://icl.cs.utk.edu/hpcc/ Jack

The HPC Challenge Benchmark: The HPC Challenge Benchmark: A Candidate for Replacing A Candidate

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

CONTAINERS DEMOCRATIZE HPC CJ Newburn, Principal Architect for HPC, NVIDIA GTC19 S9525 -

Computer Security Summer Scholars 2016 Ma7 Vander Werf HPC System Administrator Security in HPC

Building a Grid System for HPC HPC on Grid High Performance Computing (HPC): Use of computer

HPC IN EUROPE Organisation of public HPC resources Context Focus on publicly-funded HPC

HPC platforms @ UL Overview (as of 2013) and Usage http://hpc.uni.lu S. Varrette, PhD.

HPC platforms @ UL Overview (as of 2013) and Usage http://hpc.uni.lu S. Varrette, H. Cartiaux

Preemptible Atomics Jan Vitek Jason Baker, Antonio Cunei, Jeremy Manson, Marek Prochazka, Bin

P age 1 Review: Cache perf ormance What are all the aspects of cache organization that impact

P age 1 Reducing Misses by Compiler Merging Arrays Example Optimizations McFarling [1989]

Outline ! Introduction ! Basic

Full Boltzmann equations for Leptogenesis (FHW, M. Plmacher, Y.Y.Y Wong: arXiv:0907.0205)

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses Soner Onder Michigan Technological

1 Unroll and Jam Unroll and Jam Example (cont) Unroll the Outer Loop Idea do j = 1,2*n by 2

Combining Compression Functions and Block Cipher-Based Hash Functions Asiacrypt 2006 Thomas