HPCG: ONE YEAR LATER Jack Dongarra & Piotr Luszczek University - PowerPoint PPT Presentation

http://tiny.cc/hpcg 1 HPCG: ONE YEAR LATER Jack Dongarra & Piotr Luszczek University of Tennessee/ORNL Michael Heroux Sandia National Labs

http://tiny.cc/hpcg Confessions of an 2 Accidental Benchmarker • Appendix B of the LINPACK Users’ Guide • Designed to help users extrapolate execution time for LINPACK software package • First benchmark report from 1977; • Cray 1 to DEC PDP-10 Started 36 Years Ago LINPACK code is based on “right-looking” algorithm: O(n 3 ) Flop/s and O(n 3 ) data movement

http://tiny.cc/hpcg 3 TOP500 • In 1986 Hans Meuer started a list of supercomputer around the world, they were ranked by peak performance. • Hans approached me in 1992 to put together our lists into the “TOP500”. • The first TOP500 list was in June 1993.

http://tiny.cc/hpcg 4 HPL has a Number of Problems • HPL performance of computer systems are no longer so strongly correlated to real application performance , especially for the broad set of HPC applications governed by partial differential equations. • Designing a system for good HPL performance can actually lead to design choices that are wrong for the real application mix, or add unnecessary components or complexity to the system.

http://tiny.cc/hpcg 5 Concerns • The gap between HPL predictions and real application performance will increase in the future. • A computer system with the potential to run HPL at an Exaflop is a design that may be very unattractive for real applications. • Future architectures targeted toward good HPL performance will not be a good match for most applications . • This leads us to a think about a different metric

http://tiny.cc/hpcg 6 HPL - Good Things • Easy to run • Easy to understand • Easy to check results • Stresses certain parts of the system • Historical database of performance information • Good community outreach tool • “Understandable” to the outside world • “If your computer doesn’t perform well on the LINPACK Benchmark, you will probably be disappointed with the performance of your application on the computer.”

http://tiny.cc/hpcg 7 HPL - Bad Things • LINPACK Benchmark is 37 years old • TOP500 (HPL) is 21.5 years old • Floating point-intensive performs O(n 3 ) floating point operations and moves O(n 2 ) data. • No longer so strongly correlated to real apps. • Reports Peak Flops (although hybrid systems see only 1/2 to 2/3 of Peak) • Encourages poor choices in architectural features • Overall usability of a system is not measured • Used as a marketing tool • Decisions on acquisition made on one number • Benchmarking for days wastes a valuable resource

http://tiny.cc/hpcg 8 Ugly Things about HPL • Doesn’t probe the architecture; only one data point • Constrains the technology and architecture options for HPC system designers. • Skews system design. • Floating point benchmarks are not quite as valuable to some as data-intensive system measurements

http://tiny.cc/hpcg 9 Many Other Benchmarks • TOP500 • Livermore Loops • Green 500 • EuroBen • Graph 500 174 • NAS Parallel Benchmarks • Green/Graph • Genesis • Sustained Petascale • RAPS Performance • SHOC • HPC Challenge • LAMMPS • Perfect • Dhrystone • ParkBench • Whetstone • SPEC-hpc

http://tiny.cc/hpcg http://tiny.cc/hpcg 10 Goals for New Benchmark • Augment the TOP500 listing with a benchmark that correlates with important scientific and technical apps not well represented by HPL • Encourage vendors to focus on architecture features needed for high performance on those important scientific and technical apps. • Stress a balance of floating point and communication bandwidth and latency • Reward investment in high performance collective ops • Reward investment in high performance point-to-point messages of various sizes • Reward investment in local memory system performance • Reward investment in parallel runtimes that facilitate intra-node parallelism • Provide an outreach/communication tool • Easy to understand • Easy to optimize • Easy to implement, run, and check results • Provide a historical database of performance information • The new benchmark should have longevity

http://tiny.cc/hpcg 11 Proposal: HPCG • High Performance Conjugate Gradient (HPCG). • Solves Ax=b, A large, sparse, b known, x computed. • An optimized implementation of PCG contains essential computational and communication patterns that are prevalent in a variety of methods for discretization and numerical solution of PDEs • Patterns: • Dense and sparse computations. • Dense and sparse collective. • Multi-scale execution of kernels via MG (truncated) V cycle. • Data-driven parallelism (unstructured sparse triangular solves). • Strong verification and validation properties (via spectral properties of PCG).

http://tiny.cc/hpcg Model Problem Description • Synthetic discretized 3D PDE (FEM, FVM, FDM). • Single DOF heat diffusion model. • Zero Dirichlet BCs, Synthetic RHS s.t. solution = 1. ( n x × n y × n z ) • Local domain: ( np x × np y × np z ) • Process layout: ( n x * np x ) × ( n y * np y ) × ( n z * np z ) • Global domain: • Sparse matrix: • 27 nonzeros/row interior. • 7 – 18 on boundary. • Symmetric positive definite.

http://tiny.cc/hpcg 13 HPCG Design Philosophy • Relevance to broad collection of important apps. • Simple, single number. • Few user-tunable parameters and algorithms: • The system, not benchmarker skill, should be primary factor in result. • Algorithmic tricks don’t give us relevant information. • Algorithm (PCG) is vehicle for organizing: • Known set of kernels. • Core compute and data patterns. • Tunable over time (as was HPL). • Easy-to-modify: • _ref kernels called by benchmark kernels. • User can easily replace with custom versions. • Clear policy: Only kernels with _ref versions can be modified.

http://tiny.cc/hpcg 14 Example • Build HPCG with default MPI and OpenMP modes enabled. export OMP_NUM_THREADS=1 mpiexec –n 96 ./xhpcg 70 80 90 • Results in: n x = 70, n y = 80, n z = 90 np x = 4, np y = 4, np z = 6 • Global domain dimensions: 280-by-320-by-540 • Number of equations per MPI process: 504,000 • Global number of equations: 48,384,000 • Global number of nonzeros: 1,298,936,872 • Note: Changing OMP_NUM_THREADS does not change any of these values.

http://tiny.cc/hpcg 15 PCG ALGORITHM u p 0 := x 0 , r 0 := b - Ap 0 u Loop i = 1, 2, … o z i := M -1 r i-1 o if i = 1 § p i := z i § a i := dot_product ( r i-1 , z ) o else § a i := dot_product ( r i-1 , z ) § b i := a i / a i -1 § p i := b i * p i-1 + z i o end if o a i := dot_product ( r i-1 , z i ) / dot_product ( p i , A * p i ) o x i+1 := x i + a i * p i o r i := r i-1 – a i * A * p i o if || r i || 2 < tolerance then Stop u end Loop ¡ ¡ ¡ ¡ ¡ ¡

http://tiny.cc/hpcg 16 Preconditioner • Hybrid geometric/algebraic multigrid: • Grid operators generated synthetically: • Coarsen by 2 in each x, y, z dimension (total of 8 reduction each level). • Use same GenerateProblem() function for all levels. • Grid transfer operators: • Simple injection. Crude but … • Requires no new functions, no repeat use of other functions. • Cheap. • Symmetric Gauss-Seidel preconditioner • Smoother: • In Matlab that might look like: • Symmetric Gauss-Seidel [ComputeSymGS()]. LA = tril(A); UA = triu(A); DA = diag(diag(A)); • Except, perform halo exchange prior to sweeps. • Number of pre/post sweeps is tuning parameter. x = LA\y; x1 = y - LA*x + DA*x; % Subtract off extra • Bottom solve: diagonal contribution x = UA\x1; • Right now just a single call to ComputeSymGS(). • If no coarse grids, has identical behavior as HPCG 1.X. 16

http://tiny.cc/hpcg 17 Problem Setup Validation Testing Reference Sparse MV Reference CG timing and Gauss-Seidel and residual kernel timing. reduction. • Time calls to the • Time the execution • Construct Geometry. • Perform spectral reference versions of 50 iterations of properties PCG Tests: • Generate Problem. of sparse MV and the reference PCG • Convergence for 10 • Setup Halo Exchange. MG for inclusion in implementation. distinct eigenvalues: • Initialize Sparse Meta-data. output report. • Record reduction of • No preconditioning. • Call user-defined residual using the • With Preconditioning OptimizeProblem function. reference • Symmetry tests: This function permits the implementation. user to change data • Sparse MV kernel. The optimized code structures and perform • MG kernel. must attain the permutation that can improve same residual execution. reduction, even if more iterations are required. Optimized CG Setup. Optimized CG timing and Report results analysis. • Run numberOfCgSets • Write a log file for • Run one set of Optimized PCG calls to optimized PCG diagnostics and solver to determine number of solver with debugging. iterations required to reach residual numberOfOptCgIters reduction of reference PCG. • Write a benchmark iterations. results file for reporting • Record iteration count as • For each set, record official information. numberOfOptCgIters. residual norm. • Detect failure to converge. • Record total time. • Compute how many sets of • Compute mean and Optimized PCG Solver are required variance of residual to fill benchmark timespan. Record values. as numberOfCgSets

HPCG: ONE YEAR LATER Jack Dongarra & Piotr Luszczek University - PowerPoint PPT Presentation

http://tiny.cc/hpcg 1 HPCG: ONE YEAR LATER Jack Dongarra & Piotr Luszczek University of Tennessee/ORNL Michael Heroux Sandia National Labs http://tiny.cc/hpcg Confessions of an 2 Accidental Benchmarker Appendix B of the LINPACK

The Shmitah Cycle Common Holy Year 1 Year 2 Year 1 Year 2 Year 3 Year 4 Year 5 Year 6

One Year Later One Year Later Presentation to the Federal Communications Commission November

A CUDA IMPLEMENTATION OF THE HPCG BENCHMARK Everett Phillips Massimiliano Fatica OUTLINE High

TOWARD A NEW (ANOTHER) METRIC FOR RANKING HIGH PERFORMANCE COMPUTING SYSTEMS Jack Dongarra

TOWARD A NEW (ANOTHER) METRIC FOR RANKING HIGH PERFORMANCE COMPUTING SYSTEMS Jack Dongarra &

Jieun Kim Hi-Sun Kim University of Chicago 1 st 2 nd 3 rd 4 th 5 th st nd rd th th year

District 211 One-to-One Program One-to-One: Program Background 2012-2013 2016-2017 One-to-One

Blueprint for Restoring Safety and Soundness to the GSEs: One Year Later November 2018 Safety

ONE YEAR LATER WHERE WE ARE ON TRANSPORTATION ALTERNATIVES QUESTIONS & COMMENTS Submit

POLICY BRIEFING: POLICY BRIEFING: The LSRP Program, A Year The LSRP Program, A Year Later

Seven Years Later: Seven Years Later: What the Agile Manifesto Left Out What the Agile Manifesto

Promise for the Future -- Impressions of some of the later Swenson cultivars -- Bruce Smith

United Way Presentation Tiny Tots Child Care Later that night Three months later How it

Improving availability for Customers June 2016 ECR Malaysia Confidential - Joe Dybell Dairy

Life on on the the Battlefields Battlefields Life 94 Years Years Later Later 94 Charlotte

Later Hittite Kings Regnal Dates and Succession M5-20a M5-17b Later Hittite Kings Regnal Dates

Second Quarter 2015 Results Presentation to Investors July 23, 2015 Disclaimer Cautionary

Not everything that counts can be counted, and not everything that can be counted counts.

Hardware-Software Codesign 9. Worst Case Execution Time Analysis Lothar Thiele Swiss Federal

Swimming in the data river Or, when streaming analytics isnt Gian Merlino gian@imply.io

Risk assessment for uncertain cash flows: Model ambiguity, discounting ambiguity, and the role of

Wrap-up: Session 1 Setting the scene : European Strategy for Particle Physics Problems,

Welcome to CNSM 2013! CNSM 2013: The 9th Conference on Network and Service Management October

High Performance Computing driven software development for next-generation modelling of the

HPCG: ONE YEAR LATER Jack Dongarra & Piotr Luszczek University - PowerPoint PPT Presentation

http://tiny.cc/hpcg 1 HPCG: ONE YEAR LATER Jack Dongarra & Piotr Luszczek University of Tennessee/ORNL Michael Heroux Sandia National Labs http://tiny.cc/hpcg Confessions of an 2 Accidental Benchmarker Appendix B of the LINPACK

The Shmitah Cycle Common Holy Year 1 Year 2 Year 1 Year 2 Year 3 Year 4 Year 5 Year 6

One Year Later One Year Later Presentation to the Federal Communications Commission November

A CUDA IMPLEMENTATION OF THE HPCG BENCHMARK Everett Phillips Massimiliano Fatica OUTLINE High

TOWARD A NEW (ANOTHER) METRIC FOR RANKING HIGH PERFORMANCE COMPUTING SYSTEMS Jack Dongarra

TOWARD A NEW (ANOTHER) METRIC FOR RANKING HIGH PERFORMANCE COMPUTING SYSTEMS Jack Dongarra &amp;

Jieun Kim Hi-Sun Kim University of Chicago 1 st 2 nd 3 rd 4 th 5 th st nd rd th th year

District 211 One-to-One Program One-to-One: Program Background 2012-2013 2016-2017 One-to-One

Blueprint for Restoring Safety and Soundness to the GSEs: One Year Later November 2018 Safety

ONE YEAR LATER WHERE WE ARE ON TRANSPORTATION ALTERNATIVES QUESTIONS &amp; COMMENTS Submit

POLICY BRIEFING: POLICY BRIEFING: The LSRP Program, A Year The LSRP Program, A Year Later

Seven Years Later: Seven Years Later: What the Agile Manifesto Left Out What the Agile Manifesto

Promise for the Future -- Impressions of some of the later Swenson cultivars -- Bruce Smith

United Way Presentation Tiny Tots Child Care Later that night Three months later How it

Improving availability for Customers June 2016 ECR Malaysia Confidential - Joe Dybell Dairy

Life on on the the Battlefields Battlefields Life 94 Years Years Later Later 94 Charlotte

Later Hittite Kings Regnal Dates and Succession M5-20a M5-17b Later Hittite Kings Regnal Dates

Second Quarter 2015 Results Presentation to Investors July 23, 2015 Disclaimer Cautionary

Not everything that counts can be counted, and not everything that can be counted counts.

Hardware-Software Codesign 9. Worst Case Execution Time Analysis Lothar Thiele Swiss Federal

Swimming in the data river Or, when streaming analytics isnt Gian Merlino gian@imply.io

Risk assessment for uncertain cash flows: Model ambiguity, discounting ambiguity, and the role of

Wrap-up: Session 1 Setting the scene : European Strategy for Particle Physics Problems,

Welcome to CNSM 2013! CNSM 2013: The 9th Conference on Network and Service Management October

High Performance Computing driven software development for next-generation modelling of the

TOWARD A NEW (ANOTHER) METRIC FOR RANKING HIGH PERFORMANCE COMPUTING SYSTEMS Jack Dongarra &

ONE YEAR LATER WHERE WE ARE ON TRANSPORTATION ALTERNATIVES QUESTIONS & COMMENTS Submit