TOWARD A NEW (ANOTHER) METRIC FOR RANKING HIGH PERFORMANCE COMPUTING SYSTEMS
Jack Dongarra & Piotr Luszczek University of Tennessee/ORNL Michael Heroux Sandia National Labs
See: http//tiny.cc/hpcg
http://tiny.cc/hpcg 1
TOWARD A NEW (ANOTHER) METRIC FOR RANKING HIGH PERFORMANCE - - PowerPoint PPT Presentation
http://tiny.cc/hpcg 1 TOWARD A NEW (ANOTHER) METRIC FOR RANKING HIGH PERFORMANCE COMPUTING SYSTEMS Jack Dongarra & Piotr Luszczek University of Tennessee/ORNL Michael Heroux Sandia National Labs See: http//tiny.cc/hpcg
Jack Dongarra & Piotr Luszczek University of Tennessee/ORNL Michael Heroux Sandia National Labs
http://tiny.cc/hpcg 1
2
Linpack software package
http://tiny.cc/hpcg
magnitude increase in the number of processors
improvements
Began in late 70’s time when floating point operations were expensive compared to
http://tiny.cc/hpcg 3
http://tiny.cc/hpcg 4
http://tiny.cc/hpcg 5
http://tiny.cc/hpcg 6
http://tiny.cc/hpcg 7
http://tiny.cc/hpcg 8
$8600 for that one run.
http://tiny.cc/hpcg 9
0%# 10%# 20%# 30%# 40%# 50%# 60%# 70%# 80%# 90%# 100%# 6/1/93# 10/1/93# 2/1/94# 6/1/94# 10/1/94# 2/1/95# 6/1/95# 10/1/95# 2/1/96# 6/1/96# 10/1/96# 2/1/97# 6/1/97# 10/1/97# 2/1/98# 6/1/98# 10/1/98# 2/1/99# 6/1/99# 10/1/99# 2/1/00# 6/1/00# 10/1/00# 2/1/01# 6/1/01# 10/1/01# 2/1/02# 6/1/02# 10/1/02# 2/1/03# 6/1/03# 10/1/03# 2/1/04# 6/1/04# 10/1/04# 2/1/05# 6/1/05# 10/1/05# 2/1/06# 6/1/06# 10/1/06# 2/1/07# 6/1/07# 10/1/07# 2/1/08# 6/1/08# 10/1/08# 2/1/09# 6/1/09# 10/1/09# 2/1/10# 6/1/10# 10/1/10# 2/1/11# 6/1/11# 10/1/11# 2/1/12# 6/1/12# 10/1/12# 2/1/13# 6/1/13#
61#hours# 30#hours# 20#hours# 12#hours# 11#hours# 10#hours# 9#hours# 8#hours# 7#hours# 6#hours# 5#hours# 4#hours# 3#hours# 2#hours# 1#hour#
http://tiny.cc/hpcg 10
Top500 List Computer r_max (Tflop/s) n_max Hours MW
6/93 (1)
TMC CM-5/1024 .060 52224 0.4
11/93 (1)
Fujitsu Numerical Wind Tunnel .124 31920 0.1 1.
6/94 (1)
Intel XP/S140 .143 55700 0.2
11/94 - 11/95 (3)
Fujitsu Numerical Wind Tunnel .170 42000 0.1 1.
6/96 (1)
Hitachi SR2201/1024 .220 138,240 2.2
11/96 (1)
Hitachi CP-PACS/2048 .368 103,680 0.6
6/97 - 6/00 (7) Intel ASCI Red
2.38 362,880 3.7 .85
11/00 - 11/01 (3)
IBM ASCI White, SP Power3 375 MHz 7.23 518,096 3.6
6/02 - 6/04 (5) NEC Earth-Simulator
35.9 1,000,000 5.2 6.4
11/04 - 11/07 (7)
IBM BlueGene/L
0.4 1.4
6/08 - 6/09 (3) IBM Roadrunner –PowerXCell 8i 3.2 Ghz
1,105. 2,329,599 2.1 2.3
11/09 - 6/10 (2) Cray Jaguar - XT5-HE 2.6 GHz
1,759. 5,474,272 17.3 6.9
11/10 (1)
NUDT Tianhe-1A, X5670 2.93Ghz NVIDIA 2,566. 3,600,000 3.4 4.0
6/11 - 11/11 (2) Fujitsu K computer, SPARC64 VIIIfx
10,510. 11,870,208 29.5 9.9
6/12 (1)
IBM Sequoia BlueGene/Q 16,324. 12,681,215 23.1 7.9
11/12 (1)
Cray XK7 Titan AMD + NVIDIA Kepler 17,590. 4,423,680 0.9 8.2
6/13 – 11/13(?) NUDT Tianhe-2 Intel IvyBridge & Xeon Phi
33,862. 9,960,000 5.4 17.8 9 6 2
11
http://tiny.cc/hpcg 12
http://tiny.cc/hpcg 13
http://tiny.cc/hpcg 14
(nx × ny × nz) (npx × npy × npz) (nx *npx)× (ny *npy)× (nz *npz)
http://tiny.cc/hpcg
export OMP_NUM_THREADS=1 mpiexec –n 96 ./xhpcg 70 80 90
16
nx = 70, ny = 80, nz = 90 npx = 4, npy = 4, npz = 6
http://tiny.cc/hpcg
http://tiny.cc/hpcg 17
Problem Setup
OptimizeProblem function. This function permits the user to change data structures and perform permutation that can improve execution. Validation Testing
properties CG Tests:
distinct eigenvalues:
kernel. Reference Sparse MV and Gauss-Seidel kernel timing.
reference versions
symmetric Gauss- Seidel for inclusion in output report. Reference CG timing and residual reduction.
the reference CG implementation.
residual using the reference implementation. The optimized code must attain the same residual reduction, even if more iterations are required. Optimized CG Setup.
to determine number of iterations required to reach residual reduction
numberOfOptCgIters.
Optimized CG Solver are required to fill benchmark timespan. Record as numberOfCgSets Optimized CG timing and analysis.
calls to optimized CG solver with numberOfOptCgIters iterations.
residual norm.
variance of residual values. Report results
diagnostics and debugging.
results file for reporting
http://tiny.cc/hpcg 18
Problem Setup
permutation that can improve execution.
http://tiny.cc/hpcg 19
xT Ay = yT Ax xT M −1y = yT M −1x
http://tiny.cc/hpcg 20
http://tiny.cc/hpcg 21
http://tiny.cc/hpcg 25
addition).
http://tiny.cc/hpcg 26
Setup: LA = tril(A); UA = triu(A); DA = diag(diag(A)); Solve: x = LA\y; x1 = y - LA*x + DA*x; % Subtract off extra diagonal contribution x = UA\x1;
27 http://tiny.cc/hpcg 27
28 http://tiny.cc/hpcg 28
29 http://tiny.cc/hpcg 29
http://tiny.cc/hpcg 30
0" 1000" 2000" 3000" 4000" 5000" 6000" 1" 2" 4" 8" 16" 32" Gflop/s' Nodes'
Results'for'Cielo' Dual'Socket'AMD'(8'core)'Magny'Cour' Each'node'is'2*8'Cores'2.4'GHz'='Total'153.6'Gflops/'
Theore/cal"Peak" HPL"GFLOP/s" HPCG"GFLOP/s"
http://tiny.cc/hpcg 31 512 MPI Processes
Courtesy Kalyan Kumaran, Argonne Courtesy Mahesh Rajan, Sandia
http://tiny.cc/hpcg 32
Edison: Avg DDOT MPI_Allreduce time: 2.0 sec Red Sky: Avg DDOT MPI_Allreduce time: 10.5 sec
Results courtesy of Ludovic Saugé, Bull Results courtesy of M. Rajan, D. Doerfler, Sandia
http://tiny.cc/hpcg 33
Results courtesy of Ian Karlin, Scott Futral, LLNL
http://tiny.cc/hpcg 34
mulVMthreading%
performance%improvement%
Summary'of'“as'is”'code'on'the'K
Improvement
0.0%% 20.0%% 40.0%% 60.0%% 80.0%% 100.0%% 120.0%% 140.0%% As%Is% Tuned% Time'[s]'
Measured'Time'of'Kernels' (by'HPCG.*.yaml'file)'
OpVmizaVon% DDOT% WAXPBY% SPMV% SYMGS% Total%
preconditioner.
to enrich design tradeoff space
35 http://tiny.cc/hpcg 35
1.E-01 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1 2 3 4 5 6 v-cycle
Communication within each V-Cycle
Message Size (bytes) # P2P Messages # Global Collectives
Graph courtesy Future Technology Group, LBL
36 http://tiny.cc/hpcg 36
Toward a New Metric for Ranking High Performance Computing Systems
HPCG Technical Specification
Piotr Luszczek
37 http://tiny.cc/hpcg 37