for HPC workloads Key Liao Center for HPC Shanghai Jiao Tong - - PowerPoint PPT Presentation
for HPC workloads Key Liao Center for HPC Shanghai Jiao Tong - - PowerPoint PPT Presentation
Benchmarking Huawei ARM Multi-Core Processors for HPC workloads Key Liao Center for HPC Shanghai Jiao Tong University Jan 9th, 2019 About Me Key Liao ( ) B.S. from Environment Science and Engineering, SJTU. HPC Engineer of Center
About Me
Key Liao (廖秋承) B.S. from Environment Science and Engineering, SJTU. HPC Engineer of Center for High Performance Computing, SJTU. Leader of ARM Research Team at CHPC, SJTU. Supervisor of SJTU Student HPC Competition Team. Main Research Area: Computer Architecture Theoretical Computer Performance Evaluation Performance Optimization Email: keymorrislane@sjtu.edu.cn
Outline
➢Kunpeng 920
➢ Float-point Arithmetic ➢ Memory subsystem
➢Proxy Applications
➢ TeaLeaf ➢ SNAP ➢ CloverLeaf
➢Real-world applications
➢ GTC-P
Chips Information
core 0 core 1 core 3 core 4 grp 0 grp 1 grp 2 grp 3 grp 4 grp 5
Chips Information
Model Intel Xeon Gold 6148 Hi1616 Kunpeng 920 (Engineering Sample) Arch Skylake-SP ARM ARM Lithography 14nm 16nm 7nm Main Frequency(GHz) 2.4 2.4 2.0 Num of Cores 20 32 48 Vectorization Ins/Width AVX512/512bits ASIMD/128bits ASIMD/128bits Theoretical DP Peak Performance (GFLOPS)* 1536 307.2 768 L3 Cache 1.375 MB 32MB (shared) 64MB (shared) DRAM Support 6 x DDR4-2666 4 x DDR4-2400 8 x DDR4-3200 TDP 150 70 150 Launch Time 2017 2016 2019 * Theoretical DP peak performance is calculated based on the frequency we test during chips running their best vectorization instruction set.
Platform Information
Platform 6148 1616 920 CPU Xeon Gold 6148 Hi1616 Kunpeng 920 Number of Sockets 4 4 8 DRAM Size (GB) 2048 256 256 DRAM Frequency (MHz) 2666 2400 2666 Linux CentOS 7.5 Kernel 3.10.0 EulerOS Kernel 4.11.0 EulerOS Kernel 4.14.0 Compiler All with Intel Parallel Studio XE Cluster Version 2019 Update 1 (Education License) GNU/GCC-8.2.0 MPI Library MVAPICH2-2.3 BLAS Library OpenBLAS 0.3.5
360.2 955.32 220.2 310.5 750 2252.2 475.2 670.7 500 1000 1500 2000 2500 2683 6148 1616 920 Single Socket Dual Socket
Float-point Arithmetic
2683 6148 1616 920 0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 70.0% 80.0% 90.0% Single Socket Dual Socket
HPL Benchmark on Four Platforms HPL Efficiency on Four Platforms
- 41.1% Better than Hi1616, compared to a 165.3% increase from Haswell to
Skylake in 3 years.
- HPL efficiency on Kunpeng 920 is around 40% compared to more than 70%
- n other chips.
Float-point Arithmetic
SP Scalar DP Scalar SP Vector DP Vector
Hi1616 2ins/cycle 9.596GFlops 2ins/cycle 9.596GFlops 1ins/cycle 19.194Gflops 1ins/cycle 9.596GFlops Kunpeng920 2ins/cycle 7.989Gflops 2ins/cycle 7.989Gflops 2ins/cycle 31.954GFlops 1ins/cycle 7.989Gflops FMA Instruction Throughput
- Hi1616
- 128-bit SIMD
- SP: 614.4 Gflops
- DP: 307.2 Gflops
- Hi1620
- 128-bit SIMD
- SP: 1,536 Gflops
- DP: 384 Glops
- Throughput of DP SIMD instruction is limited.
- Not a good chip for intense DP computation.
- DP computation is not so important as people
used to think.
- Trend on SVE and VLA .
Memory Subsystem
1 1 1 1 1 1 1 1 1.1 1.6 1.2 1.6 1.2 2.0 2.1 3.6
0.5 1 1.5 2 2.5 3 3.5 4 L1 Read L1 Write L2 Read L2 Write L3 Read L3 Write DRAM Read DRAM Write
Relative Scale
6148 920
Normalized Bandwidth of Different Memory Layers
Chip L1 L2 L3 DRAM 6148 1x 1x 1x 1x 920 0.33x 0.71x 1.57x 1.25x Normalized Average Latency (ns)
Chip Communication - Bandwidth
Platform 2680 2680 6148 6148 1616 1616 920 920 Technique QPI UPI Hydra Interface Hydra Interface Bandwidth(GB/s) 35.2 40.8 10.0 12.7
▪ SNAP
▪ A proxy application for a modern deterministic discrete ordinates transport code
▪ TeaLe Leaf
▪ Proxy app for solving the linear heat conduction equation on a spatially decomposed regular grid, utilising a five point finite difference stencil
▪ Clove
- verL
rLeaf eaf
▪ Solving Euler’s equations of compressible fluid dynamics, under a Lagrangian- Eulerian scheme, on a two-dimensional spatial regular structured grid.
Proxy Applications
0.77 0.766 0.916 0.56 0.91 0.462
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 6148 920 Grind Time (ns)
Single 2-Socket Strong Scaling 2-Socket Weak Scaling
SNAP Grind Time
Proxy Applications - Results
601.18 364.42 339.5 193.05 989.88 607.13
200 400 600 800 1000 1200 6148 920 Wall Lock (s)
Single 2-Socket Strong Scaling 2-Socket Weak Scaling
TeaLeaf 1342.61 1041.56 755.5 571.67
200 400 600 800 1000 1200 1400 1600 6148 920 Wall Lock (s)
Single 2-Socket Strong Scaling CloverLeaf-bm16 182.58 208.78 120.65 109.8
50 100 150 200 250 6148 920 Wall Lock (s)
Single 2-Socket Strong Scaling
CloverLeaf-bm128_short
Proxy Applications - Results
1.005 1.65 1.289 0.875
0.5 1 1.5 2 SNAP TeaLeaf CloverLeaf_bm16 CloverLeaf_bm128_short
Normalized Performance of Proxy Applications on Single Socket
6148 920
1.636 1.759 1.322 1.099
0.5 1 1.5 2 SNAP TeaLeaf CloverLeaf_bm16 CloverLeaf_bm128_short
Normalized Performance of Proxy Applications on Dual Socket
6148 920
Proxy Applications - SNAP
- Generally, load a relative big data
- set. Performing random access in
the data set. (dim3_sweep.f90)
- If OpenMP is enable, threading
across data set.
- MPI_Recv becomes a hotspot
after scaling across socket.
0.77 0.766 0.916 0.56 0.91 0.462
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 6148 920 Grind Time (ns)
Single 2-Socket Strong Scaling 2-Socket Weak Scaling
SNAP Grind Time (9600 cells, nang=64, ng=332, nstep=100)
Same single node performace
26.9% Speedup
Proxy Applications - TeaLeaf
- Memory subsystem bandwidth.
- 3840 x 3840, 10000 steps.
1.0x 1.0x 1.77x 1.89x 0.607x 0.600x
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 6148 920 Wall Lock (s)
Single 2-Socket Strong Scaling 2-Socket Weak Scaling
TeaLeaf Relative Speedup
Proxy Applications - CloverLeaf
- Memory subsystem bandwidth.
- But
- Double-float arithmetic intensity increases as the number of cells
increases and the total number of iteration decreases.
1342.61 1041.56 755.5 571.67
200 400 600 800 1000 1200 1400 1600 6148 920 Wall Lock (s)
Single 2-Socket Strong Scaling
CloverLeaf-bm16
182.58 208.78 120.65 109.8
50 100 150 200 250 6148 920 Wall Lock (s)
Single 2-Socket Strong Scaling
CloverLeaf-bm128_short
1.77x 1.82x 1.51x 1.90x
▪ GTC-P: Gyrokinetic Toroidal Code - Princeton ▪ GTC-P is Particle-in-Cell code that delivers fusion simulations at extreme scales on the worldwide supercomputers including Tianhe-2, Titan, TaihuLight and etc., that feature CPU, GPU and many-core processors.
GTC-P
Supported by NSF SAVI Project
GTC-P
Kunpeng 920
GTC-P
GTC-P Performance With Different Combination of Processes and Threads on Kunpeng 920
GTC-P
Kunpeng 920