for HPC workloads Key Liao Center for HPC Shanghai Jiao Tong - - PowerPoint PPT Presentation

for hpc workloads
SMART_READER_LITE
LIVE PREVIEW

for HPC workloads Key Liao Center for HPC Shanghai Jiao Tong - - PowerPoint PPT Presentation

Benchmarking Huawei ARM Multi-Core Processors for HPC workloads Key Liao Center for HPC Shanghai Jiao Tong University Jan 9th, 2019 About Me Key Liao ( ) B.S. from Environment Science and Engineering, SJTU. HPC Engineer of Center


slide-1
SLIDE 1

Benchmarking Huawei ARM Multi-Core Processors for HPC workloads

Key Liao Center for HPC Shanghai Jiao Tong University Jan 9th, 2019

slide-2
SLIDE 2

About Me

Key Liao (廖秋承) B.S. from Environment Science and Engineering, SJTU. HPC Engineer of Center for High Performance Computing, SJTU. Leader of ARM Research Team at CHPC, SJTU. Supervisor of SJTU Student HPC Competition Team. Main Research Area: Computer Architecture Theoretical Computer Performance Evaluation Performance Optimization Email: keymorrislane@sjtu.edu.cn

slide-3
SLIDE 3

Outline

➢Kunpeng 920

➢ Float-point Arithmetic ➢ Memory subsystem

➢Proxy Applications

➢ TeaLeaf ➢ SNAP ➢ CloverLeaf

➢Real-world applications

➢ GTC-P

slide-4
SLIDE 4

Chips Information

core 0 core 1 core 3 core 4 grp 0 grp 1 grp 2 grp 3 grp 4 grp 5

slide-5
SLIDE 5

Chips Information

Model Intel Xeon Gold 6148 Hi1616 Kunpeng 920 (Engineering Sample) Arch Skylake-SP ARM ARM Lithography 14nm 16nm 7nm Main Frequency(GHz) 2.4 2.4 2.0 Num of Cores 20 32 48 Vectorization Ins/Width AVX512/512bits ASIMD/128bits ASIMD/128bits Theoretical DP Peak Performance (GFLOPS)* 1536 307.2 768 L3 Cache 1.375 MB 32MB (shared) 64MB (shared) DRAM Support 6 x DDR4-2666 4 x DDR4-2400 8 x DDR4-3200 TDP 150 70 150 Launch Time 2017 2016 2019 * Theoretical DP peak performance is calculated based on the frequency we test during chips running their best vectorization instruction set.

slide-6
SLIDE 6

Platform Information

Platform 6148 1616 920 CPU Xeon Gold 6148 Hi1616 Kunpeng 920 Number of Sockets 4 4 8 DRAM Size (GB) 2048 256 256 DRAM Frequency (MHz) 2666 2400 2666 Linux CentOS 7.5 Kernel 3.10.0 EulerOS Kernel 4.11.0 EulerOS Kernel 4.14.0 Compiler All with Intel Parallel Studio XE Cluster Version 2019 Update 1 (Education License) GNU/GCC-8.2.0 MPI Library MVAPICH2-2.3 BLAS Library OpenBLAS 0.3.5

slide-7
SLIDE 7

360.2 955.32 220.2 310.5 750 2252.2 475.2 670.7 500 1000 1500 2000 2500 2683 6148 1616 920 Single Socket Dual Socket

Float-point Arithmetic

2683 6148 1616 920 0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 70.0% 80.0% 90.0% Single Socket Dual Socket

HPL Benchmark on Four Platforms HPL Efficiency on Four Platforms

  • 41.1% Better than Hi1616, compared to a 165.3% increase from Haswell to

Skylake in 3 years.

  • HPL efficiency on Kunpeng 920 is around 40% compared to more than 70%
  • n other chips.
slide-8
SLIDE 8

Float-point Arithmetic

SP Scalar DP Scalar SP Vector DP Vector

Hi1616 2ins/cycle 9.596GFlops 2ins/cycle 9.596GFlops 1ins/cycle 19.194Gflops 1ins/cycle 9.596GFlops Kunpeng920 2ins/cycle 7.989Gflops 2ins/cycle 7.989Gflops 2ins/cycle 31.954GFlops 1ins/cycle 7.989Gflops FMA Instruction Throughput

  • Hi1616
  • 128-bit SIMD
  • SP: 614.4 Gflops
  • DP: 307.2 Gflops
  • Hi1620
  • 128-bit SIMD
  • SP: 1,536 Gflops
  • DP: 384 Glops
  • Throughput of DP SIMD instruction is limited.
  • Not a good chip for intense DP computation.
  • DP computation is not so important as people

used to think.

  • Trend on SVE and VLA .
slide-9
SLIDE 9

Memory Subsystem

1 1 1 1 1 1 1 1 1.1 1.6 1.2 1.6 1.2 2.0 2.1 3.6

0.5 1 1.5 2 2.5 3 3.5 4 L1 Read L1 Write L2 Read L2 Write L3 Read L3 Write DRAM Read DRAM Write

Relative Scale

6148 920

Normalized Bandwidth of Different Memory Layers

Chip L1 L2 L3 DRAM 6148 1x 1x 1x 1x 920 0.33x 0.71x 1.57x 1.25x Normalized Average Latency (ns)

slide-10
SLIDE 10

Chip Communication - Bandwidth

Platform 2680 2680 6148 6148 1616 1616 920 920 Technique QPI UPI Hydra Interface Hydra Interface Bandwidth(GB/s) 35.2 40.8 10.0 12.7

slide-11
SLIDE 11

▪ SNAP

▪ A proxy application for a modern deterministic discrete ordinates transport code

▪ TeaLe Leaf

▪ Proxy app for solving the linear heat conduction equation on a spatially decomposed regular grid, utilising a five point finite difference stencil

▪ Clove

  • verL

rLeaf eaf

▪ Solving Euler’s equations of compressible fluid dynamics, under a Lagrangian- Eulerian scheme, on a two-dimensional spatial regular structured grid.

Proxy Applications

slide-12
SLIDE 12

0.77 0.766 0.916 0.56 0.91 0.462

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 6148 920 Grind Time (ns)

Single 2-Socket Strong Scaling 2-Socket Weak Scaling

SNAP Grind Time

Proxy Applications - Results

601.18 364.42 339.5 193.05 989.88 607.13

200 400 600 800 1000 1200 6148 920 Wall Lock (s)

Single 2-Socket Strong Scaling 2-Socket Weak Scaling

TeaLeaf 1342.61 1041.56 755.5 571.67

200 400 600 800 1000 1200 1400 1600 6148 920 Wall Lock (s)

Single 2-Socket Strong Scaling CloverLeaf-bm16 182.58 208.78 120.65 109.8

50 100 150 200 250 6148 920 Wall Lock (s)

Single 2-Socket Strong Scaling

CloverLeaf-bm128_short

slide-13
SLIDE 13

Proxy Applications - Results

1.005 1.65 1.289 0.875

0.5 1 1.5 2 SNAP TeaLeaf CloverLeaf_bm16 CloverLeaf_bm128_short

Normalized Performance of Proxy Applications on Single Socket

6148 920

1.636 1.759 1.322 1.099

0.5 1 1.5 2 SNAP TeaLeaf CloverLeaf_bm16 CloverLeaf_bm128_short

Normalized Performance of Proxy Applications on Dual Socket

6148 920

slide-14
SLIDE 14

Proxy Applications - SNAP

  • Generally, load a relative big data
  • set. Performing random access in

the data set. (dim3_sweep.f90)

  • If OpenMP is enable, threading

across data set.

  • MPI_Recv becomes a hotspot

after scaling across socket.

0.77 0.766 0.916 0.56 0.91 0.462

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 6148 920 Grind Time (ns)

Single 2-Socket Strong Scaling 2-Socket Weak Scaling

SNAP Grind Time (9600 cells, nang=64, ng=332, nstep=100)

Same single node performace

26.9% Speedup

slide-15
SLIDE 15

Proxy Applications - TeaLeaf

  • Memory subsystem bandwidth.
  • 3840 x 3840, 10000 steps.

1.0x 1.0x 1.77x 1.89x 0.607x 0.600x

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 6148 920 Wall Lock (s)

Single 2-Socket Strong Scaling 2-Socket Weak Scaling

TeaLeaf Relative Speedup

slide-16
SLIDE 16

Proxy Applications - CloverLeaf

  • Memory subsystem bandwidth.
  • But
  • Double-float arithmetic intensity increases as the number of cells

increases and the total number of iteration decreases.

1342.61 1041.56 755.5 571.67

200 400 600 800 1000 1200 1400 1600 6148 920 Wall Lock (s)

Single 2-Socket Strong Scaling

CloverLeaf-bm16

182.58 208.78 120.65 109.8

50 100 150 200 250 6148 920 Wall Lock (s)

Single 2-Socket Strong Scaling

CloverLeaf-bm128_short

1.77x 1.82x 1.51x 1.90x

slide-17
SLIDE 17

▪ GTC-P: Gyrokinetic Toroidal Code - Princeton ▪ GTC-P is Particle-in-Cell code that delivers fusion simulations at extreme scales on the worldwide supercomputers including Tianhe-2, Titan, TaihuLight and etc., that feature CPU, GPU and many-core processors.

GTC-P

Supported by NSF SAVI Project

slide-18
SLIDE 18

GTC-P

Kunpeng 920

slide-19
SLIDE 19

GTC-P

GTC-P Performance With Different Combination of Processes and Threads on Kunpeng 920

slide-20
SLIDE 20

GTC-P

Kunpeng 920

slide-21
SLIDE 21

▪ Kunpeng 920 is capable to finish those scientific computation which has relatively low arithmetic intensity (<4 dp F/B) better than Intel's recent chip which has similar price. ▪ Pro ▪ Good Topology designs for threading ▪ High bandwidth, low latency, do well in many memoty-bound apps. ▪ Con ▪ Low bandwidth of Hydra Interface. ▪ Low DP arithmetic capability.

Conclusion