Run-to-run Variability on Theta and Best Practices for Performance - - PowerPoint PPT Presentation

run to run variability on theta and best practices for
SMART_READER_LITE
LIVE PREVIEW

Run-to-run Variability on Theta and Best Practices for Performance - - PowerPoint PPT Presentation

Run-to-run Variability on Theta and Best Practices for Performance Benchmarking ALCF Developer Session September 26 th 2018 Sudheer Chunduri sudheer@anl.gov www.anl.gov Run-to-run Variability Equal work is not Equal time 2 Image


slide-1
SLIDE 1

www.anl.gov

Run-to-run Variability on Theta and Best Practices for Performance Benchmarking

ALCF Developer Session – September 26th 2018 Sudheer Chunduri sudheer@anl.gov

slide-2
SLIDE 2

2

Run-to-run Variability

Equal work is not Equal time

Image courtesy: https://concertio.com/2018/07/02/dealing-with-variability/

slide-3
SLIDE 3

3

Equal work is not Equal time

§ Sources of Variability

§

Core-level

  • OS noise effects
  • Dynamic frequency scaling
  • Manufacturing variability

§

Node level

  • Shared cache contention on a multi-core

§

System level

  • Network congestion due to inter-job interference

§ Challenges

§

Less reliable performance measures (multiple repetitions with statistical significance analysis is required)

§

Performance tuning – quantifying the impact of a code change is difficult

§

Difficult to predict job duration

  • Less user productivity
  • Inefficient system utilization
  • Complicates job scheduling
slide-4
SLIDE 4

4

Outline

§ Overview of Theta Architecture § Evaluation of run-to-run variability on Theta

§ Classify and quantify sources of variability § Present ways to mitigate wherever possible

§ Recommended Best practices for performance benchmarking

slide-5
SLIDE 5

5

Theta System Overview

§ System:

Cray XC40 system (#21 in Top500 in June 2018) 14 similar systems in top 50 supercomputers 4,392 compute nodes/281,088 cores, 11.69 PF peak performance

§ Processor:

2nd Generation Intel Xeon Phi (Knights Landing) 7230 64 cores - 2 cores on one tile with shared L2 1.3 base frequency, can turbo up to 1.5 GHz

§ Node:

Single socket KNL 192 GB DDR4-2400 per node 16 GB MCDRAM per node (Cache mode/Flat mode)

§ Network:

Cray Aries interconnect with Dragonfly network topology Adaptive routing

Figures source: Intel, Cray

slide-6
SLIDE 6

6

Aspects of Variability Examined

Figures source: Intel, Cray

§ Core level

  • OS noise effects
  • Core to core variability
  • Cores within a tile

§ Node level

  • MCDRAM memory mode

effects § System level

  • Network congestion
  • Node placement and

routing mode effects Micro-benchmarks Mini-apps Applications

slide-7
SLIDE 7

7

2.85 2.90 2.95 3.00 3.05 3.10 3.15 3.20

Core-level Variability

§ Each core runs the MKL DGEMM benchmark § Matrix size chosen so as to fit within L1 cache

DGEMM on 64 cores

Time(s) Max to Min Var: 11.18%

slide-8
SLIDE 8

8

2.85 2.90 2.95 3.00 3.05 3.10 3.15 3.20 Time(s) Cores, 0−63 Max R2R Var: 5.91% Max to Min Var: 6.01% 2.85 2.90 2.95 3.00 3.05 3.10 3.15 3.20

Core-level Variability

§ Each core runs the MKL DGEMM benchmark § Matrix size chosen so as to fit within L1 cache

DGEMM on 64 cores

§ Core specialization – A Cray OS feature allowing users

to reserve cores for handling system services

Time(s) Max to Min Var: 11.18% Time(s) Max to Min Var: 5.22%

DGEMM on 64 cores with Core Specialization

slide-9
SLIDE 9

9

Core-level Variability

§ Benchmark: Selfish § Runs in a tight loop and measures the time for each iteration. § If an iteration takes longer than a particular threshold, then the timestamp (Noise) is recorded.

OS noise effects on a core without Core Specialization Actual time Noise events Noise(us)

slide-10
SLIDE 10

10

Core-level Variability

§ Benchmark: Selfish § Runs in a tight loop and measures the time for each iteration. § If an iteration takes longer than a particular threshold, then the timestamp (Noise) is recorded.

OS noise effects on a core without Core Specialization OS noise effects on a core with Core Specialization

Core Specialization is an effective mitigation for core level variability

Noise(us) Noise(us)

slide-11
SLIDE 11

11

Core-level Variability

Benchmark: Selfish

  • Small micro-benchmark in the milliseconds range
  • Noise is significant

Noise(us)

slide-12
SLIDE 12

12

Core-level Variability

Benchmark: Selfish

  • Small micro-benchmark in the milliseconds range
  • Noise is significant

Micro-benchmark in the seconds range Time scale matters – runtimes greater than seconds don’t see the impact

Noise(us) Noise(us)

slide-13
SLIDE 13

13

Node-level Variability

Variability due to memory mode

DRAM - 192 GB capacity ~ 90 GB/s effective bandwidth MCDRAM - 16 GB capacity ~ 480 GB/s effective bandwidth

KNL Has two types of memory

slide-14
SLIDE 14

14

Node-level Variability

Variability due to memory mode

DRAM - 192 GB capacity ~ 90 GB/s effective bandwidth MCDRAM - 16 GB capacity ~ 480 GB/s effective bandwidth

KNL Has two types of memory

MCDRAM can be operated in two modes

Flat Mode Cache Mode

slide-15
SLIDE 15

15

Node-level Variability

Variability due to memory mode

DRAM - 192 GB capacity ~ 90 GB/s effective bandwidth MCDRAM - 16 GB capacity ~ 480 GB/s effective bandwidth

KNL Has two types of memory

MCDRAM can be operated in two modes

Flat Mode Cache Mode

  • In cache mode, MCDRAM operated as direct-mapped cache to DRAM
  • Potential conflicts because of the direct mapping

Source of Variability:

slide-16
SLIDE 16

16

Node-level variability

Stream TRIAD in flat mode

Less than 1% variability: 480 GB/s effective bandwidth

STREAM benchmark using 63 cores with one core for core specialization & working set of 7.5 GB Bandwidth (GB/s) Job number

STREAM TRIAD benchmark used to measure memory bandwidth with A(i) = B(i) + s * C(i)

slide-17
SLIDE 17

17

Node-level variability

Stream TRIAD in flat mode

Less than 1% variability: 480 GB/s effective bandwidth

STREAM benchmark using 63 cores with one core for core specialization & working set of 7.5 GB Bandwidth (GB/s) Job number

MCDRAM writes are consistent across all the nodes

Counter Value

MCDRAM Write count MCDRAM Read count

DRAM Reads & Writes MCDRAM Reads & Writes

slide-18
SLIDE 18

18

Node-level variability

Stream TRIAD in cache mode

  • Max. 4.5% run-to-run, 2X job-to-job variability

350 GB/s effective bandwidth

Bandwidth (GB/s) Job number STREAM benchmark using 63 cores with one core for core specialization & working set of 7.5 GB

slide-19
SLIDE 19

19

Node-level variability

Stream TRIAD in cache mode

  • Max. 4.5% run-to-run, 2X job-to-job variability

350 GB/s effective bandwidth

Bandwidth (GB/s) Job number STREAM benchmark using 63 cores with one core for core specialization & working set of 7.5 GB

Higher bandwidth correlates with lower MCDRAM miss ratio (More MCDRAM writes due to conflicts!)

MCDRAM Miss count MCDRAM Write count

DRAM Reads & Writes MCDRAM Hits & Misses, Reads & Writes Counter Value

slide-20
SLIDE 20

20

Network-level variability

§ Cray XC Dragonfly topology

§ Potential links sharing between the user jobs § High chances for inter-job contention

§ Sources of variability -> Inter-job contention

§ Size of the job, Node placement , Workload characteristics , Co-located job mix

slide-21
SLIDE 21

21

Network-level variability

MPI Collectives

§ MPI_Allreduce using 64 processes with 8 MB message § Repeated 100 times within a job § Measured on several days

§

Changes in node placement and Job mix § Isolated system run:

§

< 1% variability

§

Best observed

slide-22
SLIDE 22

22

Network-level variability

MPI Collectives

§ MPI_Allreduce using 64 processes with 8 MB message § Repeated 100 times within a job § Measured on several days

§

Changes in node placement and Job mix § Isolated system run:

§

< 1% variability

§

Best observed § Variability is around 35%

§

Much higher variability with smaller message sizes (not shown here) § Each box shows the median, IQR (Inter-Quartile Range)

and the outliers

128 nodes Allreduce 8MB 64 PPN

  • 0.32

0.36 0.40 D e f a u l t

Date Latency (s) variable

Ideal 02−01−13 02−03−20 02−05−00 02−05−17 02−07−01 02−07−15 02−08−21 02−09−21 02−10−13 02−11−13 02−12−13 02−13−16 02−14−22 02−15−20 02−16−17 02−17−04 02−17−13 02−17−17 02−18−15 02−20−03 02−21−02 02−21−17 02−22−15 02−23−17 02−24−21 02−25−17 02−26−17 03−02−04

name

−10% −5% +%5 +10% MoM

128−Allreduce−64−1048576

Latency(s)

Best

  • bserved

MoM

Different jobs

slide-23
SLIDE 23

23

Summary on Variability

§ Core-to-core level variability due to OS noise

§

Core 0 is slow compared to rest of the cores

§

Crucial for low-latency MPI benchmarking and for micro-kernel benchmarking

§

Longer time scales don’t see the effect

§

Core specialization helps reduce the overhead

§

Frequency scaling effects are not dominant enough to induce variability § Node level variability due to MCDRAM cache page conflicts

§

Around 2X variability on STREAM benchmark

§

Linux Zone sort helps improve average performance and reduce variability to some extent

§

Example miniapps that are sensitive: Nekbone, MiniFE

§

For applications with working sets that fits within MCDRAM, using Flat mode is the mitigation § Network level variability due to inter-job contention

§

Up to 35% for large message sized MPI collectives

§

Even higher variability for latency bound small sized collectives

§

No obvious mitigation

slide-24
SLIDE 24

24

Application Level Variability

Nekbone variability at the node level

Job number

100 200 300 400 500 600 700 800

Time (s)

  • Max. to Min. ratio = 3.57 %

Totaltime DAXPY+ MXM COMM

Flat mode on Theta Time(s) Nekbone: Nekbone mini-app derived from Nek5000

  • Streaming kernels – BW bound – DAXPY+
  • Matrix multiply – Compute bound – MXM
  • Communication bound – COMM
  • Max. to Min. ratio = 3.5%
slide-25
SLIDE 25

25

Job number

100 200 300 400 500 600 700 800 900

Time (s)

  • Max. to Min. ratio = 21.95 %

Totaltime DAXPY+ MXM COMM

Cache mode on Theta Time(s)

Application Level Variability

Nekbone variability at the node level

Job number

100 200 300 400 500 600 700 800

Time (s)

  • Max. to Min. ratio = 3.57 %

Totaltime DAXPY+ MXM COMM

Flat mode on Theta Time(s) Nekbone: Nekbone mini-app derived from Nek5000

  • Streaming kernels – BW bound – DAXPY+
  • Matrix multiply – Compute bound – MXM
  • Communication bound – COMM

Problem is memory bandwidth intensive 3.57% Max-to-Min variability in Flat mode

22% Max-to-Min variability in Cache-mode

  • Max. to Min. ratio = 3.5%
  • Max. to Min. ratio = 22%
slide-26
SLIDE 26

26

Application Level Variability

Nekbone variability at the network level

Time(s)

128 nodes on Theta Time(s)

  • Max. to Min. ratio = 32.1%

With a different input, Nekbone is communication bound 32.14% variability on 128 node jobs on Theta Variability in Total time ~ variability in COMM time

slide-27
SLIDE 27

27

Application Level Variability

Nekbone variability at the network level

5 repetitions within a job All use the same node allocation in a job Time(s)

128 nodes on Theta Time(s)

  • Max. to Min. ratio = 32.1%

With a different input, Nekbone is communication bound 32.14% variability on 128 node jobs on Theta Variability in Total time ~ variability in COMM time 256 nodes on Theta

Run to Run ratio= 32.1% Job to job ratio = 36.9%

Time(s)

slide-28
SLIDE 28

28

Application Level Variability

MILC variability at the network level

§ MILC

§

MIMD Lattice Computation QCD Code simulating 4D SU(3) lattice gauge theory

§

Performs large scale numerical simulations to study quantum chromodynamics (QCD)

§

Compute intensive per one lattice site with low memory footprint per compute node

slide-29
SLIDE 29

29

5 10 15 20 25 30 Job number 1600 1800 2000 2200 2400 2600 2800 3000 Time (s)

Job-to-job range = 74.67 %

Application Level Variability

MILC variability at the network level

§ MILC

§

MIMD Lattice Computation QCD Code simulating 4D SU(3) lattice gauge theory

§

Performs large scale numerical simulations to study quantum chromodynamics (QCD)

§

Compute intensive per one lattice site with low memory footprint per compute node Time(s)

Job to job ratio = 74.6%

128 nodes on Theta § Job-to-job variability:

§

74% on 128 node jobs on Theta

§

41% on 256 node jobs on Theta § Higher the time has a corresponding higher

time in the communication (MPI) part – Cray PAT MPI profiling

slide-30
SLIDE 30

30

650 700 750 800 Base Optimized Time(s)

Time(s)

Impact of Variability on Performance Tuning

Nekbone: Optimization: libxsmm to optimize small matmul Impact of optimization in Flat mode: 20.7% (no variability) Cache mode Avg. performance improvement: 18.8%(95%CI)

  • Variability: ~10%
  • Performance improvement range [+2% +35%]

+2%

slide-31
SLIDE 31

31

650 700 750 800 Base Optimized Time(s)

Time(s)

Impact of Variability on Performance Tuning

Nekbone: Optimization: libxsmm to optimize small matmul Impact of optimization in Flat mode: 20.7% (no variability) Cache mode Avg. performance improvement: 18.8%(95%CI)

  • Variability: ~10%
  • Performance improvement range [+2% +35%]

+2% +35%

slide-32
SLIDE 32

32

650 700 750 800 Base Optimized Time(s)

Time(s)

Impact of Variability on Performance Tuning

Nekbone: Optimization: libxsmm to optimize small matmul Impact of optimization in Flat mode: 20.7% (no variability) Cache mode Avg. performance improvement: 18.8%(95%CI)

  • Variability: ~10%
  • Performance improvement range [+2% +35%]

MILC: Optimization: Rank reorder to minimize inter-node traffic Impact of Optimization in less variable environment: 22%

slide-33
SLIDE 33

33

MFlops

35000 40000 45000 50000 Base Optimized MFlops 650 700 750 800 Base Optimized Time(s)

Time(s)

Impact of Variability on Performance Tuning

Nekbone: Optimization: libxsmm to optimize small matmul Impact of optimization in Flat mode: 20.7% (no variability) Cache mode Avg. performance improvement: 18.8%(95%CI)

  • Variability: ~10%
  • Performance improvement range [+2% +35%]

MILC: Optimization: Rank reorder to minimize inter-node traffic Impact of Optimization in less variable environment: 22% Production mode Avg. performance improvement: 23.3%

  • Variability: 25% in Opt. case & 41% in base case
  • Performance improvement range [-14% +55%]
slide-34
SLIDE 34

34

Conclusions

§ Classified and quantified sources of variability on Xeon Phi based Cray XC

§

Core level variability due to OS noise

  • Available mitigations: Use core spec (mechanism to reduce OS noise), exclude tile 0 & 32

§

Memory mode variability due to cache mode page conflicts

  • Available mitigations: run in flat mode
  • Potential mitigations: improved zone sort (part of Cray software stack)

§

Network variability due to shared network resources

  • Available mitigations: run without other jobs present on system
  • Potential mitigations: A compact job placement with static routing

§ Characterized impact on the Applications – up to 70% for MILC; up to 35% for Nekbone § Guidelines on performance tuning in the presence of variability:

§

Be aware of the network level congestion that does not have a clear mitigation strategy, this could potentially influence the communication intensive applications.

§

Incorporate statistical analysis in the performance benchmarking and analysis (refer https://htor.inf.ethz.ch/publications/img/hoefler-scientific-benchmarking.pdf for more details on statistics)

slide-35
SLIDE 35

35

Conclusions

§ Classified and quantified sources of variability on Xeon Phi based Cray XC

§

Core level variability due to OS noise

  • Available mitigations: Use core spec (mechanism to reduce OS noise), exclude tile 0 & 32

§

Memory mode variability due to cache mode page conflicts

  • Available mitigations: run in flat mode
  • Potential mitigations: improved zone sort (part of Cray software stack)

§

Network variability due to shared network resources

  • Available mitigations: run without other jobs present on system
  • Potential mitigations: A compact job placement with static routing

§ Characterized impact on the Applications – up to 70% for MILC; up to 35% for Nekbone § Guidelines on performance tuning in the presence of variability:

§

Be aware of the network level congestion that does not have a clear mitigation strategy, this could potentially influence the communication intensive applications.

§

Incorporate statistical analysis in the performance benchmarking and analysis (refer https://htor.inf.ethz.ch/publications/img/hoefler-scientific-benchmarking.pdf for more details on statistics)

Questions?