run to run variability on theta and best practices for
play

Run-to-run Variability on Theta and Best Practices for Performance - PowerPoint PPT Presentation

Run-to-run Variability on Theta and Best Practices for Performance Benchmarking ALCF Developer Session September 26 th 2018 Sudheer Chunduri sudheer@anl.gov www.anl.gov Run-to-run Variability Equal work is not Equal time 2 Image


  1. Run-to-run Variability on Theta and Best Practices for Performance Benchmarking ALCF Developer Session – September 26 th 2018 Sudheer Chunduri sudheer@anl.gov www.anl.gov

  2. Run-to-run Variability Equal work is not Equal time 2 Image courtesy: https://concertio.com/2018/07/02/dealing-with-variability/

  3. Equal work is not Equal time § Sources of Variability Core-level § OS noise effects • Dynamic frequency scaling • Manufacturing variability • Node level § Shared cache contention on a multi-core • System level § Network congestion due to inter-job interference • § Challenges Less reliable performance measures (multiple repetitions with statistical significance analysis is required) § Performance tuning – quantifying the impact of a code change is difficult § Difficult to predict job duration § Less user productivity • Inefficient system utilization • Complicates job scheduling • 3

  4. Outline § Overview of Theta Architecture § Evaluation of run-to-run variability on Theta § Classify and quantify sources of variability § Present ways to mitigate wherever possible § Recommended Best practices for performance benchmarking 4

  5. Theta System Overview § System: Cray XC40 system (#21 in Top500 in June 2018) 14 similar systems in top 50 supercomputers 4,392 compute nodes/281,088 cores, 11.69 PF peak performance § Processor: 2 nd Generation Intel Xeon Phi (Knights Landing) 7230 64 cores - 2 cores on one tile with shared L2 1.3 base frequency, can turbo up to 1.5 GHz § Node: Single socket KNL 192 GB DDR4-2400 per node 16 GB MCDRAM per node (Cache mode/Flat mode) § Network: Cray Aries interconnect with Dragonfly network topology Adaptive routing 5 Figures source: Intel, Cray

  6. Aspects of Variability Examined § Core level - OS noise effects Micro-benchmarks - Core to core variability - Cores within a tile § Node level Mini-apps - MCDRAM memory mode effects Applications § System level - Network congestion - Node placement and routing mode effects 6 Figures source: Intel, Cray

  7. Core-level Variability § Each core runs the MKL DGEMM benchmark § Matrix size chosen so as to fit within L1 cache Max to Min Var: 11.18% 3.20 3.15 3.10 3.05 Time(s) 3.00 2.95 2.90 2.85 DGEMM on 64 cores 7

  8. Core-level Variability § Each core runs the MKL DGEMM benchmark § Core specialization – A Cray OS feature allowing users § Matrix size chosen so as to fit within L1 cache to reserve cores for handling system services Max to Min Var: 5.22% Max to Min Var: 11.18% Max R2R Var: 5.91% Max to Min Var: 6.01% 3.20 3.20 3.15 3.15 3.10 3.10 3.05 3.05 Time(s) Time(s) Time(s) 3.00 3.00 2.95 2.95 2.90 2.90 2.85 2.85 DGEMM on 64 cores with Cores, 0 − 63 DGEMM on 64 cores Core Specialization 8

  9. Core-level Variability § Benchmark: Selfish § Runs in a tight loop and measures the time for each iteration. § If an iteration takes longer than a particular threshold, then the timestamp (Noise) is recorded. Noise(us) Noise events Actual time OS noise effects on a core without Core Specialization 9

  10. Core-level Variability § Benchmark: Selfish § Runs in a tight loop and measures the time for each iteration. § If an iteration takes longer than a particular threshold, then the timestamp (Noise) is recorded. Core Specialization is an effective mitigation for core level variability Noise(us) Noise(us) OS noise effects on a core without Core OS noise effects on a core with Core Specialization Specialization 10

  11. Core-level Variability Benchmark: Selfish - Small micro-benchmark in the milliseconds range - Noise is significant Noise(us) 11

  12. Core-level Variability Benchmark: Selfish Micro-benchmark in the seconds range - Small micro-benchmark in the milliseconds range Time scale matters – runtimes greater than seconds don’t see the impact - Noise is significant Noise(us) Noise(us) 12

  13. Node-level Variability Variability due to memory mode KNL Has two types of memory DRAM - 192 GB capacity MCDRAM - 16 GB capacity ~ 90 GB/s effective bandwidth ~ 480 GB/s effective bandwidth 13

  14. Node-level Variability Variability due to memory mode KNL Has two types of memory DRAM - 192 GB capacity MCDRAM - 16 GB capacity ~ 90 GB/s effective bandwidth ~ 480 GB/s effective bandwidth MCDRAM can be operated in two modes Flat Mode Cache Mode 14

  15. Node-level Variability Variability due to memory mode KNL Has two types of memory DRAM - 192 GB capacity MCDRAM - 16 GB capacity ~ 90 GB/s effective bandwidth ~ 480 GB/s effective bandwidth MCDRAM can be operated in two modes Flat Mode Cache Mode Source of Variability: In cache mode, MCDRAM operated as direct-mapped cache to DRAM • Potential conflicts because of the direct mapping • 15

  16. Node-level variability Stream TRIAD in flat mode STREAM benchmark using 63 cores with one core STREAM TRIAD benchmark for core specialization & working set of 7.5 GB used to measure memory bandwidth with A(i) = B(i) + s * C(i) Bandwidth (GB/s) Job number Less than 1% variability: 480 GB/s effective bandwidth 16

  17. Node-level variability Stream TRIAD in flat mode STREAM benchmark using 63 cores with one core DRAM Reads & Writes for core specialization & working set of 7.5 GB MCDRAM Reads & Writes MCDRAM Read count Bandwidth (GB/s) Counter Value MCDRAM Write count Job number Less than 1% variability: 480 GB/s effective bandwidth MCDRAM writes are consistent across all the nodes 17

  18. Node-level variability Stream TRIAD in cache mode STREAM benchmark using 63 cores with one core for core specialization & working set of 7.5 GB Bandwidth (GB/s) Job number Max. 4.5% run-to-run, 2X job-to-job variability 350 GB/s effective bandwidth 18

  19. Node-level variability Stream TRIAD in cache mode STREAM benchmark using 63 cores with one core DRAM Reads & Writes for core specialization & working set of 7.5 GB MCDRAM Hits & Misses, Reads & Writes Counter Value Bandwidth (GB/s) MCDRAM Write count MCDRAM Miss count Job number Higher bandwidth correlates with lower MCDRAM Max. 4.5% run-to-run, 2X job-to-job variability miss ratio (More MCDRAM writes due to conflicts!) 350 GB/s effective bandwidth 19

  20. Network-level variability § Cray XC Dragonfly topology § Potential links sharing between the user jobs § High chances for inter-job contention § Sources of variability -> Inter-job contention § Size of the job, Node placement , Workload characteristics , Co-located job mix 20

  21. Network-level variability MPI Collectives § MPI_Allreduce using 64 processes with 8 MB message § Repeated 100 times within a job § Measured on several days Changes in node placement and Job mix § § Isolated system run: < 1% variability § Best observed § 21

  22. Network-level variability MPI Collectives 128 − Allreduce − 64 − 1048576 § MPI_Allreduce using 64 processes with 8 MB message ● variable 0.40 ● Ideal 02 − 16 − 17 02 − 01 − 13 02 − 17 − 04 § Repeated 100 times within a job 02 − 03 − 20 02 − 17 − 13 02 − 05 − 00 02 − 17 − 17 02 − 05 − 17 02 − 18 − 15 ● § Measured on several days 02 − 07 − 01 02 − 20 − 03 ● ● ● Latency(s) ● 02 − 07 − 15 02 − 21 − 02 02 − 08 − 21 02 − 21 − 17 ● Changes in node placement and Job mix ● § 0.36 02 − 09 − 21 02 − 22 − 15 Latency (s) 02 − 10 − 13 02 − 23 − 17 ● ● ● § Isolated system run: 02 − 11 − 13 02 − 24 − 21 02 − 12 − 13 02 − 25 − 17 ● ● ● ● 02 − 13 − 16 02 − 26 − 17 ● ● < 1% variability ● § ● ● 02 − 14 − 22 03 − 02 − 04 ● ● ● ● ● ● 02 − 15 − 20 Best observed ● ● ● ● ● § MoM ● 0.32 name § Variability is around 35% − 10% ● ● ● ● − 5% ● ● ● ● +%5 Much higher variability with smaller message sizes (not § +10% Best MoM shown here) observed t l u Different jobs a § Each box shows the median, IQR (Inter-Quartile Range) f e D Date 128 nodes Allreduce 8MB 64 PPN and the outliers 22

  23. Summary on Variability § Core-to-core level variability due to OS noise Core 0 is slow compared to rest of the cores § Crucial for low-latency MPI benchmarking and for micro-kernel benchmarking § Longer time scales don’t see the effect § Core specialization helps reduce the overhead § Frequency scaling effects are not dominant enough to induce variability § § Node level variability due to MCDRAM cache page conflicts Around 2X variability on STREAM benchmark § Linux Zone sort helps improve average performance and reduce variability to some extent § Example miniapps that are sensitive: Nekbone, MiniFE § For applications with working sets that fits within MCDRAM, using Flat mode is the mitigation § § Network level variability due to inter-job contention Up to 35% for large message sized MPI collectives § Even higher variability for latency bound small sized collectives § No obvious mitigation § 23

  24. Application Level Variability Nekbone variability at the node level Nekbone: Nekbone mini-app derived from Nek5000 - Streaming kernels – BW bound – DAXPY+ - Matrix multiply – Compute bound – MXM - Communication bound – COMM Max. to Min. ratio = 3.5% Max. to Min. ratio = 3.57 % 800 700 600 500 Time(s) Time (s) 400 300 200 100 0 Job number Totaltime DAXPY+ MXM COMM Flat mode on Theta 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend