Run-to-run Variability on Theta and Best Practices for Performance - PowerPoint PPT Presentation

Run-to-run Variability on Theta and Best Practices for Performance Benchmarking ALCF Developer Session – September 26 th 2018 Sudheer Chunduri sudheer@anl.gov www.anl.gov

Run-to-run Variability Equal work is not Equal time 2 Image courtesy: https://concertio.com/2018/07/02/dealing-with-variability/

Equal work is not Equal time § Sources of Variability Core-level § OS noise effects • Dynamic frequency scaling • Manufacturing variability • Node level § Shared cache contention on a multi-core • System level § Network congestion due to inter-job interference • § Challenges Less reliable performance measures (multiple repetitions with statistical significance analysis is required) § Performance tuning – quantifying the impact of a code change is difficult § Difficult to predict job duration § Less user productivity • Inefficient system utilization • Complicates job scheduling • 3

Outline § Overview of Theta Architecture § Evaluation of run-to-run variability on Theta § Classify and quantify sources of variability § Present ways to mitigate wherever possible § Recommended Best practices for performance benchmarking 4

Theta System Overview § System: Cray XC40 system (#21 in Top500 in June 2018) 14 similar systems in top 50 supercomputers 4,392 compute nodes/281,088 cores, 11.69 PF peak performance § Processor: 2 nd Generation Intel Xeon Phi (Knights Landing) 7230 64 cores - 2 cores on one tile with shared L2 1.3 base frequency, can turbo up to 1.5 GHz § Node: Single socket KNL 192 GB DDR4-2400 per node 16 GB MCDRAM per node (Cache mode/Flat mode) § Network: Cray Aries interconnect with Dragonfly network topology Adaptive routing 5 Figures source: Intel, Cray

Aspects of Variability Examined § Core level - OS noise effects Micro-benchmarks - Core to core variability - Cores within a tile § Node level Mini-apps - MCDRAM memory mode effects Applications § System level - Network congestion - Node placement and routing mode effects 6 Figures source: Intel, Cray

Core-level Variability § Each core runs the MKL DGEMM benchmark § Matrix size chosen so as to fit within L1 cache Max to Min Var: 11.18% 3.20 3.15 3.10 3.05 Time(s) 3.00 2.95 2.90 2.85 DGEMM on 64 cores 7

Core-level Variability § Each core runs the MKL DGEMM benchmark § Core specialization – A Cray OS feature allowing users § Matrix size chosen so as to fit within L1 cache to reserve cores for handling system services Max to Min Var: 5.22% Max to Min Var: 11.18% Max R2R Var: 5.91% Max to Min Var: 6.01% 3.20 3.20 3.15 3.15 3.10 3.10 3.05 3.05 Time(s) Time(s) Time(s) 3.00 3.00 2.95 2.95 2.90 2.90 2.85 2.85 DGEMM on 64 cores with Cores, 0 − 63 DGEMM on 64 cores Core Specialization 8

Core-level Variability § Benchmark: Selfish § Runs in a tight loop and measures the time for each iteration. § If an iteration takes longer than a particular threshold, then the timestamp (Noise) is recorded. Noise(us) Noise events Actual time OS noise effects on a core without Core Specialization 9

Core-level Variability § Benchmark: Selfish § Runs in a tight loop and measures the time for each iteration. § If an iteration takes longer than a particular threshold, then the timestamp (Noise) is recorded. Core Specialization is an effective mitigation for core level variability Noise(us) Noise(us) OS noise effects on a core without Core OS noise effects on a core with Core Specialization Specialization 10

Core-level Variability Benchmark: Selfish - Small micro-benchmark in the milliseconds range - Noise is significant Noise(us) 11

Core-level Variability Benchmark: Selfish Micro-benchmark in the seconds range - Small micro-benchmark in the milliseconds range Time scale matters – runtimes greater than seconds don’t see the impact - Noise is significant Noise(us) Noise(us) 12

Node-level Variability Variability due to memory mode KNL Has two types of memory DRAM - 192 GB capacity MCDRAM - 16 GB capacity ~ 90 GB/s effective bandwidth ~ 480 GB/s effective bandwidth 13

Node-level Variability Variability due to memory mode KNL Has two types of memory DRAM - 192 GB capacity MCDRAM - 16 GB capacity ~ 90 GB/s effective bandwidth ~ 480 GB/s effective bandwidth MCDRAM can be operated in two modes Flat Mode Cache Mode 14

Node-level Variability Variability due to memory mode KNL Has two types of memory DRAM - 192 GB capacity MCDRAM - 16 GB capacity ~ 90 GB/s effective bandwidth ~ 480 GB/s effective bandwidth MCDRAM can be operated in two modes Flat Mode Cache Mode Source of Variability: In cache mode, MCDRAM operated as direct-mapped cache to DRAM • Potential conflicts because of the direct mapping • 15

Node-level variability Stream TRIAD in flat mode STREAM benchmark using 63 cores with one core STREAM TRIAD benchmark for core specialization & working set of 7.5 GB used to measure memory bandwidth with A(i) = B(i) + s * C(i) Bandwidth (GB/s) Job number Less than 1% variability: 480 GB/s effective bandwidth 16

Node-level variability Stream TRIAD in flat mode STREAM benchmark using 63 cores with one core DRAM Reads & Writes for core specialization & working set of 7.5 GB MCDRAM Reads & Writes MCDRAM Read count Bandwidth (GB/s) Counter Value MCDRAM Write count Job number Less than 1% variability: 480 GB/s effective bandwidth MCDRAM writes are consistent across all the nodes 17

Node-level variability Stream TRIAD in cache mode STREAM benchmark using 63 cores with one core for core specialization & working set of 7.5 GB Bandwidth (GB/s) Job number Max. 4.5% run-to-run, 2X job-to-job variability 350 GB/s effective bandwidth 18

Node-level variability Stream TRIAD in cache mode STREAM benchmark using 63 cores with one core DRAM Reads & Writes for core specialization & working set of 7.5 GB MCDRAM Hits & Misses, Reads & Writes Counter Value Bandwidth (GB/s) MCDRAM Write count MCDRAM Miss count Job number Higher bandwidth correlates with lower MCDRAM Max. 4.5% run-to-run, 2X job-to-job variability miss ratio (More MCDRAM writes due to conflicts!) 350 GB/s effective bandwidth 19

Network-level variability § Cray XC Dragonfly topology § Potential links sharing between the user jobs § High chances for inter-job contention § Sources of variability -> Inter-job contention § Size of the job, Node placement , Workload characteristics , Co-located job mix 20

Network-level variability MPI Collectives § MPI_Allreduce using 64 processes with 8 MB message § Repeated 100 times within a job § Measured on several days Changes in node placement and Job mix § § Isolated system run: < 1% variability § Best observed § 21

Network-level variability MPI Collectives 128 − Allreduce − 64 − 1048576 § MPI_Allreduce using 64 processes with 8 MB message ● variable 0.40 ● Ideal 02 − 16 − 17 02 − 01 − 13 02 − 17 − 04 § Repeated 100 times within a job 02 − 03 − 20 02 − 17 − 13 02 − 05 − 00 02 − 17 − 17 02 − 05 − 17 02 − 18 − 15 ● § Measured on several days 02 − 07 − 01 02 − 20 − 03 ● ● ● Latency(s) ● 02 − 07 − 15 02 − 21 − 02 02 − 08 − 21 02 − 21 − 17 ● Changes in node placement and Job mix ● § 0.36 02 − 09 − 21 02 − 22 − 15 Latency (s) 02 − 10 − 13 02 − 23 − 17 ● ● ● § Isolated system run: 02 − 11 − 13 02 − 24 − 21 02 − 12 − 13 02 − 25 − 17 ● ● ● ● 02 − 13 − 16 02 − 26 − 17 ● ● < 1% variability ● § ● ● 02 − 14 − 22 03 − 02 − 04 ● ● ● ● ● ● 02 − 15 − 20 Best observed ● ● ● ● ● § MoM ● 0.32 name § Variability is around 35% − 10% ● ● ● ● − 5% ● ● ● ● +%5 Much higher variability with smaller message sizes (not § +10% Best MoM shown here) observed t l u Different jobs a § Each box shows the median, IQR (Inter-Quartile Range) f e D Date 128 nodes Allreduce 8MB 64 PPN and the outliers 22

Summary on Variability § Core-to-core level variability due to OS noise Core 0 is slow compared to rest of the cores § Crucial for low-latency MPI benchmarking and for micro-kernel benchmarking § Longer time scales don’t see the effect § Core specialization helps reduce the overhead § Frequency scaling effects are not dominant enough to induce variability § § Node level variability due to MCDRAM cache page conflicts Around 2X variability on STREAM benchmark § Linux Zone sort helps improve average performance and reduce variability to some extent § Example miniapps that are sensitive: Nekbone, MiniFE § For applications with working sets that fits within MCDRAM, using Flat mode is the mitigation § § Network level variability due to inter-job contention Up to 35% for large message sized MPI collectives § Even higher variability for latency bound small sized collectives § No obvious mitigation § 23

Application Level Variability Nekbone variability at the node level Nekbone: Nekbone mini-app derived from Nek5000 - Streaming kernels – BW bound – DAXPY+ - Matrix multiply – Compute bound – MXM - Communication bound – COMM Max. to Min. ratio = 3.5% Max. to Min. ratio = 3.57 % 800 700 600 500 Time(s) Time (s) 400 300 200 100 0 Job number Totaltime DAXPY+ MXM COMM Flat mode on Theta 24

Run-to-run Variability on Theta and Best Practices for Performance - PowerPoint PPT Presentation

Run-to-run Variability on Theta and Best Practices for Performance Benchmarking ALCF Developer Session September 26 th 2018 Sudheer Chunduri sudheer@anl.gov www.anl.gov Run-to-run Variability Equal work is not Equal time 2 Image

Theta Correspondence for Dummies (Correspondance Theta pour les nuls) Jeffrey Adams Dipendra

VARIABILITY OF HAWAIIAN WINTER RAINFALL VARIABILITY OF HAWAIIAN WINTER RAINFALL VARIABILITY OF

Theta, Gamma, and Working Memory Computational Models of Neural Systems Lecture 3.8 David S.

Theta, Gamma, and Working Memory Computational Models of Neural Systems Lecture 3.8 David S.

Next Edge Theta Yield Fund Next Edge Capital Corp., January 2016 IMPORTANT NOTES The Next Edge

1-Bucket-Theta: Map 1-Bucket-Theta: Reduce Col T 1 6 Row Input: tuple x S T,

Computing optimal pairings on abelian varieties with theta functions 06/06/2013 AGCT David

Variability of an artificial tandem repeat Ted Pak HURS 2007 Variability of an artificial tandem

Five Steps to Optimization Five Steps to Optimization Beyond Best Practices Beyond Best

Variability Extraction and Analysis Toolkit (VEXA) VEXA Introduction The Variability Extraction

Climate Variability in South Asia V. Niranjan, M. Dinesh Kumar, and Nitin Bassi Institute for

Chapter 4: Variability Variability Provides a quantitative measure of the degree to which

Introduction Variability in Data Summarizing variability in a data set CS 239

1 Best Practices Conversational UX Design 2 Best Practices Conversational UX Design SET THE

Best Practices: Electronics Cooling Ruben Bons - CD-adapco Best Practices Outline Geometry

Welcome to data visualization best practices in R Nick Strayer Instructor DataCamp

Recent Developments on Exact Solvers for the (Prize-Collecting) Steiner Tree Problem Ivana

CPSC 581 Human Computer Interaction II Your Hosts Sonny Chan - MS 634 - sonny.chan@ucalgary.ca

Software Design See Alan Cooper, The Essentials of User Interface Design who designs the

Next Generation Internet research and activities in China Yan MA Information Network Center

BGP: The protocol that holds the Internet together Dr. Nils Kammenhuber Chair for Network

Design and Architectures for Embedded Systems Maik Scheer Scheer ( (Lehrstuhl Lehrstuhl Prof.

Hermes-A: An Asynchronous NoC Router with Distributed Routing Julian Pontes Matheus Moreira

An Adaptive Multi-Temporal Approach for Robust Routing Pedro CASAS & Sandrine VATON ENST

Run-to-run Variability on Theta and Best Practices for Performance - PowerPoint PPT Presentation

Run-to-run Variability on Theta and Best Practices for Performance Benchmarking ALCF Developer Session September 26 th 2018 Sudheer Chunduri sudheer@anl.gov www.anl.gov Run-to-run Variability Equal work is not Equal time 2 Image

Theta Correspondence for Dummies (Correspondance Theta pour les nuls) Jeffrey Adams Dipendra

VARIABILITY OF HAWAIIAN WINTER RAINFALL VARIABILITY OF HAWAIIAN WINTER RAINFALL VARIABILITY OF

Theta, Gamma, and Working Memory Computational Models of Neural Systems Lecture 3.8 David S.

Theta, Gamma, and Working Memory Computational Models of Neural Systems Lecture 3.8 David S.

Next Edge Theta Yield Fund Next Edge Capital Corp., January 2016 IMPORTANT NOTES The Next Edge

1-Bucket-Theta: Map 1-Bucket-Theta: Reduce Col T 1 6 Row Input: tuple x S T,

Computing optimal pairings on abelian varieties with theta functions 06/06/2013 AGCT David

Variability of an artificial tandem repeat Ted Pak HURS 2007 Variability of an artificial tandem

Five Steps to Optimization Five Steps to Optimization Beyond Best Practices Beyond Best

Variability Extraction and Analysis Toolkit (VEXA) VEXA Introduction The Variability Extraction

Climate Variability in South Asia V. Niranjan, M. Dinesh Kumar, and Nitin Bassi Institute for

Chapter 4: Variability Variability Provides a quantitative measure of the degree to which

Introduction Variability in Data Summarizing variability in a data set CS 239

1 Best Practices Conversational UX Design 2 Best Practices Conversational UX Design SET THE

Best Practices: Electronics Cooling Ruben Bons - CD-adapco Best Practices Outline Geometry

Welcome to data visualization best practices in R Nick Strayer Instructor DataCamp

Recent Developments on Exact Solvers for the (Prize-Collecting) Steiner Tree Problem Ivana

CPSC 581 Human Computer Interaction II Your Hosts Sonny Chan - MS 634 - sonny.chan@ucalgary.ca

Software Design See Alan Cooper, The Essentials of User Interface Design who designs the

Next Generation Internet research and activities in China Yan MA Information Network Center

BGP: The protocol that holds the Internet together Dr. Nils Kammenhuber Chair for Network

Design and Architectures for Embedded Systems Maik Scheer Scheer ( (Lehrstuhl Lehrstuhl Prof.

Hermes-A: An Asynchronous NoC Router with Distributed Routing Julian Pontes Matheus Moreira

An Adaptive Multi-Temporal Approach for Robust Routing Pedro CASAS &amp; Sandrine VATON ENST

An Adaptive Multi-Temporal Approach for Robust Routing Pedro CASAS & Sandrine VATON ENST