Using the Roofline Model and Intel Advisor
Samuel Williams
SWWilliams@lbl.gov Computational Research Division Lawrence Berkeley National Lab
Tuomas Koskela
TKoskela@lbl.gov NERSC Lawrence Berkeley National Lab
Using the Roofline Model and Intel Advisor Samuel Williams Tuomas - - PowerPoint PPT Presentation
Using the Roofline Model and Intel Advisor Samuel Williams Tuomas Koskela SWWilliams@lbl.gov TKoskela@lbl.gov Computational Research Division NERSC Lawrence Berkeley National Lab Lawrence Berkeley National Lab Acknowledgements This
SWWilliams@lbl.gov Computational Research Division Lawrence Berkeley National Lab
TKoskela@lbl.gov NERSC Lawrence Berkeley National Lab
§ This material is based upon work supported by the Advanced Scientific Computing Research Program in the U.S. Department of Energy, Office of Science, under Award Number DE-AC02-05CH11231. § This research used resources of the National Energy Research Scientific Computing Center (NERSC), which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02- 05CH11231. § This research used resources of the Oak Ridge Leadership Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. § Special Thanks to:
§ Identify performance bottlenecks § Motivate software optimizations § Determine when we’re done optimizing
§ Predict performance on future machines / architectures
computational needs of today’s applications.
4
§ Historically, many performance models and simulators tracked latencies to predict performance (i.e. counting cycles) § The last two decades saw a number of latency-hiding techniques…
§ Effectively latency hiding has resulted in a shift from a latency-limited computing regime to a throughput-limited computing regime
5
§ The Roofline Model is a throughput-
(concurrency = latency*bandwidth)
(applies to CPUs, GPUs, Google TPUs1, etc…)
§ Three main components:
potential of the system)
with perfect compilers, procs, …)
6
1Jouppi et al, “In-Datacenter Performance Analysis of a Tensor Processing Unit”, ISCA, 2017.https://crd.lbl.gov/departments/computer-science/PAR/research/roofline
§ Ideally, we could always attain peak Flop/s § However, finite locality (reuse) limits performance. § Plot the performance bound using Arithmetic Intensity (AI) as the x- axis…
performance, etc…
ultimately memory bound. 7 Peak Flop/s Attainable Flop/s Arithmetic Intensity (Flop:Byte) Memory-bound Compute-bound
§ Typical machine balance is 5-10 flops per byte…
§ Consider STREAM Triad…
8 Peak Flop/s Attainable Flop/s Arithmetic Intensity (Flop:Byte) TRIAD
#pragma omp parallel for for(i=0;i<N;i++){ Z[i] = X[i] + alpha*Y[i]; }
§ Conversely, 7-point constant coefficient stencil…
but 3x the flop rate 9 Peak Flop/s Attainable Flop/s Arithmetic Intensity (Flop:Byte) 7-point Stencil TRIAD
#pragma omp parallel for for(k=1;k<dim+1;k++){ for(j=1;j<dim+1;j++){ for(i=1;i<dim+1;i++){ int ijk = i + j*jStride + k*kStride; new[ijk] = -6.0*old[ijk ] + old[ijk-1 ] + old[ijk+1 ] + old[ijk-jStride] + old[ijk+jStride] + old[ijk-kStride] + old[ijk+kStride]; }}}
§ Real processors have multiple levels of memory
§ We may measure a bandwidth and define an AI for each level
multiple AI’s
10 Peak Flop/s Attainable Flop/s Arithmetic Intensity (Flop:Byte)
§ Real processors have multiple levels of memory
§ We may measure a bandwidth and define an AI for each level
multiple AI’s
bandwidths and AI’s 11 Peak Flop/s Attainable Flop/s Arithmetic Intensity (Flop:Byte)
§ We have assumed one can attain peak flops with high locality. § In reality, this is premised on sufficient…
§ Without these, …
compute-bound
TLP (single core can’t saturate BW w/scalar code) 12 Peak Flop/s No FMA No vectorization Attainable Flop/s Arithmetic Intensity (Flop:Byte)
14
Machine Characterization Potential of my target system
a lack of FMA, DLP, ILP, TLP?
reduced AI (i.e. memory/cache bandwidth)?
NUMA, strided, or random memory access patterns?
Application Instrumentation Properties of my app’s execution
level ?
§ Challenges:
“Marketing Numbers”
15
§ https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/ § Empirical Roofline Toolkit (ERT)
Cori / KNL
10 100 1000 10000 100000 0.01 0.1 1 10 100 GFLOPs / sec FLOPs / Byte Empirical Roofline Graph (Results.summitdev.ccs.ornl.gov.02.MPI4/Run.001) 17904.6 GFLOPs/sec (Maximum) L1 - 6506.5 GB/s DRAM - 1929.7 GB/sSummitDev / 4GPUs
§ Flop counters can be broken/missing in production HW (Haswell) § Counting Loads and Stores is a poor proxy for data movement as they don’t capture reuse § Counting L1 misses is a poor proxy for data movement as they don’t account for HW prefetching. § DRAM counters are accurate, but are privileged and thus nominally inaccessible in user mode § OS/kernel changes must be approved by vendor (e.g. Cray) and the center (e.g. NERSC)
16
§ NERSC/CRD (==NESAP/SUPER) collaboration…
production systems
software Flop counters (could use Byfl as well)
approved) to access uncore counters
and DRAM data movement on HSW and KNL
17
http://www.nersc.gov/users/application-performance/measuring-arithmetic-intensity/
NERSC is LBL’s production computing division CRD is LBL’s Computational Research Division NESAP is NERSC’s KNL application readiness project LBL is part of SUPER (DOE SciDAC3 Computer Science Institute)
§ NESAP is the NERSC KNL application readiness project. § NESAP used Roofline to drive optimization and analysis on KNL
Landing Processor", Intel Xeon Phi User Group Workshop (IXPUG), June 2016.
Landing", Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), November 2016.
18
19
1" 10" 100" 1000" 10000" 0.1" 1" 10" GFLOP/s" Arithme3c"Intensity"(FLOP/byte)" Roofline"Model" wo/FMA" Original" w/Tiling" w/Tiling+Vect" 1" 10" 100" 1000" 10000" 0.1" 1" 10" GFLOP/s" Arithme3c"Intensity"(FLOP/byte)" Roofline"Model" wo/FMA" Original" w/Tiling" w/Tiling+Vect" 1" 10" 100" 1000" 10000" 0.1" 1" 10" GFLOP/s" Arithme3c"Intensity"(FLOP/byte)" Roofline"Model" wo/FMA" Original" SELL" SB" SELL+SB" nRHS+SELL+SB" 1" 10" 100" 1000" 10000" 0.1" 1" 10" GFLOP/s" Arithme3c"Intensity"(FLOP/byte)" Roofline"Model" wo/FMA" Original" SELL" SB" SELL+SB" nRHS+SELL+SB" 1" 10" 100" 1000" 10000" 0.01" 0.1" 1" 10" GFLOP/s" Arithme3c"Intensity"(FLOP/byte)" Roofline"Model" wo/FMA" 1"RHS" 4"RHS" 8"RHS" 1" 10" 100" 1000" 10000" 0.01" 0.1" 1" 10" GFLOP/s" Arithme3c"Intensity"(FLOP/byte)" Roofline"Model" wo/FMA" 1"RHS" 4"RHS" 8"RHS"
2P HSW KNL MFDn PICSAR EMGeo
§ Having to compose VTune, SDE, and graphing tools worked correctly and benefitted NESAP, but … § …placed a very high burden on users…
§ …still lacked integration with compiler/debugger/disassembly § CRD/NERSC wanted a more integrated solution…
20
§ Original / DRAM / Hierarchical Roofline…
§ “Cache-Aware” Roofline
23
§ Why Does this matter?
24
“Cache-Aware” Roofline Roofline
§ Captures cache effects § Captures cache effects § Single Arithmetic Intensity § Multiple Arithmetic Intensities (one per level of memory) § AI independent of problem size § AI dependent on problem size (capacity misses reduce AI) § AI is Flop:Bytes as presented to the L1 cache § AI is Flop:Bytes after being filtered by lower cache levels § Memory/Cache/Locality effects are indirectly observed § Memory/Cache/Locality effects are directly observed § Requires static analysis or binary instrumentation to measure AI § Requires performance counters to measure AI
25 #pragma omp parallel for for(i=0;i<N;i++){ Z[i] = X[i] + alpha*Y[i]; }
§ L1 AI…
§ No cache reuse…
iteration i+delta for any delta.
§ … leads to a DRAM AI equal to the L1 AI
26
“Cache-Aware” Roofline Roofline
Peak Flop/s Attainable Flop/s Arithmetic Intensity (Flop:Byte) Multiple AI’s…. 1) based on flop:DRAM bytes 2) Based on flop:L1 bytes (same) Peak Flop/s Attainable Flop/s Arithmetic Intensity (Flop:Byte) Single AI based on flop:L1 bytes Observed performance is correlated with DRAM bandwidth Actual Performance is the minimum of the two intercepts
27 #pragma omp parallel for for(k=1;k<dim+1;k++){ for(j=1;j<dim+1;j++){ for(i=1;i<dim+1;i++){ int ijk = i + j*jStride + k*kStride; new[ijk] = -6.0*old[ijk ] + old[ijk-1 ] + old[ijk+1 ] + old[ijk-jStride] + old[ijk+jStride] + old[ijk-kStride] + old[ijk+kStride]; }}}
§ L1 AI…
number of loads.
§ Moderate cache reuse…
§ … leads to DRAM AI larger than the L1 AI
28
“Cache-Aware” Roofline Roofline
Peak Flop/s Attainable Flop/s Arithmetic Intensity (Flop:Byte) Peak Flop/s Attainable Flop/s Arithmetic Intensity (Flop:Byte) Single AI based on flop:L1 bytes Multiple AI’s…. 1) flop:DRAM ~ 0.44 2) flop:L1 ~ 0.11 Observed performance is between L1 and DRAM lines (== some cache locality) Actual Performance is the minimum of the two
29
“Cache-Aware” Roofline Roofline
Peak Flop/s Attainable Flop/s Arithmetic Intensity (Flop:Byte) Peak Flop/s Attainable Flop/s Arithmetic Intensity (Flop:Byte) Single AI based on flop:L1 bytes Multiple AI’s…. 1) flop:DRAM ~ 0.20 2) flop:L1 ~ 0.11 Capacity misses reduce DRAM AI and performance Observed performance is closer to DRAM line (== less cache locality)
*DRAM Roofline and OS/X Advisor GUI: These are preview features that may or may not be included in mainline product releases. They may not be stable as they are prototypes incorporating very new functionality. Intel provides preview features in order to collect user feedback and plans further development and productization steps based on the feedback.
§ Integrated Performance Analysis Tool
§ CRD/NERSC began a collaboration with Intel
32
NESAP is NERSC’s KNL application readiness project SciDAC is the DOE Office of Science’s Scientific Discovery thru Advanced Computing program ECP is the DOE’s Exascale Computing Project
Background
§ https://software.intel.com/en-us/intel-advisor-xe § https://software.intel.com/en-us/articles/getting-started-with- intel-advisor-roofline-feature § https://www.youtube.com/watch?v=h2QEM1HpFgg
Running Advisor on NERSC Systems
§ http://www.nersc.gov/users/software/performance-and- debugging-tools/advisor/ 33
§ Compile…
use ‘-g’ when compiling
§ Submit Job…
% salloc –perf=vtune <<< interactive sessions; --perf only needed for DRAM DRAM Roofline –or- #SBATCH –perf=vtune <<< batch submissions; --perf only needed for DRAM DRAM Roofline
Benchmark…
% module load advisor % export ADVIXE_EXPERIMENTAL=roofline_ex <<< only needed for DRAM DRAM Roofline % srun [args] advixe-cl -collect survey -no-stack-stitching -project-dir $DIR -- ./a.out [args] % srun [args] advixe-cl -collect tripcounts -flops-and-masks -callstack-flops -project-dir $DIR -- ./a.out [args]
§ Use Advisor GUI…
% module load advisor % export ADVIXE_EXPERIMENTAL=roofline_ex <<< only needed for DRAM DRAM Roofline % advixe-gui $DIR
34
35
36
37
38
39
40
41
*DRAM Roofline and OS/X Advisor GUI: These are preview features that may or may not be included in mainline product releases. They may not be stable as they are prototypes incorporating very new functionality. Intel provides preview features in order to collect user feedback and plans further development and productization steps based on the feedback.
§ Apply to a 5123 domain on a single NUMA node (single HSW socket) § Create 5 code variants to highlight effects (as seen in advisor)
ver0. Baseline: thread over outer loop (k), but prevent vectorization
#pragma novector // prevent simd int ijk = i*iStride + j*jStride + k*kStride; // variable iStride to confuse the compiler
ver1. Enable vectorization
int ijk = i + j*jStride + k*kStride; // unit-stride inner loop
ver2. Eliminate capacity misses 2D tiling of j-k iteration space // working set had been O(6MB) per thread ver3. Improve vectorization Provide aligned pointers and strides ver4. Force vectorization / cache bypass
__assume(jstride%8 == 0); // stride by variable is still aligned #pragma omp simd, vector nontemportal // force simd; force cache bypass
44
45
46
Cache-Aware Roofline
47
Cache-Aware Roofline
48
Cache-Aware Roofline
49
Cache-Aware Roofline
50
Cache-Aware Roofline
51
DRAM Roofline*
52
DRAM Roofline*
53
DRAM Roofline*
54
DRAM Roofline*
55
DRAM Roofline*
57
§ Sunday November 12th § 8:30am-12pm (half-day tutorial) § multi-/manycore focus
Background
§ https://software.intel.com/en-us/intel-advisor-xe § https://software.intel.com/en-us/articles/getting-started-with- intel-advisor-roofline-feature § https://www.youtube.com/watch?v=h2QEM1HpFgg
Running Advisor on NERSC Systems
§ http://www.nersc.gov/users/software/performance-and- debugging-tools/advisor/ 58
§ This material is based upon work supported by the Advanced Scientific Computing Research Program in the U.S. Department of Energy, Office of Science, under Award Number DE-AC02-05CH11231. § This research used resources of the National Energy Research Scientific Computing Center (NERSC), which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02- 05CH11231. § This research used resources of the Oak Ridge Leadership Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. § Special Thanks to: