A tool for Bottleneck analysis and Performance Prediction for - PowerPoint PPT Presentation

Motivation A Statistical Approach: BlackForest Implementation and Case Studies Conclusion A tool for Bottleneck analysis and Performance Prediction for GPU-accelerated Applications S. Madougou, A. Varbanescu, C. de Laat and R. van Nieuwpoort Universiteit van Amsterdam, NL May 23, 2016 S. Madougou et al. A tool for Bottleneck analysis and Performance Prediction for GPU-accelerated

Motivation A Statistical Approach: BlackForest Implementation and Case Studies Conclusion Motivation Heterogeneous computing emerging as a way to computing efficiency parallel design and programming are the trends Hard to get optimal performance on heterogeneous architectures Need for tools for understanding performance on heterogeneous architectures Different approaches: profilers, simulators, performance models S. Madougou et al. A tool for Bottleneck analysis and Performance Prediction for GPU-accelerated

Motivation A Statistical Approach: BlackForest Implementation and Case Studies Conclusion Why GPUs? For their popularity Higher pure computing horse-power than CPUs Performance enhancement for more and more applications For the challenge of getting performance on GPUs Fitness to data parallel and specific programing models Exploration of a large optimization space (via tuning, etc) S. Madougou et al. A tool for Bottleneck analysis and Performance Prediction for GPU-accelerated

Motivation A Statistical Approach: BlackForest Implementation and Case Studies Conclusion Modelling performance, why? Scaling behavior through application parameter space Scaling behavior through hardware parameter space Performance bottlenecks Performance limiting factors S. Madougou et al. A tool for Bottleneck analysis and Performance Prediction for GPU-accelerated

Motivation A Statistical Approach: BlackForest Implementation and Case Studies Conclusion Performance modelling (PM) Not the first, certainly not the last. Many different approaches: Simulation Analytical Statistical/ML Measurements Current approaches present many shortcomings 1 : 1 Madougou et al., An empirical evaluation of GPGPU performance models, Hetero-Par 2014. S. Madougou et al. A tool for Bottleneck analysis and Performance Prediction for GPU-accelerated

Motivation A Statistical Approach: BlackForest Implementation and Case Studies Conclusion Main PM Obstacles Complexity Requirement for detailed hardware knowledge Dependence on hardware or application Requiring user intervention Simulation/benchmarking is time consuming S. Madougou et al. A tool for Bottleneck analysis and Performance Prediction for GPU-accelerated

Motivation A Statistical Approach: BlackForest Implementation and Case Studies Conclusion Machine Learning Trade-Offs Pros: Doesn’t require hardware understanding Doesn’t require software understanding Sparse set of measurements is sufficient Easily publishable buzzword! Cons: Don’t know what is learned Hard to know where bottlenecks are Prone to overfitting S. Madougou et al. A tool for Bottleneck analysis and Performance Prediction for GPU-accelerated

Motivation A Statistical Approach: BlackForest Implementation and Case Studies Conclusion Some Observations All platforms expose hardware performance counters (PCs) Performance data is easy to extract but hard to interpret S. Madougou et al. A tool for Bottleneck analysis and Performance Prediction for GPU-accelerated

Motivation A Statistical Approach: BlackForest Implementation and Case Studies Conclusion PC Measurements and Metrics PC: special-purpose register built into a processor to store the count of an hardware event PCs allow to establish correlation between application code and its mapping to the hardware Choice of tool for PC counting and derived metrics Low level: PAPI, vendor-specific, high level: TAU, HPCToolkit, Score-P, etc LIKWID (CPU), nvprof (GPU) used currently S. Madougou et al. A tool for Bottleneck analysis and Performance Prediction for GPU-accelerated

Motivation A Statistical Approach: BlackForest Implementation and Case Studies Conclusion Some PCs and Metrics for CPU (Intel Nehalem) metric meaning group inst per br instructions per branch BRANCH branch rate BRANCH br rate volume of data read/write in GByte MEM mem data vol single precision arithmetic performance FLOPS SP SPFlops SPMUOPS single precision vectorization performance FLOPS SP PMUOPS vectorization performance FLOPS SP L1 miss ratio L1 data cache miss ratio CACHE dcache miss rate L1 data cache miss rate CACHE data volume between L2 and L3 L3 L3 data vol loads to stores ratio DATA L2S ratio L1 data TLB miss rate TLB L1DTLB miss rate cycles per instruction Always cpi br mispred rate branch misprediction rate BRANCH S. Madougou et al. A tool for Bottleneck analysis and Performance Prediction for GPU-accelerated

Motivation A Statistical Approach: BlackForest Implementation and Case Studies Conclusion Some PCs and Metrics for GPU (CUDA CC 2.0) counter meaning average number of replays due to shared memory conflicts shared replay overhead for each instruction executed number of executed shared load (store) shared load|store instructions, increments per warp on a multiprocessor inst replay overhead average number of replays for each instruction executed number of cache lines that hit in L1 l1 global load hit for global memory load accesses number of cache lines that miss in L1 l1 global load miss for global memory load accesses number of executed global load instructions gld request increments per warp on a multiprocessor gst request similar to gld request for store instructions number of global store transactions global store transaction increments per transaction which can be 32,64,96 or 128 bytes requested global memory load throughput gld requested throughput ratio of average active warps achieved occupancy per active cycle to the maximum number of warps per SM memory read throughput at L2 cache l2 read throughput memory write transactions at L2 cache l2 write transactions number of instructions executed per cycle ipc S. Madougou et al. A tool for Bottleneck analysis and Performance Prediction for GPU-accelerated

Motivation A Statistical Approach: BlackForest Implementation and Case Studies Conclusion Counter Behavior vs Performance - CPU 2 signature pattern performance behavior HPM (group) different counts of instructions retired or FP load imbalance saturating speedup operations among cores (FLOPS DP,FLOPS SP) saturating speedup across memory BW comparable memory BW saturation cores sharing a memory interface to peak memory BW (MEM) large discrepancy between between low BW utilization despite LD/ST strided memory access simple BW-based model and actual domination, low cache hit ratios, frequent performance evicts/replacements (CACHE,DATA,MEM) performance insensitive large ratio of inst. retired to FP inst. if FP, to problem sizes fitting many cycles per inst. if long-latency arithmetic, bad instruction mix into different cache levels scalar instructions dominating in data-parallel loops (FLOPS DP,FLOPS SP,CPI) large discrepancy between low CPI near theoretical limit if instruction actual performance and simple throughput is the problem, static code analysis limited instruction throughput predictions based on max FLOP/s predicting large pressure on single execution or LD/ST throughput port (FLOPS DP,FLOPS SP,CPI) speedup going down as more cores large non-FP instruction count synchronization overhead are added, no speedup with small (growing with number of cores used), low problem sizes, core busy but low FP CPI (FLOPS DP,FLOPS SP,CPI) very low speedup or slowdown frequent (remote) evicts (CACHE) false cache line sharing even with small core counts 2 J. Treibig et al., Best practices for HPM-assisted performance engineering on modern multicore processors, CoRR, 2012 S. Madougou et al. A tool for Bottleneck analysis and Performance Prediction for GPU-accelerated

A tool for Bottleneck analysis and Performance Prediction for - PowerPoint PPT Presentation

Motivation A Statistical Approach: BlackForest Implementation and Case Studies Conclusion A tool for Bottleneck analysis and Performance Prediction for GPU-accelerated Applications S. Madougou, A. Varbanescu, C. de Laat and R. van Nieuwpoort

SynAthina Onli line Tools 1. . A mapping tool 2. A Community Tool 3. An Archive Tool 3. An

von Neumann's bottleneck von Neumann machine One control unit that connects memory and

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

High Performance Systems EuroMPI 2015 Objectives Yet another performance analysis tool

Wardrop Equilibria and Price of Stability in Bottleneck Games With Splittable Traffic Vladimir

The Information Bottleneck Method Naftali Tishby, Fernando C. Pereira, William Bialek Naftali

More demanding workload Design Goals ____ More demanding workload Bottleneck: Network stack in

Using lasso and related estimators for prediction Di Liu StataCorp July 12, 2019 1 / 20

Prediction and Odds 18.05 Spring 2017 Probabilistic Prediction Also called probabilistic

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp 1 / 50

CS 104 Computer Organization and Design Branch Prediction CS104:Branch Prediction 1 Branch

Exercise 7a: Additional Intra Prediction Modes Implement Additional Block Prediction Modes Add

DeepLoc Data set statistics & performance Protein prediction II Gregor Sturm, Johannes Rest,

Black Box Scanning Tool + White Box Testing Tool Toshis Black Box Scanning Tool Same

Compiler-assisted Performance Analysis Adam Nemet Apple anemet@apple.com Hotspot User

TheScienceofComputingand theEngineeringofSoftware TonyHoare

BLAG: Improving the Accuracy of Blacklists Sivaram Ramanathan 1 , Jelena Mirkovic 1 and Minlan Yu

Quite Simple Approaches for Authorship Attribution, Intrinsic Plagiarism Detection and Sexual

FlowCAP2 Results: Challenges 1, 2, and 3 Nima Aghaeepour CIHR/MSFHR Strategic Training Program in

Data Analytics A (Short) Tour Venkatesh-Prasad Ranganath http://about.me/rvprasad Click to edit

The Failure Trace Archive: Enabling Comparative Analysis of Diverse Distributed Systems Derrick

Weak Truth Table Degrees of Structures David Belanger 1 April 2012 at UWMadison EMAIL :

Local Generic Formal Fibers of Excellent Rings Williams College SMALL REU 2013 Commutative

A tool for Bottleneck analysis and Performance Prediction for - PowerPoint PPT Presentation

Motivation A Statistical Approach: BlackForest Implementation and Case Studies Conclusion A tool for Bottleneck analysis and Performance Prediction for GPU-accelerated Applications S. Madougou, A. Varbanescu, C. de Laat and R. van Nieuwpoort

SynAthina Onli line Tools 1. . A mapping tool 2. A Community Tool 3. An Archive Tool 3. An

von Neumann's bottleneck von Neumann machine One control unit that connects memory and

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

High Performance Systems EuroMPI 2015 Objectives Yet another performance analysis tool

Wardrop Equilibria and Price of Stability in Bottleneck Games With Splittable Traffic Vladimir

The Information Bottleneck Method Naftali Tishby, Fernando C. Pereira, William Bialek Naftali

More demanding workload Design Goals ____ More demanding workload Bottleneck: Network stack in

Using lasso and related estimators for prediction Di Liu StataCorp July 12, 2019 1 / 20

Prediction and Odds 18.05 Spring 2017 Probabilistic Prediction Also called probabilistic

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp 1 / 50

CS 104 Computer Organization and Design Branch Prediction CS104:Branch Prediction 1 Branch

Exercise 7a: Additional Intra Prediction Modes Implement Additional Block Prediction Modes Add

DeepLoc Data set statistics &amp; performance Protein prediction II Gregor Sturm, Johannes Rest,

Black Box Scanning Tool + White Box Testing Tool Toshis Black Box Scanning Tool Same

Compiler-assisted Performance Analysis Adam Nemet Apple anemet@apple.com Hotspot User

TheScienceofComputingand theEngineeringofSoftware TonyHoare

BLAG: Improving the Accuracy of Blacklists Sivaram Ramanathan 1 , Jelena Mirkovic 1 and Minlan Yu

Quite Simple Approaches for Authorship Attribution, Intrinsic Plagiarism Detection and Sexual

FlowCAP2 Results: Challenges 1, 2, and 3 Nima Aghaeepour CIHR/MSFHR Strategic Training Program in

Data Analytics A (Short) Tour Venkatesh-Prasad Ranganath http://about.me/rvprasad Click to edit

The Failure Trace Archive: Enabling Comparative Analysis of Diverse Distributed Systems Derrick

Weak Truth Table Degrees of Structures David Belanger 1 April 2012 at UWMadison EMAIL :

Local Generic Formal Fibers of Excellent Rings Williams College SMALL REU 2013 Commutative

DeepLoc Data set statistics & performance Protein prediction II Gregor Sturm, Johannes Rest,