Data Placement Optimization in GPU Memory Hierarchy Using Predictive - PowerPoint PPT Presentation

Data Placement Optimization in GPU Memory Hierarchy Using Predictive Modeling Larisa Stoltzfus * , Murali Emani, Pei-Hung Lin, Chunhua Liao * University of Edinburgh (UK), Lawrence Livermore National Laboratory MCHPC'18: Workshop on Memory Centric High Performance Computing LLNL-PRES-761162 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Complex Memory Hierarchy on GPUs § GPUs can greatly improve performance of HPC applications, but can be difficult to optimize for due to their complex memory hierarchy § Memory hierarchies can change drastically from generation to generation § Codes optimized for one platform may not retain optimal performance when ported to other platforms 2 LLNL-PRES-761162

Performance can vary widely depending on data placement as well as platform Kepler Maxwell Pascal Volta 4 3 Matrix-Matrix Multiplication Speedup 2 1 0 1234567 1234567 1234567 1234567 Kepler Maxwell Pascal Volta Sparse Matrix-Vector 1.5 Multiplication Speedup 1.0 0.5 0.0 123456789 123456789 123456789 123456789 Memory Type [ Platform ] 3 LLNL-PRES-761162

Challenges § Different memory variants (global/ constant/ texture/ shared) can have significant impact on program performance § But identifying the best performing variant is non-obvious and complex decision to make § Given a default global variant, can the best performing memory variant be automatically determined? 4 LLNL-PRES-761162

Proposed Solution § Use machine learning to develop a predictive model to determine the best data placement for a given application on a particular platform § Use the model to predict best placement § Involves three stages: — offline training — feature and model selection — online inference 5 LLNL-PRES-761162

Approach Offline training Online inference global shared constant texture Representative kernels program variants Training Data collection Use nvprof, HW info Offline Training - Data collection Feature Feature Extraction extraction using and labelling of nvprof metrics and events CUPTI Training data Classifier Classifier best variant 6 LLNL-PRES-761162

Approach Offline training Online inference global shared constant texture Representative kernels program variants Training Data collection Use nvprof, HW info Model Building - Determine Feature Feature Extraction extraction using and labelling best version, features and model CUPTI Training data Classifier Classifier best variant 7 LLNL-PRES-761162

Approach Offline training Online inference global shared constant texture Representative kernels program variants Training Data collection Use nvprof, HW info Online Inference: Use Feature Feature Extraction model to determine best extraction using and labelling CUPTI placement in run-time Training data Classifier Classifier best variant 8 LLNL-PRES-761162

Methodology In order to build the model: § 4 different generations of NVIDIA GPUs were used: — Kepler — Pascal — Maxwell — Volta § 8 programs X 3 input data sizes X 3 thread block sizes X 4 variants MD, SPMV, CFD, MM, ConvolutionSeparable, ParticleFilter etc. 9 LLNL-PRES-761162

Offline Training § Metric and event data from nvprof from global variant along with hardware data were collected § Best performing variant (class label) for each version run was appended § Benchmarks were run 10 times on each platform, with 5 initial iterations to warm up the GPU 10 LLNL-PRES-761162

Feature Selection § Number of features narrowed down to 16 from 241 using correlation-based feature selection algorithm (CFS). § A partial list: Feature Name Meaning achieved_occupancy ratio of average active warps to maximum number of warps l2_read_transactions, Memory read/write transactions at L2 l2_write_transactions cache gld_throughput global memory load throughput warp_execution_efficiency ratio of average active threads to the maximum number of threads 11 LLNL-PRES-761162

Model Selection § Used 10-fold cross validation during evaluation § Overall, decision tree classifiers showed great promise (>95% accuracy in prediction) Classifier Prediction Accuracy (%) RandomForest 95.7 LogitBoost 95.5 IterativeClassifierOptimizer 95.5 SimpleLogistic 95.4 JRip 95.0 12 LLNL-PRES-761162

Runtime Prediction § The classifier JRIP was selected from the group of top five performing classifier models § JRIP is a propositional rule learner, which results in a decision tree § The model then reads in input from CUPTI calls - the API for nvprof - which can access hardware counters in real-time and outputs its class 13 LLNL-PRES-761162

Preliminary Results texture constant shared global 100 75 % Predicted 50 25 0 t l d e n a r e b a r u t o a t s x l h n g e s o t c Memory Type • Results from this initial exploration show that there is great potential for predictive modeling for data placement on GPUs • Overall 95% accuracy achievable, but this is higher for global and texture memory best performers 14 LLNL-PRES-761162

Runtime Validation § The JRIP model was tested out on a new benchmark - an acoustic application § The model was successfully able to correctly predict the best performing version on two platforms 15 LLNL-PRES-761162

Limitations § Currently, all versions need to be pre-compiled for run-time prediction, ideally it would be better to have model built into a compiler § CUPTI calls are slow and require as many iterations as metrics and events to collect § This would acceptable for benchmarks with many iterations, but for other kinds a workaround would need to be made 16 LLNL-PRES-761162

Conclusion § Machine learning has shown great potential for data placement prediction on a range of applications § More work needs to be done to acquire hardware counters from applications in a timely manner § Approach could be reused for other optimizations such as data layouts. 17 LLNL-PRES-761162

19 LLNL-PRES-761162

Data Placement Optimization in GPU Memory Hierarchy Using Predictive - PowerPoint PPT Presentation

Data Placement Optimization in GPU Memory Hierarchy Using Predictive Modeling Larisa Stoltzfus * , Murali Emani, Pei-Hung Lin, Chunhua Liao * University of Edinburgh (UK), Lawrence Livermore National Laboratory MCHPC'18: Workshop on Memory Centric

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Memory Hierarchy Design Memory Hierarchy Design Chapter 5 and Appendix C 1 Overview

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy Soner Onder

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

Abstractions for Practical Systems Caching and the memory hierarchy Operating systems and the

1 5.1 Introduction A Typical Memory Hierarchy A Typical Memory Hierarchy Memory Technology

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

1 Basic use of caches Levels in the memory hierarchy When fetching an instruction, first

EE 457 Unit 7a Cache and Memory Hierarchy 2 Memory Hierarchy & Caching Use several

Data Management Systems Storage Management The Memory hierarchy Memory hierarchy

Parishioner Meetings November 5, 2018 Opening Prayer Almighty God, each autumn we are reminded

Interim Results 2020 www.foodtravelexperts.com Presentation structure 1. Group highlights Simon

IBM i and External Storage MRMUG February, 2012 IBM i can use external storage Direct

CD PROJEKT CAPITAL GROUP RESULTS IN 2016 ADAM KICI SKI President PIOTR NIELUBOWICZ

From waste to resource-management The Ladder of Lansink: Instrument for the (third) transition

ICAO Operations Panel RNAV / Ground-based Charting Symbol Hierarchy OVERVIEW NEED FOR

Dr Jamie Steer, Senior Biodiversity Advisor Presentation to PNRP Hearings Panel, 9 th April 2018

Management Plan (HIAMP) Page 153 Agenda Item 6 Code of Practice Revision Lincolnshire Highways

Data Placement Optimization in GPU Memory Hierarchy Using Predictive - PowerPoint PPT Presentation

Data Placement Optimization in GPU Memory Hierarchy Using Predictive Modeling Larisa Stoltzfus * , Murali Emani, Pei-Hung Lin, Chunhua Liao * University of Edinburgh (UK), Lawrence Livermore National Laboratory MCHPC'18: Workshop on Memory Centric

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Memory Hierarchy Design Memory Hierarchy Design Chapter 5 and Appendix C 1 Overview

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy Soner Onder

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

Abstractions for Practical Systems Caching and the memory hierarchy Operating systems and the

1 5.1 Introduction A Typical Memory Hierarchy A Typical Memory Hierarchy Memory Technology

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

1 Basic use of caches Levels in the memory hierarchy When fetching an instruction, first

EE 457 Unit 7a Cache and Memory Hierarchy 2 Memory Hierarchy &amp; Caching Use several

Data Management Systems Storage Management The Memory hierarchy Memory hierarchy

Parishioner Meetings November 5, 2018 Opening Prayer Almighty God, each autumn we are reminded

Interim Results 2020 www.foodtravelexperts.com Presentation structure 1. Group highlights Simon

IBM i and External Storage MRMUG February, 2012 IBM i can use external storage Direct

CD PROJEKT CAPITAL GROUP RESULTS IN 2016 ADAM KICI SKI President PIOTR NIELUBOWICZ

From waste to resource-management The Ladder of Lansink: Instrument for the (third) transition

ICAO Operations Panel RNAV / Ground-based Charting Symbol Hierarchy OVERVIEW NEED FOR

Dr Jamie Steer, Senior Biodiversity Advisor Presentation to PNRP Hearings Panel, 9 th April 2018

Management Plan (HIAMP) Page 153 Agenda Item 6 Code of Practice Revision Lincolnshire Highways

EE 457 Unit 7a Cache and Memory Hierarchy 2 Memory Hierarchy & Caching Use several