data placement optimization in gpu memory hierarchy using
play

Data Placement Optimization in GPU Memory Hierarchy Using Predictive - PowerPoint PPT Presentation

Data Placement Optimization in GPU Memory Hierarchy Using Predictive Modeling Larisa Stoltzfus * , Murali Emani, Pei-Hung Lin, Chunhua Liao * University of Edinburgh (UK), Lawrence Livermore National Laboratory MCHPC'18: Workshop on Memory Centric


  1. Data Placement Optimization in GPU Memory Hierarchy Using Predictive Modeling Larisa Stoltzfus * , Murali Emani, Pei-Hung Lin, Chunhua Liao * University of Edinburgh (UK), Lawrence Livermore National Laboratory MCHPC'18: Workshop on Memory Centric High Performance Computing LLNL-PRES-761162 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

  2. Complex Memory Hierarchy on GPUs § GPUs can greatly improve performance of HPC applications, but can be difficult to optimize for due to their complex memory hierarchy § Memory hierarchies can change drastically from generation to generation § Codes optimized for one platform may not retain optimal performance when ported to other platforms 2 LLNL-PRES-761162

  3. Performance can vary widely depending on data placement as well as platform Kepler Maxwell Pascal Volta 4 3 Matrix-Matrix Multiplication Speedup 2 1 0 1234567 1234567 1234567 1234567 Kepler Maxwell Pascal Volta Sparse Matrix-Vector 1.5 Multiplication Speedup 1.0 0.5 0.0 123456789 123456789 123456789 123456789 Memory Type [ Platform ] 3 LLNL-PRES-761162

  4. Challenges § Different memory variants (global/ constant/ texture/ shared) can have significant impact on program performance § But identifying the best performing variant is non-obvious and complex decision to make § Given a default global variant, can the best performing memory variant be automatically determined? 4 LLNL-PRES-761162

  5. Proposed Solution § Use machine learning to develop a predictive model to determine the best data placement for a given application on a particular platform § Use the model to predict best placement § Involves three stages: — offline training — feature and model selection — online inference 5 LLNL-PRES-761162

  6. Approach Offline training Online inference global shared constant texture Representative kernels program variants Training Data collection Use nvprof, HW info Offline Training - Data collection Feature Feature Extraction extraction using and labelling of nvprof metrics and events CUPTI Training data Classifier Classifier best variant 6 LLNL-PRES-761162

  7. Approach Offline training Online inference global shared constant texture Representative kernels program variants Training Data collection Use nvprof, HW info Model Building - Determine Feature Feature Extraction extraction using and labelling best version, features and model CUPTI Training data Classifier Classifier best variant 7 LLNL-PRES-761162

  8. Approach Offline training Online inference global shared constant texture Representative kernels program variants Training Data collection Use nvprof, HW info Online Inference: Use Feature Feature Extraction model to determine best extraction using and labelling CUPTI placement in run-time Training data Classifier Classifier best variant 8 LLNL-PRES-761162

  9. Methodology In order to build the model: § 4 different generations of NVIDIA GPUs were used: — Kepler — Pascal — Maxwell — Volta § 8 programs X 3 input data sizes X 3 thread block sizes X 4 variants MD, SPMV, CFD, MM, ConvolutionSeparable, ParticleFilter etc. 9 LLNL-PRES-761162

  10. Offline Training § Metric and event data from nvprof from global variant along with hardware data were collected § Best performing variant (class label) for each version run was appended § Benchmarks were run 10 times on each platform, with 5 initial iterations to warm up the GPU 10 LLNL-PRES-761162

  11. Feature Selection § Number of features narrowed down to 16 from 241 using correlation-based feature selection algorithm (CFS). § A partial list: Feature Name Meaning achieved_occupancy ratio of average active warps to maximum number of warps l2_read_transactions, Memory read/write transactions at L2 l2_write_transactions cache gld_throughput global memory load throughput warp_execution_efficiency ratio of average active threads to the maximum number of threads 11 LLNL-PRES-761162

  12. Model Selection § Used 10-fold cross validation during evaluation § Overall, decision tree classifiers showed great promise (>95% accuracy in prediction) Classifier Prediction Accuracy (%) RandomForest 95.7 LogitBoost 95.5 IterativeClassifierOptimizer 95.5 SimpleLogistic 95.4 JRip 95.0 12 LLNL-PRES-761162

  13. Runtime Prediction § The classifier JRIP was selected from the group of top five performing classifier models § JRIP is a propositional rule learner, which results in a decision tree § The model then reads in input from CUPTI calls - the API for nvprof - which can access hardware counters in real-time and outputs its class 13 LLNL-PRES-761162

  14. Preliminary Results texture constant shared global 100 75 % Predicted 50 25 0 t l d e n a r e b a r u t o a t s x l h n g e s o t c Memory Type • Results from this initial exploration show that there is great potential for predictive modeling for data placement on GPUs • Overall 95% accuracy achievable, but this is higher for global and texture memory best performers 14 LLNL-PRES-761162

  15. Runtime Validation § The JRIP model was tested out on a new benchmark - an acoustic application § The model was successfully able to correctly predict the best performing version on two platforms 15 LLNL-PRES-761162

  16. Limitations § Currently, all versions need to be pre-compiled for run-time prediction, ideally it would be better to have model built into a compiler § CUPTI calls are slow and require as many iterations as metrics and events to collect § This would acceptable for benchmarks with many iterations, but for other kinds a workaround would need to be made 16 LLNL-PRES-761162

  17. Conclusion § Machine learning has shown great potential for data placement prediction on a range of applications § More work needs to be done to acquire hardware counters from applications in a timely manner § Approach could be reused for other optimizations such as data layouts. 17 LLNL-PRES-761162

  18. 19 LLNL-PRES-761162

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend