Data Placement Optimization in GPU Memory Hierarchy Using Predictive - - PowerPoint PPT Presentation

data placement optimization in gpu memory hierarchy using
SMART_READER_LITE
LIVE PREVIEW

Data Placement Optimization in GPU Memory Hierarchy Using Predictive - - PowerPoint PPT Presentation

Data Placement Optimization in GPU Memory Hierarchy Using Predictive Modeling Larisa Stoltzfus * , Murali Emani, Pei-Hung Lin, Chunhua Liao * University of Edinburgh (UK), Lawrence Livermore National Laboratory MCHPC'18: Workshop on Memory Centric


slide-1
SLIDE 1

LLNL-PRES-761162 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Data Placement Optimization in GPU Memory Hierarchy Using Predictive Modeling

Larisa Stoltzfus*, Murali Emani, Pei-Hung Lin, Chunhua Liao

*University of Edinburgh (UK), Lawrence Livermore National Laboratory

MCHPC'18: Workshop on Memory Centric High Performance Computing

slide-2
SLIDE 2

LLNL-PRES-761162

2

§ GPUs can greatly improve

performance of HPC applications, but can be difficult to optimize for due to their complex memory hierarchy

§ Memory hierarchies can change

drastically from generation to generation

§ Codes optimized for one platform

may not retain optimal performance when ported to other platforms

Complex Memory Hierarchy on GPUs

slide-3
SLIDE 3

LLNL-PRES-761162

3

Performance can vary widely depending on data placement as well as platform

Kepler Maxwell Pascal Volta 1234567 1234567 1234567 1234567 1 2 3 4 Speedup

Kepler Maxwell Pascal Volta 123456789 123456789 123456789 123456789 0.0 0.5 1.0 1.5 Memory Type [ Platform ] Speedup

Matrix-Matrix Multiplication Sparse Matrix-Vector Multiplication

slide-4
SLIDE 4

LLNL-PRES-761162

4

§ Different memory variants (global/ constant/ texture/ shared)

can have significant impact on program performance

§ But identifying the best performing variant is non-obvious and

complex decision to make

§ Given a default global variant, can the best performing memory

variant be automatically determined?

Challenges

slide-5
SLIDE 5

LLNL-PRES-761162

5

§ Use machine learning to develop a predictive model to

determine the best data placement for a given application on a particular platform

§ Use the model to predict best placement § Involves three stages:

— offline training — feature and model selection — online inference

Proposed Solution

slide-6
SLIDE 6

LLNL-PRES-761162

6

Offline training

Representative kernels

Feature Extraction and labelling

Training data

Training Data collection Use nvprof,

HW info Classifier global shared constant texture program variants

Online inference

Feature extraction using CUPTI

Classifier best variant

Approach

Offline Training - Data collection

  • f nvprof metrics and events
slide-7
SLIDE 7

LLNL-PRES-761162

7

Offline training

Representative kernels

Feature Extraction and labelling

Training data

Training Data collection Use nvprof,

HW info Classifier global shared constant texture program variants

Online inference

Feature extraction using CUPTI

Classifier best variant

Approach

Model Building - Determine best version, features and model

slide-8
SLIDE 8

LLNL-PRES-761162

8

Offline training

Representative kernels

Feature Extraction and labelling

Training data

Training Data collection Use nvprof,

HW info Classifier global shared constant texture program variants

Online inference

Feature extraction using CUPTI

Classifier best variant

Approach

Online Inference: Use model to determine best placement in run-time

slide-9
SLIDE 9

LLNL-PRES-761162

9

In order to build the model:

§ 4 different generations of NVIDIA GPUs were used:

— Kepler — Pascal — Maxwell — Volta

§ 8 programs X 3 input data sizes X 3 thread block sizes X 4

variants

Methodology

MD, SPMV, CFD, MM, ConvolutionSeparable, ParticleFilter etc.

slide-10
SLIDE 10

LLNL-PRES-761162

10

§ Metric and event data from nvprof from global variant along

with hardware data were collected

§ Best performing variant (class label) for each version run was

appended

§ Benchmarks were run 10 times on each platform, with 5 initial

iterations to warm up the GPU

Offline Training

slide-11
SLIDE 11

LLNL-PRES-761162

11

§ Number of features narrowed down to 16 from 241 using

correlation-based feature selection algorithm (CFS).

§ A partial list:

Feature Selection

Feature Name Meaning achieved_occupancy ratio of average active warps to maximum number of warps l2_read_transactions, l2_write_transactions Memory read/write transactions at L2 cache gld_throughput global memory load throughput warp_execution_efficiency ratio of average active threads to the maximum number of threads

slide-12
SLIDE 12

LLNL-PRES-761162

12

§ Used 10-fold cross validation during evaluation § Overall, decision tree classifiers showed great promise (>95%

accuracy in prediction)

Model Selection

Classifier Prediction Accuracy (%) RandomForest 95.7 LogitBoost 95.5 IterativeClassifierOptimizer 95.5 SimpleLogistic 95.4 JRip 95.0

slide-13
SLIDE 13

LLNL-PRES-761162

13

§ The classifier JRIP was selected from the group of top five

performing classifier models

§ JRIP is a propositional rule learner, which results in a decision

tree

§ The model then reads in input from CUPTI calls - the API for

nvprof - which can access hardware counters in real-time and

  • utputs its class

Runtime Prediction

slide-14
SLIDE 14

LLNL-PRES-761162

14

Preliminary Results

  • Results from this initial exploration show that there is great potential for

predictive modeling for data placement on GPUs

  • Overall 95% accuracy achievable, but this is higher for global and texture

memory best performers

25 50 75 100

c

  • n

s t a n t g l

  • b

a l s h a r e d t e x t u r e

Memory Type % Predicted

texture constant shared global

slide-15
SLIDE 15

LLNL-PRES-761162

15

§ The JRIP model was tested out on a new benchmark - an

acoustic application

§ The model was successfully able to correctly predict the best

performing version on two platforms

Runtime Validation

slide-16
SLIDE 16

LLNL-PRES-761162

16

§ Currently, all versions need to be pre-compiled for run-time

prediction, ideally it would be better to have model built into a compiler

§ CUPTI calls are slow and require as many iterations as metrics

and events to collect

§ This would acceptable for benchmarks with many iterations, but

for other kinds a workaround would need to be made

Limitations

slide-17
SLIDE 17

LLNL-PRES-761162

17

§ Machine learning has shown great potential for data placement

prediction on a range of applications

§ More work needs to be done to acquire hardware counters

from applications in a timely manner

§ Approach could be reused for other optimizations such as data

layouts.

Conclusion

slide-18
SLIDE 18
slide-19
SLIDE 19

LLNL-PRES-761162

19