Autotuning OpenCL Workgroup Size for Stencil Patterns Chris - PowerPoint PPT Presentation

Autotuning OpenCL Workgroup Size for Stencil Patterns

Chris Cummins http://chriscummins.cc

Stencils & Workgroup size

input stencil output

element border region input stencil output

10^6 elements 10^6 border regions input stencil output

10^6 elements 10^6 border regions input stencil output Multiple independent computations

10^6 elements 10^6 border regions input stencil output Multiple (overlapping) memory accesses

element border region input stencil output

element border region kernel input stencil output

element work-item border region kernel input stencil output

Border region Work-item wr Workgroup Tile Matrix wc

Stencils & Workgroup size

Border region Work-item wr Workgroup Tile Matrix wc

Workgroup size affects mapping to SIMD hardware. device occupancy. local memory utilisation.

Pop Quiz!

What is the best workgroup size for … Gaussian blur, 512px x 512px, floats, on: 1. AMD HD7990? 2. Nvidia GTX Titan? 3. Intel i7-3820?

What is the best workgroup size for … Gaussian blur, 512px x 512px, floats, on: 64 x 4 1. AMD HD7990? 96 x 4 2. Nvidia GTX Titan? 40 x 24 3. Intel i7-3820?

What is the best workgroup size for … Nvidia GTX 590, 4096 x 4096 elements running: 1. Sobel edge detection? 2. Heat equation? 3. Game of life?

What is the best workgroup size for … Nvidia GTX 590, 4096 x 4096 elements running: 256 x 2 1. Sobel edge detection? 128 x 2 2. Heat equation? 32 x 6 3. Game of life?

What is the best workgroup size for … 1. Intel i5-2430, game of life, 4096 x 4096? 2. Nvidia GTX 690, threshold, 512 x 512? 3. Intel i7-3820, NMS, 512 x 512?

What is the best workgroup size for … 1. Intel i5-2430, game of life, 196 x 20 4096 x 4096? 2. Nvidia GTX 690, threshold, 32 x 4 512 x 512? 3. Intel i7-3820, NMS, 512 x 512? 88 x 8

One size does not fit all!

Choosing workgroup size depends on: 1. Device 2. Program 3. Dataset

performance Optimisation space rows cols

Same stencil! Different device!

Same device! Different stencil!

Workgroup Size + Stencils 1. Non-linear, non-continuous 2. Device, program, dataset 3. Not all values are legal

Autotuning

Set a workgroup size Execute and time program

Set a workgroup size Execute and time program Set a workgroup size Execute and time program

Set a workgroup size Execute and time program Set a workgroup size Execute and time program Set a workgroup size Execute and time program

Set a workgroup size Execute and time program Set a workgroup size Execute and time program Set a workgroup size Execute and time program Set a workgroup size Execute and time program

Set a workgroup size Execute and time program Set a workgroup size Execute and time program Set a workgroup size Execute and time program Set a workgroup size Execute and time program … (continue until done / bored) Pick the best one you tried

Set a workgroup size Execute and time program Set a workgroup size (iterative Execute and time program Set a workgroup size compilation) Execute and time program Set a workgroup size Execute and time program … (continue until done / bored) Pick the best one you tried

e m i t g n o o o o l a s e k a T BAD!

e m i t g n o o o o l a s e k a T BAD! M u s t b e r e p e a t e d f o r e v e r y n e w “ x ” device dataset program

Let’s improve

Set a workgroup size Execute and time program Set a workgroup size Execute and time program Set a workgroup size Execute and time program Set a workgroup size Execute and time program … (continue until done / bored) Pick the best one you tried

Set a workgroup size Execute and time program Set a workgroup size Execute and time program Set a workgroup size Execute and time program 1 data point Set a workgroup size Execute and time program … (continue until done / bored) Pick the best one you tried

Collect data points Extract “features” Train machine learning classifier Extract “features” Input to classifier

” x “ n e e s n u n o s n o i t c i d e r p e k a m n a C device dataset program GOOD!

” x “ n e e s n u n o s n o i t c i d e r p e k a m n a C device dataset program GOOD! Many unanswered questions …

Questions: 1. What features do we need? 2. What programs do we train on? 3. How do we make predictions?

1. Device 2. Kernel 3. Dataset

or How many compute units? How much memory? Cache size? etc.

xi-2,j+2 xi+2,j+2 Sn xi,j Ss How big is border region? xi-2,j-2 xi+2,j-2 Sw Se What shape is it? How many instructions? What type of instructions? etc.

How big is the data? What type is the input? What type is the output?

Questions: 1. What features do we need? 2. What programs do we train on? 3. How do we make predictions?

Questions: 1. What features do we need? ✓ 2. What programs do we train on? 3. How do we make predictions?

1. Learn by example 2. Learn by exploration

Use benchmark programs Hope that they are representative 1. Learn by example 2. Learn by exploration

1. Learn by example 2. Learn by exploration

1. Learn by example 2. Learn by exploration Create own benchmarks Explore (the huge!) program space

Questions: 1. What features do we need? ✓ 2. What programs do we train on? 3. How do we make predictions?

Questions: 1. What features do we need? ✓ 2. What programs do we train on? ✓ 3. How do we make predictions?

1. Classifier 2. Runtime Regressor 3. Speedup Regressor

32 x 4 128 x 2 48 x 12 Predict category (optimal workgroup size) for scenario

32 x 4 128 x 2 48 x 12 ! t c e r r o c n i Predict category (optimal workgroup size) for scenario

32 x 4 ! d i l a v n i 128 x 2 48 x 12 Predict category (optimal workgroup size) for scenario

Fallback Handlers 1. Baseline 2. Random 3. Nearest Neighbour

Fallback Handlers “pick something we 1. Baseline know is safe” 2. Random 3. Nearest Neighbour

Fallback Handlers 1. Baseline “pick a random 2. Random value” 3. Nearest Neighbour

Fallback Handlers 1. Baseline 2. Random 3. Nearest Neighbour “pick the closest value we think will work”

Predict runtime of program for workgroup size Search for lowest runtime

Predict speedup of workgroup size A over B for program Search for highest speedup

Questions: 1. What features do we need? ✓ 2. What programs do we train on? ✓ 3. How do we make predictions?

Questions: 1. What features do we need? ✓ 2. What programs do we train on? ✓ 3. How do we make predictions? ✓

Experiment

Implementation Modified SkelCL stencil pattern Python server process for autotuning 5 classifiers, random forest regressor

Experimental Setup 6 stencil benchmarks + synthetic. 7 different GPUs & CPUs. 4 dataset sizes. Exhaustive search of workgroup size space for each

Results

Autotuning OpenCL Workgroup Size for Stencil Patterns Chris - PowerPoint PPT Presentation

Autotuning OpenCL Workgroup Size for Stencil Patterns Chris Cummins http://chriscummins.cc Stencils & Workgroup size Stencils & Workgroup size input stencil output element border region input stencil output 10^6 elements

Stencil Buffer Algorithms CS418 Computer Graphics John C. Hart Stencil Buffer

Precision solder paste stencil for fine pitch printing applications www.microstencil.com

OpenCL Kernel Compilation Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

Creative surprises from Undercover 29.09.2017 1 Stencil set 2 sticker sheets, stencil, 10

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou

Investigation of the OpenCL support in the GeantV's Vectorized Geometry Gabor Biro 22.09.2014.

The OpenCL C++ API Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Introduction to OpenCL David Black-Schaffer david.black-schaffer@it.uu.se 1 Disclaimer I

OpenCL on FPGAs Contains material from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Synchronization in OpenCL Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

Factory Patterns: Factory Method and Abstract Factory Design Patterns In Java Bob Tarr

Water Quality Standards Workgroup Introduction Purpose of WQS Workgroup This workgroup will

Flood Mitigation Workgroup 2 nd Workgroup Meeting Metro Hall, Room 106 May 18, 2015 Workgroup

Flood Mitigation Workgroup 4th Workgroup Meeting Metro Hall, Room 106 June 1, 2015 Workgroup

Principles and Patterns 26 February, 2020 Recap Principles Patterns Inheritance Anti-patterns

An Extension of OpenACC Directives for Out-of-Core Stencil Computation with Temporal Blocking

Scientific Computing I Module 8: Discretisation of PDEs Michael Bader Lehrstuhl Informatik V

Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures Tom

Lecture 12 Stencil methods Atomics Announcements Midterm scores have been posted to Moodle

Iterator-Based Optimization of Imperfectly-Nested Loops DANIEL FESHBACH, MARY GLASER, MICHELLE

CS475/CM375 Lecture 4: Sept 22 Sparse Gaussian Elimination, Graph Representation Reading: [Saad]

Edge-Adaptive Image Interpolation with Contour Stencils Pascal Getreuer Dec 27, 2010 TV along

HYPRE: High Performance Preconditioners October 18, 2013 Robert D. Falgout Center for Applied

Tuning space optimization for multi- core architectures V. Martnez , F. Dupros, M. Castro, H.