Guided Profiling for Auto-Tuning Array Layouts on GPUs Nicolas - PowerPoint PPT Presentation

Guided Profiling for Auto-Tuning Array Layouts on GPUs Nicolas Weber, Sandra C. Amend and Michael Goesele TU Darmstadt

Motivation • Memory access is one of the most important performance factors in CUDA applications • CUDA Programming Guide • It is one of the three basic optimization strategies to “Optimize memory usage to achieve maximum memory throughput” • Performance difference up to an order of magnitude between best and worst implementation • Experience alone does not guarantee to find the optimal configuration 2

Motivation • Tedious to optimize in big GPU applications • Layouts: Array of Structs, Structure of Arrays, AoSoA • Transpositions of multi-dimensional arrays • Size of L1 cache / shared memory • Memory placement: Global, Texture, Shared, Local and Constant memory • Changing GPU architectures require to reoptimize • Memory hierarchy was changed in every architecture  Automated optimization for most GPUs and algorithm • We develop an open source auto-tuner to automatically optimize array access in CUDA applications (with minimal programming overhead) 3

What is the optimal configuration for a kernel? • Difficult to find an analytical solution • Memory access can be input data sensitive • Different optima for varying input data • Many GPU architectures with different memory hierarchies  Empirical profiling • Requires to compile & execute many different implementations • Very time intensive 4

High Dimensionality Kernel A Function Optimizations L1 cache size Layout Memory Application Kernel B Array A Transposition Layout Memory Kernel C Array B Transposition Layout Memory Array C Transposition • Up to several million configurations! 5𝑡 𝐷𝑝𝑛𝑞𝑗𝑚𝑏𝑢𝑗𝑝𝑜 𝑢𝑗𝑛𝑓 • 1000000 ∙ + 0.5𝑡 𝐹𝑦𝑓𝑑𝑣𝑢𝑗𝑝𝑜 𝑢𝑗𝑛𝑓 ≥ 9 𝑒𝑏𝑧𝑡 16 (𝐷𝑝𝑠𝑓𝑡) 5

Measured Kernel Execution Time 560 A == 0 && B == 0 Time (ms) A != B A == 1 && B == 1 0 configurations ordered by runtime 0 5184 6 Measured

Toy Example: Performance Estimation 2D Texture 7ms Global 5ms 4ms AoS SoA AoSoA Layout 7

Toy Example: Performance Estimation 2D • Predicted Execution Time • Execution time of Base + Sum( Δ (Base, Support Configurations) ) Texture 7ms 6ms +2ms -1ms Global 5ms 4ms AoS SoA AoSoA Layout 8

Toy Example: Performance Estimation 3D Texture Global AoS SoA AoSoA Layout 9

Non-Linear Relationship • Not all configurations are linearly related to each other • Shared dimensions • Affect all arrays • L1 cache size • Independent dimensions • Only affect one array • Layout, memory and transposition 10

Toy Example: Prediction Domains L1 cache: EQ Texture Global AoS SoA AoSoA Layout L1 cache: SM L1 cache: L1 Texture Texture Global Global AoS SoA AoSoA AoS SoA AoSoA 11 Layout Layout

Toy Example: Prediction Domains L1 cache: EQ Texture Global AoS SoA AoSoA Layout L1 cache: SM L1 cache: L1 Texture Texture Global Global AoS SoA AoSoA AoS SoA AoSoA 12 Layout Layout

Real Example: Measured Time 560 Time (ms) 0 configurations ordered by runtime 0 5184 13 Measured

Real Example: Base Configurations 560 Time (ms) 0 configurations ordered by runtime 0 5184 14 Measured Base

Real Example: Support Configurations 560 Time (ms) 0 configurations ordered by runtime 0 5184 15 Measured Support Base

Real Example: Prediction 560 Time (ms) 0 configurations ordered by runtime 0 5184 16 Measured Predicted Support Base

Real Example: Prediction (zoom in) 90 85 Time (ms) 80 75 70 65 0 100 200 300 400 500 600 configuration ordered by runtime 17 Predicted Measured

Real Example: Prediction (zoom in) 90 85 Time (ms) 80 Measured: 44 / 5184 (0.85%) Our result: 72.52ms (+3.59ms) 75 Min: 68.93ms 70 Max: 526.48ms Avg: 300.75ms 65 0 100 200 300 400 500 600 configuration ordered by runtime 20 Predicted Measured Best Predicted

E VALUATION 21

1. Benchmark: BitonicSort struct { long a; int b; short c; char d; } • Sorting for each field, A < B < C < D • Values limited to 0…1023 to cause equal columns • 2 Kernels • 27 configurations 22

2. Benchmark: KD-Tree Builder • 9 Kernels • > 570k configurations 23

3. Benchmark: REYES • 4 Kernels • > 2.4M configurations 24

Profiling Algorithms • Exhaustive Search [Muraladinharan et al. 2014] • Tries all possible configurations • Greedy Profiling [Liu et al. 2008] • Optimize each dimension after each other • Evolutionary Algorithm [Jordan et al. 2012] • Starts with a random population of configurations • Good configurations are stored • Bad configurations are mutated, combined or randomly sampled 25

Evaluation • Profiling Algorithms • Exhaustive Search (E) • Greedy Algorithm (G) • Evolutionary Algorithm (A) • Our Algorithm (P) • GPUs • GeForce GTX980 (Maxwell) • Tesla K20 (Kepler) • CUDA WatchDog: kills configurations which exceed the execution time of the best found 26

Q UALITY 27

Execution Speed Up: GTX980 w/o WatchDog 1.50 Higher is better 1.00 Bitonic KD-Tree Reyes E G A P 28

Execution Speed Up: Tesla K20 w/o WatchDog 1.50 Higher is better 1.00 Bitonic KD-Tree Reyes E G A P 29

Execution Speed Up: Tesla K20 with WatchDog 1.50 Higher is better 1.00 Bitonic KD-Tree Reyes E EW G GW P PW 30

S PEED U P 31

Profiling Speed Up: BitonicSort 1.2 1.03 Higher is better 1.00 1.00 1.00 1.00 1.00 1 0.96 0.95 0.94 0.92 0.91 0.90 0.85 0.84 0.8 GTX980 E EW G GW A P PW K20 32

Profiling Speed Up: KD-Tree Builder 1,588.2 1600 1,327.8 1200 956.4 Higher is better 865.2 800 387.5 374.8 400 335.6 158.1 138.6 70.6 1.0 1.0 1.0 1.1 0 GTX980 E EW G GW A P PW K20 33

Profiling Speed Up: REYES 10000 9,395.9 8,657.3 8000 Higher is better 6000 4000 3,547.5 3,387.1 3,111.3 2,882.7 1,900.5 2000 987.3 930.3 949.4 1.0 1.0 1.0 1.0 0 GTX980 E EW G GW A P PW K20 34

Summary • Introduced prediction guided profiling algorithm • up to 5.5x faster than other state of the art methods • while achieving comparable results • up to 9300x faster than exhaustive search • 10 days 20 hours  1 minute 40 seconds • Limitations • No global optimization  only one kernel at once is optimized 35

Thank you for your attention! Source Code available @ http://tinyurl.com/matog (BSD 3-Clause license) Contact: matog@gris.tu-darmstadt.de

Guided Profiling for Auto-Tuning Array Layouts on GPUs Nicolas - PowerPoint PPT Presentation

Guided Profiling for Auto-Tuning Array Layouts on GPUs Nicolas Weber, Sandra C. Amend and Michael Goesele TU Darmstadt Motivation Memory access is one of the most important performance factors in CUDA applications CUDA Programming Guide

FFR Guided Functional FFR Guided Functional FFR Guided Functional FFR Guided Functional

KODA AUTO University KODA AUTO University Agenda on KODA AUTO University Enterprise

KODA AUTO University KODA AUTO University Agenda on KODA AUTO University Enterprise

singly linked lists Sept. 18, 2017 1 Recall last lecture: Java array array array array of

PAC PACE AUT AUTO-WER WERKS KS Vehicle Tuning Services Performance tuning with fuel

Guided Therapeutics in Cancer Surgery Guided Therapeutics in Cancer Surgery Guided Therapeutics

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

MVC Guided Pathways Brief review of Guided Pathways at MVC Plan for Today Spring

Online Auto-Tuning Ray S. Chen Jeffrey K. Hollingsworth 1 Motivation HPC systems will

Xabclib and OpenATLib Ver.1.0: A Fully Auto-tuned Sparse Iterative Library and Its Auto-tuning

Review We can declare an array of any type, even other arrays A 2D array is an array of

Cache Performance 1 C and cache misses (1) int array[1024]; // 4KB array int even_sum = 0,

The Korean Auto & Auto Parts Industry Chapter 1. The Status of Korean Auto Industry 2 1

GB Auto The Ghabbour Group of Companies Everything on Wheels GB Auto, S.A.E I nitial

GB Auto The Ghabbour Group of Companies Everything on Wheels GB Auto, S.A.E Initial

WIDE Project RFID/Auto-ID activities Yojiro UO Auto-ID Labs, JAPAN WIDE Project Auto-ID

JRuby: Youve got Java in my Ruby Thomas E. Enebo <tom.enebo@gmail.com> JRuby

Spatial Bayesian Nonparametrics for Natural Image Segmentation Erik Sudderth Brown University

2D Graphics with SFML Simple and Fast Multimedia Library Meeting C++ 2018 | Lukas Drrenberger |

texture mapping 1 why texture mapping? objects have spatially varying details represent as

1 L Feb-20-04 SMD159, Texture in OpenGL Overview OpenGL texture functions and options 2 L

FOR VIRTUAL MATERIAL DESIGN Adib Akl 1,2 , Charles Yaacoub 2 , Marc Donias 1 , Jean-Pierre Da Costa

Directional Filterbank for Texture Image Classification Hong Man Department of ECE Stevens

Texture Analysis and Segmentation Texture Analysis and Segmentation using Modulation Models

Guided Profiling for Auto-Tuning Array Layouts on GPUs Nicolas - PowerPoint PPT Presentation

Guided Profiling for Auto-Tuning Array Layouts on GPUs Nicolas Weber, Sandra C. Amend and Michael Goesele TU Darmstadt Motivation Memory access is one of the most important performance factors in CUDA applications CUDA Programming Guide

FFR Guided Functional FFR Guided Functional FFR Guided Functional FFR Guided Functional

KODA AUTO University KODA AUTO University Agenda on KODA AUTO University Enterprise

KODA AUTO University KODA AUTO University Agenda on KODA AUTO University Enterprise

singly linked lists Sept. 18, 2017 1 Recall last lecture: Java array array array array of

PAC PACE AUT AUTO-WER WERKS KS Vehicle Tuning Services Performance tuning with fuel

Guided Therapeutics in Cancer Surgery Guided Therapeutics in Cancer Surgery Guided Therapeutics

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

MVC Guided Pathways Brief review of Guided Pathways at MVC Plan for Today Spring

Online Auto-Tuning Ray S. Chen Jeffrey K. Hollingsworth 1 Motivation HPC systems will

Xabclib and OpenATLib Ver.1.0: A Fully Auto-tuned Sparse Iterative Library and Its Auto-tuning

Review We can declare an array of any type, even other arrays A 2D array is an array of

Cache Performance 1 C and cache misses (1) int array[1024]; // 4KB array int even_sum = 0,

The Korean Auto &amp; Auto Parts Industry Chapter 1. The Status of Korean Auto Industry 2 1

GB Auto The Ghabbour Group of Companies Everything on Wheels GB Auto, S.A.E I nitial

GB Auto The Ghabbour Group of Companies Everything on Wheels GB Auto, S.A.E Initial

WIDE Project RFID/Auto-ID activities Yojiro UO Auto-ID Labs, JAPAN WIDE Project Auto-ID

JRuby: Youve got Java in my Ruby Thomas E. Enebo &lt;tom.enebo@gmail.com&gt; JRuby

Spatial Bayesian Nonparametrics for Natural Image Segmentation Erik Sudderth Brown University

2D Graphics with SFML Simple and Fast Multimedia Library Meeting C++ 2018 | Lukas Drrenberger |

texture mapping 1 why texture mapping? objects have spatially varying details represent as

1 L Feb-20-04 SMD159, Texture in OpenGL Overview OpenGL texture functions and options 2 L

FOR VIRTUAL MATERIAL DESIGN Adib Akl 1,2 , Charles Yaacoub 2 , Marc Donias 1 , Jean-Pierre Da Costa

Directional Filterbank for Texture Image Classification Hong Man Department of ECE Stevens

Texture Analysis and Segmentation Texture Analysis and Segmentation using Modulation Models

The Korean Auto & Auto Parts Industry Chapter 1. The Status of Korean Auto Industry 2 1

JRuby: Youve got Java in my Ruby Thomas E. Enebo <tom.enebo@gmail.com> JRuby