Guided Profiling for Auto-Tuning Array Layouts on GPUs Nicolas - - PowerPoint PPT Presentation

guided profiling for auto tuning array
SMART_READER_LITE
LIVE PREVIEW

Guided Profiling for Auto-Tuning Array Layouts on GPUs Nicolas - - PowerPoint PPT Presentation

Guided Profiling for Auto-Tuning Array Layouts on GPUs Nicolas Weber, Sandra C. Amend and Michael Goesele TU Darmstadt Motivation Memory access is one of the most important performance factors in CUDA applications CUDA Programming Guide


slide-1
SLIDE 1

Guided Profiling for Auto-Tuning Array Layouts on GPUs

Nicolas Weber, Sandra C. Amend and Michael Goesele TU Darmstadt

slide-2
SLIDE 2

Motivation

  • Memory access is one of the most important performance factors in

CUDA applications

  • CUDA Programming Guide
  • It is one of the three basic optimization strategies to “Optimize memory usage to

achieve maximum memory throughput”

  • Performance difference up to an order of magnitude between best and

worst implementation

  • Experience alone does not guarantee to find the optimal configuration

2

slide-3
SLIDE 3

Motivation

  • Tedious to optimize in big GPU applications
  • Layouts: Array of Structs, Structure of Arrays, AoSoA
  • Transpositions of multi-dimensional arrays
  • Size of L1 cache / shared memory
  • Memory placement: Global, Texture, Shared, Local and Constant memory
  • Changing GPU architectures require to reoptimize
  • Memory hierarchy was changed in every architecture
  • Automated optimization for most GPUs and algorithm
  • We develop an open source auto-tuner to automatically optimize array access in

CUDA applications (with minimal programming overhead)

3

slide-4
SLIDE 4

What is the optimal configuration for a kernel?

  • Difficult to find an analytical solution
  • Memory access can be input data sensitive
  • Different optima for varying input data
  • Many GPU architectures with different memory hierarchies
  • Empirical profiling
  • Requires to compile & execute many different implementations
  • Very time intensive

4

slide-5
SLIDE 5

High Dimensionality

  • Up to several million configurations!
  • 1000000 ∙

5𝑡 𝐷𝑝𝑛𝑞𝑗𝑚𝑏𝑢𝑗𝑝𝑜 𝑢𝑗𝑛𝑓 16 (𝐷𝑝𝑠𝑓𝑡)

+ 0.5𝑡 𝐹𝑦𝑓𝑑𝑣𝑢𝑗𝑝𝑜 𝑢𝑗𝑛𝑓 ≥ 9 𝑒𝑏𝑧𝑡

5 Application Kernel A Function Optimizations Array A Transposition Layout Memory Array B Transposition Layout Memory Kernel B Array C Transposition Layout Memory Kernel C L1 cache size

slide-6
SLIDE 6

560 5184

Time (ms) configurations ordered by runtime

Measured

Measured Kernel Execution Time

6 A == 0 && B == 0 A != B A == 1 && B == 1

slide-7
SLIDE 7

Toy Example: Performance Estimation 2D

Layout

7 Global AoS Texture SoA AoSoA 7ms 4ms 5ms

slide-8
SLIDE 8

Toy Example: Performance Estimation 2D

  • Predicted Execution Time
  • Execution time of Base + Sum( Δ(Base, Support Configurations) )

8

Layout

5ms Global AoS Texture SoA AoSoA 7ms 4ms +2ms

  • 1ms

6ms

slide-9
SLIDE 9

Toy Example: Performance Estimation 3D

9

Layout

Global AoS Texture SoA AoSoA

slide-10
SLIDE 10

Non-Linear Relationship

  • Not all configurations are linearly related to each other
  • Shared dimensions
  • Affect all arrays
  • L1 cache size
  • Independent dimensions
  • Only affect one array
  • Layout, memory and transposition

10

slide-11
SLIDE 11

L1 cache: L1

Toy Example: Prediction Domains

11 L1 cache: EQ

Layout

Global AoS Texture SoA AoSoA

Layout

Global AoS Texture SoA AoSoA

L1 cache: SM

Layout

Global AoS Texture SoA AoSoA

slide-12
SLIDE 12

Toy Example: Prediction Domains

12 L1 cache: EQ

Layout

Global AoS Texture SoA AoSoA

L1 cache: L1

Layout

Global AoS Texture SoA AoSoA

L1 cache: SM

Layout

Global AoS Texture SoA AoSoA

slide-13
SLIDE 13

Real Example: Measured Time

13

560 5184

Time (ms) configurations ordered by runtime Measured

slide-14
SLIDE 14

Real Example: Base Configurations

14

560 5184

Time (ms) configurations ordered by runtime Measured Base

slide-15
SLIDE 15

Real Example: Support Configurations

15

560 5184

Time (ms) configurations ordered by runtime Measured Support Base

slide-16
SLIDE 16

560 5184

Time (ms) configurations ordered by runtime

Measured Predicted Support Base

Real Example: Prediction

16

slide-17
SLIDE 17

Real Example: Prediction (zoom in)

65 70 75 80 85 90 100 200 300 400 500 600

Time (ms) configuration ordered by runtime Predicted Measured

17

slide-18
SLIDE 18

Real Example: Prediction (zoom in)

18

65 70 75 80 85 90 100 200 300 400 500 600

Time (ms) configuration ordered by runtime Predicted Measured

slide-19
SLIDE 19

Real Example: Prediction (zoom in)

65 70 75 80 85 90 100 200 300 400 500 600

Time (ms) configuration ordered by runtime Predicted Measured

19

slide-20
SLIDE 20

Real Example: Prediction (zoom in)

65 70 75 80 85 90 100 200 300 400 500 600

Time (ms) configuration ordered by runtime Predicted Measured Best Predicted

20

Measured: 44 / 5184 (0.85%) Our result: 72.52ms (+3.59ms) Min: 68.93ms Max: 526.48ms Avg: 300.75ms

slide-21
SLIDE 21

EVALUATION

21

slide-22
SLIDE 22
  • 1. Benchmark: BitonicSort

struct { long a; int b; short c; char d; }

  • Sorting for each field, A < B < C < D
  • Values limited to 0…1023 to cause equal columns
  • 2 Kernels
  • 27 configurations

22

slide-23
SLIDE 23
  • 2. Benchmark: KD-Tree Builder
  • 9 Kernels
  • > 570k configurations

23

slide-24
SLIDE 24
  • 3. Benchmark: REYES
  • 4 Kernels
  • > 2.4M configurations

24

slide-25
SLIDE 25

Profiling Algorithms

  • Exhaustive Search

[Muraladinharan et al. 2014]

  • Tries all possible configurations
  • Greedy Profiling

[Liu et al. 2008]

  • Optimize each dimension after each other
  • Evolutionary Algorithm

[Jordan et al. 2012]

  • Starts with a random population of configurations
  • Good configurations are stored
  • Bad configurations are mutated, combined or randomly sampled

25

slide-26
SLIDE 26

Evaluation

  • Profiling Algorithms
  • Exhaustive Search (E)
  • Greedy Algorithm (G)
  • Evolutionary Algorithm (A)
  • Our Algorithm (P)
  • GPUs
  • GeForce GTX980 (Maxwell)
  • Tesla K20 (Kepler)
  • CUDA WatchDog: kills configurations which exceed the execution

time of the best found

26

slide-27
SLIDE 27

QUALITY

27

slide-28
SLIDE 28

Execution Speed Up: GTX980 w/o WatchDog

28 Higher is better

1.00 1.50

Bitonic KD-Tree Reyes

E G A P

slide-29
SLIDE 29

Execution Speed Up: Tesla K20 w/o WatchDog

29 Higher is better

1.00 1.50

Bitonic KD-Tree Reyes

E G A P

slide-30
SLIDE 30

Execution Speed Up: Tesla K20 with WatchDog

30

1.00 1.50

Bitonic KD-Tree Reyes

E EW G GW P PW

Higher is better

slide-31
SLIDE 31

SPEED UP

31

slide-32
SLIDE 32

Profiling Speed Up: BitonicSort

32 Higher is better

1.00 1.00 1.00 0.85 0.95 1.03 0.90 0.84 1.00 0.91 0.96 0.92 0.94 1.00 0.8 1 1.2

GTX980 K20 E EW G GW A P PW

slide-33
SLIDE 33

Profiling Speed Up: KD-Tree Builder

33 Higher is better

1.0 1.0 1.0 1.1 138.6 865.2 158.1 956.4 70.6 335.6 374.8 1,327.8 387.5 1,588.2 400 800 1200 1600

GTX980 K20 E EW G GW A P PW

slide-34
SLIDE 34

Profiling Speed Up: REYES

34 Higher is better

1.0 1.0 1.0 1.0 930.3 2,882.7 987.3 3,111.3 949.4 1,900.5 3,547.5 8,657.3 3,387.1 9,395.9 2000 4000 6000 8000 10000

GTX980 K20 E EW G GW A P PW

slide-35
SLIDE 35

Summary

  • Introduced prediction guided profiling algorithm
  • up to 5.5x faster than other state of the art methods
  • while achieving comparable results
  • up to 9300x faster than exhaustive search
  • 10 days 20 hours  1 minute 40 seconds
  • Limitations
  • No global optimization  only one kernel at once is optimized

35

slide-36
SLIDE 36

Thank you for your attention!

Source Code available @ http://tinyurl.com/matog (BSD 3-Clause license) Contact: matog@gris.tu-darmstadt.de