Autotuning OpenCL Workgroup Size for Stencil Patterns Chris - - PowerPoint PPT Presentation

autotuning opencl workgroup size for stencil patterns
SMART_READER_LITE
LIVE PREVIEW

Autotuning OpenCL Workgroup Size for Stencil Patterns Chris - - PowerPoint PPT Presentation

Autotuning OpenCL Workgroup Size for Stencil Patterns Chris Cummins http://chriscummins.cc Stencils & Workgroup size Stencils & Workgroup size input stencil output element border region input stencil output 10^6 elements


slide-1
SLIDE 1

Autotuning OpenCL Workgroup Size for Stencil Patterns

slide-2
SLIDE 2

Chris Cummins

http://chriscummins.cc

slide-3
SLIDE 3

Stencils & Workgroup size

slide-4
SLIDE 4

Stencils & Workgroup size

slide-5
SLIDE 5

input

  • utput

stencil

slide-6
SLIDE 6

input

  • utput

stencil border region element

slide-7
SLIDE 7

input

  • utput

stencil border regions elements 10^6 10^6

slide-8
SLIDE 8

input

  • utput

stencil border regions elements 10^6 10^6 Multiple independent computations

slide-9
SLIDE 9

input

  • utput

stencil border regions elements 10^6 10^6 Multiple (overlapping) memory accesses

slide-10
SLIDE 10
slide-11
SLIDE 11

input

  • utput

stencil border region element

slide-12
SLIDE 12

input

  • utput

stencil border region element kernel

slide-13
SLIDE 13

input

  • utput

stencil border region element work-item kernel

slide-14
SLIDE 14

Work-item Workgroup Matrix Tile

wc wr

Border region

slide-15
SLIDE 15

Stencils & Workgroup size

slide-16
SLIDE 16

Stencils & Workgroup size

slide-17
SLIDE 17

Work-item Workgroup Matrix Tile

wc wr

Border region

slide-18
SLIDE 18

Workgroup size affects

mapping to SIMD hardware. device occupancy. local memory utilisation.

slide-19
SLIDE 19

Pop Quiz!

slide-20
SLIDE 20

What is the best workgroup size for … Gaussian blur, 512px x 512px, floats, on:

  • 1. AMD HD7990?
  • 2. Nvidia GTX Titan?
  • 3. Intel i7-3820?
slide-21
SLIDE 21

What is the best workgroup size for … Gaussian blur, 512px x 512px, floats, on:

  • 1. AMD HD7990?
  • 2. Nvidia GTX Titan?
  • 3. Intel i7-3820?

64 x 4 96 x 4 40 x 24

slide-22
SLIDE 22

What is the best workgroup size for … Nvidia GTX 590, 4096 x 4096 elements running:

  • 1. Sobel edge detection?
  • 2. Heat equation?
  • 3. Game of life?
slide-23
SLIDE 23

What is the best workgroup size for … Nvidia GTX 590, 4096 x 4096 elements running:

  • 1. Sobel edge detection?
  • 2. Heat equation?
  • 3. Game of life?

256 x 2 128 x 2 32 x 6

slide-24
SLIDE 24

What is the best workgroup size for …

  • 1. Intel i5-2430, game of life,

4096 x 4096?

  • 2. Nvidia GTX 690, threshold,

512 x 512?

  • 3. Intel i7-3820, NMS, 512 x 512?
slide-25
SLIDE 25

What is the best workgroup size for …

  • 1. Intel i5-2430, game of life,

4096 x 4096?

  • 2. Nvidia GTX 690, threshold,

512 x 512?

  • 3. Intel i7-3820, NMS, 512 x 512?

196 x 20 32 x 4 88 x 8

slide-26
SLIDE 26

One size does not fit all!

slide-27
SLIDE 27

Choosing workgroup size depends on:

  • 1. Device
  • 2. Program
  • 3. Dataset
slide-28
SLIDE 28

Optimisation space

rows cols performance

slide-29
SLIDE 29
slide-30
SLIDE 30

Same stencil! Different device!

slide-31
SLIDE 31

Same device! Different stencil!

slide-32
SLIDE 32
slide-33
SLIDE 33

Workgroup Size + Stencils

1. Non-linear, non-continuous 2. Device, program, dataset 3. Not all values are legal

slide-34
SLIDE 34

Autotuning

slide-35
SLIDE 35

Set a workgroup size Execute and time program

slide-36
SLIDE 36

Set a workgroup size Execute and time program

Set a workgroup size Execute and time program

slide-37
SLIDE 37

Set a workgroup size Execute and time program

Set a workgroup size Execute and time program

Set a workgroup size Execute and time program

slide-38
SLIDE 38

Set a workgroup size Execute and time program

Set a workgroup size Execute and time program

Set a workgroup size Execute and time program

Set a workgroup size Execute and time program

slide-39
SLIDE 39

Set a workgroup size Execute and time program

Set a workgroup size Execute and time program

Set a workgroup size Execute and time program

Set a workgroup size Execute and time program

… (continue until done / bored)

Pick the best one you tried

slide-40
SLIDE 40

Set a workgroup size Execute and time program

Set a workgroup size Execute and time program

Set a workgroup size Execute and time program

Set a workgroup size Execute and time program

… (continue until done / bored)

Pick the best one you tried

(iterative compilation)

slide-41
SLIDE 41

BAD!

slide-42
SLIDE 42

BAD!

T a k e s a l

  • n

g t i m e

slide-43
SLIDE 43

BAD!

M u s t b e r e p e a t e d f

  • r

e v e r y n e w “ x ” T a k e s a l

  • n

g t i m e

device program dataset

slide-44
SLIDE 44

Let’s improve

slide-45
SLIDE 45

Set a workgroup size Execute and time program

Set a workgroup size Execute and time program

Set a workgroup size Execute and time program

Set a workgroup size Execute and time program

… (continue until done / bored)

Pick the best one you tried

slide-46
SLIDE 46

Set a workgroup size Execute and time program

Set a workgroup size Execute and time program

Set a workgroup size Execute and time program

Set a workgroup size Execute and time program

… (continue until done / bored)

Pick the best one you tried 1 data point

slide-47
SLIDE 47

Collect data points Extract “features” Train machine learning classifier Extract “features” Input to classifier

slide-48
SLIDE 48

GOOD!

slide-49
SLIDE 49

GOOD!

C a n m a k e p r e d i c t i

  • n

s

  • n

u n s e e n “ x ”

device program dataset

slide-50
SLIDE 50

GOOD!

C a n m a k e p r e d i c t i

  • n

s

  • n

u n s e e n “ x ”

device program dataset

Many unanswered questions …

slide-51
SLIDE 51

Questions:

  • 1. What features do we need?
  • 2. What programs do we train on?
  • 3. How do we make predictions?
slide-52
SLIDE 52

Questions:

  • 1. What features do we need?
  • 2. What programs do we train on?
  • 3. How do we make predictions?
slide-53
SLIDE 53
  • 1. Device
  • 2. Kernel
  • 3. Dataset
slide-54
SLIDE 54
  • 1. Device
  • 2. Kernel
  • 3. Dataset
slide-55
SLIDE 55
  • r

How many compute units? How much memory? Cache size? etc.

slide-56
SLIDE 56
  • 1. Device
  • 2. Kernel
  • 3. Dataset
slide-57
SLIDE 57
  • 1. Device
  • 2. Kernel
  • 3. Dataset
slide-58
SLIDE 58
  • 1. Device
  • 2. Kernel
  • 3. Dataset
slide-59
SLIDE 59 Sn Ss Se Sw xi-2,j-2 xi+2,j+2 xi+2,j-2 xi-2,j+2 xi,j

How big is border region? What shape is it? How many instructions? What type of instructions? etc.

slide-60
SLIDE 60
  • 1. Device
  • 2. Kernel
  • 3. Dataset
slide-61
SLIDE 61
  • 1. Device
  • 2. Kernel
  • 3. Dataset
slide-62
SLIDE 62
  • 1. Device
  • 2. Kernel
  • 3. Dataset
slide-63
SLIDE 63

How big is the data? What type is the input? What type is the output?

slide-64
SLIDE 64
  • 1. Device
  • 2. Kernel
  • 3. Dataset
slide-65
SLIDE 65
  • 1. Device
  • 2. Kernel
  • 3. Dataset
slide-66
SLIDE 66

Questions:

  • 1. What features do we need?
  • 2. What programs do we train on?
  • 3. How do we make predictions?
slide-67
SLIDE 67

Questions:

  • 1. What features do we need? ✓
  • 2. What programs do we train on?
  • 3. How do we make predictions?
slide-68
SLIDE 68
  • 1. Learn by example
  • 2. Learn by exploration
slide-69
SLIDE 69
  • 1. Learn by example
  • 2. Learn by exploration

Use benchmark programs Hope that they are representative

slide-70
SLIDE 70
  • 1. Learn by example
  • 2. Learn by exploration
slide-71
SLIDE 71
  • 1. Learn by example
  • 2. Learn by exploration

Create own benchmarks Explore (the huge!) program space

slide-72
SLIDE 72

Questions:

  • 1. What features do we need? ✓
  • 2. What programs do we train on?
  • 3. How do we make predictions?
slide-73
SLIDE 73

Questions:

  • 1. What features do we need? ✓
  • 2. What programs do we train on? ✓
  • 3. How do we make predictions?
slide-74
SLIDE 74
  • 1. Classifier
  • 2. Runtime Regressor
  • 3. Speedup Regressor
slide-75
SLIDE 75
  • 1. Classifier
  • 2. Runtime Regressor
  • 3. Speedup Regressor
slide-76
SLIDE 76

Predict category (optimal workgroup size) for scenario

32 x 4 128 x 2 48 x 12

slide-77
SLIDE 77

Predict category (optimal workgroup size) for scenario

32 x 4 128 x 2 48 x 12

slide-78
SLIDE 78

Predict category (optimal workgroup size) for scenario

32 x 4 128 x 2 48 x 12

slide-79
SLIDE 79

Predict category (optimal workgroup size) for scenario

32 x 4 128 x 2 48 x 12

i n c

  • r

r e c t !

slide-80
SLIDE 80

Predict category (optimal workgroup size) for scenario

32 x 4 128 x 2 48 x 12

i n v a l i d !

slide-81
SLIDE 81
  • 1. Baseline
  • 2. Random
  • 3. Nearest Neighbour

Fallback Handlers

slide-82
SLIDE 82
  • 1. Baseline
  • 2. Random
  • 3. Nearest Neighbour

Fallback Handlers

“pick something we know is safe”

slide-83
SLIDE 83
  • 1. Baseline
  • 2. Random
  • 3. Nearest Neighbour

Fallback Handlers

“pick a random value”

slide-84
SLIDE 84
  • 1. Baseline
  • 2. Random
  • 3. Nearest Neighbour

Fallback Handlers

“pick the closest value we think will work”

slide-85
SLIDE 85
  • 1. Classifier
  • 2. Runtime Regressor
  • 3. Speedup Regressor
slide-86
SLIDE 86
  • 1. Classifier
  • 2. Runtime Regressor
  • 3. Speedup Regressor
slide-87
SLIDE 87
  • 1. Classifier
  • 2. Runtime Regressor
  • 3. Speedup Regressor
slide-88
SLIDE 88

Predict runtime of program for workgroup size Search for lowest runtime

slide-89
SLIDE 89
  • 1. Classifier
  • 2. Runtime Regressor
  • 3. Speedup Regressor
slide-90
SLIDE 90
  • 1. Classifier
  • 2. Runtime Regressor
  • 3. Speedup Regressor
slide-91
SLIDE 91
  • 1. Classifier
  • 2. Runtime Regressor
  • 3. Speedup Regressor
slide-92
SLIDE 92

Predict speedup of workgroup size A

  • ver B for program

Search for highest speedup

slide-93
SLIDE 93
  • 1. Classifier
  • 2. Runtime Regressor
  • 3. Speedup Regressor
slide-94
SLIDE 94
  • 1. Classifier
  • 2. Runtime Regressor
  • 3. Speedup Regressor
slide-95
SLIDE 95

Questions:

  • 1. What features do we need? ✓
  • 2. What programs do we train on? ✓
  • 3. How do we make predictions?
slide-96
SLIDE 96

Questions:

  • 1. What features do we need? ✓
  • 2. What programs do we train on? ✓
  • 3. How do we make predictions? ✓
slide-97
SLIDE 97

Experiment

slide-98
SLIDE 98

Implementation

Modified SkelCL stencil pattern Python server process for autotuning 5 classifiers, random forest regressor

slide-99
SLIDE 99

Experimental Setup

6 stencil benchmarks + synthetic. 7 different GPUs & CPUs. 4 dataset sizes. Exhaustive search of workgroup size space for each

slide-100
SLIDE 100

Results

slide-101
SLIDE 101

Optimisation space

rows cols

  • ptimality
slide-102
SLIDE 102 wc 2 20 40 60 80 100 wr 2 20 40 60 80 100 O r a c l e f r e q u e n c y ( l
  • g
) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
slide-103
SLIDE 103 wc 2 20 40 60 80 100 wr 2 20 40 60 80 100 O r a c l e f r e q u e n c y ( l
  • g
) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5

32% optimal workgroup sizes unique

slide-104
SLIDE 104 wc 2 20 40 60 80 100 wr 2 20 40 60 80 100 O r a c l e f r e q u e n c y ( l
  • g
) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
  • ptimal 15%
  • f time

32% optimal workgroup sizes unique

slide-105
SLIDE 105 50 100 150 200 250 300 350 400 Scenarios (sorted by descending max speedup) 100 101 102 103 Speedup (log) Max w(4×4) w(32×4)
slide-106
SLIDE 106 50 100 150 200 250 300 350 400 Scenarios (sorted by descending max speedup) 100 101 102 103 Speedup (log) Max w(4×4) w(32×4)

upper bound ( a v e r a g e 1 5 . 1 4 x )

slide-107
SLIDE 107 50 100 150 200 250 300 350 400 Scenarios (sorted by descending max speedup) 100 101 102 103 Speedup (log) Max w(4×4) w(32×4)

upper bound static tuning ( a v e r a g e 1 5 . 1 4 x )

slide-108
SLIDE 108 50 100 150 200 250 300 350 400 Scenarios (sorted by descending max speedup) 100 101 102 103 Speedup (log) Max w(4×4) w(32×4)

upper bound static tuning human expert ( a v e r a g e 1 5 . 1 4 x )

slide-109
SLIDE 109

Autotuning

Classification

slide-110
SLIDE 110 J48 RandomForest RandomForest Accurate ZeroR NaiveBayes SMO SimpleLogistic J48 RandomForest Illegal Refused Accurate ZeroR NaiveBayes SMO SimpleLogistic J48 RandomForest 1 2 3 4 5 6 7 Speedup Baseline Random NearestNeighbour ZeroR NaiveBayes SMO SimpleLogistic J48 RandomForest 0% 20% 40% 60% 80% 100% Performance Baseline Random NearestNeighbour
slide-111
SLIDE 111 J48 RandomForest RandomForest Accurate ZeroR NaiveBayes SMO SimpleLogistic J48 RandomForest Illegal Refused Accurate ZeroR NaiveBayes SMO SimpleLogistic J48 RandomForest 1 2 3 4 5 6 7 Speedup Baseline Random NearestNeighbour ZeroR NaiveBayes SMO SimpleLogistic J48 RandomForest 0% 20% 40% 60% 80% 100% Performance Baseline Random NearestNeighbour

26%

  • ptimal
slide-112
SLIDE 112 J48 RandomForest RandomForest Accurate ZeroR NaiveBayes SMO SimpleLogistic J48 RandomForest Illegal Refused Accurate ZeroR NaiveBayes SMO SimpleLogistic J48 RandomForest 1 2 3 4 5 6 7 Speedup Baseline Random NearestNeighbour ZeroR NaiveBayes SMO SimpleLogistic J48 RandomForest 0% 20% 40% 60% 80% 100% Performance Baseline Random NearestNeighbour

26%

  • ptimal

90%

  • ptimal
slide-113
SLIDE 113 J48 RandomForest RandomForest Accurate ZeroR NaiveBayes SMO SimpleLogistic J48 RandomForest Illegal Refused Accurate ZeroR NaiveBayes SMO SimpleLogistic J48 RandomForest 1 2 3 4 5 6 7 Speedup Baseline Random NearestNeighbour ZeroR NaiveBayes SMO SimpleLogistic J48 RandomForest 0% 20% 40% 60% 80% 100% Performance Baseline Random NearestNeighbour

Nearest neighbour best 26%

  • ptimal

90%

  • ptimal
slide-114
SLIDE 114 ZeroR NaiveBayes SMO SimpleLogistic J48 RandomForest 2 4 6 8 10 Classification time (ms) 5% 10% 15% 20% 25% 30% 35% Ratio ZeroR 0% 5% 10% 15% Ratio ZeroR 1 2 3 4 5 6 7 Speedup 20% 40% 60% 80% 100% Performance
slide-115
SLIDE 115 ZeroR NaiveBayes SMO SimpleLogistic J48 RandomForest 2 4 6 8 10 Classification time (ms) 5% 10% 15% 20% 25% 30% 35% Ratio ZeroR 0% 5% 10% 15% Ratio ZeroR 1 2 3 4 5 6 7 Speedup 20% 40% 60% 80% 100% Performance

2.5ms RTT

slide-116
SLIDE 116

Autotuning

Regression

slide-117
SLIDE 117 Kernel 10-fold Device Synthetic Dataset Average 1 2 3 4 5 6 7 Speedup Kernel 10-fold Device Synthetic Dataset Average 0% 20% 40% 60% 80% 100% Performance Kernel 10-fold Device Synthetic Dataset Average 1 2 3 4 5 6 7 Speedup Kernel 10-fold Device Synthetic Dataset Average 0% 20% 40% 60% 80% 100% Performance

Runtime regression Speedup regression

slide-118
SLIDE 118 Kernel 10-fold Device Synthetic Dataset Average 1 2 3 4 5 6 7 Speedup Kernel 10-fold Device Synthetic Dataset Average 0% 20% 40% 60% 80% 100% Performance Kernel 10-fold Device Synthetic Dataset Average 1 2 3 4 5 6 7 Speedup Kernel 10-fold Device Synthetic Dataset Average 0% 20% 40% 60% 80% 100% Performance

Runtime regression Speedup regression

Highest speedup

slide-119
SLIDE 119 Kernel 10-fold Device Synthetic Dataset Average 20 40 60 80 100 120 140 Classification time (ms) 6% 8% 10% 12% Accuracy Kernel 10-fold Device Synthetic Dataset Average 20 40 60 80 100 120 140 Classification time (ms) 6% 8% 10% 12% Accuracy

Runtime regression Speedup regression

slide-120
SLIDE 120 Kernel 10-fold Device Synthetic Dataset Average 20 40 60 80 100 120 140 Classification time (ms) 6% 8% 10% 12% Accuracy Kernel 10-fold Device Synthetic Dataset Average 20 40 60 80 100 120 140 Classification time (ms) 6% 8% 10% 12% Accuracy

Runtime regression Speedup regression

40x slower than J48

slide-121
SLIDE 121 J48 NaiveBayes RandomForest SimpleLogistic SMO Runtime Regression Speedup Regression 0.0 0.5 1.0 1.5 2.0 2.5

Speedup over human expert

(ignoring cases where human expert is invalid)

slide-122
SLIDE 122 J48 NaiveBayes RandomForest SimpleLogistic SMO Runtime Regression Speedup Regression 0.0 0.5 1.0 1.5 2.0 2.5

Speedup over human expert

(ignoring cases where human expert is invalid) Appear similar

slide-123
SLIDE 123 20 40 60 80 Columns 20 40 60 Rows Runtime Regression 20 40 60 80 Columns 20 40 60 Rows J48 20 40 60 80 Columns 20 40 60 Rows NaiveBayes 20 40 60 80 Columns 20 40 60 Rows SimpleLogistic 20 40 60 80 Columns 20 40 60 Rows SMO 20 40 60 80 Columns 20 40 60 Rows RandomForest

Very different prediction characteristics

slide-124
SLIDE 124

Conclusions

slide-125
SLIDE 125

Average 15x speedup best/worst workgroup size Setting workgroup size depends on device, kernel, dataset Static tuning achieves 26% of optimal performance

slide-126
SLIDE 126

We present three methodologies for autotuning OpenCL workgroup size Trade-offs between prediction cost and training cost Achieving average 1.22x speedup over human expert, with increased reliability

slide-127
SLIDE 127

Details in the paper!

slide-128
SLIDE 128

Autotuning OpenCL Workgroup Size for Stencil Patterns

http://chriscummins.cc