Accelerating Winograd Convolutions using Symbolic Computation and - - PowerPoint PPT Presentation

accelerating winograd convolutions using symbolic
SMART_READER_LITE
LIVE PREVIEW

Accelerating Winograd Convolutions using Symbolic Computation and - - PowerPoint PPT Presentation

Accelerating Winograd Convolutions using Symbolic Computation and Meta-programming A. Mazaheri 1 , T. Beringer 1 , M. Moskewicz 2 , F. Wolf 1 , A. Jannesari 3 1 TU Darmstadt, 2 Tesla Inc., 3 Iowa State University 30.04.2020 EuroSys20,


slide-1
SLIDE 1
  • A. Mazaheri1, T. Beringer1, M. Moskewicz2, F. Wolf1, A. Jannesari3

1 TU Darmstadt, 2 Tesla Inc., 3 Iowa State University

30.04.2020 EuroSys’20, Heraklion, Crete, Greece

Accelerating Winograd Convolutions using Symbolic Computation and Meta-programming

Image: Freepik.com

slide-2
SLIDE 2

4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 2

Neural networks are everywhere

Object detection Speech recognition Sentiment analysis Semantic segmentation Autonomous cars Translation Music composition Intelligent agents Word prediction

slide-3
SLIDE 3

4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 3

Convolutional neural networks

1x1x1000

Feature map visualization

Convolution+ReLU Max pooling Fully connected+ReLU Softmax

slide-4
SLIDE 4

4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 4

Convolution & tensors

Input tensor C ✕ H ✕ W Kernel tensor OC ✕ IC ✕ M ✕ N Output tensor H’ ✕ W’

slide-5
SLIDE 5

4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 5

Convolution & tensors

Input tensor C ✕ H ✕ W Kernel tensor OC ✕ IC ✕ M ✕ N Output tensor H’ ✕ W’

Element-wise multiplication Summation

1 2

slide-6
SLIDE 6

4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 6

Convolution & tensors

Input tensor C ✕ H ✕ W Kernel tensor OC ✕ IC ✕ M ✕ N Output tensor H’ ✕ W’

Element-wise multiplication Summation

1 2

slide-7
SLIDE 7

4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 7

Convolution & tensors

Input tensor C ✕ H ✕ W Kernel tensor OC ✕ IC ✕ M ✕ N Output tensor H’ ✕ W’

Element-wise multiplication Summation

1 2

slide-8
SLIDE 8

4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 8

Convolution & tensors

Input tensor C ✕ H ✕ W Kernel tensor OC ✕ IC ✕ M ✕ N Output tensor H’ ✕ W’

Element-wise multiplication Summation

  • Dominate computation (>90% of runtime)
  • Similar to generalized matrix-matrix multiply à Massive GPU parallelism

1 2

slide-9
SLIDE 9

4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 9

Output Output transformation Multiplication Input / Filter transformation Input (tiled)

Winograd convolution !(#, %)

  • Sample !(# = 2×2, % = 3×3) Winograd convolution

Internal tile size: + = # + % − 1

Research questions:

  • Can we reduce the overhead of Winograd transformations?
  • How to properly choose the right +?
  • How to run Winograd efficiently on a wide range of GPU platforms?
slide-10
SLIDE 10

4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 10

Winograd code generation workflow

Caffe TensorFlow …

myCNN.proto: layers{ layer{ Conv1 top=data bottom=conv1 }layer{ Conv2 top=conv1 … }… }

Auto-tuning & variant selection

CNN frontend Winograd Conv. Codegen HW

Nvidia GPUs AMD GPUs Qualcomm Snapdragon New targets

?

Graph-level optimization conv1 activation fc1 conv1 + activation fc1

CUDA OpenCL Vulkan ?

Compute graph (CG) Annotated CG Refined CG

HW Info. Winograd spec. F(m, r) Transformation recipe generator

Transformation matrices DB Templates per operation Code generation

KERNEL conv(in, filts)// CUCL IN img:chan:y:x main(cf1,cf2){ %(filts_buf_loads); }

Template meta-programming (C++ metacode)

KERNEL conv(in, filts)// CUCL IN img:chan:y:x main(cf1,cf2){ filts_buf[0+tid]= filts[tid];}

CUDA/OpenCL/Vulkan kernels

SGEMM Library

Non-fused Winograd Fused Winograd Direct Conv.

slide-11
SLIDE 11

4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 11

Optimizing Winograd transformations [Symbolic analysis]

= −1×$%,% + 0 + 0 −1×$%,) + 0 + 0 −1×$%,* + 0 + 0

+,,, * + +-,, * + +.,, * +,,- * + +-,- * + +.,- * +,,. * + +-,. * + +.,. * +,,, * − +-,, * + +.,, * +,,- * − +-,- * + +.,- * +,,. * − +-,. * + +.,. *

0 + 0 + 1×$*,% 0 + 0 + 1×$*,) 0 + 0 + 1×$*,*

  • Represent the target matrix by symbols
  • Perform multiplication and obtain the results

for (i = 0; i < alpha; i++) { for (j = 0; j < r; j++) { res[i][j] = 0; for (k = 0; k < r; k++) res[i][j] += G[i][k] * g[k][j]; } }

Matrix multiplication code before optimization

slide-12
SLIDE 12

4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 12

Optimizing Winograd transformations [Remove 1,0s]

= −1×$%,% + 0 + 0 −1×$%,) + 0 + 0 −1×$%,* + 0 + 0

+,,, * + +-,, * + +.,, * +,,- * + +-,- * + +.,- * +,,. * + +-,. * + +.,. * +,,, * − +-,, * + +.,, * +,,- * − +-,- * + +.,- * +,,. * − +-,. * + +.,. *

0 + 0 + 1×$*,% 0 + 0 + 1×$*,) 0 + 0 + 1×$*,* = −$%,% −$%,) −$%,*

+,,, * + +-,, * + +.,, * +,,- * + +-,- * + +.,- * +,,. * + +-,. * + +.,. * +,,, * − +-,, * + +.,, * +,,- * − +-,- * + +.,- * +,,. * − +-,. * + +.,. *

$*,% $*,) $*,*

slide-13
SLIDE 13

4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 13

Optimizing Winograd transformations [Index representation]

= −"#,# −"#,% −"#,&

'(,( & + '*,( & + '+,( & '(,* & + '*,* & + '+,* & '(,+ & + '*,+ & + '+,+ & '(,( & − '*,( & + '+,( & '(,* & − '*,* & + '+,* & '(,+ & − '*,+ & + '+,+ &

"&,# "&,% "&,& = −"#,,

'(,- & + '+,- & + '*,- & '(,- & + '+,- & − '*,- &

"&,,

slide-14
SLIDE 14

4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 14

Optimizing Winograd transformations [Factorization]

= −"#,%

&',( ) + &+,( ) + &,,( ) &',( ) + &+,( ) − &,,( )

"),%

= −"#,%

  • ) ("#,% + "),% + "-,%)
  • ) ("#,% + "),% − "-,%)

"),%

slide-15
SLIDE 15

4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 15

Optimizing Winograd transformations [Common subexpression elimination]

= −"#,%

& ' ("#,% + "',% + "&,%) & ' ("#,% + "',% − "&,%)

"',% = −"#,%

& ' (+,-1 + "&,%) & ' (+,-1 − "&,%)

"',% , +,-1 = "#,% + "',%

slide-16
SLIDE 16

4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 16

Before optimizations

Optimizing Winograd transformations [Code generation]

= −"#,%

& ' ()*+1 + "&,%) & ' ()*+1 − "&,%)

"',% , )*+1 = "#,% + "',%

for (i = 0; i < alpha; i++) { for (j = 0; j < r; j++) { Gg[i][j] = 0; for (k = 0; k < r; k++) Gg[i][j] += G[i][k] * g[k][j]; } }

After optimizations

for(j=0, j<4, j++){ cse1 = g[0][j] + g[2][j]; Gg[0][j] = -g[0][j]; Gg[1][j] = 0.5*(cse1 + g[1][j]); Gg[2][j] = 0.5*(cse1 - g[1][j]); Gg[3][j] = g[2][j]; }

slide-17
SLIDE 17

4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 17

Performance auto-tuning

  • Winograd variant
  • Thread blocking
  • Register blocking
  • Loop unrolling factor
  • Winograd output tile size

Tuning knobs

float g[7][7]; float Gg[8][7]; float tmp[8][8]; const GASQ float *B = filts_ref + (k * 3 + c) * 7 * 7; for (int i = 0; i < 7; ++i) { for (int j = 0; j < 7; ++j) { g[i][j] = B[7*j + i]; } }

Tensor operation kernel Lowest runtime kernel

slide-18
SLIDE 18

4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 18

Winograd convolution accuracy

4 5 6 7 8 9 10 11 12 13 14 15 16

α

10−7 10−5 10−3 10−1

L1-Norm Error

2 4 6

Error increase rate

  • L1-norm error analysis for various Winograd internal tile sizes

Lowest error growth

slide-19
SLIDE 19

4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 19

Winograd transformation optimization results

F ( 2 , 3 ) F ( 3 , 3 ) F ( 4 , 3 ) F ( 5 , 3 ) F ( 6 , 3 ) F ( 7 , 3 ) F ( 8 , 3 ) F ( 9 , 3 ) F ( 1 , 3 ) 0.0 0.2 0.4 0.6

Arithmetic reduction ratio α = 8

F ( 2 , 5 ) F ( 3 , 5 ) F ( 4 , 5 ) F ( 5 , 5 ) F ( 6 , 5 ) F ( 7 , 5 ) F ( 8 , 5 ) F ( 9 , 5 ) F ( 1 , 5 ) 0.0 0.2 0.4 0.6

α = 8

F ( 2 , 7 ) F ( 3 , 7 ) F ( 4 , 7 ) F ( 5 , 7 ) F ( 6 , 7 ) F ( 7 , 7 ) F ( 8 , 7 ) F ( 9 , 7 ) F ( 1 , 7 ) 0.0 0.2 0.4 0.6

α = 8

Transformations Whole Winograd

F(2,3) F(3,3) F(4,3) F(5,3) F(6,3) F(7,3) F(8,3) F(9,3) 0.0 0.2 0.4 0.7

Runtime (ms) 3x3 conv

F(2,5) F(3,5) F(4,5) F(5,5) F(6,5) F(7,5) F(8,5) F(9,5) 0.0 0.1 0.3 0.4

5x5 conv

F(2,7) F(3,7) F(4,7) F(5,7) F(6,7) F(7,7) F(8,7) F(9,7) 0.0 4.7 9.3 14.0

7x7 conv

1.00 1.25 1.2 1.4 1.0 1.2

Speedup ratio

Non-optimized Optimized

  • Overall arithmetic reduction ratios related to transformation steps and the whole

Winograd algorithm for a single tile

  • Runtime comparison on Nvidia 1080 Ti
slide-20
SLIDE 20

4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 20

Performance portability [Nvidia GPU]

  • Runtime comparison of Winograd kernels generated by our method with

cuDNN vendor library on Nvidia GTX 1080 Ti.

1.00E+08 1.73E+08 2.99E+08 4.43E+08 7.32E+08 1.50E+09 4.48E+09 10−1 100 Runtime (ms) cuDNN Fastest Boda No-Winograd cuDNN Winograd Boda Winograd 2 4 6 8

Boda Winograd / cuDNN Winograd

8.1✕ speedup

slide-21
SLIDE 21

4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 21

Performance portability [AMD GPU]

  • Runtime comparison of Winograd kernels generated by our method with the

MIOpen vendor library on AMD Radeon RX 580.

1.00E+08 1.73E+08 2.99E+08 4.43E+08 7.32E+08 1.50E+09 4.48E+09 10−1 100 Runtime (ms) MIOpen Fastest Boda No-Winograd MIOpen Winograd Boda Winograd 0.5 1.0 1.5 2.0

Boda Winograd / MIOpen Winograd

1.9✕ speedup

slide-22
SLIDE 22

4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 22

Performance portability [Mobile GPU]

  • Validated the effect of auto-tuning on Mali G71 GPU
  • ARM Compute library as a baseline

1.00E+08 1.73E+08 2.99E+08 4.43E+08 7.32E+08 1.50E+09 4.48E+09 101 102 Runtime (ms) ARM ComputeLibrary-Winograd Boda No-Autotuning Boda Autotuning 1.0 1.5 2.0

Speedup Autotunig/No-Autotuning

1.74✕ speedup

slide-23
SLIDE 23

4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 23

Conclusion

  • Symbolic analysis à Optimizing Winograd transformation steps
  • Meta-programming à Enhancing the performance portability of

Winograd convolutions

1

Efficient Winograd convolution is tricky to implement

2

When ! = 8; largest arithmetic reduction, acceptable accuracy

3

Performance portability on three different architectures

slide-24
SLIDE 24

Questions?

Contact me if you are interested: Arya Mazaheri <mazaheri@cs.tu-darmstadt.de>

Image designed by vectorpouch / Freepik

SF 4.0