accelerating winograd convolutions using symbolic
play

Accelerating Winograd Convolutions using Symbolic Computation and - PowerPoint PPT Presentation

Accelerating Winograd Convolutions using Symbolic Computation and Meta-programming A. Mazaheri 1 , T. Beringer 1 , M. Moskewicz 2 , F. Wolf 1 , A. Jannesari 3 1 TU Darmstadt, 2 Tesla Inc., 3 Iowa State University 30.04.2020 EuroSys20,


  1. Accelerating Winograd Convolutions using Symbolic Computation and Meta-programming A. Mazaheri 1 , T. Beringer 1 , M. Moskewicz 2 , F. Wolf 1 , A. Jannesari 3 1 TU Darmstadt, 2 Tesla Inc., 3 Iowa State University 30.04.2020 EuroSys’20, Heraklion, Crete, Greece Image: Freepik.com

  2. Neural networks are everywhere Object detection Semantic segmentation Autonomous cars Speech recognition Translation Music composition Sentiment analysis Word prediction Intelligent agents 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 2

  3. Convolutional neural networks Feature map visualization 1x1x1000 Convolution+ReLU Max pooling Fully connected+ReLU Softmax 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 3

  4. Convolution & tensors Input tensor Kernel tensor Output tensor C ✕ H ✕ W OC ✕ IC ✕ M ✕ N H’ ✕ W’ 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 4

  5. Convolution & tensors Input tensor Kernel tensor Output tensor C ✕ H ✕ W OC ✕ IC ✕ M ✕ N H’ ✕ W’ Element-wise Summation 1 2 multiplication 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 5

  6. Convolution & tensors Input tensor Kernel tensor Output tensor C ✕ H ✕ W OC ✕ IC ✕ M ✕ N H’ ✕ W’ Element-wise Summation 1 2 multiplication 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 6

  7. Convolution & tensors Input tensor Kernel tensor Output tensor C ✕ H ✕ W OC ✕ IC ✕ M ✕ N H’ ✕ W’ Element-wise Summation 1 2 multiplication 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 7

  8. Convolution & tensors Input tensor Kernel tensor Output tensor C ✕ H ✕ W OC ✕ IC ✕ M ✕ N H’ ✕ W’ Element-wise Summation 1 2 multiplication • Dominate computation (>90% of runtime) • Similar to generalized matrix-matrix multiply à Massive GPU parallelism 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 8

  9. Winograd convolution !(#, %) Sample !(# = 2×2, % = 3×3) Winograd convolution • Input (tiled) Input / Filter transformation Multiplication Output transformation Output Internal tile size: + = # + % − 1 Research questions: • Can we reduce the overhead of Winograd transformations? How to properly choose the right + ? • • How to run Winograd efficiently on a wide range of GPU platforms? 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 9

  10. Winograd code generation workflow CNN frontend Caffe TensorFlow … myCNN.proto: layers{ layer{ Conv1 top=data bottom=conv1 }layer{ Conv2 top=conv1 … }… } Compute graph (CG) Code generation Auto-tuning & HW Winograd Conv. Codegen Transformation Templates per operation variant selection Info. matrices DB Non-fused Fused Direct Annotated CG Winograd Winograd Conv. Transformation Graph-level optimization KERNEL conv(in, filts)// CUCL IN img:chan:y:x recipe generator main(cf1,cf2){ %(filts_buf_loads); } conv1 conv1 + SGEMM Template meta-programming activation Winograd spec. activation Library (C++ metacode) F(m, r) CUDA/OpenCL/Vulkan kernels fc1 fc1 KERNEL conv(in, filts)// CUCL IN img:chan:y:x main(cf1,cf2){ filts_buf[0+tid]= filts[tid];} Refined CG ? OpenCL CUDA Vulkan Nvidia AMD Qualcomm HW ? New targets GPUs GPUs Snapdragon 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 10

  11. Optimizing Winograd transformations [Symbolic analysis] • Represent the target matrix by symbols • Perform multiplication and obtain the results −1×$ %,) + 0 + 0 −1×$ %,* + 0 + 0 −1×$ %,% + 0 + 0 + ,,- + -,- + .,- + ,,. + -,. + .,. + ,,, + -,, + .,, * + * + * + * + * + * + * * * = + ,,- + -,- + .,- + ,,. + -,. + .,. + ,,, + -,, + .,, * − * + * − * + * − * + * * * 0 + 0 + 1×$ *,) 0 + 0 + 1×$ *,* 0 + 0 + 1×$ *,% for (i = 0; i < alpha; i++) { for (j = 0; j < r; j++) { res[i][j] = 0; Matrix multiplication for (k = 0; k < r; k++) code before optimization res[i][j] += G[i][k] * g[k][j]; } } 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 11

  12. Optimizing Winograd transformations [Remove 1,0s] −1×$ %,) + 0 + 0 −1×$ %,* + 0 + 0 −1×$ %,% + 0 + 0 + ,,- + -,- + .,- + ,,. + -,. + .,. + ,,, + -,, + .,, * + * + * + * + * + * + * * * = + ,,- + -,- + .,- + ,,. + -,. + .,. + ,,, + -,, + .,, * − * + * − * + * − * + * * * 0 + 0 + 1×$ *,) 0 + 0 + 1×$ *,* 0 + 0 + 1×$ *,% −$ %,) −$ %,* −$ %,% + ,,- + -,- + .,- + ,,. + -,. + .,. + ,,, + -,, + .,, * + * + * + * + * + * + * * * = + ,,- + -,- + .,- + ,,. + -,. + .,. + ,,, + -,, + .,, * − * + * − * + * − * + * * * $ *,) $ *,* $ *,% 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 12

  13. Optimizing Winograd transformations [Index representation] −" #,% −" #,& −" #,# ' (,* ' *,* ' +,* ' (,+ ' *,+ ' +,+ ' (,( ' *,( ' +,( & + & + & + & + & + & + & & & = ' (,* ' *,* ' +,* ' (,+ ' *,+ ' +,+ ' (,( ' *,( ' +,( & − & + & − & + & − & + & & & " &,% " &,& " &,# −" #,, ' (,- ' +,- ' *,- & + & + & = ' (,- ' +,- ' *,- & + & − & " &,, 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 13

  14. Optimizing Winograd transformations [Factorization] −" #,% & ',( & +,( & ,,( ) + ) + ) = & ',( & +,( & ,,( ) + ) − ) " ),% −" #,% - ) (" #,% + " ),% + " -,% ) = - ) (" #,% + " ),% − " -,% ) " ),% 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 14

  15. Optimizing Winograd transformations [Common subexpression elimination] −" #,% & ' (" #,% + " ',% + " &,% ) = & ' (" #,% + " ',% − " &,% ) " ',% −" #,% & ' (+,-1 + " &,% ) = , +,-1 = " #,% + " ',% & ' (+,-1 − " &,% ) " ',% 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 15

  16. Optimizing Winograd transformations [Code generation] −" #,% & ' ()*+1 + " &,% ) , )*+1 = " #,% + " ',% = & ' ()*+1 − " &,% ) " ',% Before optimizations After optimizations for (i = 0; i < alpha; i++) { for(j=0, j<4, j++){ for (j = 0; j < r; j++) { cse1 = g[0][j] + g[2][j]; Gg[i][j] = 0; Gg[0][j] = -g[0][j]; for (k = 0; k < r; k++) Gg[1][j] = 0.5*(cse1 + g[1][j]); Gg[i][j] += G[i][k] * Gg[2][j] = 0.5*(cse1 - g[1][j]); g[k][j]; Gg[3][j] = g[2][j]; } } } 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 16

  17. Performance auto-tuning Tuning knobs • Winograd variant • Thread blocking • Register blocking • Loop unrolling factor • Winograd output tile size Lowest runtime kernel Tensor operation kernel float g[7][7]; float Gg[8][7]; float tmp[8][8]; const GASQ float *B = filts_ref + (k * 3 + c) * 7 * 7; for (int i = 0; i < 7; ++i) { for (int j = 0; j < 7; ++j) { g[i][j] = B[7*j + i]; } } 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 17

  18. Winograd convolution accuracy • L1-norm error analysis for various Winograd internal tile sizes 10 − 1 6 Error increase rate L1-Norm Error 10 − 3 4 10 − 5 2 10 − 7 4 5 6 7 8 9 10 11 12 13 14 15 16 α Lowest error growth 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 18

  19. Winograd transformation optimization results • Overall arithmetic reduction ratios related to transformation steps and the whole Winograd algorithm for a single tile α = 8 α = 8 0 . 6 0 . 6 α = 8 reduction ratio 0 . 6 Transformations Arithmetic Whole Winograd 0 . 4 0 . 4 0 . 4 0 . 2 0 . 2 0 . 2 0 . 0 0 . 0 0 . 0 ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) 3 3 3 3 3 3 3 3 3 5 5 5 5 5 5 5 5 5 7 7 7 7 7 7 7 7 7 , , , , , , , , , , , , , , , , , , , , , , , , , , , 2 3 4 5 6 7 8 9 0 2 3 4 5 6 7 8 9 0 2 3 4 5 6 7 8 9 0 ( ( ( ( ( ( ( ( 1 ( ( ( ( ( ( ( ( 1 ( ( ( ( ( ( ( ( 1 F F F F F F F F F F F F F F F F F F F F F F F F ( ( ( F F F • Runtime comparison on Nvidia 1080 Ti 3x3 conv 5x5 conv 7x7 conv 0 . 7 14 . 0 Runtime (ms) Speedup ratio 0 . 4 Non-optimized 1 . 2 Optimized 1 . 4 0 . 3 9 . 3 0 . 4 1 . 25 4 . 7 0 . 2 0 . 1 1 . 2 1 . 0 1 . 00 0 . 0 0 . 0 0 . 0 F(2,3) F(3,3) F(4,3) F(5,3) F(6,3) F(7,3) F(8,3) F(9,3) F(2,5) F(3,5) F(4,5) F(5,5) F(6,5) F(7,5) F(8,5) F(9,5) F(2,7) F(3,7) F(4,7) F(5,7) F(6,7) F(7,7) F(8,7) F(9,7) 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend