SWIRL++ :Evaluating Performance Models System Overview to Guide - - PowerPoint PPT Presentation

swirl evaluating performance models
SMART_READER_LITE
LIVE PREVIEW

SWIRL++ :Evaluating Performance Models System Overview to Guide - - PowerPoint PPT Presentation

SWIRL++ T. Rusira et al LCPC19 Convolution SWIRL++ :Evaluating Performance Models System Overview to Guide Code Transformation in Model Guided Optimization Convolutional Neural Networks Code Variants Memory Cost Space Pruning


slide-1
SLIDE 1

SWIRL++

  • T. Rusira et al

LCPC’19 Convolution System Overview Model Guided Optimization

Code Variants Memory Cost Space Pruning Heuristics Unrolling

Evaluation

Performance Empirical Stability

Conclusion References Appendices

SWIRL experiments SuRF

SWIRL++ :Evaluating Performance Models to Guide Code Transformation in Convolutional Neural Networks

Tharindu Rusira 1 Anand Venkat 2 Raj Barik 3 Mary Hall1

  • 1. University of Utah
  • 2. Intel Labs
  • 3. Uber Technologies Inc.

LCPC’19 22 Oct 2019

1 / 29

slide-2
SLIDE 2

SWIRL++

  • T. Rusira et al

LCPC’19 Convolution System Overview Model Guided Optimization

Code Variants Memory Cost Space Pruning Heuristics Unrolling

Evaluation

Performance Empirical Stability

Conclusion References Appendices

SWIRL experiments SuRF

Overview

Convolution System Overview Model Guided Optimization Code Variants Memory Cost Space Pruning Heuristics Unrolling Evaluation Performance Empirical Stability Conclusion Appendices SWIRL experiments SuRF

2 / 29

slide-3
SLIDE 3

SWIRL++

  • T. Rusira et al

LCPC’19 Convolution System Overview Model Guided Optimization

Code Variants Memory Cost Space Pruning Heuristics Unrolling

Evaluation

Performance Empirical Stability

Conclusion References Appendices

SWIRL experiments SuRF

2D-Convolution

https: //machinelearninguru.com/computer_vision/basics/convolution/convolution_layer.html 3 / 29

slide-4
SLIDE 4

SWIRL++

  • T. Rusira et al

LCPC’19 Convolution System Overview Model Guided Optimization

Code Variants Memory Cost Space Pruning Heuristics Unrolling

Evaluation

Performance Empirical Stability

Conclusion References Appendices

SWIRL experiments SuRF

2D-Convolution

4 / 29

slide-5
SLIDE 5

SWIRL++

  • T. Rusira et al

LCPC’19 Convolution System Overview Model Guided Optimization

Code Variants Memory Cost Space Pruning Heuristics Unrolling

Evaluation

Performance Empirical Stability

Conclusion References Appendices

SWIRL experiments SuRF

2D-Convolution

5 / 29

slide-6
SLIDE 6

SWIRL++

  • T. Rusira et al

LCPC’19 Convolution System Overview Model Guided Optimization

Code Variants Memory Cost Space Pruning Heuristics Unrolling

Evaluation

Performance Empirical Stability

Conclusion References Appendices

SWIRL experiments SuRF

2D-Convolution

6 / 29

slide-7
SLIDE 7

SWIRL++

  • T. Rusira et al

LCPC’19 Convolution System Overview Model Guided Optimization

Code Variants Memory Cost Space Pruning Heuristics Unrolling

Evaluation

Performance Empirical Stability

Conclusion References Appendices

SWIRL experiments SuRF

SWIRL4

Provides compiler optimizations for Latte.py 2 through transformation recipes.

2.https://github.com/IntelLabs/Latte.py 4.Venkat, A. et al., SWIRL:High-performance many-core CPU code generation for deep neural

  • networks. IJHPCA, 33(6)

7 / 29

slide-8
SLIDE 8

SWIRL++

  • T. Rusira et al

LCPC’19 Convolution System Overview Model Guided Optimization

Code Variants Memory Cost Space Pruning Heuristics Unrolling

Evaluation

Performance Empirical Stability

Conclusion References Appendices

SWIRL experiments SuRF

Transformation Recipes

I ∈ RNCBHWC, W ∈ RKBCBPQKC, O ∈ RNKBPQK

8 / 29

slide-9
SLIDE 9

SWIRL++

  • T. Rusira et al

LCPC’19 Convolution System Overview Model Guided Optimization

Code Variants Memory Cost Space Pruning Heuristics Unrolling

Evaluation

Performance Empirical Stability

Conclusion References Appendices

SWIRL experiments SuRF

SWIRL’s output

9 / 29

slide-10
SLIDE 10

SWIRL++

  • T. Rusira et al

LCPC’19 Convolution System Overview Model Guided Optimization

Code Variants Memory Cost Space Pruning Heuristics Unrolling

Evaluation

Performance Empirical Stability

Conclusion References Appendices

SWIRL experiments SuRF

SWIRL++

Manual exploration of optimizing transformations does not always guarantee high performance. SWIRL++ is a step towards automation.

10 / 29

slide-11
SLIDE 11

SWIRL++

  • T. Rusira et al

LCPC’19 Convolution System Overview Model Guided Optimization

Code Variants Memory Cost Space Pruning Heuristics Unrolling

Evaluation

Performance Empirical Stability

Conclusion References Appendices

SWIRL experiments SuRF

Code Variants

Given a loop order L, apply varying TL, PL, VL to generate convolution variants with constraints HL. Return best k variants that minimize the cost.

11 / 29

slide-12
SLIDE 12

SWIRL++

  • T. Rusira et al

LCPC’19 Convolution System Overview Model Guided Optimization

Code Variants Memory Cost Space Pruning Heuristics Unrolling

Evaluation

Performance Empirical Stability

Conclusion References Appendices

SWIRL experiments SuRF

Memory cost of Convolution

12 / 29

slide-13
SLIDE 13

SWIRL++

  • T. Rusira et al

LCPC’19 Convolution System Overview Model Guided Optimization

Code Variants Memory Cost Space Pruning Heuristics Unrolling

Evaluation

Performance Empirical Stability

Conclusion References Appendices

SWIRL experiments SuRF

Memory cost of Convolution

13 / 29

slide-14
SLIDE 14

SWIRL++

  • T. Rusira et al

LCPC’19 Convolution System Overview Model Guided Optimization

Code Variants Memory Cost Space Pruning Heuristics Unrolling

Evaluation

Performance Empirical Stability

Conclusion References Appendices

SWIRL experiments SuRF

Memory cost of Convolution

14 / 29

slide-15
SLIDE 15

SWIRL++

  • T. Rusira et al

LCPC’19 Convolution System Overview Model Guided Optimization

Code Variants Memory Cost Space Pruning Heuristics Unrolling

Evaluation

Performance Empirical Stability

Conclusion References Appendices

SWIRL experiments SuRF

Memory cost of Convolution

15 / 29

slide-16
SLIDE 16

SWIRL++

  • T. Rusira et al

LCPC’19 Convolution System Overview Model Guided Optimization

Code Variants Memory Cost Space Pruning Heuristics Unrolling

Evaluation

Performance Empirical Stability

Conclusion References Appendices

SWIRL experiments SuRF

Memory cost of Convolution

16 / 29

slide-17
SLIDE 17

SWIRL++

  • T. Rusira et al

LCPC’19 Convolution System Overview Model Guided Optimization

Code Variants Memory Cost Space Pruning Heuristics Unrolling

Evaluation

Performance Empirical Stability

Conclusion References Appendices

SWIRL experiments SuRF

Memory cost of Convolution

17 / 29

slide-18
SLIDE 18

SWIRL++

  • T. Rusira et al

LCPC’19 Convolution System Overview Model Guided Optimization

Code Variants Memory Cost Space Pruning Heuristics Unrolling

Evaluation

Performance Empirical Stability

Conclusion References Appendices

SWIRL experiments SuRF

Memory cost of Convolution

18 / 29

slide-19
SLIDE 19

SWIRL++

  • T. Rusira et al

LCPC’19 Convolution System Overview Model Guided Optimization

Code Variants Memory Cost Space Pruning Heuristics Unrolling

Evaluation

Performance Empirical Stability

Conclusion References Appendices

SWIRL experiments SuRF

Space Pruning Heuristics

HL defines a set of heuristics to restrict the search space.

◮ Data layouts and candidate loop orders are selected a

priori

◮ Two outermost loops are parallelized with omp

parallel for collapse(2)

◮ Feature map dimensions tiled by SIMDWIDTH and

inner loop vectorized

◮ K, C, P, Q dimensions are candidates for tiling

19 / 29

slide-20
SLIDE 20

SWIRL++

  • T. Rusira et al

LCPC’19 Convolution System Overview Model Guided Optimization

Code Variants Memory Cost Space Pruning Heuristics Unrolling

Evaluation

Performance Empirical Stability

Conclusion References Appendices

SWIRL experiments SuRF

Loop Unrolling

◮ Unroll candidate loops after tile factors are determined ◮ Unroll factors are derived to fully utilize the register file ◮ if P, Q are candidates for unrolling, corresponding tile

factors p, q are determined such that p × q ≥ REGS to fully hide FMA latency

20 / 29

slide-21
SLIDE 21

SWIRL++

  • T. Rusira et al

LCPC’19 Convolution System Overview Model Guided Optimization

Code Variants Memory Cost Space Pruning Heuristics Unrolling

Evaluation

Performance Empirical Stability

Conclusion References Appendices

SWIRL experiments SuRF

Performance Results

Platform: dual socket Intel Xeon Platinum 8280 CascadeLake, 2x28 2.7 GHZ (max 4.0 GHz) cores with 192 GB memory, 32KB L1, 1MB L2, 38.5 MB L3 cache, and 32 512-bit vector registers. (icc 18.0.1)

21 / 29

slide-22
SLIDE 22

SWIRL++

  • T. Rusira et al

LCPC’19 Convolution System Overview Model Guided Optimization

Code Variants Memory Cost Space Pruning Heuristics Unrolling

Evaluation

Performance Empirical Stability

Conclusion References Appendices

SWIRL experiments SuRF

Empirical Stability

Search using Random Forest, (SuRF)1;3 based autotuner is integrated to traverse the same search space. (a)-(f) VGG (g)-(i) Overfeat (j)-(l) Inception

  • 1. Balaprakash, P. et al. Autotuning in High-Performance Computing Applications.IEEE106.11(’18)
  • 3. Nelson, T. et al. Generating efficient tensor contractions for gpus. ICPP’15.

22 / 29

slide-23
SLIDE 23

SWIRL++

  • T. Rusira et al

LCPC’19 Convolution System Overview Model Guided Optimization

Code Variants Memory Cost Space Pruning Heuristics Unrolling

Evaluation

Performance Empirical Stability

Conclusion References Appendices

SWIRL experiments SuRF

SWIRL++ Top-1 Vs. Top-k

Relative speedups of the best among Top-k variants with respect to Top-1

23 / 29

slide-24
SLIDE 24

SWIRL++

  • T. Rusira et al

LCPC’19 Convolution System Overview Model Guided Optimization

Code Variants Memory Cost Space Pruning Heuristics Unrolling

Evaluation

Performance Empirical Stability

Conclusion References Appendices

SWIRL experiments SuRF

Conclusion

◮ A model driven approach for automating SWIRL

  • ptimizations based on cache footprint analysis

◮ Empirical evaluation of the stability and effectiveness of

the model

◮ Achieves performance comparable to TF-MKL and

hand-optimized SWIRL compiler on Intel Xeon CascadeLake

◮ Possible generalizations to extend into other

applications/domains

24 / 29

slide-25
SLIDE 25

SWIRL++

  • T. Rusira et al

LCPC’19 Convolution System Overview Model Guided Optimization

Code Variants Memory Cost Space Pruning Heuristics Unrolling

Evaluation

Performance Empirical Stability

Conclusion References Appendices

SWIRL experiments SuRF

References I

[1] Balaprakash, P., Dongarra, J., Gamblin, T., Hall, M., Hollingsworth, J.K., Norris, B., Vuduc, R.: Autotuning in high-performance computing applications. Proceedings of the IEEE 106(11), 2068–2083 (Nov 2018). https://doi.org/10.1109/JPROC.2018.2841200 [2] IntelLabs: Latte.py. https://github.com/IntelLabs/Latte.py [3] Nelson, T., Rivera, A., Balaprakash, P., Hall, M., Hovland, P.D., Jessup, E., Norris, B.: Generating efficient tensor contractions for gpus. In: 2015 44th International Conference on Parallel Processing.

  • pp. 969–978. IEEE (2015)

[4] Venkat, A., Rusira, T., Barik, R., Hall, M., Truong, L.: Swirl: High-performance many-core cpu code generation for deep neural networks. The International Journal of High Performance Computing Applications 0(0) (0). https://doi.org/10.1177/1094342019866247, https://doi.org/10.1177/1094342019866247 25 / 29

slide-26
SLIDE 26

SWIRL++

  • T. Rusira et al

LCPC’19 Convolution System Overview Model Guided Optimization

Code Variants Memory Cost Space Pruning Heuristics Unrolling

Evaluation

Performance Empirical Stability

Conclusion References Appendices

SWIRL experiments SuRF

The End

26 / 29

slide-27
SLIDE 27

SWIRL++

  • T. Rusira et al

LCPC’19 Convolution System Overview Model Guided Optimization

Code Variants Memory Cost Space Pruning Heuristics Unrolling

Evaluation

Performance Empirical Stability

Conclusion References Appendices

SWIRL experiments SuRF

Appendix-A

Figure: SWIRL performance breakdown on Skylake (AVX-512)

27 / 29

slide-28
SLIDE 28

SWIRL++

  • T. Rusira et al

LCPC’19 Convolution System Overview Model Guided Optimization

Code Variants Memory Cost Space Pruning Heuristics Unrolling

Evaluation

Performance Empirical Stability

Conclusion References Appendices

SWIRL experiments SuRF

Appendix-B

Figure: SWIRL performance comparison with LATTE compiler on Broadwell (AVX-2)

28 / 29

slide-29
SLIDE 29

SWIRL++

  • T. Rusira et al

LCPC’19 Convolution System Overview Model Guided Optimization

Code Variants Memory Cost Space Pruning Heuristics Unrolling

Evaluation

Performance Empirical Stability

Conclusion References Appendices

SWIRL experiments SuRF

Appendix-C

Figure: SuRF

29 / 29