Accelerating Sparse DNN Models without Hardware-Support via - PowerPoint PPT Presentation

Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity 2020/11 • Shanghai

Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity Cong Guo 1 , Bo Yang Hsueh 2 , Jingwen Leng 1 , Yuxian Qiu 1 , Yue Guan 1 , Zehuan Wang 2 , Xiaoying Jia 2 , Xipeng Li 2 , Yuhao Zhu 3 and Minyi Guo 1 1 Shanghai Jiao Tong University, Emerging Parallel Computing Center, REArch (Resilient and Efficient Architecture) group 2 NVIDIA, 3 University of Rochester

Outline l Background & Motivation Tile-Wise Sparsity Efficient GPU Implementation Evaluation

Dense GEMM Accelerator GEMM-based accelerators are dominant owing to their wide applicability. Nvidia GPU Tensor Core Google TPU Cambrian MLU Huawei Ascend Alibaba Hanguang Img2col General Matrix multiplication Convolution operations that dominate computer vision models are converted to the GEMM. NLP models are naturally equivalent to the GEMM operation.

DNN Models and Pruning Explosive Model Size 8.3Bn Enormous computation cost and memory usage. 1.5Bn 340M 26M ResNe-50 ELMo GPT-1 BERT-large GPT-2 GPT-2 8B Pruning VGG-16 92.5% Fewer parameters and Sparsity less computation cost [Song Han, etc. NIPS’15] The DNN models are sparse! Pruning is an effective and promising approach to reduce the DNN latency.

Sparsity Pattern Irregular, Random Balance Regular, Structured High Accuracy Low Accuracy Low efficiency High efficiency n*n block No constraint (1*1 block) Fixed sparsity of each vector Software: Software: Software: MKL, cuSparse Balanced Sparsity [AAAI’19] Block-sparse [ arXiv’17 ] Hardware: Hardware: 8x8,16x16(CUDA) 32x32, 64x64(Tensor) OuterSPACE, [HPCA’18] Bank-Balanced Sparsity [FPGA’19] SpArch, [HPCA’20] Sparse Tensor Core [MICRO’19] BW is friendly to dense GEMM accelerator. Tesla A100 [GTC’20]

Sparsity Pattern Efficiency GPU: Tesla V100 32GB Workload: BERT(MNLI) Software: TensorFlow 1.15(Fine-tune) cuBlas, cuSparse and Block-sparse (Inference) Accuracy loss < 1% 1. BW achieves the best performance. 2. BW is still 3 × slower than the dense model on the tensor core. Pattern Core Library Speedup Density 3. They are all Inefficient on the existing dense GEMM hardware. Element Wise CUDA cuSparse 0.06x 63% Vector Wise CUDA cuSparse 0.07x 56% A new sparsity pattern that can match the Block Wise Tensor Block-sparse 0.33x 50% existing hardware features while maintaining the fine granularity, which is critical for [Block-Sparse, arXiv’17] [Song Han, etc. NIPS’15] achieving the high model accuracy. [Sparse Tensor Core, Micro’19] [Balanced Sparsity, AAAI’19]

Outline Background & Motivation l Tile-Wise Sparsity Efficient GPU Implementation Evaluation

Algorithm-software Co-designed Tile-Wise Sparsity GPU (SIMD) C = A × B Software: CUTLASS – Tiling GEMM Instruction Cache DRAM Warp Scheduler Dispatch Unit Tile-Wise Sparsity. An algorithm-software co-designed SM Register File pruning method that reduces the DNN latency on existing DRAM dense architectures while maintaining high accuracy Core SFU CUDA core without special hardware support. Dispatch Port Operand Collector LD/ST key insight: a tiling-friendly sparsity pattern FP INT Interconnect Network Result Queue Cache Hardware: Tesla V100

Tile-Wise Sparsity G + 4, G + 3, G + 2, G − 9 Re-organize GEMM: M, N, K The key idea of our tile-wise pattern is to prune T x = G = Granularity each B tile with the regular row and column pruning. T y = Tile Length (y) C = pruned column The tiling based GEMM is widely used in the dense GEMM accelerators, such as TPU, not only GPU.

Pruning algorithm Importance Score Gradually Pruning Importance Score [P. Molchanov , etc. CVPR 2019 ] Global Weight Pruning Apriori Tuning Pre-trained Pruning Uneven distribution of EW More details on the paper… Global Weight Pruning Fine-Tune Apriori Tuning Gradually Pruning [Song Han, etc. NIPS’15]

Outline Background & Motivation Tile-Wise Sparsity l Efficient GPU Implementation Evaluation

Efficient GPU Implementation Goal: Execute TW sparsity on GPU (including CUDA core and Tensor core) efficiently. Three optimizations leveraging GPU’s programming features: 1. Memory accesses coalesce (via memory layout transpose) 2. Kernel reduction (via fusion) 3. Load imbalance mitigation (via concurrent kernel)

Baseline GEMM Tiling Re-organize Preprocessed Run-time Run-time CUTLASS A[offset A + offset k [k]] Sparsity in the Global Density in the Core B[offset B ] Execute Efficiently! C[offset C + offset n [n]]

Memory Accesses Coalesce Optimization 1 Performance Column skipping degradation Transpose to eliminate Memory uncoalescing Row skipping Efficiency Efficiency

Kernel Fusion Optimization 2 Fused with img2col on CNN Transpose is free to GPU and TPU

Kernel Fusion Optimization 2 Fused with Transpose on BERT 1 Based on NVIDIA Faster Transformer. 2 Transpose fusion 3 4 5

Load Imbalance MiHgaHon Optimization 3 Vles Condensed Tile MulV-Stream Concurrent kernel execuVon Overlap the computaVon of different Vles by assigning to different streams, and rely on the underlying scheduler to maximize the resource uVlizaVon.

Outline Background & Motivation Tile-Wise Sparsity Efficient GPU Implementation l Evaluation

Methodology Hardware: NVIDIA Tesla V100 32GB GPU DNN models and datasets: Sparsity pacern: Models Datasets Pa-ern Core Library BERT-Base GLUE dataset: MNLI, MRPC, SST, CoLA, RTE, QNLI Tile Wise (TW) Tensor-fp16 Tile Sparsity* SQuAD CUDA-fp32 VGG-16 (CNN ) ImageNet Block Wise (BW) Tensor-fp16 Block-sparse Element Wise (EW) CUDA-fp32 cuSparse NMT (LSTM ) IWSLT English-Vietnamese dataset Vector Wise (VW) CUDA-fp32 cuSparse** *Based on CUTLASS 1.3 **V100 can not support sparse tensor core. In the rest of this section, we focus on the GEMM execution time unless explicitly mentioned.

Impact of TW Granularity Workload: Pattern Granularity Critical Sparsity* Dense 128 0% EW - - BW 32 ~85% At the sparsity of 75%, TW-128 has accuracy loss of about 0.9% and 2.4% BW 64 ~85% compared to EW and the baseline dense model at 75% sparsity, respecdvely. With only 40% sparsity, TW-128 starts to outperform the dense model latency. TW 64 75% BW-64 experiences the most drasdc accuracy drop of 4% at 75% sparsity. BW-64 TW 128 40% is faster than the dense model only when the sparsity is greater than 85%, which leads to an accuracy loss as high as 10%. * With the sparsity, the pruning method starts TW exceeds BW in both of speedup and model accuracy. to outperform the dense model latency. The lower the better. G=128 is sufficient to maintain the model accuracy while providing significant latency reduchon for TW.

Accuracy 0.85 0.9 Workload: EW 0.85 0.8 Pa-ern Granularity Accuracy Accuracy 0.8 EW EW EW - TW TW TW 0.75 VW BW 32 * 32 0.75 VW BW BW 0.7 TW 128 VW BERT- MNLI BERT- SQuAD VW 16 0.7 0.65 20 40 60 80 100 20 40 60 80 100 Sparsity (%) Sparsity (%) EW the best. 27 BW the worst. 0.92 26 0.9 The accuracy of TW and VW are Accuracy 0.88 BLEU similar when the sparsity is below 25 EW 0.86 70%. With high sparsity (> 70%), EW TW TW 24 TW generally outperforms the VW VW VW 0.84 BW NMT VGG BW with the exception of NMT. 23 0.82 20 40 60 80 100 20 40 60 80 100 Sparsity (%) Sparsity (%)

Sparsity Pattern BERT-base Layer-0 W Q Irregularity: EW > TW > VW, BW

Speedup on GEMM BERT accuracy loss < 3% VGG accuracy loss < 1% NMT BLEU loss < 1 Tensor cores: TW 1.95 × speedup CUDA cores: TW 2.86 × speedup TW achieves the meaningful latency reduction on both tensor cores and CUDA cores owing to its compatibility with dense GEMM, while all other sparsity patterns cause the actual slowdown.

End-to-end Latency and Impact of Optimizations W/o Explicit Fused Dense Time (ms) Transpose Transpose Transpose 32.38 14.29 GEMM 37.6 14.29 GEMM: 2.26x than Dense and 2.63x than w/o Transpose (71%) (51%) Fusion Transpose: 2% overhead non-GEMM 12.99 12.99 12.99 13.93 Without Fusion Transpose: ~10% overhead Transpose 0 0 5.18 0 Total 45.37 50.59 32.46 28.22 End-to-end: 1.61x speedup Speedup 1 0.9 1.4 1.61 BERT-Base model with 75% sparsity on tensor core Without transpose: Performance degradation. With explicit transpose: 10% overhead. -- Optimization 1 With fusion transpose: 2% overhead. -- Optimization 2 End-to-end speedup: 1.61x.

Conclusion TW achieves the meaningful speedup on both tensor cores (1.95 × ) and CUDA cores(2.86 × ) with a high model accuracy, while all other sparsity patterns cause the actual slowdown. The tiling GEMM algorithm is widely used in the dense GEMM-based accelerators. In other words, supporting TW on other platforms like TPU is feasible. Proposed a new DNN model sparsity design insight based on the Tile-Wise algorithm-sooware ophmizahon. Tile Sparsity is open source! https://github.com/clevercool/TileSparsity

Thanks ! Questions?

Accelerating Sparse DNN Models without Hardware-Support via - PowerPoint PPT Presentation

Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity 2020/11 Shanghai Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity Cong Guo 1 , Bo Yang Hsueh 2 , Jingwen Leng 1 , Yuxian Qiu 1 ,

DNN-based Branch-and-bound for the Quadratic Assignment Problem *Koichi Fujii, Naoki Ito, Yuji

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

The Dark Side of DNN Pruning Reza Yazdani Marc Riera Jose-Maria Arnau Antonio Gonzlez

Outlier Channel Splitting Improving DNN Quantization without Retraining Ritchie Zhao , Yuwei Hu,

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Lecture 5 : Sparse Models Homework 3 discussion (Nima) Sparse Models Lecture - Reading :

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training Hongyu Zhu 1,2 ,

Power-Driven DNN Dataflow Optimization on FPGA Qi Sun 1 , Tinghuan Chen 1 , Jin Miao 2 , Bei Yu 1 1

CNVLUTIN: Ineffectual-neuron-free DNN computing J. Albericio , P. Judd, T. Hetherington*, T.

Contents Introduction Pipelined FPGA DNN accelerators Roof-line Model and optimizing

Multilingual and low-resource ASR Lecture 18 CS 753 Instructor: Preethi Jyothi Recall Hybrid

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU STEVE RENNICH, SR. ENGINEER, NVIDIA

Sparse tensors are a natural way of representing real-world data 1 Sparse tensors are a natural

Hardware switches - the open-source approach Ji Prko jiri@resnulli.us Red Hat Proceedings

Where is my next project coming from? Sales Pipeline Management For Freelancers and Small

CHAPTER 9 Types of Radio Copy Review Mechanics of developing a commercial What

Semantics of FOL and -calculus using nominal techniques Murdoch J. Gabbay Samson@60, Oxford

policy 2017 Stefan Ingves Governor of the Riksbank Riksdag Committee on Finance 3 May 2018 A

The Royal Bank of Scotland Group Q311 Results 4 th November 2011 Important Information Certain

Update on Federal Policies that Impact Home and Community Based Services Alison Barkoff, J.D.

Dependency Parser for Bengali-English Code-Mixed Data enhanced with a Synthetic Treebank Urmi