CMSC5743 L02: CNN Accurate Speedup I Bei Yu (Latest update: - PowerPoint PPT Presentation

CMSC5743 L02: CNN Accurate Speedup I Bei Yu (Latest update: September 28, 2020) Fall 2020 1 / 31

These slides contain/adapt materials developed by ◮ Minsik Cho and Daniel Brand (2017). “MEC: memory-efficient convolution for deep neural network”. In: Proc. ICML ◮ Asit K. Mishra et al. (2017). “Fine-grained accelerators for sparse machine learning workloads”. In: Proc. ASPDAC , pp. 635–640 ◮ Jongsoo Park et al. (2017). “Faster CNNs with direct sparse convolutions and guided pruning”. In: Proc. ICLR ◮ UC Berkeley EE290: “Hardware for Machine Learning” https://inst.eecs.berkeley.edu/~ee290-2/sp20/ 2 / 31

Overview Convolution 101 GEMM Sparse Convolution Direct Convolution Further Discussions 3 / 31

2D-Convolution Input Output Weight Activation Activation H: Height of Input Activation a b c d e W: Width of Input Activation f g h i j 1 2 3 A B C R: Height of Weight S: Width of Weight H k l m n o R 4 5 6 P D E F P: Height of Output Activation p q r s t 7 8 9 G H I Q: Width of Output Activation u v w x y S Q W 𝐵 = 𝑏 ∗ 1 + 𝑐 ∗ 2 + 𝑑 ∗ 3 +𝑔 ∗ 4 + 𝑕 ∗ 5 + ℎ ∗ 6 +𝑙 ∗ 7 + 𝑚 ∗ 8 + 𝑛 ∗ 9 4 / 31

2D-Convolution Input Output Weight Activation Activation H: Height of Input Activation a b c d e W: Width of Input Activation R: Height of Weight f g h i j 1 2 3 A B C S: Width of Weight H R P k l m n o 4 5 6 D E F P: Height of Output Activation p q r s t 7 8 9 G H I Q: Width of Output Activation u v w x y S stride: # of rows/columns Q traversed per step W 4 / 31

2D-Convolution Input Output Weight Activation Activation H: Height of Input Activation a b c d e W: Width of Input Activation R: Height of Weight f g h i j 1 2 3 A B C S: Width of Weight H R P k l m n o 4 5 6 D E F P: Height of Output Activation p q r s t 7 8 9 G H I Q: Width of Output Activation u v w x y S stride: # of rows/columns Q traversed per step W 3 4 / 31

2D-Convolution Input Output Weight Activation Activation H: Height of Input Activation a b c d e W: Width of Input Activation R: Height of Weight f g h i j 1 2 3 A B C S: Width of Weight H k l m n o R 4 5 6 P D E F P: Height of Output Activation p q r s t 7 8 9 G H I Q: Width of Output Activation u v w x y S stride: # of rows/columns Q traversed per step W 𝐽 = 𝑛 ∗ 1 + 𝑜 ∗ 2 + 𝑝 ∗ 3 +𝑠 ∗ 4 + 𝑡 ∗ 5 + 𝑢 ∗ 6 +𝑥 ∗ 7 + 𝑦 ∗ 8 + 𝑧 ∗ 9 4 / 31

2D-Convolution Input Output Weight Activation Activation H: Height of Input Activation a b c d e W: Width of Input Activation R: Height of Weight f g h i j 1 2 3 A B C S: Width of Weight H R P k l m n o 4 5 6 D E F P: Height of Output Activation p q r s t 7 8 9 G H I Q: Width of Output Activation u v w x y S stride: # of rows/columns Q traversed per step W 𝑄 = (𝐼 − 𝑆) 𝑡𝑢𝑠𝑗𝑒𝑓 + 1 𝑅 = (𝑋 − 𝑇) 𝑡𝑢𝑠𝑗𝑒𝑓 + 1 4 / 31

2D-Convolution Input Output Weight Activation Activation H: Height of Input Activation 0 0 0 0 0 0 0 W: Width of Input Activation 0 a b c d e 0 A B C R: Height of Weight 1 2 3 0 f g h i j 0 P D E F S: Width of Weight H R 4 5 6 0 k l m n o 0 P: Height of Output Activation G H I 0 p q r s t 0 7 8 9 Q: Width of Output Activation 0 u v w x y 0 S stride: # of rows/columns 0 0 0 0 0 0 0 Q traversed per step 𝑄 = (𝐼 − 𝑆 + 2 ∗ 𝑞𝑏𝑒) W padding: # of zero + 1 rows/columns added 𝑡𝑢𝑠𝑗𝑒𝑓 𝑅 = (𝑋 − 𝑇 + 2 ∗ 𝑞𝑏𝑒) + 1 𝑡𝑢𝑠𝑗𝑒𝑓 4 / 31

3D-Convolution H: Height of Input Activation W: Width of Input Activation Input Output Weight R: Height of Weight Activation Activation S: Width of Weight P: Height of Output Activation Q: Width of Output Activation C stride: # of rows/columns C traversed per step a b c d e padding: # of zero A B C f g h i j 1 2 3 rows/columns added P D E F H R k l m n o 4 5 6 G H I C: # of Input Channels p q r s t 7 8 9 Q u v w x y S W 5 / 31

3D-Convolution Weight H: Height of Input Activation Input Output W: Width of Input Activation Activation Activation R: Height of Weight C S: Width of Weight P: Height of Output Activation C 1 2 3 Q: Width of Output Activation K R 4 5 6 stride: # of rows/columns a b c d e traversed per step 7 8 9 A B C padding: # of zero f g h i j P S D E F K rows/columns added H k l m n o G H I … p q r s t C: # of Input Channels Q u v w x y K: # of Output Channels C W 1 2 3 R 4 5 6 7 8 9 S 5 / 31

3D-Convolution Input Activation Weight Output Activation C C K H: Height of Input Activation a b c d e 1 2 3 A B C W: Width of Input Activation f g h i j R 4 5 6 P D E F R: Height of Weight H k l m n o S: Width of Weight 7 8 9 G H I P: Height of Output Activation p q r s t S Q Q: Width of Output Activation K u v w x y N stride: # of rows/columns N … … traversed per step padding: # of zero C K rows/columns added C C: # of Input Channels 1 2 3 K: # of Output Channels P R 4 5 6 N: Batch size H 7 8 9 Q S W 5 / 31

Convolution 101 0 0 0 0 0 0 0 4 6 3 5 4 0 2 2 1 1 2 0 2 0 2 0 1 1 0 0 1 0 0 6 2 4 4 1 1 1 1 5 3 4 4 0 2 0 1 2 0 0 2 4 3 3 0 1 1 1 1 1 0 1 0 -1 4 0 0 0 1 0 2 0 0 2 2 4 3 0 0 0 0 0 0 0 Direct convolution: No extra memory overhead ◮ Low performance ◮ Poor memory access pattern due to geometry-specific constraint ◮ Relatively short dot product 6 / 31

Background: Memory System Processor Inclusive– 4-8 bytes (word) what is in L1$ is a subset of Increasing L1$ what is in L2$ distance is a subset of 8-32 bytes (block) from the what is in MM L2$ processor that is a 1 to 4 blocks in access subset of is in Main Memory time SM 1,024+ bytes (disk sector = page) Secondary Memory (Relative) size of the memory at each level ◮ Spatial locality ◮ Temporal Locality 7 / 31

Im2col (Image2Column) Convolution 0 0 0 0 2 2 0 2 0 1 0 0 0 0 0 0 0 0 0 0 2 2 1 2 0 1 0 4 6 3 5 4 0 2 2 1 1 2 0 0 0 0 2 1 1 0 1 1 0 . 2 6 2 4 4 0 2 0 1 1 0 0 1 . 1 5 3 4 4 1 0 2 0 1 2 0 0 . 2 4 3 3 4 1 0 1 1 1 1 1 0 1 1 0 1 2 0 1 1 1 1 0 2 2 4 3 0 0 0 1 0 2 0 . . 0 0 0 0 0 0 0 0 -1 1 1 0 0 2 0 0 0 0 25 x 9 9 x 1 ◮ Large extra memory overhead ◮ Good performance ◮ BLAS-friendly memory layout to enjoy SIMD/locality/parallelism ◮ Applicable for any convolution configuration on any platform 8 / 31

CMSC5743 L02: CNN Accurate Speedup I Bei Yu (Latest update: - PowerPoint PPT Presentation

CMSC5743 L02: CNN Accurate Speedup I Bei Yu (Latest update: September 28, 2020) Fall 2020 1 / 31 These slides contain/adapt materials developed by Minsik Cho and Daniel Brand (2017). MEC: memory-efficient convolution for deep neural

CMSC5743 L03: CNN Accurate Speedup II Bei Yu (Latest update: September 29, 2020) Fall 2020 1 /

CS7015 (Deep Learning) : Lecture 12 Object Detection: R-CNN, Fast R-CNN, Faster R-CNN, You Only

Object Detection using R-CNN Experiments CS381V: Visual Recognition, Spring 2016 William Xie

Lecture 02: Digital Logic Review Bei Yu byu@cse.cuhk.edu.hk CENG3420 L02 Digital Logic. 1

GaudiMP GaudiMP performance performance- and and KSM KSM- measurements measurements

CENG5030 Part 2-4: CNN Inaccurate Speedup-2 - Quantization Bei Yu (Latest update: March 25,

TAKING DATA ON FORM TAKING DATA ON FORM- -WOUND WOUND MOTORS MOTORS By : Manuel Manny

Decay vertex ID using CNN for p K+ Aaron Higuera University of Houston CNN Tools on

CNN Ba CNN Based ed Pi Pipeline peline for or Op Optical ical Fl Flow ow Tal Schuster,

CENG5030 Part 2-1: Introduction to Convolutional Nueral Network Bei Yu (Latest update: March 4,

Nue Energy Reconstruction with CNN Lars Hertel, Ilsoo Seong, Jianming Bian 2018/08/20 Intro.

CMSC5743 L05: Quantization Bei Yu (Latest update: October 12, 2020) Fall 2020 1 / 25 Overview

CMSC5743 L06: Binary/Ternary Network Bei Yu (Latest update: November 2, 2020) Fall 2020 1 / 21

CMSC5743 L09: Network Architecture Search Bei Yu (Latest update: September 13, 2020) Fall 2020

CMSC5743 Lab05 Introduction to Distiller Qi Sun (Latest update: October 13, 2020) Fall 2020 1

Toward the first quantum simulation with quantum speedup Andrew Childs University of Maryland

The Tournament Divide and Conquer Technique Daniel Rubery Georgiy Platonov Mohammad Rafayet Ali

ECEN 5022 Cryptography Introduction to Information Theory Peter Mathys University of Colorado

Gaseous Tracker R&D Madhu Dixit Carleton University & TRIUMF ILC Detector Test Beam

New Results on Romulus T. Iwata, M. Khairallah, K. Minematsu and T. Peyrin NIST LWC 2020 Virtual

Building 3-Variable Homogeneous Integer-valued Polynomials Using Projective Planes Marie-Andre

Multiaperture Surface-Plasma Negative Ion Source: Beam Formation and Transport through LEBT Yu.

Internet Lab (iLab1) Wireless Networks Lars Wstrich ilab1@net.in.tum.de Chair of Network

Validation of Dimemas communication model for MPI collective operations Sergi Girona, Jess

Sambuz

Useful Links

Newsletter

Mail Us

CMSC5743 L02: CNN Accurate Speedup I Bei Yu (Latest update: - PowerPoint PPT Presentation

CMSC5743 L02: CNN Accurate Speedup I Bei Yu (Latest update: September 28, 2020) Fall 2020 1 / 31 These slides contain/adapt materials developed by Minsik Cho and Daniel Brand (2017). MEC: memory-efficient convolution for deep neural

CMSC5743 L03: CNN Accurate Speedup II Bei Yu (Latest update: September 29, 2020) Fall 2020 1 /

CS7015 (Deep Learning) : Lecture 12 Object Detection: R-CNN, Fast R-CNN, Faster R-CNN, You Only

Object Detection using R-CNN Experiments CS381V: Visual Recognition, Spring 2016 William Xie

Lecture 02: Digital Logic Review Bei Yu byu@cse.cuhk.edu.hk CENG3420 L02 Digital Logic. 1

GaudiMP GaudiMP performance performance- and and KSM KSM- measurements measurements

CENG5030 Part 2-4: CNN Inaccurate Speedup-2 - Quantization Bei Yu (Latest update: March 25,

TAKING DATA ON FORM TAKING DATA ON FORM- -WOUND WOUND MOTORS MOTORS By : Manuel Manny

Decay vertex ID using CNN for p K+ Aaron Higuera University of Houston CNN Tools on

CNN Ba CNN Based ed Pi Pipeline peline for or Op Optical ical Fl Flow ow Tal Schuster,

CENG5030 Part 2-1: Introduction to Convolutional Nueral Network Bei Yu (Latest update: March 4,

Nue Energy Reconstruction with CNN Lars Hertel, Ilsoo Seong, Jianming Bian 2018/08/20 Intro.

CMSC5743 L05: Quantization Bei Yu (Latest update: October 12, 2020) Fall 2020 1 / 25 Overview

CMSC5743 L06: Binary/Ternary Network Bei Yu (Latest update: November 2, 2020) Fall 2020 1 / 21

CMSC5743 L09: Network Architecture Search Bei Yu (Latest update: September 13, 2020) Fall 2020

CMSC5743 Lab05 Introduction to Distiller Qi Sun (Latest update: October 13, 2020) Fall 2020 1

Toward the first quantum simulation with quantum speedup Andrew Childs University of Maryland

The Tournament Divide and Conquer Technique Daniel Rubery Georgiy Platonov Mohammad Rafayet Ali

ECEN 5022 Cryptography Introduction to Information Theory Peter Mathys University of Colorado

Gaseous Tracker R&amp;D Madhu Dixit Carleton University &amp; TRIUMF ILC Detector Test Beam

New Results on Romulus T. Iwata, M. Khairallah, K. Minematsu and T. Peyrin NIST LWC 2020 Virtual

Building 3-Variable Homogeneous Integer-valued Polynomials Using Projective Planes Marie-Andre

Multiaperture Surface-Plasma Negative Ion Source: Beam Formation and Transport through LEBT Yu.

Internet Lab (iLab1) Wireless Networks Lars Wstrich ilab1@net.in.tum.de Chair of Network

Validation of Dimemas communication model for MPI collective operations Sergi Girona, Jess

Sambuz

Useful Links

Newsletter

Mail Us

Gaseous Tracker R&D Madhu Dixit Carleton University & TRIUMF ILC Detector Test Beam