CNVLUTIN: Ineffectual-neuron-free DNN computing J. Albericio , P. - - PowerPoint PPT Presentation

cnvlutin ineffectual neuron free dnn computing
SMART_READER_LITE
LIVE PREVIEW

CNVLUTIN: Ineffectual-neuron-free DNN computing J. Albericio , P. - - PowerPoint PPT Presentation

CNVLUTIN: Ineffectual-neuron-free DNN computing J. Albericio , P. Judd, T. Hetherington*, T. Aamodt*, N. E. Jerger, A. Moshovos * Please cite the original source. CNVLUTIN: Ineffectual-neuron-free DNN computing J. Albericio P. Judd T.


slide-1
SLIDE 1

CNVLUTIN: Ineffectual-neuron-free DNN computing

  • J. Albericio, P. Judd, T. Hetherington*,
  • T. Aamodt*, N. E. Jerger, A. Moshovos
*

Please cite the original source.

slide-2
SLIDE 2

CNVLUTIN: Ineffectual-neuron-free DNN computing

  • J. Albericio
*
  • P. Judd
  • T. Hetherington*
  • T. Aamodt*
  • N. Enright Jerger
  • A. Moshovos
slide-3
SLIDE 3

DNNs = SIMD Heaven

3

+

x x ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯

100’s — 1000's

slide-4
SLIDE 4

DNNs = SIMD Heaven

4

+

x x ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯

100’s — 1000's

slide-5
SLIDE 5

DNNs = SIMD Heaven

5

+

x x ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯

100’s — 1000's

slide-6
SLIDE 6

DNNs = SIMD Heaven

6

+

x x ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯

100’s — 1000's

slide-7
SLIDE 7

DNNs = SIMD Heaven

7

+

x x ⋯ ⋯ ⋯ ⋯ ⋯

100’s — 1000's

slide-8
SLIDE 8

CNVLUTIN: Smarter SIMD

8

52% Performance — 2x ED2P Out-of-the-box networks

slide-9
SLIDE 9

Outline

  • 1. What’s a CNN?
  • 2. A wide SIMD design
  • 3. CNVLUTIN: Skipping neurons in a wide SIMD

design

  • 4. Evaluation
  • 5. Our approach
9
slide-10
SLIDE 10

What’s a CNN?

10’s of layers

Korean mask!

10
slide-11
SLIDE 11

What’s a CNN?

11
slide-12
SLIDE 12

What’s a CNN?

Neurons (Input)

11
slide-13
SLIDE 13

What’s a CNN?

… …

Synapses (Filters) Neurons (Input)

11
slide-14
SLIDE 14

What’s a CNN?

… …

12
slide-15
SLIDE 15

What’s a CNN?

… …

Neurons (Output)

12
slide-16
SLIDE 16

What’s a CNN?

… …

Neurons (Output)

12
slide-17
SLIDE 17

What’s a CNN?

… … …

Neurons (Output)

12
slide-18
SLIDE 18

What’s a CNN?

10’s of layers

13

Korean mask!

slide-19
SLIDE 19

What’s a CNN?

Convolution ReLU

10’s of layers

Pool 13

Korean mask!

slide-20
SLIDE 20

What’s a CNN?

Convolution ReLU Pool

Inner products Negatives to 0

  • 3
  • 2
  • 1
1 2 3
  • 3 -2 -1 0 1 2 3

Data size reduction

+ x x

… CNN typical layer

14
slide-21
SLIDE 21 15

~90%

Time spent in convolutions

slide-22
SLIDE 22

Lots of Runtime Zeroes

0.1 0.2 0.3 0.4 0.5 0.6 Alexnet Google NiN VGG19 VGG_M VGG_S AVG Fraction of zero neurons in multiplications 16
slide-23
SLIDE 23

Lots of Runtime Zeroes

0.1 0.2 0.3 0.4 0.5 0.6 Alexnet Google NiN VGG19 VGG_M VGG_S AVG Fraction of zero neurons in multiplications 16 Waste of time and energy!!!
slide-24
SLIDE 24

Lots of Runtime Zeroes

0.1 0.2 0.3 0.4 0.5 0.6 Alexnet Google NiN VGG19 VGG_M VGG_S AVG Fraction of zero neurons in multiplications 16 Waste of time and energy!!! Dynamically generated = Not predictable
slide-25
SLIDE 25 SB (eDRAM) NBin x x f NBout + Filter 0 Filter 15 x x + f 16 IP0 IP15 Neuron Lane 0 Neuron Lane 15

How to compute DNNs: DaDianNao*

Filter 0 Filter 15 *Chen et al. MICRO 2014 Neurons
slide-26
SLIDE 26

Processing in DaDianNao

2 1 1 1 1 1 2 1 15 ⋯ ⋯ ⋯ 3 1 ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ 18 Neuron Lanes 1 15 Synapse Lanes Filter 0 1 15 Synapse Lanes Filter 15
slide-27
SLIDE 27

Processing in DaDianNao

2 1 1 1 1 1 2 1 15 ⋯ ⋯ ⋯ 3 1 ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ 18 Neuron Lanes 1 15 Synapse Lanes Filter 0 1 15 Synapse Lanes Filter 15
slide-28
SLIDE 28

Processing in DaDianNao

2 1 1 1 1 1 2 1 15 ⋯ ⋯ ⋯ 3 1 ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ X X Multiplication of corresponding neuron and synapse elements 18 Neuron Lanes 1 15 Synapse Lanes Filter 0 1 15 Synapse Lanes Filter 15
slide-29
SLIDE 29

Zero-skipping in DaDianNao?

3 1 2 1 1 1 1 1 2 3 1 2 ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ 19 1 15 Neuron Lanes 1 15 Synapse Lanes Filter 0 1 15 Synapse Lanes Filter 15
slide-30
SLIDE 30

Zero-skipping in DaDianNao?

3 1 2 1 1 1 1 1 2 3 1 2 1 1 1 1 2 1 Zero removal ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ 19 1 15 Neuron Lanes 1 15 Synapse Lanes Filter 0 1 15 Synapse Lanes Filter 15
slide-31
SLIDE 31

Zero-skipping in DaDianNao?

3 1 2 1 1 1 1 1 2 3 1 2 1 1 1 1 2 1 Zero removal ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ 19 1 15 Neuron Lanes 1 15 Synapse Lanes Filter 0 1 15 Synapse Lanes Filter 15
slide-32
SLIDE 32

Zero-skipping in DaDianNao?

3 1 2 1 1 1 1 1 2 3 1 2 1 1 1 1 2 1 Zero removal ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ X X 19 1 15 Neuron Lanes 1 15 Synapse Lanes Filter 0 1 15 Synapse Lanes Filter 15
slide-33
SLIDE 33

Zero-skipping in DaDianNao?

3 1 2 1 1 1 1 1 2 3 1 2 1 1 1 1 2 1 Zero removal ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ X X

Lanes can not longer

  • perate in

lock-step!

19 1 15 Neuron Lanes 1 15 Synapse Lanes Filter 0 1 15 Synapse Lanes Filter 15
slide-34
SLIDE 34

CNVLUTIN: Decoupling Lanes

⋯ ⋯ ⋯ 1 15 ⋯ ⋯ 20 Neuron Lanes ⋯ ⋯ ⋯ ⋯ 1 15 Synapse Lanes Filter 0 ⋯ ⋯ ⋯ ⋯ 1 15 Synapse Lanes Filter 15 Neuron Lane 0 Synapses Lane 0 Filter 0 Filter 1 ⋯ ⋯ ⋯ ⋯ Filter 15 Synapses Lane 15 Neuron Lane 15 Filter 0 Filter 1 ⋯ ⋯ ⋯ ⋯ Filter 15

DaDianNao CNVLUTIN Subunit 0 Subunit 15

slide-35
SLIDE 35

CNVLUTIN: Decoupling Lanes

1 1 2 1 1 Neuron Lane 0 Neuron Lane 15 1 ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ 21 1 2 3 Offsets Offsets 1 2 ⋯ ⋯ Synapses Lane 0 Filter 0 Filter 15 Synapses Lane 15 Filter 0 Filter 15

Subunit 0 Subunit 15

slide-36
SLIDE 36

CNVLUTIN: Decoupling Lanes

1 1 2 1 1 Neuron Lane 0 Neuron Lane 15 1 ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ 21 1 2 3 Offsets Offsets 1 2 ⋯ ⋯ Synapses Lane 0 Filter 0 Filter 15 Synapses Lane 15 Filter 0 Filter 15

Subunit 0 Subunit 15

slide-37
SLIDE 37

CNVLUTIN: Decoupling Lanes

1 1 2 1 1 Neuron Lane 0 Neuron Lane 15 1 ⋯ ⋯ ⋯ ⋯ ⋯ X ⋯ ⋯ ⋯ ⋯ X 21 1 2 3 Offsets Offsets 1 2 ⋯ ⋯ Synapses Lane 0 Filter 0 Filter 15 Synapses Lane 15 Filter 0 Filter 15

Subunit 0 Subunit 15

slide-38
SLIDE 38

CNVLUTIN: Ineffectual-neuron Filtering

22

Layer i Layer i+1

slide-39
SLIDE 39

CNVLUTIN: Ineffectual-neuron Filtering

23

Layer i

eDRAM Encoder

Layer i+1

Dispatcher
slide-40
SLIDE 40

CNVLUTIN: Ineffectual-neuron Filtering

23 Layer i eDRAM Encoder Layer i+1 Dispatcher
slide-41
SLIDE 41

CNVLUTIN: Ineffectual-neuron Filtering

23 Layer i eDRAM Encoder Layer i+1 Dispatcher 7 Brick 0 Offsets 6 5 0 0 0 0 0 0 2 1 0 0 7 6 5 0 0 0 0 0 0 2 1 0 3 2 1 0 0 0 0 0 0 2 1 7 6 5 2 1 3 2 1 2 1 ZF Neurons Brick 1 Brick 2 Offset Unit Buffers Packed neurons eDRAM Neurons
slide-42
SLIDE 42

CNVLUTIN: Ineffectual-neuron Filtering

23 Layer i eDRAM Encoder Layer i+1 Dispatcher 7 Brick 0 Offsets 6 5 0 0 0 0 0 0 2 1 0 0 7 6 5 0 0 0 0 0 0 2 1 0 3 2 1 0 0 0 0 0 0 2 1 7 6 5 2 1 3 2 1 2 1 ZF Neurons Brick 1 Brick 2 Offset Unit Buffers Packed neurons eDRAM Neurons
slide-43
SLIDE 43

CNVLUTIN: Ineffectual-neuron Filtering

23 Layer i eDRAM Encoder Layer i+1 Dispatcher 7 Brick 0 Offsets 6 5 0 0 0 0 0 0 2 1 0 0 7 6 5 0 0 0 0 0 0 2 1 0 3 2 1 0 0 0 0 0 0 2 1 7 6 5 2 1 3 2 1 2 1 ZF Neurons Brick 1 Brick 2 Offset Unit Buffers Packed neurons eDRAM Neurons
slide-44
SLIDE 44

Neuron Lane 0 Neuron Lane 1 Neuron Lane 15

CNVLUTIN: Computation Slicing

24
slide-45
SLIDE 45

Methodology

  • In-house timing simulator: baseline + CNVLUTIN
  • Logic + SRAM: Synthesis on 65nm TSMC
  • eDRAM model: Destiny
  • DNNs: Trained models from Caffe model zoo
25
slide-46
SLIDE 46

Area

Only +4.5% in area overhead

26
slide-47
SLIDE 47

Speedup: ineffectual = 0

27 0.5 1 1.5 2 Alexnet Google NiN VGG19 VGG_M VGG_S Geo

Better

slide-48
SLIDE 48

Speedup: ineffectual = 0

27 0.5 1 1.5 2 Alexnet Google NiN VGG19 VGG_M VGG_S Geo

Better

slide-49
SLIDE 49

Speedup: ineffectual = 0

27 0.5 1 1.5 2 Alexnet Google NiN VGG19 VGG_M VGG_S Geo

1.37x Performance on average Better

slide-50
SLIDE 50

Loosening the Ineffectual Neuron Criterion

28

“If all you have is a hammer, everything looks like a nail” (Maslow’s hammer)

37 13 10 15 1 123 7 1 3 1 20 18 31 33

CNVLUTIN zero

slide-51
SLIDE 51

Loosening the Ineffectual Neuron Criterion

29 37 13 10 15 1 123 7 1 3 1 20 18 31 33

Example: consider ineffectual if value<2 “If all you have is a hammer, everything looks like a nail” (Maslow’s hammer) CNVLUTIN zero

slide-52
SLIDE 52

Speedup: ineffectual >= 0

1.52x Performance No accuracy lost

30 0.5 1 1.5 2 Alexnet Google NiN VGG19 VGG_M VGG_S Geo
  • nly 0's
0's and more

Better

slide-53
SLIDE 53

Speedup: ineffectual >= 0

1.52x Performance No accuracy lost

30 0.5 1 1.5 2 Alexnet Google NiN VGG19 VGG_M VGG_S Geo
  • nly 0's
0's and more

Better

slide-54
SLIDE 54

Loosening the Ineffectual Neuron Criterion

31 37 13 10 15 1 123 7 1 3 1 20 18 31 33

Example: consider ineffectual if value<2 “If all you have is a hammer, everything looks like a nail” (Maslow’s hammer) CNVLUTIN zero

slide-55
SLIDE 55

Loosening the Ineffectual Neuron Criterion

32 37 13 10 15 1 123 7 1 3 1 20 18 31 33

Example: consider ineffectual if value<8 “If all you have is a hammer, everything looks like a nail” (Maslow’s hammer) CNVLUTIN zero

slide-56
SLIDE 56

Loosening the Ineffectual Neuron Criterion

33 37 13 10 15 1 123 7 1 3 1 20 18 31 33

Example: consider ineffectual if value<32 “If all you have is a hammer, everything looks like a nail” (Maslow’s hammer) CNVLUTIN zero

slide-57
SLIDE 57

Trading accuracy for performance

34
slide-58
SLIDE 58

Trading accuracy for performance

34
  • 2%
slide-59
SLIDE 59

Trading accuracy for performance

34

+60%

  • 2%
slide-60
SLIDE 60

CNVLUTIN: Smarter SIMD

35

52% Performance — 2x ED2P No accuracy lost

  • ut-of-the-box networks
slide-61
SLIDE 61

Our Approach

36

Value-Aware Deep Learning Acceleration

arXiv, a while ago: Reduced-Precision Strategies for Bounded Memory in DNNs ICS 2016 Proteus: Exploiting Numerical Precision Variability in DNNs Today CNVLUTIN: Ineffectual-Neuron-Free DNN Computing CAL (to appear) How to offer performance that scales linearly with required numerical precision More things coming soon :-)
slide-62
SLIDE 62

CNVLUTIN: Ineffectual-neuron-free DNN computing

  • J. Albericio, P. Judd, T. Hetherington*,
  • T. Aamodt*, N. E. Jerger, A. Moshovos
*

Additional Material

Please cite the original source.

slide-63
SLIDE 63

Execution Activity Breakdown

38
slide-64
SLIDE 64

Cnvlutin: Maintaining Wide NM accesses

Dispatcher: Reads Neuron Bricks - up to 16 neuron Maintain one per Neuron Lane

n15 n1 n2 n0 n4 n0 Brick Buffer From NM Bank 0 From NM Bank 15 Neuron block Broadcast n15 n3 n1 39
slide-65
SLIDE 65

Cnvlutin: Maintaining Wide NM accesses

0 1 #1: Partition NM in 16 Slices over 16 banks 15

Neuron Lane 0 Neuron Lane 1 Neuron Lane 15

40
slide-66
SLIDE 66

#2: Fetch and Maintain One Container per Slice Container: up to 16 non-zero neurons Neuron Lane 0 Neuron Lane 1 Neuron Lane 15

… Cnvlutin: Maintaining Wide NM accesses

41
slide-67
SLIDE 67

#3: Keep neuron lanes supplied with one neuron per cycle Neuron Lane 0 Neuron Lane 1 Neuron Lane 15

… Cnvlutin: Maintaining Wide NM accesses

42
slide-68
SLIDE 68

Cnvlutin: Summary

SB (eDRAM) Synapse Lane 15 Synapse Lane 15 Lane 0 Filter Lane 15 Filter x x Offsets Subunit 0 + NBout to central eDRAM SB (eDRAM) encoder f + f x x Subunit 15 from central eDRAM from central eDRAM Nbin Neuron Lane 0 Neuron Lane 15 Offsets Synapse Lane 0 Synapse Lane 0 Lane 0 Filter Nbin Lane 15 Filter

Decoupled Neuron Lanes: Neuron + coordinate Proceed independently Partitioned SB: 16-wide accesses 1 synapse per filter

43
slide-69
SLIDE 69

Cnvlutin: Summary

SB (eDRAM) Synapse Lane 15 Synapse Lane 15 Lane 0 Filter Lane 15 Filter x x Offsets Subunit 0 + NBout to central eDRAM SB (eDRAM) encoder f + f x x Subunit 15 from central eDRAM from central eDRAM Nbin Neuron Lane 0 Neuron Lane 15 Offsets Synapse Lane 0 Synapse Lane 0 Lane 0 Filter Nbin Lane 15 Filter

1-wide Neuron Lanes Decoupled Neuron Lanes: Neuron + coordinate Proceed independently Partitioned SB: 16-wide accesses 1 synapse per filter

43
slide-70
SLIDE 70

Cnvlutin: Summary

SB (eDRAM) Synapse Lane 15 Synapse Lane 15 Lane 0 Filter Lane 15 Filter x x Offsets Subunit 0 + NBout to central eDRAM SB (eDRAM) encoder f + f x x Subunit 15 from central eDRAM from central eDRAM Nbin Neuron Lane 0 Neuron Lane 15 Offsets Synapse Lane 0 Synapse Lane 0 Lane 0 Filter Nbin Lane 15 Filter

1-wide Neuron Lanes 16-synapse Synapse Lanes Decoupled Neuron Lanes: Neuron + coordinate Proceed independently Partitioned SB: 16-wide accesses 1 synapse per filter

43