Riptide: Fast End-to-End Binarized Neural Networks Josh Fromm, - - PowerPoint PPT Presentation

riptide fast end to end binarized neural networks
SMART_READER_LITE
LIVE PREVIEW

Riptide: Fast End-to-End Binarized Neural Networks Josh Fromm, - - PowerPoint PPT Presentation

Riptide: Fast End-to-End Binarized Neural Networks Josh Fromm, Meghan Cowan, Matthai Philipose, Luis Ceze, and Shwetak Patel 2 Canziani et al., An analysis of deep neural network models for practical applications. 2016 1-bit Matrix


slide-1
SLIDE 1

Riptide: Fast End-to-End Binarized Neural Networks

Josh Fromm, Meghan Cowan, Matthai Philipose, Luis Ceze, and Shwetak Patel

slide-2
SLIDE 2

2 Canziani et al., “An analysis of deep neural network models for practical applications.” 2016

slide-3
SLIDE 3
  • Quantize floats to +/-1
  • 1.122 * -3.112 ==> 1 * -1
  • Notice:
  • 1 * 1 = 1
  • 1 * -1 = -1
  • -1 * 1 = -1
  • -1 * -1 = 1
  • Replacing -1 with 0, this is

just XNOR

  • Retrain model to

convergence

1.2 3.12 -11.2 3.4 -2.12 -132.1 … 0.2 -121.1, … 0b110100…1 0xD0… 64 floats 64 bits

A[:64] . W[:64] == popc(A/64 XNOR W/64)

1-bit Matrix Operations

3 Rastegari et al., “Xnor-net: Imagenet classification using binary convolutional neural networks.” 2016

slide-4
SLIDE 4

float x[], y[], w[]; ... for i in 1…N: y[j] += x[i] * w[i]; unsigned long x[], y[], w[]; … for i in 1…N/64: y[j] += 64 – 2*popc(not(x_b[i] xor w_b[i]));

2N ops 3N/64 ops ~40x faster 32x smaller Typically, lose ~10% accuracy

1-bit Matrix Operations: Cost/Benefit

4 Rastegari et al., “Xnor-net: Imagenet classification using binary convolutional neural networks.” 2016

slide-5
SLIDE 5

32x smaller

float x[], y[], w[]; ... for i in 1…N: y[j] += x[i] * w[i]; unsigned long x[], y[], w[]; … for i in 1…N/64: y[j] += 64 – 2*popc(not(x_b[i] xor w_b[i]));

2N ops 3N/64 ops ~40x faster Typically, lose ~10% accuracy

1-bit Matrix Operations: Cost/Benefit

5 Rastegari et al., “Xnor-net: Imagenet classification using binary convolutional neural networks.” 2016

slide-6
SLIDE 6

~40x faster

1-bit Matrix Operations: Cost/Benefit

6

slide-7
SLIDE 7

1-bit Matrix Operations: Cost/Benefit

7

Runtime 380 ms Unoptimized Binary Network 1904 ms Full Precision Baseline

slide-8
SLIDE 8

Implementation Challenges

CPUs have no native support for low bit data types Need to work on packed data Need to implement optimizations from scratch Optimizations tuned for specific CPU

uint1 uint2

Optimized linear algebra libraries Hardware support for conventional deep learning

No optimized linear algebra libraries like BLAS to leverage Baselines incredibly well optimized

8

slide-9
SLIDE 9

Are Binary Networks Actually Fast?

Majority of work in binarization is simulated

  • Which binarization techniques can be implemented efficiently?
  • What are the runtime bottlenecks in a binary model?
  • How do I deploy a fast binary model on my platform?

To address these questions we introduce Riptide.

9

slide-10
SLIDE 10

10

A one-stop solution to training and deploying fast binary networks on a variety of hardware platforms.

  • Addresses implementation issues in mixed polarity

quantization

  • Introduces the Fused Glue operation, removing all floating-

point arithmetic from binary models.

  • Provides high-performance bitserial operators through TVM.
  • Yields 4-12X speedups across various models and bitwidths

while maintaining state-of-the-art accuracy.

  • Available open-source today at github.com/jwfromm/Riptide
slide-11
SLIDE 11

Implementing Binary Layers

11

features: float array

=

kernels: float array activations: float array

Multiply Accumulate

slide-12
SLIDE 12

Implementing Binary Layers

12

features: float array

=

kernels: float array activations: float array

Multiply Accumulate

features: int array

QA

slide-13
SLIDE 13

Implementing Binary Layers

13

features: float array

=

kernels: float array activations: float array

Multiply Accumulate

QA

features: int array

QW

kernels: int array

slide-14
SLIDE 14

Implementing Binary Layers

14

features: float array

=

kernels: float array activations: int array

QA

features: int array

QW

kernels: int array

Bitserial Accumulate

slide-15
SLIDE 15

1

  • 1

Quantization Function: ! 𝑦 = 𝑦 > 0

Quantization Polarity

1

  • 1

Quantization Function: ! 𝑦 = 𝑡𝑗𝑕𝑜(𝑦)

Bipolar Quantization Unipolar Quantization

  • Implemented with bitwise-xnor and popcount
  • Well-suited for weights, which represent

correlation (1) or inverse-correlation (-1)

  • Implemented with bitwise-and and popcount
  • Well-suited for activations, which represent

pattern-match (1) or no pattern-match (0)

15

slide-16
SLIDE 16

Quantization Polarity

  • XnorNet (all bipolar) -> 44.2% accuracy
  • DorefaNet (bipolar weights unipolar activations) -> 50.0% accuracy

16 Zhou et al., “DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients.” 2016

1 1 0 1 0 1 …

A (unipolar)

0 1 0 0 1 0 …

W (bipolar)

0 1 0 -1 0 -1 …

Expected

=

Multiple meanings of 0 bits causes mixed polarity to unimplementable

slide-17
SLIDE 17

Mixed Polarity Operation

17

Count number of bit multiplications where output should be 1 Subtract cases where output should be -1

  • Enables mixed polarity binary networks
  • Doubles amount of inner loop compute but does not require additional memory operations
  • Mixed polarity may offer compelling points on speedup to accuracy versus pure bipolar
slide-18
SLIDE 18

Multibit Quantization

18

1 Quantization Function: ! 𝑦 = 𝑚𝑗𝑜𝑓𝑏𝑠(𝑦) .3 .6

  • Translates naturally to integer representation
  • Does not necessarily fit distribution

Zhou et al., “DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients.” 2016

slide-19
SLIDE 19

Multibit Quantization

19

Quantization Function: ! 𝑦 = 𝐼𝑋𝐻𝑅(𝑦)

Cai et al., “Deep Learning with Low Precision by Half-wave Gaussian Quantization.” 2017

.2 2.1 .6 1.1

  • Better fit for Gaussian distribution
  • Not implementable
slide-20
SLIDE 20

Multibit Quantization

20 Cai et al., “Deep Learning with Low Precision by Half-wave Gaussian Quantization.” 2017

.2 2.1 .6 1.1 Bit pair: 00 01 10 11 Value is based on unique bit pair rather than combination of bits, (01 + 10 ≠ 11) Unique bit combinations lost during popcount

slide-21
SLIDE 21

Implementing Binary Layers

21

features: float array

=

kernels: float array 128 128 256 activations: int array

QA

features: int array

QW

kernels: int array

Bitwise Accumulate

1-bit bipolar quantization N-bit linear bipolar or unipolar quantization Xnor-popcount / mixed polarity-popcount

slide-22
SLIDE 22

Implementing Binary Models

22

Full Precision Binary QConv Conv QConv QConv QDense QConv

slide-23
SLIDE 23

Implementing Binary Models

23

Full Precision Binary QConv Conv QConv QConv QDense QConv BatchNorm Dequantize WeightScale Activation Quantize Bitpack Computational Complexity: 4𝐼𝑋𝐺 𝐼𝑋𝐺 4𝐼𝑋𝐺 𝐼𝑋𝐺 5𝐼𝑋𝐺 3𝐼𝑋𝐺 𝑂𝐿𝐿𝐺𝐼𝑋𝐷 43

slide-24
SLIDE 24

Estimated Impact of Glue Layers

  • Impact of glue layers is too high
  • We must derive binarized glue

for decent end-to-end speedups

24

slide-25
SLIDE 25

Weight Scaling

25

Qconv W 𝛽? = 𝑛𝑓𝑏𝑜( 𝑋

? )

𝑟 𝑏 = 𝛽?𝑏 Qconv

  • Introduced in XnorNet
  • Allows scale of weights to be preserved
  • Brought accuracy from 27% to 44%
  • Now used ubiquitously

Rastegari et al., “Xnor-net: Imagenet classification using binary convolutional neural networks.” 2016

slide-26
SLIDE 26

Quantized Weight Scaling

26

Qconv W 𝛽? = 𝑛𝑓𝑏𝑜( 𝑋

? )

𝑟 𝑏 = 𝛽?𝑏 Qconv

Rastegari et al., “Xnor-net: Imagenet classification using binary convolutional neural networks.” 2016

  • Use approximate power of 2 (AP2)
  • Replaces multiply with bitwise shift
  • Constant at inference time
  • Requires only a single instruction
slide-27
SLIDE 27

BatchNormalization

  • Centers and scales output activations
  • Essential for quantization, used in all binary techniques
  • Must derive quantized versions of both centering and scaling

𝜈? = C

D ∑FGC D (𝑏F)

𝜏?

I = C D ∑FGC D (𝑏F − 𝜈?)I

K 𝑏F = LM N OP

QP

RS T

27 Ioffe et al., “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” 2015

slide-28
SLIDE 28

Binary Scaling

  • We can simply compute the AP2 of standard deviation

28

slide-29
SLIDE 29

Binary Center

To add a constant to a binarized tensor, we must use Fixed Point Quantization (FPQ) with the same bits and scale

29

1 1 1 1 0 1 1 0 …

N-bit input to next layer wb fractional bits

𝐶 = 𝑂 + 𝑥𝑐 1 1 2 + 1 4 + 1 8 + ⋯ 𝑇 = 1 + \

FGC ]^

1 2_ − 1 ( ` 1 2)F = 1 + 1 2_ − 1 (1 − 1 2]^)

̂ 𝜈 = 𝐺𝑄𝑅(𝜈, 𝐶, 𝑇)

slide-30
SLIDE 30

Fused Glue Operation

30

𝐶 = 𝑂 + 𝑥𝑐 𝑇 = 1 + 1 2_ − 1 (1 − 1 2]^) ̂ 𝜈 = 𝐺𝑄𝑅(𝜈, 𝐶, 𝑇) This is the fused glue operation All terms are constant at runtime except 𝑏 Only requires two integer operations

slide-31
SLIDE 31

Fully Binarized Network

31

Full Precision Binary QConv QConv BatchNorm Dequantize WeightScale Activation Quantize Bitpack Computational Complexity: 4𝐼𝑋𝐺 𝐼𝑋𝐺 4𝐼𝑋𝐺 𝐼𝑋𝐺 5𝐼𝑋𝐺 3𝐼𝑋𝐺 Total = 18𝐼𝑋𝐺 QConv QConv Bitpack Computational Complexity: 2𝐼𝑋𝐺 𝐼𝑋𝐺 3𝐼𝑋𝐺 Total = 6𝐼𝑋𝐺 Fused Glue Clip

Traditional Binary Network Fully Binarized Network

  • 3X fewer glue operations
  • No floating-point data
  • No multiplication or division
slide-32
SLIDE 32

FBN Accuracy

  • Our system is comparable to

state-of-the-art techniques

  • Unipolar quantization yields

higher accuracies as expected

  • Effective across various models

32

slide-33
SLIDE 33

Measurement Platform

33

Raspberry Pi ARM Cortex-A53

  • Widely available and inexpensive
  • Representative of IoT devices
  • QualComm Snapdragons
  • Azure Sphere
  • Resource constrained / in need of acceleration
slide-34
SLIDE 34

Separates compute and implementation into a declaration and schedule Schedules contain knobs that are attuned for the backend

Tensor Expression Language AutoTVM: Optimize Tensor Operators Schedule Optimization Space

… …

Optimizing deep learning compiler

34 Chen et al., “TVM: An Automated End-to-End Optimizing Compiler for Deep Learning.” 2018

slide-35
SLIDE 35

TVM Schedule Intrinsics

  • Tiling: Break computation into chunks for better locality
  • Vectorization: Use hardware SIMD instructions for more efficient
  • peration execution.
  • Parallelization: Leverage MIMD facilities such as multiple cores.
  • Loop Unrolling: Replicate the body of loops to reduce overhead.

35 Chen et al., “Learning to Optimize Tensor Programs.” 2019

slide-36
SLIDE 36

Fast Popcount

36

slide-37
SLIDE 37

_efg h

Bytes 2𝑂𝐼𝑋𝐷 Bytes 2𝑂𝐼𝑋𝐷 Bytes

_efg h

Bytes BinaryConv

Int-16

𝑑j 𝑑I 𝑑C 𝑑k 𝑑_ .

Int-16 Popcount Accumulation

𝑦j 𝑦C 𝑦I 𝑦k 𝑦m 𝑦n 𝑦o 𝑦p 𝑦h 𝑦q 𝑦_ . . .

Int-N Bit Packed Activations

Int-8

Fused Shift/Scale

Int-N

𝑟j 𝑟I 𝑟C 𝑟k .

Int-16 Quantized Prepacked Bits

𝑟_

Int-16

Unused

Bit Pack

𝑧j 𝑧C 𝑧I 𝑧k 𝑧m 𝑧n 𝑧o 𝑧p 𝑧h 𝑧q 𝑧_ . . .

Int-N Bit Packed Outputs

Int-8

slide-38
SLIDE 38

𝑦j 𝑦C 𝑦I 𝑦k 𝑦m 𝑦n 𝑦o 𝑦p 𝑦h 𝑦q 𝑦_ . . .

_efg h

Bytes BinaryConv

𝑑j 𝑑I 𝑑C 𝑑k 𝑑_ .

2𝑂𝐼𝑋𝐷 Bytes Int-N Bit Packed Activations Int-16 Popcount Accumulation Fused Shift/Scale

Int-16

𝑟j 𝑟I 𝑟C 𝑟k .

Int-16 Quantized Prepacked Bits

𝑟_

2𝑂𝐼𝑋𝐷 Bytes Bit Pack

𝑧j 𝑧C 𝑧I 𝑧k 𝑧m 𝑧n 𝑧o 𝑧p 𝑧h 𝑧q 𝑧_ . . .

Int-N Bit Packed Outputs

Int-8

_efg h

Bytes

Int-16

Unused

Int-8

Int-N

Bitpack Fusion

slide-39
SLIDE 39

Impact of Optimizations

  • Combination of TVM
  • ptimizations gives 12X

Speedup over baseline

  • Each optimization has a

significant impact

  • Speedups from bitpack fusion

are due to fewer memory

  • perations
  • With a high-quality

implementation, we can study

  • ur design choices

39

slide-40
SLIDE 40

Optimization Ablation Study

  • Removing any optimization has

a significant impact on performance

  • Using fused glue gives nearly a

2X speedup, as predicted by

  • pcount estimates

40

slide-41
SLIDE 41

Glue Layer Impact

  • Glue consistently takes a similar

amount of time as core compute layers

  • Our fused glue operation

almost completely removes this cost

41

slide-42
SLIDE 42

Impact of Polarity

  • Baseline is near optimal
  • Quantized layers have much

more memory overhead

  • Although unipolar quantization

has twice as many operations, it is only marginally slower than bipolar quantization

42

slide-43
SLIDE 43

Cumulative Speedup

43

slide-44
SLIDE 44

Layerwise Speedup

  • Speedup is not consistent

across layers

  • May be possible to design a

network of binarizable layers

44

slide-45
SLIDE 45

Thank You!

Code:

45

Paper:

slide-46
SLIDE 46

Backup Slides

46

slide-47
SLIDE 47

47