CMSC5743 L06: Binary/Ternary Network Bei Yu (Latest update: - - PowerPoint PPT Presentation

cmsc5743 l06 binary ternary network
SMART_READER_LITE
LIVE PREVIEW

CMSC5743 L06: Binary/Ternary Network Bei Yu (Latest update: - - PowerPoint PPT Presentation

CMSC5743 L06: Binary/Ternary Network Bei Yu (Latest update: November 2, 2020) Fall 2020 1 / 21 These slides contain/adapt materials developed by Ritchie Zhao et al. (2017). Accelerating binarized convolutional neural networks with


slide-1
SLIDE 1

CMSC5743 L06: Binary/Ternary Network

Bei Yu

(Latest update: November 2, 2020)

Fall 2020

1 / 21

slide-2
SLIDE 2

These slides contain/adapt materials developed by ◮ Ritchie Zhao et al. (2017). “Accelerating binarized convolutional neural networks with

software-programmable FPGAs”. In: Proc. FPGA, pp. 15–24

◮ Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary

convolutional neural networks”. In: Proc. ECCV, pp. 525–542

2 / 21

slide-3
SLIDE 3

Motivation

Binary / Ternary Net: Motivation

1

  • 1

1

−0.05 0.05 1600 3200 4800 6400

Weight Value Count

=>

3 / 21

slide-4
SLIDE 4

6

Binarized Neural Networks (BNN)

Input Map

2.4 6.2 … 3.3 1.8

… Weights

0.8 0.1 0.3 0.8

Input Map

(Binary) 1 −1 … 1 1

… Weights

(Binary) 1 −1 1 −1

=

Output Map

5.0 9.1 … 4.3 7.8

=

123

(Integer) 1 −3 … 3 −7

… 423 = 123 − 5 67 − 8

  • : + <

Output Map

(Binary) 1 −1 … 1 −1

… =23 = >+1 if 423 ≥ 0 −1 otherwise

Batch Normalization Binarization

Key Differences

  • 1. Inputs are binarized (−1 or +1)
  • 2. Weights are binarized (−1 or +1)
  • 3. Results are binarized after

batch normalization CNN BNN

4 / 21

slide-5
SLIDE 5
  • 6 conv layers, 3 dense layers, 3 max pooling layers
  • All conv filters are 3x3
  • First conv layer takes in floating-point input
  • 13.4 Mbits total model size (after hardware optimizations)

7

BNN CIFAR-10 Architecture [2]

[2] M. Courbariaux et al. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1

  • r -1. arXiv:1602.02830, Feb 2016.

32x32 16x16 8x8 4x4 3 128 128 256 256 512 512 1024 1024 10 Number of feature maps Feature map dimensions

4 / 21

slide-6
SLIDE 6
  • 1. Floating point ops replaced with binary logic ops

– Encode {+1,−1} as {0,1} à multiplies become XORs – Conv/dense layers do dot products à XOR and popcount – Operations can map to LUT fabric as opposed to DSPs

  • 2. Binarized weights may reduce total model size

– Fewer bits per weight may be offset by having more weights

8

Advantages of BNN

b1 b2 b1

1 ⨯

⨯ b2 +1 +1 +1 +1 −1 −1 −1 +1 −1 −1 −1 +1 b1 b2 b1

1 XO

XOR b2 1 1 1 1 1 1

4 / 21

slide-7
SLIDE 7

Architecture Depth Param Bits (Float) Param Bits (Fixed-Point) Error Rate (%) ResNet [3]

(CIFAR-10)

164 51.9M 13.0M* 11.26 BNN [2] 9

  • 13.4M

11.40

9

BNN vs CNN Parameter Efficiency

Comparison:

– Conservative assumption: ResNet can use 8-bit weights – BNN is based on VGG (less advanced architecture) – BNN seems to hold promise! * Assuming each float param can be quantized to 8-bit fixed-point

[2] M. Courbariaux et al. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1

  • r -1. arXiv:1602.02830, Feb 2016.

[3] K. He, X. Zhang, S. Ren, and J. Sun. Identity Mappings in Deep Residual Networks. ECCV 2016. 4 / 21

slide-8
SLIDE 8

Overview

Minimize the Quantization Error Reduce the Gradient Error

5 / 21

slide-9
SLIDE 9

Overview

Minimize the Quantization Error Reduce the Gradient Error

6 / 21

slide-10
SLIDE 10

1

1Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In:

  • Proc. ECCV, pp. 525–542.

6 / 21

slide-11
SLIDE 11

1

1Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In:

  • Proc. ECCV, pp. 525–542.

6 / 21

slide-12
SLIDE 12

1

1Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In:

  • Proc. ECCV, pp. 525–542.

6 / 21

slide-13
SLIDE 13

1

1Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In:

  • Proc. ECCV, pp. 525–542.

6 / 21

slide-14
SLIDE 14

1

1Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In:

  • Proc. ECCV, pp. 525–542.

6 / 21

slide-15
SLIDE 15

1

1Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In:

  • Proc. ECCV, pp. 525–542.

6 / 21

slide-16
SLIDE 16

Training Binary Weight Networks

Naive S Solution: ! .1 1. . ! 2 . 1. .

1

1Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In:

  • Proc. ECCV, pp. 525–542.

6 / 21

slide-17
SLIDE 17

0' 10' 20' 30' 40' 50' 60' '' ' AlexNet'TopX1'(%)'ILSVRC2012' 56.7' 0.2' Full'Precision' Naïve'

1

1Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In:

  • Proc. ECCV, pp. 525–542.

6 / 21

slide-18
SLIDE 18

''.'.'.''' '.'.'.'' W W

R R R

''.'.'.''' '.'.'.'' WB

B B B Binarization

1

1Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In:

  • Proc. ECCV, pp. 525–542.

6 / 21

slide-19
SLIDE 19

1

1Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In:

  • Proc. ECCV, pp. 525–542.

6 / 21

slide-20
SLIDE 20

Binary Weight Network

''.'.'.''' '.'.'.''

R R R

Train f for b binary w y weights:

  • 1. Randomly initialize W
  • 2. For iter = 1 to N

3. Load a random input image X 4. WB = sign(W) 5. α = kWk`1

n

6. Forward pass with α, WB 7. Compute loss function C 8.

@C @W = Backward pass with α, WB

9. Update W (W = W − @C

@W)

1

1Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In:

  • Proc. ECCV, pp. 525–542.

6 / 21

slide-21
SLIDE 21

1

1Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In:

  • Proc. ECCV, pp. 525–542.

6 / 21

slide-22
SLIDE 22

1

1Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In:

  • Proc. ECCV, pp. 525–542.

6 / 21

slide-23
SLIDE 23

1

1Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In:

  • Proc. ECCV, pp. 525–542.

6 / 21

slide-24
SLIDE 24

1

1Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In:

  • Proc. ECCV, pp. 525–542.

6 / 21

slide-25
SLIDE 25

1

1Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In:

  • Proc. ECCV, pp. 525–542.

6 / 21

slide-26
SLIDE 26

1

1Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In:

  • Proc. ECCV, pp. 525–542.

6 / 21

slide-27
SLIDE 27

0' 10' 20' 30' 40' 50' 60' '' ' AlexNet'TopX1'(%)'ILSVRC2012' 56.7' 0.2' 56.8' Full'Precision' Naïve' Binary'Weight'

1

1Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In:

  • Proc. ECCV, pp. 525–542.

6 / 21

slide-28
SLIDE 28

1

1Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In:

  • Proc. ECCV, pp. 525–542.

6 / 21

slide-29
SLIDE 29

1

1Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In:

  • Proc. ECCV, pp. 525–542.

6 / 21

slide-30
SLIDE 30

1

1Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In:

  • Proc. ECCV, pp. 525–542.

6 / 21

slide-31
SLIDE 31

B sign(X) R B

(1) Binarizing Weights

= = =

(3) Convolution with XNOR-Bitcount

B B R R sign(X)

≈ c"

(2) Binarizing Input Efficient

=

|X:,:,i| c

Redundant computation in overlapping areas Inefficient (2) Binarizing Input

X R B sign(X)

=

Average Filter

1

1Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In:

  • Proc. ECCV, pp. 525–542.

6 / 21

slide-32
SLIDE 32

1

1Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In:

  • Proc. ECCV, pp. 525–542.

6 / 21

slide-33
SLIDE 33

0' 10' 20' 30' 40' 50' 60' '' ' AlexNet'TopX1'(%)'ILSVRC2012' 56.7' 0.2' 56.8' 30.5'

1

1Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In:

  • Proc. ECCV, pp. 525–542.

6 / 21

slide-34
SLIDE 34

Network Structure in XNOR-Networks

sign(x) !

X1' +1'

A'typical'block'in'CNN' BNorm' AcIv' Pool' Conv' '

✗InformaIon'Loss' ✓MulIple'Maximums'

MaxXPooling'

1

1Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In:

  • Proc. ECCV, pp. 525–542.

6 / 21

slide-35
SLIDE 35

Network Structure in XNOR-Networks

BNorm' AcIv' Pool' Conv' '

✗InformaIon'Loss' ✓MulIple'Maximums'

1

1Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In:

  • Proc. ECCV, pp. 525–542.

6 / 21

slide-36
SLIDE 36

Network Structure in XNOR-Networks

✓InformaIon'Loss' ✓MulIple'Maximums'

BNorm' AcIv' BNorm' AcIv' Pool' Conv' '

1

1Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In:

  • Proc. ECCV, pp. 525–542.

6 / 21

slide-37
SLIDE 37

1

1Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In:

  • Proc. ECCV, pp. 525–542.

6 / 21

slide-38
SLIDE 38

0' 10' 20' 30' 40' 50' 60' AlexNet'TopX1'(%)'ILSVRC2012' 56.7' 0.2' 56.8' 30.5' 44.2' ✓ 32x'Smaller'Model'

50 100 150 200 250 300 350 400 450 500 AlexNet VGG ResNet-18 Float Binary 245 MB 500 MB 100 MB 7.4 MB 16 MB 1.5 MB

✓ 58x'Less'ComputaIon'

1 32 1024 number of channels 0x 20x 40x 60x 80x Speedup by varying channel size 0x0 10x10 20x20 filter size 50x 55x 60x 65x Speedup by varying filter size

1

1Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In:

  • Proc. ECCV, pp. 525–542.

6 / 21

slide-39
SLIDE 39

0' 10' 20' 30' 40' 50' 60' 70' 80' 90' AlexNet'Top.1$&$5'(%)'ILSVRC2012'

1

1Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In:

  • Proc. ECCV, pp. 525–542.

6 / 21

slide-40
SLIDE 40

Motivation and Intuition

Motivation ◮ Naive methods (Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David (2015).

“Binaryconnect: Training deep neural networks with binary weights during propagations”. In: Advances in neural information processing systems,

  • pp. 3123–3131, Matthieu Courbariaux, Itay Hubara, et al. (2016). “Binarized neural

networks: Training deep neural networks with weights and activations constrained to+ 1 or-1”. In: arXiv preprint arXiv:1602.02830) suffer the accuracy loss

Intuition ◮ Quantized parameter should approximate the full precision parameter as closely as

possible

7 / 21

slide-41
SLIDE 41

ABC-Net

Towards Accurate Binary Convolutional Neural Network

8 / 21

slide-42
SLIDE 42

ABC-Net

Contribution ◮ Approximate full-precision weights with the linear combination of multiple binary weight

bases

◮ Introduce multiple binary activations

9 / 21

slide-43
SLIDE 43

ABC-Net

Weights Binarization ◮ Weights tensors in one layer: W ∈ Rw×h×cin×cout B1, B2, . . . , BM ∈ {−1, +1}w×h×cinxcout W ≈ α1B1 + α2B2 + . . . + αMBM Bi = Fui(W) = sign ( ¯ W + ui std(W)) , i = 1, 2, . . . , M

where ¯

W = W − mean(W), ui is a shift parameter(e.g. ui = −1 + (i − 1)

2 M−1)

α can be calculated via mina J(α) = W − Bα2

10 / 21

slide-44
SLIDE 44

ABC-Net

Forward and Backward ◮ Forward B1, B2, · · · , BM = Fu1(W), Fw2(W), · · · , Fu,u(W) solve min

a

J(α) = W − Bα2 for α O =

M

  • m=1

αm Conv (Bm, A) ◮ Backward ∂c ∂W = ∂c ∂O M

  • m=1

αm ∂O ∂Bm ∂Bm ∂W

  • STE

= ∂c ∂O M

  • m=1

αm ∂O ∂Bm

  • =

M

  • m=1

αm ∂c ∂Bm

11 / 21

slide-45
SLIDE 45

ABC-Net

Multiple Binary Activations ◮ Bounded Activation Function h(x) ∈ [0, 1] hr(x) = clip(x + v, 0, 1)

where v is a shift parameter

◮ Binarization Function Hv(R) := 2Ihv(R)≥0.5 − 1 A1, A2, . . . , AN = Hv1(R), Hv2(R), . . . , HvN(R) R ≈ β1A1 + β2A2 + . . . + βNAN

where R is the real-value activation

◮ A1, A2, . . . , AN is the base to represent the real-valued activations

12 / 21

slide-46
SLIDE 46

ABC-Net

◮ ApproxConv is expected to approximate the conventional full-precision convolution

with linear combination of binary convolutions

◮ The right part is the overall block structure of the convolution in ABC-Net.The input is

binarized using different functions Hv1,Hv2,Hv3

Conv(W, R) ≈ Conv M

m=1 αmBm, N n=1 βnAn

  • =

M

m=1

N

n=1 αmβn Conv (Bm, An)

13 / 21

slide-47
SLIDE 47

ABC-Net

Read the paper2if you want to learn the specific details of the algorithm

2Xiaofan Lin, Cong Zhao, and Wei Pan (2017). “Towards accurate binary convolutional neural network”. In: Advances in

Neural Information Processing Systems, pp. 345–353.

14 / 21

slide-48
SLIDE 48

Overview

Minimize the Quantization Error Reduce the Gradient Error

15 / 21

slide-49
SLIDE 49

Motivation and Intuition

Motivation ◮ Although STE is often adopted to estimate the gradients in BP, there exists obvious

gradient mismatch between the gradient of the binarization function

◮ With the restriction of STE, the parameters outside the range of [−1 : +1] will not be

updated.

15 / 21

slide-50
SLIDE 50

Bi-Real

Bi-real net: Enhancing the performance of 1-bit CNNs with improved representational capability and advanced training algorithm

16 / 21

slide-51
SLIDE 51

Bi-Real

Naive Binarization Function ◮ Recall the partial derivative calculation in back propagation

∂L ∂Al,t

r =

∂L ∂Al,t

b

∂Al,t

b

∂Al,t

r =

∂L ∂Al,t

b

∂ Sign(Al,t

r )

∂Al,t

r

∂L ∂Al,t

b

∂F(Al,t

r )

∂Al,t

r

◮ Sign function is a non-differentiable function, so F is an approximation differentiable

function of Sign function

17 / 21

slide-52
SLIDE 52

Bi-Real

∂L ∂Al,t

r =

∂L ∂Al,t

b

∂Al,t

b

∂Al,t

r =

∂L ∂Al,t

b

∂ Sign(Al,t

r )

∂Al,t

r

∂L ∂Al,t

b

∂F(Al,t

r )

∂Al,t

r

Approximation of Sign function ◮ Naive Approximation F(x) = clip(x, 0, 1), see fig(b) ◮ More Precious Approximation in Bi-Real, see fig(c)

Approxsign(x) =        −1,

if x < −1

2x + x2,

if − 1 ≤ x < 0

2x − x2,

if 0 ≤ x < 1

1,

  • therwise

∂Approxsign(x) ∂x

=    2 + 2x,

if − 1 ≤ x < 0

2 − 2x,

if 0 ≤ x < 1

0,

  • therwise

18 / 21

slide-53
SLIDE 53

Bi-Real

Read the paper3 if you want to learn the specific details of the algorithm

3Zechun Liu et al. (2018). “Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability

and advanced training algorithm”. In: Proceedings of the European conference on computer vision (ECCV), pp. 722–737.

19 / 21

slide-54
SLIDE 54

Trained Ternary Quantization4

Overview of the trained ternary quantization procedure.

4Chenzhuo Zhu et al. (2017). “Trained ternary quantization”. In: Proc. ICLR. 20 / 21

slide-55
SLIDE 55

Trained Ternary Quantization4

Ternary weights value (above) and distribution (below) with iterations for different layers of ResNet-20 on CIFAR-10.

4Chenzhuo Zhu et al. (2017). “Trained ternary quantization”. In: Proc. ICLR. 20 / 21

slide-56
SLIDE 56

Reading List

◮ Hyeonuk Kim et al. (2017). “A Kernel Decomposition Architecture for Binary-weight

Convolutional Neural Networks”. In: Proc. DAC, 60:1–60:6

◮ Jungwook Choi et al. (2018). “Pact: Parameterized clipping activation for quantized

neural networks”. In: arXiv preprint arXiv:1805.06085

◮ Dongqing Zhang et al. (2018). “Lq-nets: Learned quantization for highly accurate and

compact deep neural networks”. In: Proceedings of the European conference on computer vision (ECCV), pp. 365–382

◮ Aojun Zhou et al. (2017). “Incremental network quantization: Towards lossless cnns

with low-precision weights”. In: arXiv preprint arXiv:1702.03044

◮ Zhaowei Cai et al. (2017). “Deep learning with low precision by half-wave gaussian

quantization”. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5918–5926

21 / 21