learning accurate low bit deep neural networks with

Learning Accurate Low-bit Deep Neural Networks with Stochastic - PowerPoint PPT Presentation

Learning Accurate Low-bit Deep Neural Networks with Stochastic Quantization Yinpeng Dong 1 , Renkun Ni 2 , Jianguo Li 3 , Yurong Chen 3 , Jun Zhu 1 , Hang Su 1 1 Department of CST, Tsinghua University 2 University of Virginia 3 Intel Labs China


  1. Learning Accurate Low-bit Deep Neural Networks with Stochastic Quantization Yinpeng Dong 1 , Renkun Ni 2 , Jianguo Li 3 , Yurong Chen 3 , Jun Zhu 1 , Hang Su 1 1 Department of CST, Tsinghua University 2 University of Virginia 3 Intel Labs China

  2. Deep Learning is Everywhere Self-Driving Alpha Go Machine Translation Dota 2

  3. Limitations n More data + deeper models Γ  more FLOPs + lager memory n Computation Intensive n Memory Intensive n Hard to deploy on mobile devices 3

  4. Low-bit DNNs for Efficient Inference n High Redundancy in DNNs; n Quantize full-precision(32-bits) weights to binary(1 bit) or ternary(2 bits) weights; n Replace multiplication(convolution) by addition and subtraction; 4

  5. οΏ½ Typical Low-bit DNNs n BinaryConnect: 𝐢 " = $+1 with probability π‘ž = 𝜏(𝑋 " ) βˆ’1 with probability 1 βˆ’ π‘ž n BWN: minimize 𝑋 βˆ’ 𝛽𝐢 @ 𝛽 = βˆ‘ 𝑋 " "AB 𝐢 " = π‘‘π‘—π‘•π‘œ 𝑋 " , 𝑒 n TWN: minimize 𝑋 βˆ’ π›½π‘ˆ +1 if 𝑋 " > βˆ† βˆ‘ 𝑋 " "∈M βˆ† 0 if 𝑋 " < βˆ† π‘ˆ " = E , 𝛽 = 𝐽 βˆ† βˆ’1 if 𝑋 " < βˆ’βˆ† βˆ†= 0.7 @ 𝐽 βˆ† = 𝑗 𝑋 " > βˆ† , 𝑒 Q 𝑋 " "AB 5

  6. Training & Inference of Low-bit DNN n Let 𝑋 be the full-precision weights, 𝑅 be the low-bit weights ( 𝐢 , π‘ˆ , α𝐢 , Ξ±π‘ˆ ). n Forward propagation: quantize 𝑋 to 𝑅 and perform convolution or multiplication n Backward propagation: use 𝑅 to calculate gradients n Parameter update: 𝑋 TUB = 𝑋 T βˆ’ πœƒ T WX WY Z n Inference: only need to keep low-bit weights 𝑅 6

  7. Motivations n Quantize all weights simultaneously; n Quantization error 𝑋 βˆ’ 𝑅 may be large for some elements/filters; n Induce inappropriate gradient directions. n Quantize a portion of weights n Stochastic selection n Could be applied to any low-bit settings 7

  8. Roulette Selection Algorithm Weight Matrix Quantization Error Stochastic Partition with r = 50% Hybrid Weight Matrix Rotation Rotation 1.3 -1.1 0.75 0.85 0.2 1.3 -1.1 0.75 0.85 C1 0.95 -0.9 1.05 -1.0 0.05 1 -1 1 -1 C2 Selection Selection Point Point 1.4 -0.9 -0.8 0.9 0.2 1 -1 -1 1 C3 -1.2 0.8 1.0 -1.0 0.1 -1.2 0.8 1.0 -1.0 C4 1-st selection: v=0.58 2-nd selection: v=0.37 C2 selected C3 selected 𝑋 " βˆ’ 𝑅 " B 𝑓 " = Quantization Error: 𝑋 " B Quantization Probability: Larger quantization error means smaller quantization probability, e.g. π‘ž " ∝ B ] ^ Quantization Ratio r: Gradually increase to 100% 8

  9. Training & Inference _ n Hybrid weight matrix 𝑅 _ " = $𝑅 " if channel i being selected 𝑅 𝑋 " else n Parameter update 𝑋 TUB = 𝑋 T βˆ’ πœƒ T πœ–π‘€ _ T πœ–π‘… n Inference: all weights are quantized; use 𝑅 to perform inference 9

  10. οΏ½οΏ½ Ablation Studies n Selection Granularity: Β¨ Filter-level > Element-level n Selection/partition algorithms Β¨ Stochastic (roulette) > deterministic (sorting) ~ fixed (selection only at first iteration) n Quantization functions Β¨ Linear > Sigmoid > Constant ~ Softmax , where 𝑔 = B n π‘ž " = exp (𝑔 " ) βˆ‘ exp ⁄ (𝑔 " ) ] n Quantization Ratio Update Scheme Β¨ Exponential > Fine-tune > Uniformly n 50% Γ  75% Γ  87.5% Γ  100% 10

  11. Results -- CIFAR CIFAR-10 CIFAR-100 Bits VGG-9 ResNet-56 VGG-9 ResNet-56 FWN 32 9.00 6.69 30.68 29.49 BWN 1 10.67 16.42 37.68 35.01 SQ-BWN 1 9.40 7.15 35.25 31.56 TWN 2 9.87 7.64 34.80 32.09 SQ-TWN 2 8.37 6.20 34.24 28.90 error (%) of VGG-9 and ResNet-56 trained with 5 different methods on the CIFAR-10 and 80 2 FWN FWN BWN TWN 1.8 SQ-BWN SQ-TWN 1.6 60 1.4 1.2 Loss Loss 40 1 0.8 0.6 20 0.4 0.2 0 0 0 64 128 192 256 0 64 128 192 256 Iter.(k) Iter.(k) 11

  12. Results -- ImageNet AlexNet-BN ResNet-18 Bits top-1 top-5 top-1 top-5 FWN 32 44.18 20.83 34.80 13.60 BWN 1 51.22 27.18 45.20 21.08 SQ-BWN 1 48.78 24.86 41.64 18.35 TWN 2 47.54 23.81 39.83 17.02 SQ-TWN 2 44.70 21.40 36.18 14.26 (%) of AlexNet-BN and ResNet-18 trained with 5 different methods on 12

  13. Conclusions n We propose a stochastic quantization algorithm for Low-bit DNN training n Our algorithm can be flexibly applied to all low-bit settings; n Our algorithm help to consistently improve the performance; n We release our codes to public for future development Β¨ https://github.com/dongyp13/Stochastic-Quantization 13

  14. Q & A

Recommend


More recommend