Learning Accurate Low-bit Deep Neural Networks with Stochastic Quantization Yinpeng Dong 1 , Renkun Ni 2 , Jianguo Li 3 , Yurong Chen 3 , Jun Zhu 1 , Hang Su 1 1 Department of CST, Tsinghua University 2 University of Virginia 3 Intel Labs China
Deep Learning is Everywhere Self-Driving Alpha Go Machine Translation Dota 2
Limitations n More data + deeper models Γ more FLOPs + lager memory n Computation Intensive n Memory Intensive n Hard to deploy on mobile devices 3
Low-bit DNNs for Efficient Inference n High Redundancy in DNNs; n Quantize full-precision(32-bits) weights to binary(1 bit) or ternary(2 bits) weights; n Replace multiplication(convolution) by addition and subtraction; 4
οΏ½ Typical Low-bit DNNs n BinaryConnect: πΆ " = $+1 with probability π = π(π " ) β1 with probability 1 β π n BWN: minimize π β π½πΆ @ π½ = β π " "AB πΆ " = π‘πππ π " , π n TWN: minimize π β π½π +1 if π " > β β π " "βM β 0 if π " < β π " = E , π½ = π½ β β1 if π " < ββ β= 0.7 @ π½ β = π π " > β , π Q π " "AB 5
Training & Inference of Low-bit DNN n Let π be the full-precision weights, π be the low-bit weights ( πΆ , π , Ξ±πΆ , Ξ±π ). n Forward propagation: quantize π to π and perform convolution or multiplication n Backward propagation: use π to calculate gradients n Parameter update: π TUB = π T β π T WX WY Z n Inference: only need to keep low-bit weights π 6
Motivations n Quantize all weights simultaneously; n Quantization error π β π may be large for some elements/filters; n Induce inappropriate gradient directions. n Quantize a portion of weights n Stochastic selection n Could be applied to any low-bit settings 7
Roulette Selection Algorithm Weight Matrix Quantization Error Stochastic Partition with r = 50% Hybrid Weight Matrix Rotation Rotation 1.3 -1.1 0.75 0.85 0.2 1.3 -1.1 0.75 0.85 C1 0.95 -0.9 1.05 -1.0 0.05 1 -1 1 -1 C2 Selection Selection Point Point 1.4 -0.9 -0.8 0.9 0.2 1 -1 -1 1 C3 -1.2 0.8 1.0 -1.0 0.1 -1.2 0.8 1.0 -1.0 C4 1-st selection: v=0.58 2-nd selection: v=0.37 C2 selected C3 selected π " β π " B π " = Quantization Error: π " B Quantization Probability: Larger quantization error means smaller quantization probability, e.g. π " β B ] ^ Quantization Ratio r: Gradually increase to 100% 8
Training & Inference _ n Hybrid weight matrix π _ " = $π " if channel i being selected π π " else n Parameter update π TUB = π T β π T ππ _ T ππ n Inference: all weights are quantized; use π to perform inference 9
οΏ½οΏ½ Ablation Studies n Selection Granularity: Β¨ Filter-level > Element-level n Selection/partition algorithms Β¨ Stochastic (roulette) > deterministic (sorting) ~ fixed (selection only at first iteration) n Quantization functions Β¨ Linear > Sigmoid > Constant ~ Softmax , where π = B n π " = exp (π " ) β exp β (π " ) ] n Quantization Ratio Update Scheme Β¨ Exponential > Fine-tune > Uniformly n 50% Γ 75% Γ 87.5% Γ 100% 10
Results -- CIFAR CIFAR-10 CIFAR-100 Bits VGG-9 ResNet-56 VGG-9 ResNet-56 FWN 32 9.00 6.69 30.68 29.49 BWN 1 10.67 16.42 37.68 35.01 SQ-BWN 1 9.40 7.15 35.25 31.56 TWN 2 9.87 7.64 34.80 32.09 SQ-TWN 2 8.37 6.20 34.24 28.90 error (%) of VGG-9 and ResNet-56 trained with 5 different methods on the CIFAR-10 and 80 2 FWN FWN BWN TWN 1.8 SQ-BWN SQ-TWN 1.6 60 1.4 1.2 Loss Loss 40 1 0.8 0.6 20 0.4 0.2 0 0 0 64 128 192 256 0 64 128 192 256 Iter.(k) Iter.(k) 11
Results -- ImageNet AlexNet-BN ResNet-18 Bits top-1 top-5 top-1 top-5 FWN 32 44.18 20.83 34.80 13.60 BWN 1 51.22 27.18 45.20 21.08 SQ-BWN 1 48.78 24.86 41.64 18.35 TWN 2 47.54 23.81 39.83 17.02 SQ-TWN 2 44.70 21.40 36.18 14.26 (%) of AlexNet-BN and ResNet-18 trained with 5 different methods on 12
Conclusions n We propose a stochastic quantization algorithm for Low-bit DNN training n Our algorithm can be flexibly applied to all low-bit settings; n Our algorithm help to consistently improve the performance; n We release our codes to public for future development Β¨ https://github.com/dongyp13/Stochastic-Quantization 13
Q & A
Recommend
More recommend