training of convolutional neural networks cnns
play

Training of Convolutional Neural Networks (CNNs) Typical Datasets - PowerPoint PPT Presentation

Multi-Precision Policy Enforced Training (MuPPET) Multi-Precision Policy Enforced Training (MuPPET) A precision-switching strategy for quantised fixed-point training of CNNs A precision-switching strategy for quantised fixed-point training of


  1. Multi-Precision Policy Enforced Training (MuPPET) Multi-Precision Policy Enforced Training (MuPPET) A precision-switching strategy for quantised fixed-point training of CNNs A precision-switching strategy for quantised fixed-point training of CNNs Aditya Rajagopal, Diederik Adriaan Vink Aditya Rajagopal, Diederik Adriaan Vink Stylianos I. Venieris, Christos-Savvas Bouganis Stylianos I. Venieris, Christos-Savvas Bouganis i ntelligent D igital S ystems n te llig e n t ig ita l y s te m s a b D Dept. of Electrical and Electronic e p t. of E le c tric a l a n d E le c tron ic E n g in e e rin g L ab Engineering www.imperial.ac.uk/idsl www .imperial.ac.uk/idsl i ntelligent D igital S ystems i n te llig e n t ig ita l y s te m s a b

  2. Training of Convolutional Neural Networks (CNNs) Typical Datasets Typical Networks • CIFAR10 10 categories - 60000 images - • CIFAR100 100 categories - 60000 images - • ImageNet Dataset 1000 categories - 1.2 million images - i ntelligent D igital S ystems

  3. Motivation Training Time Power Consumption • • Enable wider experimentation Reduce cost of training in with training e.g. Neural large data centers • Architecture Search Perform training on edge • Increase productivity of deep devices learning practitioners Exploit low-precision hardware capabilities • NVIDIA Turing Architecture (GPU) • Microsoft Brainwave (FPGA) • Google TPU (ASIC) i ntelligent D igital S ystems

  4. Goal Perform quantised training of CNNs while maintaining FP32 accuracy and producing a model that performs inference at FP32 i ntelligent D igital S ystems

  5. Contributions of this paper • Generalisable policy that decides at run time appropriate points to increase the precision of the training process without impacting final test accuracy - Datasets : CIFAR10, CIFAR100, ImageNet - Networks : AlexNet, ResNet, GoogLeNet - Up to 1.84x training time improvement with negligible loss in accuracy • Extending training to bit-widths as low as 8-bit to leverage the low-precision capabilities of modern processing systems • Open source PyTorch implementation of the MuPPET framework with emulated quantised computations i ntelligent D igital S ystems

  6. Background: Mixed Precision Training • Current state-of-the-art: Mixed-precision training (Micikevicius et al., 2018) [6] Maintains master copy of the weights at - FP32 Quantises weights and activations to - FP16 for all computations Accumulates FP16 gradients into FP32 - master copy of the weights • Incurs accuracy drop if precision below FP16 is utilised [6] Micikevicius, P. et.al.. Mixed Precision Training. In International Conference on Learning Representations (ICLR), 2018 i ntelligent D igital S ystems

  7. Multilevel optimisation formulation • Hierarchical formulation that progressively increases precision of computations min 𝑥 ( 𝑟 ¿¿ 𝑂 ) ∈ ℝ 𝐸 𝑀𝑝𝑡𝑡 ¿ ¿¿ FP32 Master Copy of Weights min Proposed policy decides at run time 𝑥 ( 𝑟 ¿¿ 𝑂 − 1 ) ∈ ℝ 𝐸 𝑀𝑝𝑡𝑡 ¿ ¿¿ the epochs at which these changes need to be made FP32 Master Copy of Weights min 𝑥 ( 𝑟 ¿¿ 𝑂 − 2 ) ∈ ℝ 𝐸 𝑀𝑝𝑡𝑡 ¿ ¿ ¿ ⋱ 𝑥 𝐺𝑄 32 ∈ ℝ 𝐸 𝑀𝑝𝑡𝑡 ( 𝑔 ( 𝑥 𝐺𝑄 32 ) ) min i ntelligent D igital S ystems

  8. Background: Gradient Diversity • Yin et al. 2018 computes diversity between minibatches within an epoch Gra dient of weights for minibatch i 𝑜 ‖ ∇ 𝑔 𝑗 ( 𝑥 ) ‖ 𝑜 ‖ ∇ 𝑔 𝑗 ( 𝑥 ) ‖ 2 2 ∆ 𝑇 ( 𝑥 ) = ∑ ∑ 2 2 𝑗 = 1 𝑗 = 1 2 = 𝑜 𝑜 ‖ ∑ ∇ 𝑔 𝑗 ( 𝑥 ) ‖ 2 + ∑ ‖ ∇ 𝑔 𝑗 ( 𝑥 ) ‖ 𝑗 ≠ 𝑘 ¿ ∇ 𝑔 𝑗 ( 𝑥 ) , ∇ 𝑔 𝑘 ( 𝑥 ) >¿ ¿ ∑ 2 2 𝑗 = 1 𝑗 = 1 • Modified for MuPPET to compute diversity between minibatches across epochs Gra dient of last minibatch for layer l in epoch k 1 ∆ 𝑇 ( 𝑥 ) 𝑘 = 𝑘 2 𝑙 ( 𝑥 ) ‖ ‖ ∇ 𝑔 𝑚 ∑ 2 𝑙 = 𝑘 − 𝑠 ¿ ℒ ∨¿ ∑ 2 ¿ 𝑘 ‖ ∑ 𝑙 ( 𝑥 ) ‖ ∀ 𝑚 ∈ ℒ ∇ 𝑔 𝑚 2 𝑙 = 𝑘 − 𝑠 Resolution of r epochs ≤ epoch j Average gradient diversity across all layers from last r epochs i ntelligent D igital S ystems

  9. Precision Switching Policy: Methodology • Every r epochs: - The inter-epoch gradient diversity is calculated - Given an epoch e when the precision switched from level q n-1 to q n , and current epoch j 𝑇 ( 𝑘 ) ={ ∆ 𝑇 ( 𝑥 ) 𝑗 ∀ 𝑓 ≤ 𝑗 ≤ 𝑘 } 𝑞 = max 𝑇 ( 𝑘 ) 𝑘 ∆ 𝑇 ( 𝑥 ) 𝑈 = 𝛽 + 𝛾𝑓 − 𝜇 𝑘 • Empirically chosen decaying threshold placed on p: • If p violates T more than times, a precision switch is triggered and i ntelligent D igital S ystems

  10. Precision Switching Policy: Hypotheses • Intuition - Low gradient diversity increases value of p - The likelihood of observing r gradients across r epochs that have low diversity at early stages of training is low - If this happens, may imply that information is being lost due to quantisation ( high p value ) • Generalisability Generalisability across epochs 𝑞 = max 𝑇 ( 𝑘 ) Generalisability across networks and datasets 𝑘 ∆ 𝑇 ( 𝑥 ) i ntelligent D igital S ystems

  11. Precision Switching Policy: Generalisability • Similar values across various networks and datasets • Decaying threshold accounts for volatility in early stages of training i ntelligent D igital S ystems

  12. Precision Switching Policy: Adaptability • • Is it better than randomly switching? Does it tailor to network and dataset? i ntelligent D igital S ystems

  13. Precision Switching Policy: Performance (Accuracy) • Nets - AlexNet, ResNet18/20, GoogLeNet • Datasets CIFAR10, CIFAR100 (Hyperparameter Tuning), ImageNet (Application) - • Precisions 8-, 12-, 14-, 16-bit Dynamic Fixed -Point (Emulated) and 32-bit Floating -Point - • Training with MuPPET matches accuracy of standard FP32 training when trained with identical SGD hyperparameters i ntelligent D igital S ystems

  14. Quantised Training i ntelligent D igital S ystems

  15. Quantisation Quantised signed INT { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } = ⌊ 𝑦 { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } ∙ ¿ ¿ 𝑦 𝑟𝑣𝑏𝑜𝑢 Original value i ntelligent D igital S ystems

  16. Quantisation Quantised signed INT { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } = ⌊ 𝑦 { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } ∙ ¿ ¿ 𝑦 𝑟𝑣𝑏𝑜𝑢 Original value i ntelligent D igital S ystems

  17. Quantisation Quantised signed INT { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } = ⌊ 𝑦 { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } ∙ ¿ ¿ 𝑦 𝑟𝑣𝑏𝑜𝑢 Original value 𝑉𝐶 + 0.5 𝑀𝐶 − 0.5 Activations { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } = ⌊ log 2 ( min { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } ) ) ⌋ ❑ ( 𝑡 { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } , Scale factor 𝑌 𝑛𝑏𝑦 𝑌 𝑛𝑗𝑜 Weights i ntelligent D igital S ystems

  18. Quantisation Quantised signed INT { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } = ⌊ 𝑦 { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } ∙ ¿ ¿ 𝑦 𝑟𝑣𝑏𝑜𝑢 Original value Wordlength lower bound Wordlength upper bound 𝑉𝐶 + 0.5 𝑀𝐶 − 0.5 Activations { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } = ⌊ log 2 ( min { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } ) ) ⌋ ❑ ( 𝑡 { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } , Scale factor 𝑌 𝑛𝑏𝑦 𝑌 𝑛𝑗𝑜 Weights i ntelligent D igital S ystems

  19. Quantisation Quantised signed INT { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } = ⌊ 𝑦 { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } ∙ ¿ ¿ 𝑦 𝑟𝑣𝑏𝑜𝑢 Original value Wordlength lower bound Wordlength upper bound 𝑉𝐶 + 0.5 𝑀𝐶 − 0.5 Activations { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } = ⌊ log 2 ( min { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } ) ) ⌋ ❑ ( 𝑡 { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } , Scale factor 𝑌 𝑛𝑏𝑦 𝑌 𝑛𝑗𝑜 Weights Quantisation configuration and i ntelligent D igital S ystems

  20. Quantisation Quantised signed INT { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } = ⌊ 𝑦 { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } ∙ ¿ ¿ 𝑦 𝑟𝑣𝑏𝑜𝑢 Original value Wordlength lower bound Wordlength upper bound 𝑉𝐶 + 0.5 𝑀𝐶 − 0.5 Activations { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } = ⌊ log 2 ( min { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } ) ) ⌋ ❑ ( 𝑡 { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } , Scale factor 𝑌 𝑛𝑏𝑦 𝑌 𝑛𝑗𝑜 Weights Quantisation configuration Network layers Optimisation level and Network word length Current layer i ntelligent D igital S ystems

  21. Quantisation Emulation • No ML framework support for reduced precision hardware - e.g. NVIDIA Turing architecture • GEMM profiled using NVIDIA's CUTLASS Library • Training profiled through PyTorch - Quantisation of weights, activations and gradients - All gradient diversity calculations • 12- and 14-bit fixed profiled as 16-bit fixed point - Included for future custom precision hardware i ntelligent D igital S ystems

  22. Performance (Wall-clock time) Current Implementation i ntelligent D igital S ystems

  23. Performance (Wall-clock time) Ideal i ntelligent D igital S ystems

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend