Training of Convolutional Neural Networks (CNNs) Typical Datasets - - PowerPoint PPT Presentation

β–Ά
training of convolutional neural networks cnns
SMART_READER_LITE
LIVE PREVIEW

Training of Convolutional Neural Networks (CNNs) Typical Datasets - - PowerPoint PPT Presentation

Multi-Precision Policy Enforced Training (MuPPET) Multi-Precision Policy Enforced Training (MuPPET) A precision-switching strategy for quantised fixed-point training of CNNs A precision-switching strategy for quantised fixed-point training of


slide-1
SLIDE 1

intelligent Digital Systems

intelligent Digital Systems

Lab

www.imperial.ac.uk/idsl

  • Dept. of Electrical and Electronic

Engineering

Multi-Precision Policy Enforced Training (MuPPET)

A precision-switching strategy for quantised fixed-point training of CNNs

Aditya Rajagopal, Diederik Adriaan Vink Stylianos I. Venieris, Christos-Savvas Bouganis

in

te llig e n t ig ita l y s te m s a b

n te llig e n t ig ita l y s te m s a b

www .imperial.ac.uk/idsl D e p

  • t. of E

le c tric a l a n d E le c tron ic E n g in e e rin g

Multi-Precision Policy Enforced Training (MuPPET)

A precision-switching strategy for quantised fixed-point training of CNNs

Aditya Rajagopal, Diederik Adriaan Vink Stylianos I. Venieris, Christos-Savvas Bouganis

slide-2
SLIDE 2

intelligent Digital Systems

Training of Convolutional Neural Networks (CNNs)

  • CIFAR10
  • 10 categories
  • 60000 images
  • CIFAR100
  • 100 categories
  • 60000 images
  • ImageNet Dataset
  • 1000 categories
  • 1.2 million images

Typical Datasets Typical Networks

slide-3
SLIDE 3

intelligent Digital Systems

Motivation

  • Enable wider experimentation

with training e.g. Neural Architecture Search

  • Increase productivity of deep

learning practitioners

  • Reduce cost of training in

large data centers

  • Perform training on edge

devices

Training Time Power Consumption Exploit low-precision hardware capabilities

  • NVIDIA Turing Architecture (GPU)
  • Microsoft Brainwave (FPGA)
  • Google TPU (ASIC)
slide-4
SLIDE 4

intelligent Digital Systems

Goal

Perform quantised training of CNNs while maintaining FP32 accuracy and producing a model that performs inference at FP32

slide-5
SLIDE 5

intelligent Digital Systems

Contributions of this paper

  • Generalisable policy that decides at run time appropriate points to increase the

precision of the training process without impacting final test accuracy

  • Datasets: CIFAR10, CIFAR100, ImageNet
  • Networks: AlexNet, ResNet, GoogLeNet
  • Up to 1.84x training time improvement with negligible loss in accuracy
  • Extending training to bit-widths as low as 8-bit to leverage the low-precision capabilities
  • f modern processing systems
  • Open source PyTorch implementation of the MuPPET framework with emulated

quantised computations

slide-6
SLIDE 6

intelligent Digital Systems

Background: Mixed Precision Training

  • Current state-of-the-art: Mixed-precision

training (Micikevicius et al., 2018)

  • Maintains master copy of the weights at

FP32

  • Quantises weights and activations to

FP16 for all computations

  • Accumulates FP16 gradients into FP32

master copy of the weights

  • Incurs accuracy drop if precision below

FP16 is utilised

[6] Micikevicius, P. et.al.. Mixed Precision Training. In International Conference on Learning Representations (ICLR), 2018

[6]

slide-7
SLIDE 7

intelligent Digital Systems

Multilevel optimisation formulation

  • Hierarchical formulation that progressively increases precision of computations

min

π‘₯( π‘ŸΒΏΒΏ 𝑂) ∈ ℝ𝐸 𝑀𝑝𝑑𝑑 ΒΏ ΒΏΒΏ

min

π‘₯( π‘ŸΒΏΒΏ 𝑂 βˆ’1) ∈ ℝ𝐸 𝑀𝑝𝑑𝑑¿ ΒΏΒΏ

min

π‘₯( π‘ŸΒΏΒΏ 𝑂 βˆ’2) ∈ ℝ 𝐸 𝑀𝑝𝑑𝑑¿ ΒΏ ΒΏ

min

π‘₯ 𝐺𝑄 32 βˆˆβ„πΈ 𝑀𝑝𝑑𝑑( 𝑔 (π‘₯ 𝐺𝑄 32))

β‹±

Proposed policy decides at run time the epochs at which these changes need to be made

FP32 Master Copy of Weights FP32 Master Copy of Weights

slide-8
SLIDE 8

intelligent Digital Systems

Background: Gradient Diversity

  • Yin et al. 2018 computes diversity between minibatches within an epoch

βˆ† 𝑇 (π‘₯ )=βˆ‘

𝑗=1 π‘œ β€–βˆ‡ 𝑔 𝑗 (π‘₯)β€– 2 2

β€–βˆ‘

𝑗=1 π‘œ

βˆ‡ 𝑔 𝑗 (π‘₯)β€–

2 2 =

βˆ‘

𝑗=1 π‘œ β€–βˆ‡ 𝑔 𝑗(π‘₯)β€– 2 2

βˆ‘

𝑗=1 π‘œ

β€–βˆ‡ 𝑔 𝑗 (π‘₯)β€–

2 2+βˆ‘ 𝑗 β‰  π‘˜ ΒΏ βˆ‡ 𝑔 𝑗 (π‘₯ ) , βˆ‡ 𝑔 π‘˜ (π‘₯ )>ΒΏ ΒΏ

  • Modified for MuPPET to compute diversity between minibatches across epochs

βˆ† 𝑇 (π‘₯ ) π‘˜= 1 ΒΏ β„’βˆ¨ΒΏ βˆ‘

βˆ€ π‘š ∈ β„’

βˆ‘

𝑙= π‘˜ βˆ’π‘  π‘˜

β€–βˆ‡ 𝑔 π‘š

𝑙(π‘₯)β€– 2 2

β€– βˆ‘

𝑙= π‘˜ βˆ’ 𝑠 π‘˜

βˆ‡ 𝑔 π‘š

𝑙(π‘₯)β€– 2 2 ΒΏ

Resolution of r epochs ≀ epoch j Gra dient of last minibatch for layer l in epoch k Average gradient diversity across all layers from last r epochs Gra dient of weights for minibatch i

slide-9
SLIDE 9

intelligent Digital Systems

Precision Switching Policy: Methodology

𝑇 ( π‘˜ )={βˆ†π‘‡(π‘₯)𝑗 βˆ€ 𝑓 ≀𝑗≀ π‘˜}

π‘ž= max 𝑇( π‘˜) βˆ†π‘‡(π‘₯ )

π‘˜

  • Every r epochs:
  • The inter-epoch gradient diversity is calculated
  • Given an epoch e when the precision switched from level qn-1 to qn, and current epoch j
  • Empirically chosen decaying threshold placed on p:
  • If p violates T more than times, a precision switch is triggered and

π‘ˆ =𝛽+ π›Ύπ‘“βˆ’ πœ‡ π‘˜

slide-10
SLIDE 10

intelligent Digital Systems

Precision Switching Policy: Hypotheses

π‘ž= max 𝑇( π‘˜) βˆ†π‘‡(π‘₯ )

π‘˜

Generalisability across epochs

  • Intuition
  • Low gradient diversity increases value of p
  • The likelihood of observing r gradients across r epochs that have low diversity at early stages of

training is low

  • If this happens, may imply that information is being lost due to quantisation (high p value)
  • Generalisability

Generalisability across networks and datasets

slide-11
SLIDE 11

intelligent Digital Systems

Precision Switching Policy: Generalisability

  • Similar values across various networks and datasets
  • Decaying threshold accounts for volatility in early stages of training
slide-12
SLIDE 12

intelligent Digital Systems

Precision Switching Policy: Adaptability

  • Is it better than randomly switching?
  • Does it tailor to network and dataset?
slide-13
SLIDE 13

intelligent Digital Systems

Precision Switching Policy: Performance (Accuracy)

  • Nets
  • AlexNet, ResNet18/20, GoogLeNet
  • Datasets
  • CIFAR10, CIFAR100 (Hyperparameter Tuning), ImageNet (Application)
  • Precisions
  • 8-, 12-, 14-, 16-bit Dynamic Fixed-Point (Emulated) and 32-bit Floating-Point
  • Training with MuPPET matches accuracy of standard FP32 training when trained with identical SGD

hyperparameters

slide-14
SLIDE 14

intelligent Digital Systems

Quantised Training

slide-15
SLIDE 15

intelligent Digital Systems

Quantisation

π‘¦π‘Ÿπ‘£π‘π‘œπ‘’

{π‘₯𝑓𝑗𝑕h𝑒𝑑 ,𝑏𝑑𝑒 }=⌊ 𝑦{π‘₯𝑓𝑗𝑕h𝑒𝑑 ,𝑏𝑑𝑒 }βˆ™ΒΏ ΒΏ

Original value Quantised signed INT

slide-16
SLIDE 16

intelligent Digital Systems

Quantisation

π‘¦π‘Ÿπ‘£π‘π‘œπ‘’

{π‘₯𝑓𝑗𝑕h𝑒𝑑 ,𝑏𝑑𝑒 }=⌊ 𝑦{π‘₯𝑓𝑗𝑕h𝑒𝑑 ,𝑏𝑑𝑒 }βˆ™ΒΏ ΒΏ

Original value Quantised signed INT

slide-17
SLIDE 17

intelligent Digital Systems

Quantisation

Scale factor Weights Activations

𝑑

{π‘₯𝑓𝑗𝑕h𝑒𝑑, 𝑏𝑑𝑒 }=⌊log2(min ❑ (

𝑉𝐢+0.5 π‘Œ 𝑛𝑏𝑦

{π‘₯𝑓𝑗𝑕h𝑒𝑑, 𝑏𝑑𝑒 } ,

π‘€πΆβˆ’0.5 π‘Œ π‘›π‘—π‘œ

{π‘₯𝑓𝑗𝑕h𝑒𝑑, 𝑏𝑑𝑒 })) βŒ‹

π‘¦π‘Ÿπ‘£π‘π‘œπ‘’

{π‘₯𝑓𝑗𝑕h𝑒𝑑 ,𝑏𝑑𝑒 }=⌊ 𝑦{π‘₯𝑓𝑗𝑕h𝑒𝑑 ,𝑏𝑑𝑒 }βˆ™ΒΏ ΒΏ

Original value Quantised signed INT

slide-18
SLIDE 18

intelligent Digital Systems

Quantisation

Wordlength lower bound Wordlength upper bound Scale factor Weights Activations

𝑑

{π‘₯𝑓𝑗𝑕h𝑒𝑑, 𝑏𝑑𝑒 }=⌊log2(min ❑ (

𝑉𝐢+0.5 π‘Œ 𝑛𝑏𝑦

{π‘₯𝑓𝑗𝑕h𝑒𝑑, 𝑏𝑑𝑒 } ,

π‘€πΆβˆ’0.5 π‘Œ π‘›π‘—π‘œ

{π‘₯𝑓𝑗𝑕h𝑒𝑑, 𝑏𝑑𝑒 })) βŒ‹

π‘¦π‘Ÿπ‘£π‘π‘œπ‘’

{π‘₯𝑓𝑗𝑕h𝑒𝑑 ,𝑏𝑑𝑒 }=⌊ 𝑦{π‘₯𝑓𝑗𝑕h𝑒𝑑 ,𝑏𝑑𝑒 }βˆ™ΒΏ ΒΏ

Original value Quantised signed INT

slide-19
SLIDE 19

intelligent Digital Systems

Quantisation

and

Wordlength lower bound Wordlength upper bound Scale factor Weights Activations

𝑑

{π‘₯𝑓𝑗𝑕h𝑒𝑑, 𝑏𝑑𝑒 }=⌊log2(min ❑ (

𝑉𝐢+0.5 π‘Œ 𝑛𝑏𝑦

{π‘₯𝑓𝑗𝑕h𝑒𝑑, 𝑏𝑑𝑒 } ,

π‘€πΆβˆ’0.5 π‘Œ π‘›π‘—π‘œ

{π‘₯𝑓𝑗𝑕h𝑒𝑑, 𝑏𝑑𝑒 })) βŒ‹

π‘¦π‘Ÿπ‘£π‘π‘œπ‘’

{π‘₯𝑓𝑗𝑕h𝑒𝑑 ,𝑏𝑑𝑒 }=⌊ 𝑦{π‘₯𝑓𝑗𝑕h𝑒𝑑 ,𝑏𝑑𝑒 }βˆ™ΒΏ ΒΏ

Original value Quantised signed INT Quantisation configuration

slide-20
SLIDE 20

intelligent Digital Systems

Quantisation

Network word length Optimisation level Current layer Network layers

and

Quantisation configuration Wordlength lower bound Wordlength upper bound Scale factor Weights Activations

𝑑

{π‘₯𝑓𝑗𝑕h𝑒𝑑, 𝑏𝑑𝑒 }=⌊log2(min ❑ (

𝑉𝐢+0.5 π‘Œ 𝑛𝑏𝑦

{π‘₯𝑓𝑗𝑕h𝑒𝑑, 𝑏𝑑𝑒 } ,

π‘€πΆβˆ’0.5 π‘Œ π‘›π‘—π‘œ

{π‘₯𝑓𝑗𝑕h𝑒𝑑, 𝑏𝑑𝑒 })) βŒ‹

π‘¦π‘Ÿπ‘£π‘π‘œπ‘’

{π‘₯𝑓𝑗𝑕h𝑒𝑑 ,𝑏𝑑𝑒 }=⌊ 𝑦{π‘₯𝑓𝑗𝑕h𝑒𝑑 ,𝑏𝑑𝑒 }βˆ™ΒΏ ΒΏ

Original value Quantised signed INT

slide-21
SLIDE 21

intelligent Digital Systems

Quantisation Emulation

  • No ML framework support for reduced precision hardware
  • e.g. NVIDIA Turing architecture
  • GEMM profiled using NVIDIA's CUTLASS Library
  • Training profiled through PyTorch
  • Quantisation of weights, activations and gradients
  • All gradient diversity calculations
  • 12- and 14-bit fixed profiled as 16-bit fixed point
  • Included for future custom precision hardware
slide-22
SLIDE 22

intelligent Digital Systems

Performance (Wall-clock time)

Current Implementation

slide-23
SLIDE 23

intelligent Digital Systems

Performance (Wall-clock time)

Ideal