Quantization for TVM Ziheng Jiang TVM Conference, Dec 12th 2018 - - PowerPoint PPT Presentation

quantization for tvm
SMART_READER_LITE
LIVE PREVIEW

Quantization for TVM Ziheng Jiang TVM Conference, Dec 12th 2018 - - PowerPoint PPT Presentation

Quantization for TVM Ziheng Jiang TVM Conference, Dec 12th 2018 Quantization for TVM What is Quantization? source: Han et al Converting weight value to low-bit integer like 8bit precision from float-point without significant accuracy drop.


slide-1
SLIDE 1

Quantization for TVM

Ziheng Jiang TVM Conference, Dec 12th 2018

slide-2
SLIDE 2

Quantization for TVM

What is Quantization?

source: Han et al

Converting weight value to low-bit integer like 8bit precision from float-point without significant accuracy drop.

slide-3
SLIDE 3

Quantization for TVM

Convert Apply Train Frontend DL Framework Relay: High-Level Graph IR Quantization Deploy Gain Compression & Acceleration:

  • Less storage space
  • Faster arithmetic operation
  • Friendly to accelerator and ultra

low-power embedded devices

slide-4
SLIDE 4

Quantization for TVM

Choice Spaces for Quantization

  • number of bit
  • 4bit, 8bit, 16bit
  • quantization scheme:
  • symmetric, asymmetric, etc.
  • hardware constraint:
  • e.g. prefer integer shift instead of float multiplication

Goal

Instead of proposing “the only right way to achieve quantization in TVM”, we would like to build a quantization workflow which can be customized flexibly.

slide-5
SLIDE 5

Quantization for TVM

Conv2D Batch Norm ReLU Conv2D Conv2D Mul, Add ReLu Simulated Quantize Conv2D Simulated Quantize Simulated Quantize Simulated Quantize Conv2D Mul, Add ReLu Shift Clip Cast Conv2D Mul Clip Cast Mul Clip Cast Mul Clip Cast

Original After Annotate After Realize

SimQ simulates the rounding error and saturating error during

  • quantizing. Its argument will get

tuned during calibrate.

SimQ(nbit, range, sign) = Clip(Round( x

r * 2nbit−sign)) * r

2nbit−sign

W1 W2 W1 W2

f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 i8 i8 i8 f32 f32 i32 i32 i8 i32

slide-6
SLIDE 6

Quantization for TVM

# user can override the annotate function @register_annotate_function("nn.conv2d", override=True) def annotate_conv2d(ref_call, new_args, ctx): lhs, rhs = new_args lhs = attach_simulated_quantize(lhs, sign=False, rounding='round') rhs = attach_simulated_quantize(lhs, sign=False, rounding='stochastic_round') return expr.Call(ref_call.op, [lhs, rhs], ref_call.attrs) # assuming we have an existed mxnet model, convert it to relay graph graph, params = relay.frontend.from_mxnet(mxnet_model) # quantize the relay graph with all kinds of configure with qconfig(nbit_dict={QFieldKind.ACTIVATION: 24}, global_scale=8.0, skip_k_conv=1): qgraph, qparams = quantize(graph, params) # ...build and deploy it locally or remotely with tvm

Code Sample

slide-7
SLIDE 7

Quantization for TVM

End to End Performance

Global Scale Accuracy 2.0 64.1% 4.0 68.1% 8.0 69.5% 16.0 69.6%

Accuracy Drop with ResNet18 (original 70.8%)

Demonstration with 8bit Symmetric Quantization

Time/ms Cortex A53 VTA ResNet18 307.09 64.87 MobileNet 131.14 51.96