cmsc5743 l05 quantization
play

CMSC5743 L05: Quantization Bei Yu (Latest update: October 12, - PowerPoint PPT Presentation

CMSC5743 L05: Quantization Bei Yu (Latest update: October 12, 2020) Fall 2020 1 / 25 Overview Fixed-Point Representation Non-differentiable Quantization Differentiable Quantization Reading List 2 / 25 Overview Fixed-Point Representation


  1. CMSC5743 L05: Quantization Bei Yu (Latest update: October 12, 2020) Fall 2020 1 / 25

  2. Overview Fixed-Point Representation Non-differentiable Quantization Differentiable Quantization Reading List 2 / 25

  3. Overview Fixed-Point Representation Non-differentiable Quantization Differentiable Quantization Reading List 3 / 25

  4. These slides contain/adapt materials developed by ◮ Hardware for Machine Learning, Shao Spring 2020 @ UCB ◮ 8-bit Inference with TensorRT ◮ Junru Wu et al. (2018). “Deep k -Means: Re-training and parameter sharing with harder cluster assignments for compressing deep convolutions”. In: Proc. ICML ◮ Shijin Zhang et al. (2016). “Cambricon-x: An accelerator for sparse neural networks”. In: Proc. MICRO . IEEE, pp. 1–12 ◮ Jorge Albericio et al. (2016). “Cnvlutin: Ineffectual-neuron-free deep neural network computing”. In: ACM SIGARCH Computer Architecture News 44.3, pp. 1–13 3 / 25

  5. Scientific Notation Decimal representation mantissa exponent 6.02 10 x 10 23 radix (base) decimal point • Normalized form: no leadings 0s (exactly one digit to left of decimal point) • Alternatives to representing 1/1,000,000,000 • Normalized: 1.0 x 10 -9 • Not normalized: 0.1 x 10 -8 ,10.0 x 10 -10 4 / 25

  6. Scientific Notation Binary representation mantissa exponent 1.01 two x 2 -1 radix (base) “binary point” • Computer arithmetic that supports it called floating point, because it represents numbers where the binary point is not fixed, as it is for integers 5 / 25

  7. Normalized Form ◮ Floating Point Numbers can have multiple forms, e.g. 0 . 232 × 10 4 = 2 . 32 × 10 3 = 23 . 2 × 10 2 = 2320 . × 10 0 = 232000 . × 10 − 2 ◮ It is desirable for each number to have a unique representation => Normalized Form ◮ We normalize Mantissa’s in the Range [ 1 .. R ) , where R is the Base, e.g.: ◮ [ 1 .. 2 ) for BINARY ◮ [ 1 .. 10 ) for DECIMAL 6 / 25

  8. Floating-Point Representation • Normal format: +1.xxx…x two *2 yyy…y two 31 30 23 22 0 S Exponent Significand 1 bit 8 bits 23 bits • S represents Sign • Exponent represents y’s • Significand represents x’s • Represent numbers as small as 2.0 x 10 -38 to as large as 2.0 x 10 38 7 / 25

  9. Floating-Point Representation (FP32) • IEEE 754 Floating Point Standard • Called Biased Notation , where bias is number subtracted to get real number • IEEE 754 uses bias of 127 for single prec. • Subtract 127 from Exponent field to get actual value for exponent • 1023 is bias for double precision • Summary (single precision, or fp32): 31 30 23 22 0 S Exponent Significand 1 bit 8 bits 23 bits • (-1) S x (1 + Significand) x 2 (Exponent-127) 8 / 25

  10. Floating-Point Representation (FP16) • IEEE 754 Floating Point Standard • Called Biased Notation , where bias is number subtracted to get real number • IEEE 754 uses bias of 15 for half prec. • Subtract 15 from Exponent field to get actual value for exponent • Summary (half precision, or fp15): 15 15 10 9 0 S Exponent Significand 1 bit 5 bits 10 bits • (-1) S x (1 + Significand) x 2 (Exponent-15) 9 / 25

  11. Question: What is the IEEE single precision number 40C0 0000 16 in decimal? 10 / 25

  12. Question: What is -0.5 10 in IEEE single precision binary floating point format? 11 / 25

  13. Fixed-Point Arithmetic Fixed-Point Arithmetic • Integers with a binary point and a bias • “slope and bias”: y = s*x + z • Qm.n: m (# of integer bits) n (# of fractional bits) • Qm.n: m (# of integer bits) n (# of fractional bits) s = 1, z = 0 s = 1/4, z = 0 s = 4, z = 0 s = 1.5, z =10 s = 1, z = 0 s = 1/4, z = 0 s = 4, z = 0 s = 1.5, z =10 2^2 2^1 2^0 Val 2^0 2^-1 2^-2 Val 2^4 2^3 2^2 Val 2^2 2^1 2^0 Val 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.5*0 +10 0 0 1 1 0 0 1 1/4 0 0 1 4 0 0 1 1.5*1 +10 0 1 0 2 0 1 0 2/4 0 1 0 8 0 1 0 1.5*2 +10 0 1 1 3 0 1 1 3/4 0 1 1 12 0 1 1 1.5*3 +10 1 0 0 4 1 0 0 1 1 0 0 16 1 0 0 1.5*4 +10 1 0 1 5 1 0 1 5/4 1 0 1 20 1 0 1 1.5*5 +10 1 1 0 6 1 1 0 6/4 1 1 0 24 1 1 0 1.5*6 +10 1 1 1 7 1 1 1 7/4 1 1 1 28 1 1 1 1.5*7 +10 12 / 25

  14. Hardware Implications Multipliers ! " # Fixed-point multiplier Floating-point multiplier 13 / 25

  15. Overview Fixed-Point Representation Non-differentiable Quantization Differentiable Quantization Reading List 14 / 25

  16. Greedy Layer-wise Quantization 1 Quantization flow ◮ For a fixed-point number, it representation is: bw − 1 B i · 2 − f l · 2 i , � n = i = 0 where bw is the bit width and f l is the fractional length which is dynamic for different layers and feature map sets while static in one layer. ◮ Weight quantization: find the optimal f l for weights: � f l = arg min | W float − W ( bw , f l ) | , f l where W is a weight and W ( bw , f l ) represents the fixed-point format of W under the given bw and f l . 1 Jiantao Qiu et al. (2016). “Going deeper with embedded fpga platform for convolutional neural network”. In: Proc. FPGA , pp. 26–35. 14 / 25

  17. Greedy Layer-wise Quantization Quantization flow Input images CNN model Weight quantization phase Weight dynamic range analysis namic range analysis ◮ Feature quantization: find the optimal f l for features: � | x + float − x + ( bw , f l ) | , Weight quantization configuration zation f l = arg min f l Data quantization Data quantization phase Fixed-point CNN model Fixed-point CNN Floating-point CNN model CNN model where x + represents the result of a layer Laye Layer 1 Layer 1 yer 1 Feature maps Feature ma ure Feature maps ature e maps e ma when we denote the computation of a Dynamic range analysis and finding … … layer as x + = A · x . optimal quantization strategy Feature maps Feature maps Layer Layer N Layer Layer N Weight and data quantization configuration uantiz 15 / 25 29

  18. ����������������� Dynamic-Precision Data Quantization Results � ������������������������������������������� �������������������� ������� ����� ���� ���� ������������ �� �� � � � � ����������� ������������ �� � � � � ������ �������������� ��� � �� � �� ���������� � �� �� �� ������� ������� ���������������� � ��� � �� � �� ��� ���������� ������� ������� �������������� ����� ����� ����� ���������� ����� ����� ����� ����� �������� ����� ����� ����� ���������� ����� ����� ����� ������� �������� ��������� ���� ���� ������������ �� � ������������ �� � ����������� ������������ �� � ������������ �� ������ �������������� ��� ������� ������� ��� ������� ������� ���������������� ��� ������� ������� ��� ������� ������� �������������� ����� ����� ����� ����� ����� ����� ����� �������� ����� ����� ����� ����� ����� ����� �� 16 / 25

  19. Industrial Implementations – Nvidia TensorRT Quantization No Saturation Quantization – INT8 Inference No saturation: map |max| to 127 Saturate above |threshold| to 127 ● ● -|max| 0.0 +|max| -|T| 0.0 +|T| -127 127 -127 0 0 127 ◮ Map the maximum value to 127, with unifrom step length. ● Significant accuracy loss, in general ● Weights: no accuracy improvement ● Activations: improved accuracy ◮ Suffer from outliers. ● Which |threshold| is optimal? 17 / 25

  20. Quantization Industrial Implementations – Nvidia TensorRT Saturation Quantization – INT8 Inference No saturation: map |max| to 127 Saturate above |threshold| to 127 ● ● -|max| +|max| -|T| +|T| 0.0 0.0 -127 127 -127 0 0 127 ◮ Set a threshold as the maxiumum value. Significant accuracy loss, in general Weights: no accuracy improvement ● ● ● Activations: improved accuracy ◮ Divide the value domain into 2048 groups. ● Which |threshold| is optimal? ◮ Traverse all the possible thresholds to find the best one with minimum KL divergence. 18 / 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend