Rate Distortion for Model Compression: From Theory To Practice - PowerPoint PPT Presentation

Rate Distortion for Model Compression: From Theory To Practice Weihao Gao ∗ , Yu-Han Liu † , Chong Wang ‡ and Sewoong Oh § ∗ UIUC, † Google, ‡ Bytedance, § Univ of Washington June 10, 2019 Weihao Gao (UIUC) Model Compression June 10, 2019 1 / 13

Motivation Nowadays, neural networks become more and more powerful Weihao Gao (UIUC) Model Compression June 10, 2019 2 / 13

Motivation Nowadays, neural networks become more and more powerful Also, neural networks become larger and larger LeNet 40K, AlexNet 62M, BERT 110M(base)/340M(large) Weihao Gao (UIUC) Model Compression June 10, 2019 2 / 13

Motivation Nowadays, neural networks become more and more powerful Also, neural networks become larger and larger LeNet 40K, AlexNet 62M, BERT 110M(base)/340M(large) Compression of models are necessary for saving training and inference time storing space, e.g., for mobile Apps Weihao Gao (UIUC) Model Compression June 10, 2019 2 / 13

Motivation Nowadays, neural networks become more and more powerful Also, neural networks become larger and larger LeNet 40K, AlexNet 62M, BERT 110M(base)/340M(large) Compression of models are necessary for saving training and inference time storing space, e.g., for mobile Apps Two fundamental questions about model compression Weihao Gao (UIUC) Model Compression June 10, 2019 2 / 13

Motivation Nowadays, neural networks become more and more powerful Also, neural networks become larger and larger LeNet 40K, AlexNet 62M, BERT 110M(base)/340M(large) Compression of models are necessary for saving training and inference time storing space, e.g., for mobile Apps Two fundamental questions about model compression Is there any theoretical understanding of the fundamental limit of 1 model compression algorithms? Weihao Gao (UIUC) Model Compression June 10, 2019 2 / 13

Motivation Nowadays, neural networks become more and more powerful Also, neural networks become larger and larger LeNet 40K, AlexNet 62M, BERT 110M(base)/340M(large) Compression of models are necessary for saving training and inference time storing space, e.g., for mobile Apps Two fundamental questions about model compression Is there any theoretical understanding of the fundamental limit of 1 model compression algorithms? How can theoretical understanding help us to improve practical 2 compression algorithms? Weihao Gao (UIUC) Model Compression June 10, 2019 2 / 13

Fundamental limit for model compression Trade-off between compression ratio and quality of compressed model Weihao Gao (UIUC) Model Compression June 10, 2019 3 / 13

Fundamental limit for model compression Trade-off between compression ratio and quality of compressed model 2.5 uncompressed baseline Cross Entropy proposed 2.0 1.5 1.0 0.5 0.0 0% 5% 10% 15% 20% 25% Compression Ratio Figure 1: Trade-off between compression ratio and cross entropy loss Weihao Gao (UIUC) Model Compression June 10, 2019 3 / 13

Fundamental limit for model compression Trade-off between compression ratio and quality of compressed model 2.5 uncompressed baseline Cross Entropy proposed 2.0 1.5 1.0 0.5 0.0 0% 5% 10% 15% 20% 25% Compression Ratio Figure 1: Trade-off between compression ratio and cross entropy loss Fundamental question: Given a pretrained model f w ( x ), how well can we compress the model, given certain ratio? Weihao Gao (UIUC) Model Compression June 10, 2019 3 / 13

Rate distortion for model compression We bring the tool of rate distortion theory from information theory Weihao Gao (UIUC) Model Compression June 10, 2019 4 / 13

Rate distortion for model compression We bring the tool of rate distortion theory from information theory Rate : average number of bits to represent parameters Weihao Gao (UIUC) Model Compression June 10, 2019 4 / 13

Rate distortion for model compression We bring the tool of rate distortion theory from information theory Rate : average number of bits to represent parameters Distortion: difference between compressed model and original model Weihao Gao (UIUC) Model Compression June 10, 2019 4 / 13

Rate distortion for model compression We bring the tool of rate distortion theory from information theory Rate : average number of bits to represent parameters Distortion: difference between compressed model and original model w ( X ) � 2 ] For regression d ( w , ˆ w ) = E X [ � f w ( X ) − f ˆ For classification d ( w , ˆ w ) = E X [ D KL ( f ˆ w ( X ) || f w ( X ))] Weihao Gao (UIUC) Model Compression June 10, 2019 4 / 13

Rate distortion for model compression We bring the tool of rate distortion theory from information theory Rate : average number of bits to represent parameters Distortion: difference between compressed model and original model w ( X ) � 2 ] For regression d ( w , ˆ w ) = E X [ � f w ( X ) − f ˆ For classification d ( w , ˆ w ) = E X [ D KL ( f ˆ w ( X ) || f w ( X ))] Rate-distortion theorem for model compression I ( W ; ˆ R ( D ) = min W ) W | W : E [ d ( W , ˆ W )] ≤ D P ˆ Weihao Gao (UIUC) Model Compression June 10, 2019 4 / 13

Our contributions Generally, it is intractable to evaluate R ( D ) due to High dimensionality of parameters Complicated non-linearity Weihao Gao (UIUC) Model Compression June 10, 2019 5 / 13

Our contributions Generally, it is intractable to evaluate R ( D ) due to High dimensionality of parameters Complicated non-linearity In this talk, our contributions are Weihao Gao (UIUC) Model Compression June 10, 2019 5 / 13

Our contributions Generally, it is intractable to evaluate R ( D ) due to High dimensionality of parameters Complicated non-linearity In this talk, our contributions are For linear regression model, we give a lower bound of R ( D ) and give an algorithm achieving the lower bound Weihao Gao (UIUC) Model Compression June 10, 2019 5 / 13

Our contributions Generally, it is intractable to evaluate R ( D ) due to High dimensionality of parameters Complicated non-linearity In this talk, our contributions are For linear regression model, we give a lower bound of R ( D ) and give an algorithm achieving the lower bound Inspired by the optimal algorithm, we propose two “golden rules” for model compression Weihao Gao (UIUC) Model Compression June 10, 2019 5 / 13

Our contributions Generally, it is intractable to evaluate R ( D ) due to High dimensionality of parameters Complicated non-linearity In this talk, our contributions are For linear regression model, we give a lower bound of R ( D ) and give an algorithm achieving the lower bound Inspired by the optimal algorithm, we propose two “golden rules” for model compression We prove the optimality of proposed “golden rules” for one layer ReLU network Weihao Gao (UIUC) Model Compression June 10, 2019 5 / 13

Our contributions Generally, it is intractable to evaluate R ( D ) due to High dimensionality of parameters Complicated non-linearity In this talk, our contributions are For linear regression model, we give a lower bound of R ( D ) and give an algorithm achieving the lower bound Inspired by the optimal algorithm, we propose two “golden rules” for model compression We prove the optimality of proposed “golden rules” for one layer ReLU network We show that the algorithm following “golden rules” performs better in real models Weihao Gao (UIUC) Model Compression June 10, 2019 5 / 13

Linear regression Consider linear regression model f w ( x ) = w T x Weihao Gao (UIUC) Model Compression June 10, 2019 6 / 13

Linear regression Consider linear regression model f w ( x ) = w T x and the following assumptions Weights W are drawn from N (0 , Σ W ) Data X has zero mean and E [ X 2 i ] = λ x , i , E [ X i X j ] = 0. Weihao Gao (UIUC) Model Compression June 10, 2019 6 / 13

Linear regression Consider linear regression model f w ( x ) = w T x and the following assumptions Weights W are drawn from N (0 , Σ W ) Data X has zero mean and E [ X 2 i ] = λ x , i , E [ X i X j ] = 0. Theorem: the rate distortion function is lower bounded by: m R ( D ) ≥ R ( D ) = 1 1 � 2 log det(Σ W ) − 2 log( D i ) , i =1 where � if µ < λ x , i E W [ W 2 µ/λ x , i i ] , D i = E W [ W 2 if µ ≥ λ x , i E W [ W 2 i ] i ] , where µ is chosen that � m i =1 λ x , i D i = D . The lower bound is tight for linear regression. Weihao Gao (UIUC) Model Compression June 10, 2019 6 / 13

From theory to practice Two “golden rules” of the optimal compressor W [ ˆ W T Σ X ( W − ˆ Orthogonality: E W , ˆ W )] = 0 1 W [( W − ˆ W ) T Σ X ( W − ˆ Minimization: E W , ˆ W )] should be minimized, 2 given certain rate. Weihao Gao (UIUC) Model Compression June 10, 2019 7 / 13

From theory to practice Two “golden rules” of the optimal compressor W [ ˆ W T Σ X ( W − ˆ Orthogonality: E W , ˆ W )] = 0 1 W [( W − ˆ W ) T Σ X ( W − ˆ Minimization: E W , ˆ W )] should be minimized, 2 given certain rate. Modified “golden rules” for practice w T I w ( w − ˆ Orthogonality: ˆ w ) = 0, 1 w ) T I w ( w − ˆ Minimization: ( w − ˆ w ) is minimized given certain 2 constraints. Weihao Gao (UIUC) Model Compression June 10, 2019 7 / 13

Rate Distortion for Model Compression: From Theory To Practice - PowerPoint PPT Presentation

Rate Distortion for Model Compression: From Theory To Practice Weihao Gao , Yu-Han Liu , Chong Wang and Sewoong Oh UIUC, Google, Bytedance, Univ of Washington June 10, 2019 Weihao Gao (UIUC) Model Compression June

Labor Classification Yrs Rate 1 Rate 2 Rate 3 Rate 4 Rate 5 Rate 6 Rate 7 Rate 8 Rate 9

Chapter 10 Rate Distortion Theory Peng-Hua Wang Graduate Inst. of Comm. Engineering National

Lossless compression in lossy compression systems Almost every lossy compression system

14.9.2 JPEG2000 compression DCT compression basis for JPEG wavelet compression

A Rate-Distortion One-Class Model and its Applications to Clustering Fernando Pereira 1 Koby

JPEG Compression Ian Snyder December 11, 2009 Ian Snyder JPEG Compression Outline

Lecture 9: Compression 1 / 52 Compression Recap Bu ff er Management Recap 2 / 52 Compression

Mode-dependent Rate-distortion Optimized Transforms Using Graph Signal Processing Methods

Magnetic Distortion Distortion of of Magnetic HPD Images HPD Images smund Skjveland

Using Distortion in 3D Using Distortion in 3D Sheelagh Carpendale Sheelagh Carpendale

Digital Pre-Distortion Derek Kozel What is Digital Pre-Distortion (DPD) A technique for

CMB Spectral Distortion Computations using the Greens function package of CosmoTherm Primordial

Temporal Distortion Temporal Distortion Perspective) Perspective) t t Blue view Blue view y

Non Linear Distortion and Dynamic Range Issues Non Linear Distortion and Dynamic Range Issues in

Resolving Profile Distortion Resolving Profile Distortion for Electron-based IPMs for

A Model to Address Salary Compression for Faculty (an anti-compression model) Presented to

Delta-DNN : Efficiently Compressing Deep Neural Networks via Exploiting Floats Similarity The 49

What Youll Learn Today Review: how ASCII works and the great unfairness of bits What

Data Compression Reduce the size of data. Reduces storage space and hence storage cost.

ADVANCED DATABASE SYSTEMS Database Compression @ Andy_Pavlo // 15- 721 // Spring 2019 CMU

What is the State of Neural Network Pruning? Davis Blalock* Jose Javier Gonzalez* Jonathan

Fast Software-managed Code Decompression Charles Lefurgy and Trevor Mudge Advanced Computer

Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using both Vectorization and Compilation

Data suppression and compression SW in DUNE detector simula6on