neural network compression
play

Neural Network Compression Linear Neural Reconstruction David A. R. - PowerPoint PPT Presentation

Neural Network Compression Linear Neural Reconstruction David A. R. Robin Internship with Swayambhoo Jain Feb-Aug 2019 Report : www.robindar.com/m1-internship/report.pdf Neural networks X R d W 2 1 ( W 1 0 ( W 0 X ))


  1. Neural Network Compression Linear Neural Reconstruction David A. R. Robin Internship with Swayambhoo Jain Feb-Aug 2019 Report : www.robindar.com/m1-internship/report.pdf

  2. Neural networks X ∈ R d �→ W 2 · σ 1 ( W 1 · σ 0 ( W 0 · X ))

  3. Notations (when considering a single layer) d ∈ N : number of inputs of a layer h ∈ N : number of outputs of a layer W ∈ R h × d : weights of a single layer (over R d ) D : distribution of inputs to a layer X ∼ D : input to a layer (random variable)

  4. Previously in Network Compression

  5. Previously in Network Compression : Pruning Pruning : Remove weights (i.e. connections) Assumption : small magnitude | w q | pruned → small loss increase (even when pruning several weights at once) Pruning Algorithm : Prune, Retrain, Repeat Result : 90% of weights removed, same accuracy (high compressibility)

  6. Previously in Network Compression : Explaining Pruning Magnitude-based pruning requires retraining. Pruning Neurons with no inputs or no outputs (in red) can be kept 1 , as well as redundant neurons (in blue) that could be discarded at no cost. Redundancy is not leveraged Can we take advantage of redundancies ? 1 given enough retraining with weight decay, these will be discarded

  7. Previously in Network Compression : Low-rank � � W − PQ T � min � � � P ∈ R h × r , Q ∈ R d × r 2 Low-Rank Problems : keeps hidden neuron count intact, data-agnostic

  8. Contribution

  9. Activation reconstruction L -layer feed-forward flow: Weight approximation (theirs) : ˆ ◮ Z 0 input to the network W k ≈ W k ◮ ◮ Z k +1 = σ k ( W k · Z k ) Activation reconstruction (ours) : ◮ Use Z L as prediction ◮ ˆ Z k ≈ Z k We have more than weights, we have activations We only need σ k ( ˆ W k Z k ) ≈ σ k ( W k Z k ) Z k ≈ Z k , σ k ( ˆ ˆ ˆ W k Z k ) ≈ σ k ( W k Z k ) ⇒ Z k +1 ≈ Z k +1 σ k ( ˆ ˆ W k Z k ) ≈ σ k ( W k Z k ) W k ≈ W k �

  10. Linear activation reconstruction ˆ ˆ σ k ( ˆ W k ≈ W k ⇒ W k Z k ≈ W k Z k ⇒ W k Z k ) ≈ σ k ( W k Z k ) The first ( ˆ W k ≈ W k ) is sub-optimal because data-agnostic The third ( σ k ( ˆ W k Z k ) ≈ σ k ( W k Z k )) is non-convex, non-smooth ˆ Let’s try to get W k · Z k ≈ W k · Z k

  11. Low-rank inspiration Low-rank with activation reconstruction gives 2 � � � WX − PQ T X min P ∈ R h × r , Q ∈ R n × r E X � � � 2 Q : feature extractor, P : linear reconstruction from extracted features Knowing the right rank r to use is hard. Soft low-rank would use the nuclear norm � · � ∗ instead E X � WX − MX � 2 min 2 + λ · � M � ∗ M where λ controls the tradeoff between compression and accuracy

  12. Neuron removal C i ( M ) = 0 ⇒ X i is never used ⇒ we can remove neuron n o i Column-sparse matrices remove neurons. Caracterization of such matrices reminiscent of low-rank : PC T Low-Rank : M = PQ T Column-sparse : M = PC T ◮ P ∈ R h × r Q ◮ P ∈ R h × r C ◮ C ∈ { 0 , 1 } d × r C , C T 1 d = 1 r ◮ Q ∈ R d × r Q Q the feature extractor becomes a feature selector C

  13. Leveraging consecutive layers Restricting to feature selectors, we gain an interesting property feature selector ’s action commute with non-linearities For a three-layer network: W 3 · σ 2 ( W 2 · σ 1 ( W 1 · X )) P 3 C T P 2 C T P 1 C T ≈ 3 · σ 2 ( 2 · σ 1 ( 1 · X )) C T C T 2 P 1 · C T = P 3 · σ 2 ( 3 P 2 · σ 1 ( 1 X )) ˆ ˆ ˆ W 1 · C T = W 3 · σ 2 ( W 2 · σ 1 ( 1 X )) Memory footprint: original : h 3 × h 2 + h 2 × h 1 + h 1 × d ◮ � d ◮ compressed : h 3 × r 3 + r 3 × r 2 + r 2 × r 1 + α · log 2 � r 1 h 2 and h 1 are gone ! Only h 3 (#outputs) and d (#inputs) remain

  14. Optimality of feature selectors feature selector ’s action commute with non-linearities: C ∈ { 0 , 1 } r × d , C T 1 d = 1 r PC T · σ ( U ) = P · σ ( C T U ) ⇒ We only need the commutation property. Can we maybe use something less extreme than feature selectors ? Lemma (commutation lemma) Let C be a linear operator Let σ : x �→ max(0 , x ) be the pointwise ReLU C’s action commutes with σ ⇒ C is a feature selector Answer : No, not even if all σ k are ReLU

  15. Comparison with low-rank Hidden neurons are deleted Note how this doesn’t suffer pruning drawbacks discussed before

  16. Comparison with low-rank Low-Rank : M = PQ T Column-sparse : M = PC T ◮ P ∈ R h × r Q ◮ P ∈ R h × r C ◮ C ∈ { 0 , 1 } d × r C , C T 1 d = 1 r ◮ Q ∈ R d × r Q For the same ℓ 2 error, low-rank is less constrained, hence r Q ≤ r C But it doesn’t remove hidden neurons, which may dominate its cost Two regimes: ◮ Heavy overparameterization ( r C ≪ d ) : use column-sparse ◮ Light overparameterization ( r C ≈ d ) : use low-rank Once neurons have been removed, it is still possible to apply low-rank approximation on top of the first compression

  17. Solving for column-sparse

  18. Linear Neural Reconstruction Problem Using the ℓ 2 , 1 norm as a proxy for the number of non-zero columns, we can consider the following distinct relaxation E X � WX − MX � 2 min 2 + λ · � M � 2 , 1 (1) M �� i M 2 where � M � 2 , 1 = � i , j is the ℓ 2 , 1 norm of M , j i.e. the sum of the ℓ 2 -norms of its columns.

  19. Auto-correlation factorization The sum over the training set can be factored away Using A = W − M , we have � A · XX T · A T � � A · ( E X XX T ) · A T � E X � AX � 2 2 = E X Tr = Tr R = E X [ XX T ] ∈ R d × d is the auto-correlation matrix. The objective can then be evaluated in O ( hd 2 ), which does not depend on the number of samples.

  20. Efficient solving Our problem is strictly convex → solvable to global optimum We solve it with Fast Iterative Shrinkage-Thresholding, an accelerated proximal gradient method (quadratic convergence). Lemma (quadratic convergence) 2 · E X � WX − MX � 2 Let L : M �→ 1 2 + λ · � M � 2 , 1 , ( M k ) k the iterates obtained by the FISTA algorithm, M ∗ the global optimum, and L = λ max ( E X [ XX T ] ) . Then L ( M k ) − L ( M ∗ ) ≤ 2 L k 2 � M 0 − M ∗ � 2 F

  21. Extension to convolutional layers For each output position ( u , v ) in output channel j , we write X ( u , v ) the associated input, i that will be multiplied by W j to get ( W ∗ X i ) j , u , v 2 � � � W j ⊙ X ( u , v ) � W ∗ X i � 2 � � 2 = � � i � 2 u , v j hence vec( X ( u , v ) ) · vec( X ( u , v ) � � ) T R ∝ i i u , v i This rewriting holds for any stride, padding or dilation values Then use more general Group-Lasso instead of ℓ 2 , 1

  22. Results

  23. General results Network Error Comp. rate Size Architecture Type Top-1 Top-5 Baseline 1.68 % - - 1.02 MiB LeNet-300-100 Compressed 1.71 % - 46 % 482 KiB Retrained (1) 1.64 % - 29 % 307 KiB Baseline 0.74 % - - 1.64 MiB LeNet-5 (Caffe) Compressed 0.78 % - 16 % 276 KiB Retrained (1) 0.78 % - 10 % 177 KiB Baseline 43.48 % 20.93 % - 234 MiB AlexNet Compressed 45.36 % 21.90 % 39 % 91 MiB

  24. Reconstruction chaining

  25. Extension to arbitrary output We can extend the previous problem to reconstruct arbitrary output Y 1 � � Y i − MX i � 2 min 2 + λ · � M � 2 , 1 (2) 2 N M i FISTA is adapted by simply changing the gradient step dA = YX T − AXX T , where YX T can be precomputed as well

  26. Three chaining strategies Consider a feed-forward fully connected network Input Z 0 , weights ( W k ) k and non-linearities ( σ k ) k Z k +1 = σ k ( W k · Z k ) ◮ Parallel : Y = W k · Z k , X = Z k X = ˆ ◮ Top-down : Y = W k · Z k , Z k ◮ Bottom-up : Y = C T k +1 W k · Z k , X = Z k

  27. Three chaining strategies Top Down Bottom Up Parallel Operators Original layer Feature extraction Reconstruction Minimized error Activations original extracted reconstructed

  28. Reconstruction chaining Figure: Performances of reconstruction chainings (LeNet-5 Caffe)

  29. Tackling Lasso bias Lasso regularization → shrinkage effect → bias in the solution We limit this effect by solving twice ◮ Solve for ( P , C T ) and retain only C ◮ Solve for P with fixed C without penalty The second is just a linear regression

  30. Influence of debiasing Figure: Influence of debiasing on reconstruction quality (LeNet-5 Caffe)

  31. Appendix

  32. Fast Iterative Shrinkage-Thresholding Algorithm 1 FISTA with fixed step size X ∈ R h × N : input to the layer, input: W ∈ R o × h : weight to approximate, λ : hyperparameter output: M ∈ R o × h : reconstruction R ← XX T / N L ← largest eigenvalue of R M ← 0 ∈ R o × h , P ← 0 ∈ R o × h t ← λ / L , k ← 1 , θ ← 1 repeat θ ← ( k − 1) / ( k + 2) , k ← k + 1 A ← M + θ ( M − P ) dA ← ( W − A ) R P ← M , M ← prox t �·� 2 , 1 ( A − dA / L ) until desired convergence

  33. Convergence guarantees Lemma 2 · E X � WX − MX � 2 Let L : M �→ 1 2 + λ · � M � 2 , 1 , ( M k ) k the iterates obtained by FISTA as described above, M ∗ the global optimum, and L = λ max ( E X [ XX T ] ) . Then L ( M k ) − L ( M ∗ ) ≤ 2 L k 2 � M 0 − M ∗ � 2 F Choosing M 0 = 0, we can refine this bound with the following � √ � � M ∗ � 2 F ≤ � M ∗ � 2 , 1 · min d , � M ∗ � 2 , 1 and by definition of M ∗ , we have ∀ M , � M ∗ � 2 , 1 ≤ 1 λ L ( M )

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend