bayesian sparsification of deep complex valued networks
play

Bayesian Sparsification of Deep Complex-valued networks Ivan - PowerPoint PPT Presentation

Bayesian Sparsification of Deep Complex-valued networks Ivan Nazarov, Evgeny Burnaev ADASE Skoltech Moscow, Russia Synopsis Motivation for C -valued neural networks perform better for naturally C -valued data use half as much storage,


  1. Bayesian Sparsification of Deep Complex-valued networks Ivan Nazarov, Evgeny Burnaev ADASE Skoltech Moscow, Russia

  2. Synopsis Motivation for C -valued neural networks ◮ perform better for naturally C -valued data ◮ use half as much storage, but the same number of flops 1 / 14

  3. Synopsis Motivation for C -valued neural networks ◮ perform better for naturally C -valued data ◮ use half as much storage, but the same number of flops Propose Sparse Variational Dropout for C -valued neural networks ◮ Bayesian sparsification method with C -valued distributions ◮ empirically explore the compression-performance trade-off 1 / 14

  4. Synopsis Motivation for C -valued neural networks ◮ perform better for naturally C -valued data ◮ use half as much storage, but the same number of flops Propose Sparse Variational Dropout for C -valued neural networks ◮ Bayesian sparsification method with C -valued distributions ◮ empirically explore the compression-performance trade-off Conclusions ◮ C -valued methods compress similarly to R -valued predecessors ◮ final performance benefits from fine-tuning sparsified network ◮ compress a SOTA C VNN on MusicNet by 50 − 100 × at a moderate performance penalty 1 / 14

  5. C -valued neural networks: Applications Data with natural C -valued representation ◮ radar and satellite imaging [Hirose, 2009, H¨ ansch and Hellwich, 2010, Zhang et al., 2017] ◮ magnetic resonance imaging [Hui and Smith, 1995, Wang et al., 2020] ◮ radio signal classification [Yang et al., 2019, Tarver et al., 2019] ◮ spectral speech modelling and music transcription [Wisdom et al., 2016, Trabelsi et al., 2018, Yang et al., 2019] 2 / 14

  6. C -valued neural networks: Applications Data with natural C -valued representation ◮ radar and satellite imaging [Hirose, 2009, H¨ ansch and Hellwich, 2010, Zhang et al., 2017] ◮ magnetic resonance imaging [Hui and Smith, 1995, Wang et al., 2020] ◮ radio signal classification [Yang et al., 2019, Tarver et al., 2019] ◮ spectral speech modelling and music transcription [Wisdom et al., 2016, Trabelsi et al., 2018, Yang et al., 2019] Exploring benefits beyond C -valued data ◮ sequence modelling, dynamical system identification [Danihelka et al., 2016, Wisdom et al., 2016] ◮ image classification, road / lane segmentation [Popa, 2017, Trabelsi et al., 2018, Gaudet and Maida, 2018] ◮ unitary transition matrices in recurrent networks [Arjovsky et al., 2016, Wisdom et al., 2016] 2 / 14

  7. C -valued neural networks: Implementation Geometric representation C ≃ R 2 ◮ z = ℜ z +  ℑ z ,  2 = − 1 ◮ ℜ z and ℑ z are real and imaginary parts of z An intricate double- R network that respects C -arithmetic         x 1 x 1 − W 21 W 11 W 12 W 11         × ×                 x 2 x 2 W 21 W 22 W 21 W 11 R VNN linear operation C VNN linear operation Activations z �→ σ ( z ) , e.g re φ �→ σ ( r , φ ) or z �→ σ ( ℜ z ) + σ ( ℑ z ) . 3 / 14

  8. Sparsity and compression Improve power, storage or throughput efficiency of deep nets ◮ Knowledge distillation [Hinton et al., 2015, Balasubramanian, 2016] ◮ Network pruning [LeCun et al., 1990, Seide et al., 2011, Zhu and Gupta, 2018] ◮ Low-rank matrix / tensor decomposition [Denton et al., 2014, Novikov et al., 2015] ◮ Quantization and fixed point arithmetic [Courbariaux et al., 2015, Han et al., 2016, Chen et al., 2017] Applications to C VNN: ◮ C modulus pruning, quantization with k -means in R 2 , [Wu et al., 2019] ◮ ℓ 1 regularization for hyper-complex-valued networks, [Vecchi et al., 2020] 4 / 14

  9. Sparse Variational Dropout [Molchanov et al., 2017] Variational Inference with automatic relevance determination effect maximize E w ∼ q log p ( D | w ) − KL ( q � π ) (ELBO) q ∈Q � �� � � �� � data model likelihood variational regularization prior π → data model likelihood → posterior q (close to p ( w | D ) ) 5 / 14

  10. Sparse Variational Dropout [Molchanov et al., 2017] Variational Inference with automatic relevance determination effect maximize E w ∼ q log p ( D | w ) − KL ( q � π ) (ELBO) q ∈Q � �� � � �� � data model likelihood variational regularization prior π → data model likelihood → posterior q (close to p ( w | D ) ) Factorized Gaussian dropout posterior family Q � � µ ij , α ij µ ij 2 ) , α ij > 0, and µ ij ∈ R ◮ w ij ∼ q ( w ij ) = N ( w ij Factorized prior 1 ◮ (VD) π ( w ij ) ∝ [Molchanov et al., 2017] | w ij | � � 0 , 1 ◮ (ARD) π ( w ij ) = N ( w ij τ ij ) [Kharitonov et al., 2018] 5 / 14

  11. real 0.2 imag 0.9 1.0 45 0 90 C -valued Variational Dropout Our proposal Factorized complex-valued posterior q ( w ) = � q ( w ij ) � � µ, σ 2 , σ 2 ξ ) , σ 2 = α | µ | 2 , | ξ | ≤ 1 ◮ w ij are independent CN ( w � ℜ w � �� ℜ µ � � 1 + ℜ ξ �� , σ 2 ℑ ξ ∼ N 2 ℑ w ℑ µ ℑ ξ 1 − ℜ ξ 2 6 / 14

  12. real 0.2 imag 0.9 1.0 45 0 90 C -valued Variational Dropout Our proposal Factorized complex-valued posterior q ( w ) = � q ( w ij ) � � µ, σ 2 , σ 2 ξ ) , σ 2 = α | µ | 2 , | ξ | ≤ 1 ◮ w ij are independent CN ( w � ℜ w � �� ℜ µ � � 1 + ℜ ξ �� , σ 2 ℑ ξ ∼ N 2 ℑ w ℑ µ ℑ ξ 1 − ℜ ξ 2 ◮ w ij are circularly symmetric about µ ij ( ξ ij = 0) α ij and 2 | w ij − µ ij | 2 1 ◮ relevance ∝ is χ 2 α ij | µ ij | 2 2 6 / 14

  13. C -valued Variational Dropout Our proposal Factorized complex-valued posterior q ( w ) = � q ( w ij ) � � µ, σ 2 , σ 2 ξ ) , σ 2 = α | µ | 2 , | ξ | ≤ 1 ◮ w ij are independent CN ( w � ℜ w � �� ℜ µ � � 1 + ℜ ξ �� , σ 2 ℑ ξ ∼ N 2 ℑ w ℑ µ ℑ ξ 1 − ℜ ξ 2 ◮ w ij are circularly symmetric about µ ij ( ξ ij = 0) real α ij and 2 | w ij − µ ij | 2 1 ◮ relevance ∝ is χ 2 α ij | µ ij | 2 2 0.2 Factorized complex-valued priors π imag 0.9 ◮ ( C -VD) π ( w ij ) ∝ | w ij | − ρ , ρ ≥ 1 ◮ ( C -ARD) π ( w ij ) = CN ( 0 , 1 τ ij , 0 ) 1.0 45 0 90 CN ( 0 , 1 , η e φ ) , | η | ≤ 1 6 / 14

  14. C -valued Variational Dropout KL ( q � π ) term in (ELBO) � KL ( q � π ) = KL ( q ( w ij ) � π ( w ij )) ij ( C -VD) improper prior 2 log | µ ij | 2 + log 1 ρ − 2 α ij − ρ 2 Ei ( − 1 ∝ α ij ) KL ij � x e t t − 1 dt Ei ( x ) = −∞ ( C -ARD) prior is optimized w.r.t. τ ij in empirical Bayes − 1 − log σ 2 ij τ ij + τ ij ( σ 2 ij + | µ ij | 2 ) KL ij = � � 1 + 1 min = log τ ij KL ij α ij 7 / 14

  15. Experiments: Goals and Setup We conduct numerous experiments on various datasets to ◮ validate the proposed C -valued sparsification methods ◮ explore the compression-performance profiles ◮ compare to the R -valued Sparse Variational Dropout 8 / 14

  16. Experiments: Goals and Setup We conduct numerous experiments on various datasets to ◮ validate the proposed C -valued sparsification methods ◮ explore the compression-performance profiles ◮ compare to the R -valued Sparse Variational Dropout ‘pre-train’ → ‘compress’ → ‘fine-tune’ ◮ ‘compress’ with R / C -Variational Dropout layers ◮ ‘fine-tune’ pruned network ( log α ij ≤ − 1 2 ) 8 / 14

  17. Experiments: Goals and Setup We conduct numerous experiments on various datasets to ◮ validate the proposed C -valued sparsification methods ◮ explore the compression-performance profiles ◮ compare to the R -valued Sparse Variational Dropout ‘pre-train’ → ‘compress’ → ‘fine-tune’ ◮ ‘compress’ with R / C -Variational Dropout layers ◮ ‘fine-tune’ pruned network ( log α ij ≤ − 1 2 ) max E w ∼ q log p ( D | w ) − β KL ( q � π ) ( β -ELBO) q 8 / 14

  18. Experiments: Datasets Four MNIST-like datasets ◮ channel features ( R ֒ → C ) or 2d Fourier features ◮ fixed random subset of 10 k train samples ◮ simple dense and convolutional nets 9 / 14

  19. Experiments: Datasets Four MNIST-like datasets ◮ channel features ( R ֒ → C ) or 2d Fourier features ◮ fixed random subset of 10 k train samples ◮ simple dense and convolutional nets CIFAR10 dataset ( R 3 ֒ → C 3 ) ◮ random cropping and horizontal flipping ◮ C -valued variant of VGG16 [Simonyan and Zisserman, 2015] 9 / 14

  20. Experiments: Datasets Four MNIST-like datasets ◮ channel features ( R ֒ → C ) or 2d Fourier features ◮ fixed random subset of 10 k train samples ◮ simple dense and convolutional nets CIFAR10 dataset ( R 3 ֒ → C 3 ) ◮ random cropping and horizontal flipping ◮ C -valued variant of VGG16 [Simonyan and Zisserman, 2015] Music transcription on MusicNet [Thickstun et al., 2017] ◮ audio dataset of 330 annotated musical compositions ◮ use power spectrum to tell which piano keys are pressed ◮ compress deep C VNN proposed by [Trabelsi et al., 2018] 9 / 14

  21. Results: CIFAR10 Trade-off on CIFAR10 (raw) ( = 0.5) 0.92 0.90 0.88 accuracy 0.86 0.84 C VGG ARD R VGG ARD 0.82 C VGG VD R VGG VD 0.80 ×100 ×1000 compression C -valued version of VGG16 [Simonyan and Zisserman, 2015] 10 / 14

  22. Results: MusicNet Trade-off on MusicNet (fft) ( = 0.5) 0.750 Trabelsi et al. (2018) 0.725 average precision 0.700 0.675 0.650 C DeepConvNet ARD 0.625 C DeepConvNet VD C DeepConvNet k3 VD 0.600 ×10 ×100 ×1000 compression The C VNN of Trabelsi et al. [2018] 11 / 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend