Bayesian Sparsification of Deep Complex-valued networks Ivan - PowerPoint PPT Presentation

Bayesian Sparsification of Deep Complex-valued networks Ivan Nazarov, Evgeny Burnaev ADASE Skoltech Moscow, Russia

Synopsis Motivation for C -valued neural networks ◮ perform better for naturally C -valued data ◮ use half as much storage, but the same number of flops 1 / 14

Synopsis Motivation for C -valued neural networks ◮ perform better for naturally C -valued data ◮ use half as much storage, but the same number of flops Propose Sparse Variational Dropout for C -valued neural networks ◮ Bayesian sparsification method with C -valued distributions ◮ empirically explore the compression-performance trade-off 1 / 14

Synopsis Motivation for C -valued neural networks ◮ perform better for naturally C -valued data ◮ use half as much storage, but the same number of flops Propose Sparse Variational Dropout for C -valued neural networks ◮ Bayesian sparsification method with C -valued distributions ◮ empirically explore the compression-performance trade-off Conclusions ◮ C -valued methods compress similarly to R -valued predecessors ◮ final performance benefits from fine-tuning sparsified network ◮ compress a SOTA C VNN on MusicNet by 50 − 100 × at a moderate performance penalty 1 / 14

C -valued neural networks: Applications Data with natural C -valued representation ◮ radar and satellite imaging [Hirose, 2009, H¨ ansch and Hellwich, 2010, Zhang et al., 2017] ◮ magnetic resonance imaging [Hui and Smith, 1995, Wang et al., 2020] ◮ radio signal classification [Yang et al., 2019, Tarver et al., 2019] ◮ spectral speech modelling and music transcription [Wisdom et al., 2016, Trabelsi et al., 2018, Yang et al., 2019] 2 / 14

C -valued neural networks: Applications Data with natural C -valued representation ◮ radar and satellite imaging [Hirose, 2009, H¨ ansch and Hellwich, 2010, Zhang et al., 2017] ◮ magnetic resonance imaging [Hui and Smith, 1995, Wang et al., 2020] ◮ radio signal classification [Yang et al., 2019, Tarver et al., 2019] ◮ spectral speech modelling and music transcription [Wisdom et al., 2016, Trabelsi et al., 2018, Yang et al., 2019] Exploring benefits beyond C -valued data ◮ sequence modelling, dynamical system identification [Danihelka et al., 2016, Wisdom et al., 2016] ◮ image classification, road / lane segmentation [Popa, 2017, Trabelsi et al., 2018, Gaudet and Maida, 2018] ◮ unitary transition matrices in recurrent networks [Arjovsky et al., 2016, Wisdom et al., 2016] 2 / 14

C -valued neural networks: Implementation Geometric representation C ≃ R 2 ◮ z = ℜ z +  ℑ z ,  2 = − 1 ◮ ℜ z and ℑ z are real and imaginary parts of z An intricate double- R network that respects C -arithmetic         x 1 x 1 − W 21 W 11 W 12 W 11         × ×                 x 2 x 2 W 21 W 22 W 21 W 11 R VNN linear operation C VNN linear operation Activations z �→ σ ( z ) , e.g re φ �→ σ ( r , φ ) or z �→ σ ( ℜ z ) + σ ( ℑ z ) . 3 / 14

Sparsity and compression Improve power, storage or throughput efficiency of deep nets ◮ Knowledge distillation [Hinton et al., 2015, Balasubramanian, 2016] ◮ Network pruning [LeCun et al., 1990, Seide et al., 2011, Zhu and Gupta, 2018] ◮ Low-rank matrix / tensor decomposition [Denton et al., 2014, Novikov et al., 2015] ◮ Quantization and fixed point arithmetic [Courbariaux et al., 2015, Han et al., 2016, Chen et al., 2017] Applications to C VNN: ◮ C modulus pruning, quantization with k -means in R 2 , [Wu et al., 2019] ◮ ℓ 1 regularization for hyper-complex-valued networks, [Vecchi et al., 2020] 4 / 14

Sparse Variational Dropout [Molchanov et al., 2017] Variational Inference with automatic relevance determination effect maximize E w ∼ q log p ( D | w ) − KL ( q � π ) (ELBO) q ∈Q � �� data model likelihood variational regularization prior π → data model likelihood → posterior q (close to p ( w | D ) ) 5 / 14

Sparse Variational Dropout [Molchanov et al., 2017] Variational Inference with automatic relevance determination effect maximize E w ∼ q log p ( D | w ) − KL ( q � π ) (ELBO) q ∈Q � �� data model likelihood variational regularization prior π → data model likelihood → posterior q (close to p ( w | D ) ) Factorized Gaussian dropout posterior family Q � � µ ij , α ij µ ij 2 ) , α ij > 0, and µ ij ∈ R ◮ w ij ∼ q ( w ij ) = N ( w ij Factorized prior 1 ◮ (VD) π ( w ij ) ∝ [Molchanov et al., 2017] | w ij | � � 0 , 1 ◮ (ARD) π ( w ij ) = N ( w ij τ ij ) [Kharitonov et al., 2018] 5 / 14

real 0.2 imag 0.9 1.0 45 0 90 C -valued Variational Dropout Our proposal Factorized complex-valued posterior q ( w ) = � q ( w ij ) � � µ, σ 2 , σ 2 ξ ) , σ 2 = α | µ | 2 , | ξ | ≤ 1 ◮ w ij are independent CN ( w � ℜ w � �� ℜ µ � � 1 + ℜ ξ �� , σ 2 ℑ ξ ∼ N 2 ℑ w ℑ µ ℑ ξ 1 − ℜ ξ 2 6 / 14

real 0.2 imag 0.9 1.0 45 0 90 C -valued Variational Dropout Our proposal Factorized complex-valued posterior q ( w ) = � q ( w ij ) � � µ, σ 2 , σ 2 ξ ) , σ 2 = α | µ | 2 , | ξ | ≤ 1 ◮ w ij are independent CN ( w � ℜ w � �� ℜ µ � � 1 + ℜ ξ �� , σ 2 ℑ ξ ∼ N 2 ℑ w ℑ µ ℑ ξ 1 − ℜ ξ 2 ◮ w ij are circularly symmetric about µ ij ( ξ ij = 0) α ij and 2 | w ij − µ ij | 2 1 ◮ relevance ∝ is χ 2 α ij | µ ij | 2 2 6 / 14

C -valued Variational Dropout Our proposal Factorized complex-valued posterior q ( w ) = � q ( w ij ) � � µ, σ 2 , σ 2 ξ ) , σ 2 = α | µ | 2 , | ξ | ≤ 1 ◮ w ij are independent CN ( w � ℜ w � �� ℜ µ � � 1 + ℜ ξ �� , σ 2 ℑ ξ ∼ N 2 ℑ w ℑ µ ℑ ξ 1 − ℜ ξ 2 ◮ w ij are circularly symmetric about µ ij ( ξ ij = 0) real α ij and 2 | w ij − µ ij | 2 1 ◮ relevance ∝ is χ 2 α ij | µ ij | 2 2 0.2 Factorized complex-valued priors π imag 0.9 ◮ ( C -VD) π ( w ij ) ∝ | w ij | − ρ , ρ ≥ 1 ◮ ( C -ARD) π ( w ij ) = CN ( 0 , 1 τ ij , 0 ) 1.0 45 0 90 CN ( 0 , 1 , η e φ ) , | η | ≤ 1 6 / 14

C -valued Variational Dropout KL ( q � π ) term in (ELBO) � KL ( q � π ) = KL ( q ( w ij ) � π ( w ij )) ij ( C -VD) improper prior 2 log | µ ij | 2 + log 1 ρ − 2 α ij − ρ 2 Ei ( − 1 ∝ α ij ) KL ij � x e t t − 1 dt Ei ( x ) = −∞ ( C -ARD) prior is optimized w.r.t. τ ij in empirical Bayes − 1 − log σ 2 ij τ ij + τ ij ( σ 2 ij + | µ ij | 2 ) KL ij = � � 1 + 1 min = log τ ij KL ij α ij 7 / 14

Experiments: Goals and Setup We conduct numerous experiments on various datasets to ◮ validate the proposed C -valued sparsification methods ◮ explore the compression-performance profiles ◮ compare to the R -valued Sparse Variational Dropout 8 / 14

Experiments: Goals and Setup We conduct numerous experiments on various datasets to ◮ validate the proposed C -valued sparsification methods ◮ explore the compression-performance profiles ◮ compare to the R -valued Sparse Variational Dropout ‘pre-train’ → ‘compress’ → ‘fine-tune’ ◮ ‘compress’ with R / C -Variational Dropout layers ◮ ‘fine-tune’ pruned network ( log α ij ≤ − 1 2 ) 8 / 14

Experiments: Goals and Setup We conduct numerous experiments on various datasets to ◮ validate the proposed C -valued sparsification methods ◮ explore the compression-performance profiles ◮ compare to the R -valued Sparse Variational Dropout ‘pre-train’ → ‘compress’ → ‘fine-tune’ ◮ ‘compress’ with R / C -Variational Dropout layers ◮ ‘fine-tune’ pruned network ( log α ij ≤ − 1 2 ) max E w ∼ q log p ( D | w ) − β KL ( q � π ) ( β -ELBO) q 8 / 14

Experiments: Datasets Four MNIST-like datasets ◮ channel features ( R ֒ → C ) or 2d Fourier features ◮ fixed random subset of 10 k train samples ◮ simple dense and convolutional nets 9 / 14

Experiments: Datasets Four MNIST-like datasets ◮ channel features ( R ֒ → C ) or 2d Fourier features ◮ fixed random subset of 10 k train samples ◮ simple dense and convolutional nets CIFAR10 dataset ( R 3 ֒ → C 3 ) ◮ random cropping and horizontal flipping ◮ C -valued variant of VGG16 [Simonyan and Zisserman, 2015] 9 / 14

Experiments: Datasets Four MNIST-like datasets ◮ channel features ( R ֒ → C ) or 2d Fourier features ◮ fixed random subset of 10 k train samples ◮ simple dense and convolutional nets CIFAR10 dataset ( R 3 ֒ → C 3 ) ◮ random cropping and horizontal flipping ◮ C -valued variant of VGG16 [Simonyan and Zisserman, 2015] Music transcription on MusicNet [Thickstun et al., 2017] ◮ audio dataset of 330 annotated musical compositions ◮ use power spectrum to tell which piano keys are pressed ◮ compress deep C VNN proposed by [Trabelsi et al., 2018] 9 / 14

Results: CIFAR10 Trade-off on CIFAR10 (raw) ( = 0.5) 0.92 0.90 0.88 accuracy 0.86 0.84 C VGG ARD R VGG ARD 0.82 C VGG VD R VGG VD 0.80 ×100 ×1000 compression C -valued version of VGG16 [Simonyan and Zisserman, 2015] 10 / 14

Results: MusicNet Trade-off on MusicNet (fft) ( = 0.5) 0.750 Trabelsi et al. (2018) 0.725 average precision 0.700 0.675 0.650 C DeepConvNet ARD 0.625 C DeepConvNet VD C DeepConvNet k3 VD 0.600 ×10 ×100 ×1000 compression The C VNN of Trabelsi et al. [2018] 11 / 14

Bayesian Sparsification of Deep Complex-valued networks Ivan - PowerPoint PPT Presentation

Bayesian Sparsification of Deep Complex-valued networks Ivan Nazarov, Evgeny Burnaev ADASE Skoltech Moscow, Russia Synopsis Motivation for C -valued neural networks perform better for naturally C -valued data use half as much storage,

Many-Valued Logic Daniel Bonevac February 27, 2013 Daniel Bonevac Many-Valued Logic Rationales

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Active Regression via Linear-Sample Sparsification Xue Chen Eric Price UT Austin Xue Chen, Eric

Vertex Sparsification and Oblivious Reductions Ankur Moitra, MIT September 14, 2010 Ankur Moitra

Graph Sampling and Sparsification Lecture 19 CSCI 4974/6971 7 Nov 2016 1 / 10 Todays Biz 1.

Algebraic Study of Lattice-Valued Logic and Lattice-Valued Modal Logic Yoshihiro Maruyama

Shuffle algebra perspective on operator valued probability theory 30 mars 2020 1/25 Operator

VECTOR-VALUED FUNCTIONS MATH 200 MAIN QUESTIONS FOR TODAY Whats a vector valued function?

Overview of Complex Networks Complex Networks Principles of Complex Systems | @pocsvox Basic

Complex Networks Basic definitions Principles of Complex Systems Books Course 300, Fall, 2008

Why Complex-Valued When Are Integration . . . Relation to Complex . . . Fuzzy? Why Complex

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

Complex Networks Principles of Complex Systems Basic definitions Examples of CSYS/MATH 300,

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Bayesian networks (2) Lirong Xia Last class Bayesian networks compact, graphical

Spatial Statistical Inference in Functional Modeling fMRI data Magnetic Resonance Imaging (fMRI)

The FASK algorithm FASK (Fast Adjacency Skewness) appeals to Skewness. It runs the Fast

Quantitative MRI using Model- based CS Mike Davies University of Edinburgh CSA 2015 :

Visualizing Outliers in High Dimensional Functional Data for task fMRI data Exploration Yasser

EXPERTS KNOWLEDGE SHARE with Prof Marianne Pavel Dr Jaume Capdevila Dr Louis de Mestier Dr

Time, Space and Computation: Converging Human Neuroscience & Computer Science Aude Oliva

Bayesian Experimental Design for Large Scale Signal Acquisition Optimization Matthias Seeger

Creating a magnetic resonance imaging ontology Jrmy Lasbleiz 1,2 , Rgis Duvauferrier 1 ,

Bayesian Sparsification of Deep Complex-valued networks Ivan - PowerPoint PPT Presentation

Bayesian Sparsification of Deep Complex-valued networks Ivan Nazarov, Evgeny Burnaev ADASE Skoltech Moscow, Russia Synopsis Motivation for C -valued neural networks perform better for naturally C -valued data use half as much storage,

Many-Valued Logic Daniel Bonevac February 27, 2013 Daniel Bonevac Many-Valued Logic Rationales

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Active Regression via Linear-Sample Sparsification Xue Chen Eric Price UT Austin Xue Chen, Eric

Vertex Sparsification and Oblivious Reductions Ankur Moitra, MIT September 14, 2010 Ankur Moitra

Graph Sampling and Sparsification Lecture 19 CSCI 4974/6971 7 Nov 2016 1 / 10 Todays Biz 1.

Algebraic Study of Lattice-Valued Logic and Lattice-Valued Modal Logic Yoshihiro Maruyama

Shuffle algebra perspective on operator valued probability theory 30 mars 2020 1/25 Operator

VECTOR-VALUED FUNCTIONS MATH 200 MAIN QUESTIONS FOR TODAY Whats a vector valued function?

Overview of Complex Networks Complex Networks Principles of Complex Systems | @pocsvox Basic

Complex Networks Basic definitions Principles of Complex Systems Books Course 300, Fall, 2008

Why Complex-Valued When Are Integration . . . Relation to Complex . . . Fuzzy? Why Complex

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

Complex Networks Principles of Complex Systems Basic definitions Examples of CSYS/MATH 300,

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Bayesian networks (2) Lirong Xia Last class Bayesian networks compact, graphical

Spatial Statistical Inference in Functional Modeling fMRI data Magnetic Resonance Imaging (fMRI)

The FASK algorithm FASK (Fast Adjacency Skewness) appeals to Skewness. It runs the Fast

Quantitative MRI using Model- based CS Mike Davies University of Edinburgh CSA 2015 :

Visualizing Outliers in High Dimensional Functional Data for task fMRI data Exploration Yasser

EXPERTS KNOWLEDGE SHARE with Prof Marianne Pavel Dr Jaume Capdevila Dr Louis de Mestier Dr

Time, Space and Computation: Converging Human Neuroscience &amp; Computer Science Aude Oliva

Bayesian Experimental Design for Large Scale Signal Acquisition Optimization Matthias Seeger

Creating a magnetic resonance imaging ontology Jrmy Lasbleiz 1,2 , Rgis Duvauferrier 1 ,

Time, Space and Computation: Converging Human Neuroscience & Computer Science Aude Oliva