Bayesian Sparsification of Deep Complex-valued networks Ivan - - PowerPoint PPT Presentation

bayesian sparsification of deep complex valued networks
SMART_READER_LITE
LIVE PREVIEW

Bayesian Sparsification of Deep Complex-valued networks Ivan - - PowerPoint PPT Presentation

Bayesian Sparsification of Deep Complex-valued networks Ivan Nazarov, Evgeny Burnaev ADASE Skoltech Moscow, Russia Synopsis Motivation for C -valued neural networks perform better for naturally C -valued data use half as much storage,


slide-1
SLIDE 1

Bayesian Sparsification of Deep Complex-valued networks

Ivan Nazarov, Evgeny Burnaev

ADASE Skoltech Moscow, Russia

slide-2
SLIDE 2

Synopsis

Motivation for C-valued neural networks

◮ perform better for naturally C-valued data ◮ use half as much storage, but the same number of flops

1 / 14

slide-3
SLIDE 3

Synopsis

Motivation for C-valued neural networks

◮ perform better for naturally C-valued data ◮ use half as much storage, but the same number of flops

Propose Sparse Variational Dropout for C-valued neural networks

◮ Bayesian sparsification method with C-valued distributions ◮ empirically explore the compression-performance trade-off

1 / 14

slide-4
SLIDE 4

Synopsis

Motivation for C-valued neural networks

◮ perform better for naturally C-valued data ◮ use half as much storage, but the same number of flops

Propose Sparse Variational Dropout for C-valued neural networks

◮ Bayesian sparsification method with C-valued distributions ◮ empirically explore the compression-performance trade-off

Conclusions

◮ C-valued methods compress similarly to R-valued predecessors ◮ final performance benefits from fine-tuning sparsified network ◮ compress a SOTA CVNN on MusicNet by 50 − 100× at a

moderate performance penalty

1 / 14

slide-5
SLIDE 5

C-valued neural networks: Applications

Data with natural C-valued representation

◮ radar and satellite imaging

[Hirose, 2009, H¨ ansch and Hellwich, 2010, Zhang et al., 2017]

◮ magnetic resonance imaging

[Hui and Smith, 1995, Wang et al., 2020]

◮ radio signal classification

[Yang et al., 2019, Tarver et al., 2019]

◮ spectral speech modelling and music transcription

[Wisdom et al., 2016, Trabelsi et al., 2018, Yang et al., 2019] 2 / 14

slide-6
SLIDE 6

C-valued neural networks: Applications

Data with natural C-valued representation

◮ radar and satellite imaging

[Hirose, 2009, H¨ ansch and Hellwich, 2010, Zhang et al., 2017]

◮ magnetic resonance imaging

[Hui and Smith, 1995, Wang et al., 2020]

◮ radio signal classification

[Yang et al., 2019, Tarver et al., 2019]

◮ spectral speech modelling and music transcription

[Wisdom et al., 2016, Trabelsi et al., 2018, Yang et al., 2019]

Exploring benefits beyond C-valued data

◮ sequence modelling, dynamical system identification

[Danihelka et al., 2016, Wisdom et al., 2016]

◮ image classification, road / lane segmentation

[Popa, 2017, Trabelsi et al., 2018, Gaudet and Maida, 2018]

◮ unitary transition matrices in recurrent networks

[Arjovsky et al., 2016, Wisdom et al., 2016] 2 / 14

slide-7
SLIDE 7

C-valued neural networks: Implementation

Geometric representation C ≃ R2

◮ z = ℜz + ℑz, 2 = −1 ◮ ℜz and ℑz are real and imaginary parts of z

An intricate double-R network that respects C-arithmetic        

W11 W12 W21 W22

×        

x1 x2

RVNN linear operation        

W11 −W21 W21 W11

×        

x1 x2

CVNN linear operation Activations z → σ(z), e.g reφ → σ(r, φ) or z → σ(ℜz) + σ(ℑz).

3 / 14

slide-8
SLIDE 8

Sparsity and compression

Improve power, storage or throughput efficiency of deep nets

◮ Knowledge distillation

[Hinton et al., 2015, Balasubramanian, 2016]

◮ Network pruning

[LeCun et al., 1990, Seide et al., 2011, Zhu and Gupta, 2018]

◮ Low-rank matrix / tensor decomposition

[Denton et al., 2014, Novikov et al., 2015]

◮ Quantization and fixed point arithmetic

[Courbariaux et al., 2015, Han et al., 2016, Chen et al., 2017]

Applications to CVNN:

◮ C modulus pruning, quantization with k-means in R2,

[Wu et al., 2019]

◮ ℓ1 regularization for hyper-complex-valued networks,

[Vecchi et al., 2020] 4 / 14

slide-9
SLIDE 9

Sparse Variational Dropout

[Molchanov et al., 2017]

Variational Inference with automatic relevance determination effect maximize

q∈Q

Ew∼q log p(D | w)

  • data model likelihood

− KL(qπ)

  • variational regularization

(ELBO) prior π → data model likelihood → posterior q (close to p(w | D))

5 / 14

slide-10
SLIDE 10

Sparse Variational Dropout

[Molchanov et al., 2017]

Variational Inference with automatic relevance determination effect maximize

q∈Q

Ew∼q log p(D | w)

  • data model likelihood

− KL(qπ)

  • variational regularization

(ELBO) prior π → data model likelihood → posterior q (close to p(w | D)) Factorized Gaussian dropout posterior family Q

◮ wij ∼ q(wij) = N(wij

  • µij, αijµij 2), αij > 0, and µij ∈ R

Factorized prior

◮ (VD) π(wij) ∝ 1 |wij|

[Molchanov et al., 2017]

◮ (ARD) π(wij) = N(wij

  • 0, 1

τij ) [Kharitonov et al., 2018]

5 / 14

slide-11
SLIDE 11

C-valued Variational Dropout

Our proposal

Factorized complex-valued posterior q(w) = q(wij)

◮ wij are independent CN(w

  • µ, σ2, σ2ξ), σ2 = α|µ|2, |ξ| ≤ 1

ℜw ℑw

  • ∼ N2

ℜµ ℑµ

  • , σ2

2 1 + ℜξ ℑξ ℑξ 1 − ℜξ

  • 0.2

0.9 45 1.0 90

real imag 6 / 14

slide-12
SLIDE 12

C-valued Variational Dropout

Our proposal

Factorized complex-valued posterior q(w) = q(wij)

◮ wij are independent CN(w

  • µ, σ2, σ2ξ), σ2 = α|µ|2, |ξ| ≤ 1

ℜw ℑw

  • ∼ N2

ℜµ ℑµ

  • , σ2

2 1 + ℜξ ℑξ ℑξ 1 − ℜξ

  • ◮ wij are circularly symmetric about µij (ξij = 0)

◮ relevance ∝ 1 αij and 2|wij−µij|2 αij|µij|2

is χ2

2

0.2 0.9 45 1.0 90

real imag 6 / 14

slide-13
SLIDE 13

C-valued Variational Dropout

Our proposal

Factorized complex-valued posterior q(w) = q(wij)

◮ wij are independent CN(w

  • µ, σ2, σ2ξ), σ2 = α|µ|2, |ξ| ≤ 1

ℜw ℑw

  • ∼ N2

ℜµ ℑµ

  • , σ2

2 1 + ℜξ ℑξ ℑξ 1 − ℜξ

  • ◮ wij are circularly symmetric about µij (ξij = 0)

◮ relevance ∝ 1 αij and 2|wij−µij|2 αij|µij|2

is χ2

2

Factorized complex-valued priors π

◮ (C-VD) π(wij) ∝ |wij|−ρ, ρ ≥ 1 ◮ (C-ARD) π(wij) = CN(0, 1 τij , 0)

0.2 0.9 45 1.0 90

real imag

CN(0, 1, ηeφ), |η| ≤ 1

6 / 14

slide-14
SLIDE 14

C-valued Variational Dropout

KL(qπ) term in (ELBO) KL(qπ) =

  • ij

KL(q(wij)π(wij)) (C-VD) improper prior KLij ∝

ρ−2 2 log|µij|2 + log 1 αij − ρ 2Ei(− 1 αij )

Ei(x) = x

−∞

ett−1dt (C-ARD) prior is optimized w.r.t. τij in empirical Bayes KLij = −1 − log σ2

ijτij + τij(σ2 ij + |µij|2)

min

τij KLij

= log

  • 1 + 1

αij

  • 7 / 14
slide-15
SLIDE 15

Experiments: Goals and Setup

We conduct numerous experiments on various datasets to

◮ validate the proposed C-valued sparsification methods ◮ explore the compression-performance profiles ◮ compare to the R-valued Sparse Variational Dropout

8 / 14

slide-16
SLIDE 16

Experiments: Goals and Setup

We conduct numerous experiments on various datasets to

◮ validate the proposed C-valued sparsification methods ◮ explore the compression-performance profiles ◮ compare to the R-valued Sparse Variational Dropout

‘pre-train’ → ‘compress’ → ‘fine-tune’

◮ ‘compress’ with R/C-Variational Dropout layers ◮ ‘fine-tune’ pruned network (log αij ≤ − 1 2)

8 / 14

slide-17
SLIDE 17

Experiments: Goals and Setup

We conduct numerous experiments on various datasets to

◮ validate the proposed C-valued sparsification methods ◮ explore the compression-performance profiles ◮ compare to the R-valued Sparse Variational Dropout

‘pre-train’ → ‘compress’ → ‘fine-tune’

◮ ‘compress’ with R/C-Variational Dropout layers ◮ ‘fine-tune’ pruned network (log αij ≤ − 1 2)

max

q

Ew∼q log p(D | w) − β KL(qπ) (β-ELBO)

8 / 14

slide-18
SLIDE 18

Experiments: Datasets

Four MNIST-like datasets

◮ channel features (R ֒

→ C) or 2d Fourier features

◮ fixed random subset of 10k train samples ◮ simple dense and convolutional nets

9 / 14

slide-19
SLIDE 19

Experiments: Datasets

Four MNIST-like datasets

◮ channel features (R ֒

→ C) or 2d Fourier features

◮ fixed random subset of 10k train samples ◮ simple dense and convolutional nets

CIFAR10 dataset (R3 ֒ → C3)

◮ random cropping and horizontal flipping ◮ C-valued variant of VGG16 [Simonyan and Zisserman, 2015]

9 / 14

slide-20
SLIDE 20

Experiments: Datasets

Four MNIST-like datasets

◮ channel features (R ֒

→ C) or 2d Fourier features

◮ fixed random subset of 10k train samples ◮ simple dense and convolutional nets

CIFAR10 dataset (R3 ֒ → C3)

◮ random cropping and horizontal flipping ◮ C-valued variant of VGG16 [Simonyan and Zisserman, 2015]

Music transcription on MusicNet [Thickstun et al., 2017]

◮ audio dataset of 330 annotated musical compositions ◮ use power spectrum to tell which piano keys are pressed ◮ compress deep CVNN proposed by [Trabelsi et al., 2018]

9 / 14

slide-21
SLIDE 21

Results: CIFAR10

×100 ×1000 compression 0.80 0.82 0.84 0.86 0.88 0.90 0.92 accuracy

Trade-off on CIFAR10 (raw) ( = 0.5)

C VGG ARD C VGG VD R VGG ARD R VGG VD

C-valued version of VGG16 [Simonyan and Zisserman, 2015]

10 / 14

slide-22
SLIDE 22

Results: MusicNet

×10 ×100 ×1000 compression 0.600 0.625 0.650 0.675 0.700 0.725 0.750 average precision Trabelsi et al. (2018)

Trade-off on MusicNet (fft) ( = 0.5)

C DeepConvNet ARD C DeepConvNet VD C DeepConvNet k3 VD

The CVNN of Trabelsi et al. [2018]

11 / 14

slide-23
SLIDE 23

MusicNet: Effects of pruning threshold

0.550 0.575 0.600 0.625 0.650 0.675 0.700 0.725 0.750 average precision Trabelsi et al. (2018) Trabelsi et al. (2018)

C-ARD for DeepConvNet (MusicNet)

pre-fine-tune (C=0.005) post-fine-tune (C=0.005) pre-fine-tune (C=0.05) post-fine-tune (C=0.05) =

1 2

= 3 =+ 3 = 3 =+ 3

×100 ×1000 compression 0.550 0.575 0.600 0.625 0.650 0.675 0.700 0.725 0.750 average precision Trabelsi et al. (2018) Trabelsi et al. (2018)

C-VD for DeepConvNet (MusicNet)

pre-fine-tune (C=0.005) post-fine-tune (C=0.005) pre-fine-tune (C=0.05) post-fine-tune (C=0.05) =

1 2

= 3 =+ 3 = 3 =+ 3

Effect of threshold on the CVNN of Trabelsi et al. [2018]

12 / 14

slide-24
SLIDE 24

Summary: Results

Bayesian sparsification of C-valued networks

◮ proposed C-VD and C-ARD methods ◮ investigated performance-compression trade-off ◮ compress the CVNN of Trabelsi et al. [2018] by 50 − 100× at

a small performance penalty

13 / 14

slide-25
SLIDE 25

Summary: Results

Bayesian sparsification of C-valued networks

◮ proposed C-VD and C-ARD methods ◮ investigated performance-compression trade-off ◮ compress the CVNN of Trabelsi et al. [2018] by 50 − 100× at

a small performance penalty Experiments

◮ C-VD and C-ARD have trade-off similar to R methods ◮ R-networks tend to compress better than C-nets ◮ fine-tuning improves performance in high compression regime ◮ β in β-ELBO influences compression stronger than threshold

13 / 14

slide-26
SLIDE 26

Summary: Limitations

Circular symmetry of the posterior q(wij) about µij implies independence of ℜ and ℑ

◮ modelling corr(wij, wij) gives better variational approximation

Factorized q implies parameter independence

◮ structured sparsity is desirable for fast computations and

hardware implementations

14 / 14