Multiscale Sparse Models in Deep Convolutional Networks Tomas - - PowerPoint PPT Presentation

multiscale sparse models in deep convolutional networks
SMART_READER_LITE
LIVE PREVIEW

Multiscale Sparse Models in Deep Convolutional Networks Tomas - - PowerPoint PPT Presentation

Multiscale Sparse Models in Deep Convolutional Networks Tomas Angls, Roberto Leonarduzi, Stphane Mallat, Louis Thiry, John Zarka, Sixin Zhang Collge de France cole Normale Suprieure Flatiron Institute www.di.ens.fr/data Deep


slide-1
SLIDE 1

Multiscale Sparse Models in Deep Convolutional Networks

Tomas Anglès, Roberto Leonarduzi, Stéphane Mallat, Louis Thiry, John Zarka, Sixin Zhang Collège de France École Normale Supérieure Flatiron Institute

www.di.ens.fr/data

slide-2
SLIDE 2

Deep Convolutional Network

  • Deep convolutional neural network to predict y = f(x):

Lj: spatial convolutions and linear combination of channels L1 Lj

ρ(a) = max(a, 0):

ρ ρ

linear

Scale axis

Supervised learning of Lj from n examples {xi, f(xi)}i≤n Exceptional results for images, speech, language, bio-data. quantum chemistry regressions, ....

˜ f(x)

How does it reduce dimensionality ?

x ∈ Rd

, Sparsity, Invariants

low dimension

  • Y. LeCun

Multiscale

slide-3
SLIDE 3

Statistical Models from 1 Example

  • Supervised network training (ex: on ImageNet)
  • For 1 realisation x of X, compute each layer

x

6 104 pixels

˜ x

2 105 correlations

  • M. Bethdge et. al.
  • Compute correlation statistics of network coefficients
  • Synthesize ˜

x having similar statistics What mathematical interpretation ?

slide-4
SLIDE 4

Learned Generative Networks

L1 ρ Lj ρ

X

Encoder

W1 W2 Wj

e X = G(Z)

Decoder

Z = Φ(X)

Gaussian white Φ G

Network trained on bedroom images:

Z = αZ1 + (1 − α)Z2

Z1 Z2

Network trained on faces of celebrities:

G(Z)

What mathematical interpretation ?

Linearization of deformations

  • Wasserstein autoencoder: trained on n examples {xi}i≤n
slide-5
SLIDE 5

Image Classification: ImageNet 2012

1000 classes, 1.2 million labeled training images, of 224 × 224 pixels

Alex-Net ResNet Top 5 error 20% 10%

slide-6
SLIDE 6

u1 u2

Multiscale regroupement of interactions of d bodies into interactions of O(log d) groups. ⇒ wavelet transforms. Scale separation Interactions across scales

Scale Separation and Interactions

Interactions de d bodies represented by x(u): particles, pixels... How to capture scale interactions ?

  • Dimension reduction:

Critical harmonic analysis problems since 1970’s

slide-7
SLIDE 7

Overview

  • Scale separation with wavelets and interactions through phase
  • Linear scale interaction models:

– Compressive signal approximations – Stochastic models of stationary processes

  • Non-linear scale interactions models with sparse dictionaries

– Generative autoencoders – Classification of ImageNet

  • All these roads go to Convolutional Neural Networks…
slide-8
SLIDE 8

rotated and dilated:

Scale separation with Wavelets

  • Wavelet transform:

Wx = ✓ x ? 2J x ? λ ◆

λ

  • Wavelet filter ψ(u):

ω1

ω2 ˆ ψ2j,θ(ω)

+ i

real parts imaginary parts θ

Fourier \ x ? λ(!) = ˆ x(!) ˆ λ(!) ψλ(u) = 2−2j ψ(2−jrθu) invertible

  • Zero-mean and no correlations across scales:

X

u

x ? λ(u) x ? ⇤

λ0(u) =

X

ω

|b x(!)|2 λ(!) λ(!)⇤ ⇡ 0 if 6= 0

problem! 2 phases

slide-9
SLIDE 9

20 2J Scale

Wavelet Transform Filter Cascade

How to capture multiscale similarities ? Relu & Phase

slide-10
SLIDE 10

Rectified Wavelet Coefficients

Ux = ✓ x ? 2J ⇢(x ? α,λ) ◆

α,λ

ρ(a) + ρ(−a) = a ⇒ x = U −1Ux with U −1 linear

  • Linearly invertible:
  • Relu creates non-zero mean and correlations across scales:

X

u

⇢(x ? α,λ(u)) X

u

⇢(x ? α,λ(u)) ⇢(x ? α0,λ0(u)) : conv. net. coefficients

  • Multiphase real wavelets: ψα,λ = Real(e−iα ψλ)
  • Rectified with ρ(a) = max(a, 0):
slide-11
SLIDE 11

Linear Rectifiers act on Phase

Ux(u, ↵, ) = ⇢(x ? Real(eiα λ)) = ⇢(Real(eiα x ? λ)) Ux(u, ↵, ) = |x ? λ| ⇢(cos(↵ + '(x ? λ)) x ? = |x ? | ei'(x? λ) ∀z = |z|eiϕ(z) ∈ C , [z]k , |z| eikϕ(z) Phase harmonics A Relu computes phase harmonics: Homogeneous: ρ(αa) = α ρ(a) if α > 0 for any homogeneous non-linearity ρ. Fourier transform along the phase α: Theorem : with γ(α) = ρ(cos α) b Ux(u, k, ) = ˆ (k) |x ? (u)| eik '(x? la(u))

slide-12
SLIDE 12

Frequency Transpositions

Phase harmonics: Correlated if kλ ≈ λ0

k = 1 k = 2 k = 3

λ

2λ 3λ

\ x ? λ(!)

ω ω

ω

λ

Performs a non-linear frequency dilation / transposition with no time dilation

\ x ? λ0(!) λ0

[x ? ]k = |x ? (u)| ei k '(x? λ(u))

not correlated

Phase Harmonics

slide-13
SLIDE 13

|x ? j,θ(u)| '(x ? j,θ(u)) k = 2 k '(x ? j,θ(u))

j

Scale Transposition with Harmonics

scale

ω1

ω2

ω1

ω2

k = 2

Frequency transpositions Correlated Phase harmonics:

slide-14
SLIDE 14

Linear Prediction Across Scales/Freq.

  • Relu mean and correlations: invariant to translations

C(↵, , ↵0, 0) = d1 X

u

⇢(x ? α,λ(u)) ⇢(x ? α0,λ0(u)) M(↵, ) = d−1 X

u

⇢(x ? α,λ(u))

λ λ0 ⇢(x ? α,λ)

⇢(x ? α0,λ0)

  • Define linear autoregressive model from low to high frequencies:

Linear prediction

slide-15
SLIDE 15

Compressive Reconstructions

˜ x = arg min

y

kCx Cy + (Mx My) (Mx My)∗k2 Approximation rate optimal for total variation signals:

PSNR (db) log10(m/d)

kx ˜ xk ⇠ m−2

Gaspar Rochette, Sixin Zhang

  • If x ? λ is sparse then x is recovered from m ⌧ d

phase harmonic means Mx and covariances Cx:

slide-16
SLIDE 16

PSNR (db)

Approximation rate optimal for total variation signals:

Compressive Reconstructions

log10(m/d)

kx ˜ xk ⇠ m−1

slide-17
SLIDE 17

Gaussian Models of Stationary Proc.

d = 6 104

x ˜ x Gaussian model with same power spectrum No correlation is captured across scales and frequencies. Random phases. From d empirical moments: d−1 P

u(x(u) x(u − τ))

How to capture non-Gaussianity and long range interactions ? Kolmogorov model: What stochastic models for turbulence ?

slide-18
SLIDE 18

Models of Stationary Processes

x Sixin Zhang

d → ∞

d → ∞

If ergodic then empirical moments converge:

d−1 X

u

⇢(x ? α,λ(u))

d−1 X

u

⇢(x ? α,λ(u)) ⇢(x ? α0,λ0(u))

E ⇣ ⇢(x ? α,λ) ⇢(x ? α0,λ0) ⌘ E ⇣ ⇢(x ? α,λ) ⌘

  • Stationary processes conditioned by translation invariant moments
slide-19
SLIDE 19

Ergodic Stationary Processes

d = 6 104

x

˜ x

number

  • f moments

Sixin Zhang Same quality as with learned Deep networks with much less moments

m = 3 103

Phase coherence is captured

slide-20
SLIDE 20

Multifractal Models

E[|X ⋆ ψ|q] ∼ 2jζ(q)

  • Multifractal properties:
  • Probability distribution:
  • Leverage correlation:

P(|x|)

L(τ) = E [|X(t + τ)|2 X(t)]

Roberto Leonarduzi Financial S & P 500 returns: without high order moments

reproduce high-order moments : time asymmetry

slide-21
SLIDE 21

Learned Generative Networks

L1 ρ Lj ρ

X

Encoder

W1 W2 Wj

e X = G(Z)

Decoder

Z = Φ(X)

Gaussian white Φ G

  • Variational autoencoder: trained on n examples {xi}i≤n

Network trained on bedroom images:

Z = αZ1 + (1 − α)Z2

Z1 Z2

Linearization of deformations

  • Encoder Lipschitz continuous to actions of deformations

How to build such auto encoders ?

slide-22
SLIDE 22

Averaged Rectified Wavelets

2J

Ux ? J

if Dτx(u) = x(u − τ(u)) then lim

J→∞ kSJDτx SJxk  C krτk∞ kxk

Theorem Ux ? J = ✓ x ? 2J(2Jn) ⇢(x ? α,λ) ? J(2Jn) ◆

α,λ

  • Gaussianization
  • Linearize small deformations

Scale separation and spatial averaging with φJ: Spatial averaging at a large scale 2J:

slide-23
SLIDE 23

Multiscale Autoencoder

Ux ? J x

  • Encoder:

U −1

Deconvolution

CNN e Ux e x

Non-linear Dictionary model

Z

L

d0 = 102

White noise

Id − Pr Ix

convolutional network L−1 Z

(Id − Pr)−1

  • Generator:

sparse deconvolution

pseudo-inverse

Ix + ✏ Ux ? J + ✏0

Tomas Angles

Innovation

d = 104

  • Innovations: prediction errors are decorrelated across scales

≈ Gaussian

  • Spatial decorrelation and dimension reduction
slide-24
SLIDE 24

Progressive Sparse Deconvolution

CNN Dj Ux ? j+1 + ✏ Ux ? j + ✏0

sparse α

  • Progressive sparse deconvolution of x ? j for j decreasing.

Ux ? j + ✏0 = Dj ↵ by minimising the average error k✏0k2 over a data basis.

  • Learns a dictionary Dj where Ux ? j is sparse

the CNN computes a sparse code α so that:

Tomas Angles

The CNN is learned jointly with Dj What sparse code is computed by the CNN ? Could it be an l1 sparse code ?

slide-25
SLIDE 25

Training Reconstruction

xi G(SJ(xi)) Training

Polygones

Celebrities Data Basis

Tomas Angles 2J = 16

slide-26
SLIDE 26

Testing Reconstruction

Testing xt G(SJ(xt)) Tomas Angles

slide-27
SLIDE 27

Generative Interpolations

Z = αZ1 + (1 − α)Z2

Celebrities Tom´ as Angles Polygons

Z1

G Z2 G G

b b

slide-28
SLIDE 28

Random Sampling

  • Images synthesised from a Gaussian white noise

Tomas Angles

slide-29
SLIDE 29

Classification by Dictionary Learning

103

2J

1000 classes, 1.2 million labeled training images, of 224 × 224 pixels

Averaging

Ux ? J Spatial pooling Linear classifier

Louis Thiry, John Zarka

class

Alex-Net Wavelets Top 5 error 20% 70%

Logistic

W

Phase Harmonic

slide-30
SLIDE 30

Classification by Dictionary Learning

103

2J

1000 classes, 1.2 million labeled training images, of 224 × 224 pixels

Averaging

Logistic

Ux ? J Spatial pooling Linear classifier

Louis Thiry, John Zarka

class

Linear Dimension Reduction

L

Sparse dictionary expansion

α CCN, D

W

invariants sparse multiscale

l1 sparse coding

slide-31
SLIDE 31

l1 Sparse Coding: LISTA

e α = arg min

z

kx Dαk2 + γ kαk1

  • With a deep neural network implemented with D and We:

Gregor & LeCun LISTA: CNN with soft-threshold non-linearity αk+1 = soft-thresh(αk + We(Dαk − x))

Id + WeD +

Id + WeD

+ +

x

−We

Can be used to learn the dictionary D Gyries et. al

  • l1 sparse coefficients in a convolutional dictionary D
slide-32
SLIDE 32

Classification by Dictionary Learning

103

2J

1000 classes, 1.2 million labeled training images, of 224 × 224 pixels

Averaging

LISTA, D

Ux ? J Spatial pooling Linear classifier

Louis Thiry, John Zarka

class

CNN architecture: convolutions, Relu and soft-thresholdings

Linear Dimension Reduction

L

Sparse dictionary expansion

α

D: sparse informative pattern across scales

Logistic

W end-to-end optimisation of L, D, W Alex-Net Wavelets Wavelets + Sparse Top 5 error 20% 70% 30% l1 sparse coding

Phase Harmonic

slide-33
SLIDE 33

Multiscale Approximations

  • A Relu on multiscale wavelet filters can produce scale

interactions: creates phase harmonics

  • Autoregressive models over multiscale phase harmonics

approximate sparse signals and large classes of non-Gaussian and long range interaction processes

  • Non-linear models based on sparse dictionaries may reproduce

some CNN results for generation and classification

  • Still need functional analysis models and approximation

theorems with decay rates.