SLIDE 1
Multiscale Sparse Models in Deep Convolutional Networks
Tomas Anglès, Roberto Leonarduzi, Stéphane Mallat, Louis Thiry, John Zarka, Sixin Zhang Collège de France École Normale Supérieure Flatiron Institute
www.di.ens.fr/data
SLIDE 2 Deep Convolutional Network
- Deep convolutional neural network to predict y = f(x):
Lj: spatial convolutions and linear combination of channels L1 Lj
ρ(a) = max(a, 0):
ρ ρ
linear
Scale axis
Supervised learning of Lj from n examples {xi, f(xi)}i≤n Exceptional results for images, speech, language, bio-data. quantum chemistry regressions, ....
˜ f(x)
How does it reduce dimensionality ?
x ∈ Rd
, Sparsity, Invariants
low dimension
Multiscale
SLIDE 3 Statistical Models from 1 Example
- Supervised network training (ex: on ImageNet)
- For 1 realisation x of X, compute each layer
x
6 104 pixels
˜ x
2 105 correlations
- M. Bethdge et. al.
- Compute correlation statistics of network coefficients
- Synthesize ˜
x having similar statistics What mathematical interpretation ?
SLIDE 4 Learned Generative Networks
L1 ρ Lj ρ
X
Encoder
W1 W2 Wj
e X = G(Z)
Decoder
Z = Φ(X)
Gaussian white Φ G
Network trained on bedroom images:
Z = αZ1 + (1 − α)Z2
Z1 Z2
Network trained on faces of celebrities:
G(Z)
What mathematical interpretation ?
Linearization of deformations
- Wasserstein autoencoder: trained on n examples {xi}i≤n
SLIDE 5
Image Classification: ImageNet 2012
1000 classes, 1.2 million labeled training images, of 224 × 224 pixels
Alex-Net ResNet Top 5 error 20% 10%
SLIDE 6 u1 u2
Multiscale regroupement of interactions of d bodies into interactions of O(log d) groups. ⇒ wavelet transforms. Scale separation Interactions across scales
Scale Separation and Interactions
Interactions de d bodies represented by x(u): particles, pixels... How to capture scale interactions ?
Critical harmonic analysis problems since 1970’s
SLIDE 7 Overview
- Scale separation with wavelets and interactions through phase
- Linear scale interaction models:
– Compressive signal approximations – Stochastic models of stationary processes
- Non-linear scale interactions models with sparse dictionaries
– Generative autoencoders – Classification of ImageNet
- All these roads go to Convolutional Neural Networks…
SLIDE 8 rotated and dilated:
Scale separation with Wavelets
Wx = ✓ x ? 2J x ? λ ◆
λ
ω1
ω2 ˆ ψ2j,θ(ω)
+ i
real parts imaginary parts θ
Fourier \ x ? λ(!) = ˆ x(!) ˆ λ(!) ψλ(u) = 2−2j ψ(2−jrθu) invertible
- Zero-mean and no correlations across scales:
X
u
x ? λ(u) x ? ⇤
λ0(u) =
X
ω
|b x(!)|2 λ(!) λ(!)⇤ ⇡ 0 if 6= 0
problem! 2 phases
SLIDE 9
20 2J Scale
Wavelet Transform Filter Cascade
How to capture multiscale similarities ? Relu & Phase
SLIDE 10 Rectified Wavelet Coefficients
Ux = ✓ x ? 2J ⇢(x ? α,λ) ◆
α,λ
ρ(a) + ρ(−a) = a ⇒ x = U −1Ux with U −1 linear
- Linearly invertible:
- Relu creates non-zero mean and correlations across scales:
X
u
⇢(x ? α,λ(u)) X
u
⇢(x ? α,λ(u)) ⇢(x ? α0,λ0(u)) : conv. net. coefficients
- Multiphase real wavelets: ψα,λ = Real(e−iα ψλ)
- Rectified with ρ(a) = max(a, 0):
SLIDE 11
Linear Rectifiers act on Phase
Ux(u, ↵, ) = ⇢(x ? Real(eiα λ)) = ⇢(Real(eiα x ? λ)) Ux(u, ↵, ) = |x ? λ| ⇢(cos(↵ + '(x ? λ)) x ? = |x ? | ei'(x? λ) ∀z = |z|eiϕ(z) ∈ C , [z]k , |z| eikϕ(z) Phase harmonics A Relu computes phase harmonics: Homogeneous: ρ(αa) = α ρ(a) if α > 0 for any homogeneous non-linearity ρ. Fourier transform along the phase α: Theorem : with γ(α) = ρ(cos α) b Ux(u, k, ) = ˆ (k) |x ? (u)| eik '(x? la(u))
SLIDE 12
Frequency Transpositions
Phase harmonics: Correlated if kλ ≈ λ0
k = 1 k = 2 k = 3
λ
2λ 3λ
\ x ? λ(!)
ω ω
ω
λ
Performs a non-linear frequency dilation / transposition with no time dilation
\ x ? λ0(!) λ0
[x ? ]k = |x ? (u)| ei k '(x? λ(u))
not correlated
Phase Harmonics
SLIDE 13
|x ? j,θ(u)| '(x ? j,θ(u)) k = 2 k '(x ? j,θ(u))
j
Scale Transposition with Harmonics
scale
ω1
ω2
ω1
ω2
k = 2
Frequency transpositions Correlated Phase harmonics:
SLIDE 14 Linear Prediction Across Scales/Freq.
- Relu mean and correlations: invariant to translations
C(↵, , ↵0, 0) = d1 X
u
⇢(x ? α,λ(u)) ⇢(x ? α0,λ0(u)) M(↵, ) = d−1 X
u
⇢(x ? α,λ(u))
λ λ0 ⇢(x ? α,λ)
⇢(x ? α0,λ0)
- Define linear autoregressive model from low to high frequencies:
Linear prediction
SLIDE 15 Compressive Reconstructions
˜ x = arg min
y
kCx Cy + (Mx My) (Mx My)∗k2 Approximation rate optimal for total variation signals:
PSNR (db) log10(m/d)
kx ˜ xk ⇠ m−2
Gaspar Rochette, Sixin Zhang
- If x ? λ is sparse then x is recovered from m ⌧ d
phase harmonic means Mx and covariances Cx:
SLIDE 16
PSNR (db)
Approximation rate optimal for total variation signals:
Compressive Reconstructions
log10(m/d)
kx ˜ xk ⇠ m−1
SLIDE 17
Gaussian Models of Stationary Proc.
d = 6 104
x ˜ x Gaussian model with same power spectrum No correlation is captured across scales and frequencies. Random phases. From d empirical moments: d−1 P
u(x(u) x(u − τ))
How to capture non-Gaussianity and long range interactions ? Kolmogorov model: What stochastic models for turbulence ?
SLIDE 18 Models of Stationary Processes
x Sixin Zhang
d → ∞
d → ∞
If ergodic then empirical moments converge:
d−1 X
u
⇢(x ? α,λ(u))
d−1 X
u
⇢(x ? α,λ(u)) ⇢(x ? α0,λ0(u))
E ⇣ ⇢(x ? α,λ) ⇢(x ? α0,λ0) ⌘ E ⇣ ⇢(x ? α,λ) ⌘
- Stationary processes conditioned by translation invariant moments
SLIDE 19 Ergodic Stationary Processes
d = 6 104
x
˜ x
number
Sixin Zhang Same quality as with learned Deep networks with much less moments
m = 3 103
Phase coherence is captured
SLIDE 20 Multifractal Models
E[|X ⋆ ψ|q] ∼ 2jζ(q)
- Multifractal properties:
- Probability distribution:
- Leverage correlation:
P(|x|)
L(τ) = E [|X(t + τ)|2 X(t)]
Roberto Leonarduzi Financial S & P 500 returns: without high order moments
reproduce high-order moments : time asymmetry
SLIDE 21 Learned Generative Networks
L1 ρ Lj ρ
X
Encoder
W1 W2 Wj
e X = G(Z)
Decoder
Z = Φ(X)
Gaussian white Φ G
- Variational autoencoder: trained on n examples {xi}i≤n
Network trained on bedroom images:
Z = αZ1 + (1 − α)Z2
Z1 Z2
Linearization of deformations
- Encoder Lipschitz continuous to actions of deformations
How to build such auto encoders ?
SLIDE 22 Averaged Rectified Wavelets
2J
Ux ? J
if Dτx(u) = x(u − τ(u)) then lim
J→∞ kSJDτx SJxk C krτk∞ kxk
Theorem Ux ? J = ✓ x ? 2J(2Jn) ⇢(x ? α,λ) ? J(2Jn) ◆
α,λ
- Gaussianization
- Linearize small deformations
Scale separation and spatial averaging with φJ: Spatial averaging at a large scale 2J:
SLIDE 23 Multiscale Autoencoder
Ux ? J x
U −1
Deconvolution
CNN e Ux e x
Non-linear Dictionary model
Z
L
d0 = 102
White noise
Id − Pr Ix
convolutional network L−1 Z
(Id − Pr)−1
sparse deconvolution
pseudo-inverse
Ix + ✏ Ux ? J + ✏0
Tomas Angles
Innovation
d = 104
- Innovations: prediction errors are decorrelated across scales
≈ Gaussian
- Spatial decorrelation and dimension reduction
SLIDE 24 Progressive Sparse Deconvolution
CNN Dj Ux ? j+1 + ✏ Ux ? j + ✏0
sparse α
- Progressive sparse deconvolution of x ? j for j decreasing.
Ux ? j + ✏0 = Dj ↵ by minimising the average error k✏0k2 over a data basis.
- Learns a dictionary Dj where Ux ? j is sparse
the CNN computes a sparse code α so that:
Tomas Angles
The CNN is learned jointly with Dj What sparse code is computed by the CNN ? Could it be an l1 sparse code ?
SLIDE 25
Training Reconstruction
xi G(SJ(xi)) Training
Polygones
Celebrities Data Basis
Tomas Angles 2J = 16
SLIDE 26
Testing Reconstruction
Testing xt G(SJ(xt)) Tomas Angles
SLIDE 27
Generative Interpolations
Z = αZ1 + (1 − α)Z2
Celebrities Tom´ as Angles Polygons
Z1
G Z2 G G
b b
SLIDE 28 Random Sampling
- Images synthesised from a Gaussian white noise
Tomas Angles
SLIDE 29 Classification by Dictionary Learning
103
2J
1000 classes, 1.2 million labeled training images, of 224 × 224 pixels
Averaging
Ux ? J Spatial pooling Linear classifier
Louis Thiry, John Zarka
class
Alex-Net Wavelets Top 5 error 20% 70%
Logistic
W
Phase Harmonic
SLIDE 30 Classification by Dictionary Learning
103
2J
1000 classes, 1.2 million labeled training images, of 224 × 224 pixels
Averaging
Logistic
Ux ? J Spatial pooling Linear classifier
Louis Thiry, John Zarka
class
Linear Dimension Reduction
L
Sparse dictionary expansion
α CCN, D
W
invariants sparse multiscale
l1 sparse coding
SLIDE 31 l1 Sparse Coding: LISTA
e α = arg min
z
kx Dαk2 + γ kαk1
- With a deep neural network implemented with D and We:
Gregor & LeCun LISTA: CNN with soft-threshold non-linearity αk+1 = soft-thresh(αk + We(Dαk − x))
Id + WeD +
Id + WeD
+ +
x
−We
Can be used to learn the dictionary D Gyries et. al
- l1 sparse coefficients in a convolutional dictionary D
SLIDE 32 Classification by Dictionary Learning
103
2J
1000 classes, 1.2 million labeled training images, of 224 × 224 pixels
Averaging
LISTA, D
Ux ? J Spatial pooling Linear classifier
Louis Thiry, John Zarka
class
CNN architecture: convolutions, Relu and soft-thresholdings
Linear Dimension Reduction
L
Sparse dictionary expansion
α
D: sparse informative pattern across scales
Logistic
W end-to-end optimisation of L, D, W Alex-Net Wavelets Wavelets + Sparse Top 5 error 20% 70% 30% l1 sparse coding
Phase Harmonic
SLIDE 33 Multiscale Approximations
- A Relu on multiscale wavelet filters can produce scale
interactions: creates phase harmonics
- Autoregressive models over multiscale phase harmonics
approximate sparse signals and large classes of non-Gaussian and long range interaction processes
- Non-linear models based on sparse dictionaries may reproduce
some CNN results for generation and classification
- Still need functional analysis models and approximation
theorems with decay rates.