Deep Neural Network Mathematical Mysteries for High Dimensional - - PowerPoint PPT Presentation
Deep Neural Network Mathematical Mysteries for High Dimensional - - PowerPoint PPT Presentation
Deep Neural Network Mathematical Mysteries for High Dimensional Learning Stphane Mallat cole Normale Suprieure www.di.ens.fr/data High Dimensional Learning High-dimensional x = ( x (1) , ..., x ( d )) R d : Classification:
given n sample values {xi , yi = f(xi)}i≤n
- High-dimensional x = (x(1), ..., x(d)) ∈ Rd:
- Classification: estimate a class label f(x)
High Dimensional Learning
Image Classification
d = 106
Anchor Joshua Tree Beaver Lotus Water Lily
Huge variability inside classes Find invariants
- High-dimensional x = (x(1), ..., x(d)) ∈ Rd:
- Regression: approximate a functional f(x)
given n sample values {xi , yi = f(xi) ∈ R}i≤n
High Dimensional Learning
Astronomy Quantum Chemistry Physics: energy f(x) of a state vector x Importance of symmetries.
Curse of Dimensionality
local interpolation if f is regular and there are close examples:
- f(x) can be approximated from examples {xi , f(xi)}i by
?
x Problem: kx xik is always large
- Need ✏−d points to cover [0, 1]d at a Euclidean distance ✏
- o o o o o o o o o
- o o o o o o o o o
- o o o o o o o o o
- o o o o o o o o o
- o o o o o o o o o
- o o o o o o o o o
- o o o o o o o o o
- o o o o o o o o o
- o o o o o o o o o
- o o o o o o o o o
1 1 d=2
Multiscale Separation
- Variables x(u) indexed by a low-dimensional u: time/space...
pixels in images, particles in physics, words in text... From d2 interactions to O(log2 d) multiscale interactions.
- Mutliscale interactions of d variables:
- Multiscale analysis: wavelets on groups of symmetries.
hierarchical architecture.
u1 u2
Overview
- 1 Hidden Layer Network, Approximation theory and Curse
- Kernel learning
- Dimension reduction with change of variables
- Deep Neural networks and symmetry groups
- Wavelet Scattering transforms
- Applications and many open questions
Understanding Deep Convolutional Networks, arXiv 2016.
- To estimate f(x) from a sampling {xi , yi = f(xi)}i≤M
- Precise sparse approximation requires some ”regularity”.
- For binary classification f(x) =
⇢ 1 if x ∈ Ω −1 if x / ∈ Ω f(x) = sign( ˜ f(x)) where ˜ f is potentially regular.
- What type of regularity ? How to compute fM ?
we must build an M-parameter approximation fM of f.
Learning as an Approximation
ρ(wn.x + bn) M fM(x) =
M
X
n=1
αn ρ(wn.x + bn) αn
wn.x = P
k wk,nxk
One-hidden layer neural network: {wk,k}k,n and {αn}n are learned non-linear approximation.
1 Hidden Layer Neural Networks
d
x
Theorem: For ”resonnable” bounded ρ(u) 8f 2 L2[0, 1]d lim
M→∞ kf fMk = 0 .
and appropriate choices of wn,k and αn:
Cybenko, Hornik, Stinchcombe, White
No big deal: curse of dimensionality still there. ridge functions
wn
ρ(x.wn + bn)
ρ(wn.x + bn) M fM(x) =
M
X
n=1
αn ρ(wn.x + bn) αn
wn.x = P
k wk,nxk
One-hidden layer neural network: {wk,k}k,n and {αn}n are learned non-linear approximation.
1 Hidden Layer Neural Networks
d
x
fM(x) =
M
X
n=1
αn eiwn.x For nearly all ρ: essentially same approximation results. Fourier series: ρ(u) = eiu
Piecewise Linear Approximation
f(x)
x
✏
- Piecewise linear approximation:
ρ(u) = max(u, 0) ˜ f(x) = X
n
an ⇢(x − n✏) n✏
⇒ Need M = ✏−1 points to cover [0, 1] at a distance ✏
kf fMk C M −1 If f is Lipschitz: |f(x) − f(x0)| ≤ C |x − x0| ⇒ |f(x) − ˜ f(x)| ≤ C ✏.
Linear Ridge Approximation
ρ(u) = max(u, 0)
need M = ✏−d points to cover [0, 1]d at a distance ✏
⇒ kf fMk C M −1/d
Curse of dimensionality!
˜ f(x) = X
n
an ⇢(wn.x − n✏)
- Piecewise linear ridge approximation: x ∈ [0, 1]d
Sampling at a distance ✏: ⇒ |f(x) − ˜ f(x)| ≤ C ✏. If f is Lipschitz: |f(x) f(x0)| C kx x0k
- What prior condition makes learning possible ?
Approximation with Regularity
∀x, u |f(x) − pu(x)| ≤ C |x − u|s with pu(x) polynomial
- Approximation of regular functions in Cs[0, 1]d:
f(x) pu(x)
x u
|x − u| ≤ ✏1/s ⇒ |f(x) − pu(x)| ≤ C ✏ Need M −d/s point to cover [0, 1]d at a distance ✏1/s kf fMk C M −s/d ⇒
- Can not do better in Cs[0, 1]d, not good because s ⌧ d.
Failure of classical approximation theory.
Data:
Kernel Learning
x ∈ Rd x Change of variable Φ(x) = {φk(x)}k≤d0 ˜ f(x) = hΦ(x) , wi = X
k
wk φk(x) . to nearly linearize f(x), which is approximated by:
1D projection
- What ”regularity” of f is needed ?
- How and when is possible to find such a Φ ?
Metric: kx x0k
Φ
Linear Classifier
Φ(x) ∈ Rd0
w
kΦ(x) Φ(x0)k
⇒ Choose Φ increasing dimensionality ! Proposition: There exists a hyperplane separating any two subsets of N points {Φxi}i in dimension d0 > N + 1 if {Φxi}i are not in an affine subspace of dimension < N.
Increase Dimensionality
Problem: generalisation. If σ is small, nearest neighbor classifier type:
σ
Example: Gaussian kernel hΦ(x), Φ(x0)i = exp ⇣kx x0k2 2σ2 ⌘ Φ(x) is of dimension d0 = ∞ , overfitting.
Reduction of Dimensionality
) kf fMk C M −1/d0 Φ(x) 6= Φ(x0) if f(x) 6= f(x0) ⇒ ∃ ˜ f with f(x) = ˜ f(Φ(x))
- For x ∈ Ω, if Φ(Ω) is bounded and a low dimension d0
- Discriminative change of variable Φ(x):
, |f(x) f(x0)| C kΦ(x) Φ(x0)k z = Φ(x)
- If ˜
f is Lipschitz: | ˜ f(z) ˜ f(z0)| C kz z0k Discriminative: kΦ(x) Φ(x0)k C1 |f(x) f(x0)|
x
Linear Classificat.
ρ
linear convolution linear convolution
Deep Convolution Neworks
L2 ρ Φ(x)
. . .
non-linear scalar:
L1
neuron
Why does it work so well ? Optimize Lj with architecture constraints: over 109 parameters Exceptional results for images, speech, language, bio-data...
ρ(u) = max(u, 0)
- The revival of neural networks: Y. LeCun
Hierarchical invariants Linearization
y = ˜ f(x) A difficult problem
ImageNet Data Basis
- Data basis with 1 million images and 2000 classes
- Imagenet supervised training: 1.2 106 examples, 103 classes
15.3% testing error
Wavelets
Alex Deep Convolution Network
- A. Krizhevsky, Sutsever, Hinton
in 2012 New networks with 5% errors. Up to 150 layers!
Image Classification
Scene Labeling / Car Driving
k✏k < 10−2kxk + ✏ = with
Why Understading ?
correctly classified classified as
- strich
x
˜ x
Szegedy, Zaremba, Sutskever, Bruna, Erhan, Goodfellow, Fergus
- Trial and error testing can not guarantee reliability.
x(u) x1(u, k1) x2(u, k2) xJ(u, kJ) k1 k2
Deep Convolutional Networks
ρL1 ρLJ xj = ρ Lj xj−1 xj(u, kj) = ⇢ ⇣ X
k
xj−1(·, k) ? hkj,k(u) ⌘
sum across channels
classification
- Lj is a linear combination of convolutions and subsampling:
- ρ is contractive: |ρ(u) − ρ(u0)| ≤ |u − u0|
ρ(u) = max(u, 0) or ρ(u) = |u|
Linearisation in Deep Networks
- Trained on a data basis of faces:linearization
- On a data basis including bedrooms: interpolaitons
- A. Radford, L. Metz, S. Chintala
Many Questions
- Why convolutions ? Translation covariance.
- Why no overfitting ? Contractions, dimension reduction
- Why hierarchical cascade ?
- Why introducing non-linearities ?
- How and what to linearise ?
- What are the roles of the multiple channels in each layer ?
x(u) x1(u, k1) x2(u, k2) xJ(u, kJ) k1 k2
ρL1 ρLJ
classification
ρ Lj
Linear Dimension Reduction
Level sets of f(x) Ωt = {x : f(x) = t} Ω1 Ω2 Ω3 Classes by linear projections: invariants. If level sets (classes) are parallel to a linear space then variables are eliminated
Φ(x) x
Linearise for Dimensionality Reduction
Level sets of f(x) Ωt = {x : f(x) = t}
- If level sets Ωt are not parallel to a linear space
- Linearise them with a change of variable Φ(x)
- Then reduce dimension with linear projections
Classes Ω1 Ω2 Ω3
- Difficult because Ωt are high-dimensional, irregular,
known on few samples.
Φ(x)
x
Level Set Geometry: Symmetries
- A symmetry is an operator g which preserves level sets:
∀x , f(g.x) = f(x) . : global
g g
Level sets: classes Ω1 Ω2
- Curse of dimensionality ⇒ not local but global geometry
f(g1.g2.x) = f(g2.x) = f(x) If g1 and g2 are symmetries then g1.g2 is also a symmetry , characterised by their global symmetries.
Groups of symmetries
- G = { all symmetries } is a group: unknown
∀(g, g0) ∈ G2 ⇒ g.g0 ∈ G ∀g ∈ G , g−1 ∈ G (g.g0).g00 = g.(g0.g00) Inverse: Associative: If commutative g.g0 = g0.g : Abelian group.
- Group of dimension n if it has n generators:
g = gp1
1 gp2 2 ... gpn n
- Lie group: infinitely small generators (Lie Algebra)
x(u)
Translation and Deformations
Video of Philipp Scott Johnson
- Digit classification:
- Globally invariant to the translation group
- Locally invariant to small diffeomorphisms
: small : huge group
x0(u) = x(u − τ(u))
Ω3
Ω5
H : Heisenberg group of ”time-frequency” translations
Frequency Transpositions
encyclopaedias
log(ω) t log(ω) t
Time and frequency translations and deformations: for speech recognition not for locutor recognition.
- Frequency transposition invariance is needed
Frequency Transpositions
log(ω) t
SO(2) × Diff(SO(2)) Group:
- Rotation and deformations
- Scaling and deformations
R × Diff(R) Group:
Rotation and Scaling Variability
Linearize Symmetries
- A change of variable Φ(x) must linearize the orbits {g.x}g∈G
x gp
1.x
g1x gp
1.x0
g1x0
x0
- Linearise symmetries with a change of variable Φ(x)
Φ(gp
1.x0)
Φ(x0) Φ(x) Φ(gp
1.x)
- Lipschitz: 8x, g : kΦ(x) Φ(g.x)k C kgk
x(u) x0(u)
Translation and Deformations
Video of Philipp Scott Johnson
- Digit classification:
- Globally invariant to the translation group
- Locally invariant to small diffeomorphisms
Linearize small diffeomorphisms: ⇒ Lipschitz regular
Translations and Deformations
- Invariance to translations:
g.x(u) = x(u − c) ⇒ Φ(g.x) = Φ(x) .
- Small diffeomorphisms: g.x(u) = x(u − τ(u))
Metric: kgk = krτk∞ maximum scaling Linearisation by Lipschitz continuity kΦ(x) Φ(g.x)k C krτk∞ . kΦ(x) Φ(x0)k C1 |f(x) f(x0)|
- Discriminative change of variable:
|b x(ω)| |b xτ(ω)|
- Fourier transform ˆ
x(ω) = R x(t) e−iωt dt The modulus is invariant to translations: ) k|ˆ x| |ˆ xτ|k krτk∞ kxk Φ(x) = |ˆ x| = |ˆ xc|
Fourier Deformation Instability
| |ˆ xτ(ω)| − |ˆ x(ω)| | is big at high frequencies
- Instabilites to small deformations xτ(t) = x(t − τ(t)) :
ω
xc(t) = x(t − c) ⇒ ˆ xc(ω) = e−icω ˆ x(ω)
⌧(t) = ✏ t
Deep Neural Network Mathematical Mysteries for High Dimensional Learning
Stéphane Mallat École Normale Supérieure
www.di.ens.fr/data
x(u) x1(u, k1) x2(u, k2) xJ(u, kJ) k1 k2
Deep Convolutional Trees
ρL1 ρLJ xj = ρ Lj xj−1
classification
Lj is composed of convolutions and subs samplings: xj(u, kj) = ⇢ ⇣ xj−1(·, k) ? hkj,k(u) ⌘ No channel communication: how far can we go ? Why hierachical cascade ?
Translations and Deformations
- Invariance to translations:
g.x(u) = x(u − c) ⇒ Φ(g.x) = Φ(x) .
- Small diffeomorphisms: g.x(u) = x(u − τ(u))
Metric: kgk = krτk∞ maximum scaling Linearisation by Lipschitz continuity kΦ(x) Φ(g.x)k C krτk∞ . kΦ(x) Φ(x0)k C1 |f(x) f(x0)|
- Discriminative change of variable:
Overview Part II
- Wavelet Scattering transform along translations
- Generation of textures and random processes
- Channel connections for more general groups
- Image and audio classification with small training sets
- Quantum chemistry
- Open problems
Understanding Deep Convolutional Networks, arXiv 2016.
- Dilated wavelets: ψλ(t) = 2−j/Q ψ(2−j/Qt) with λ = 2−j/Q
Multiscale Wavelet Transform
Q-constant band-pass filters ˆ ψλ
| ˆ ψλ(ω)|2 λ | ˆ ψλ(ω)|2 λ ω |ˆ φ(ω)|2 ψλ(t) ψλ(t)
Wx = ✓ x ? 2J(t) x ? λ(t) ◆
λ≤2J
- Wavelet transform:
Preserves norm: Wx2 = x2 .
: average : higher frequencies
x ? λ(t) = Z x(u) λ(t − u) du ⇒ \ x ? λ(!) = b x(!) b λ(!)
- Wavelets are uniformly stable to deformations:
if ψλ,τ(t) = ψλ(t − τ(t)) then ⇤ψλ ψλ,τ⇤ ⇥ C sup
t |⌅τ(t)| .
Why Wavelets ?
- Wavelets separate multiscale information.
- Wavelets provide sparse representations.
x(t) |x ⇥ λ1(t)| =
- Z
x(u)λ1(t − u) du
- ψλ1
1/λ1
Singular Functions
|x ⇥ λ1(t)|
Scattering Transform
t |x ? λ1(t)|
Time-Frequency Fibers
λ1 log ω = t
x(t)
Wavelet transform modulus: |W|
frequency time
x(t)
|W1|x = ✓ x ? 2J |x ? λ1| ◆
λ1
First wavelet transform
Modulus improves invariance:
W1x = ✓ x ? λ1 ◆
λ1
x ? 2J
Wavelet Translation Invariance
x ? λ1(t) = x ? a
λ1(t) + i x ? b λ1(t)
|x ? λ1(t)| = q |x ? a
λ1(t)|2 + |x ? b λ1(t)|2
|x ? λ1| ? 2J(t)
2J
local translation invariance x ? 2J(t) full translation invariance
2J = ∞
Second wavelet transform modulus |W2| |x ? λ1|= ✓ |x ? λ1| ? 2J(t) ||x ? λ1| ? λ2(t)| ◆
λ2
x(t) |x ⇥ λ1(t)| =
- Z
x(u)λ1(t − u) du
- ψλ1
1/λ1
Singular Functions
|x ⇥ λ1(t)| ψλ2
log(!1) t First−order windowed scattering (small scale) log(!1) t First−order windowed scattering (large scale) log(!2) t Second−order windowed scattering (large scale) Band #75
18 Hz
Harmonic sound: x(t) = a(t) e ? h(t) with varying a(t)
Amplitude Modulation
λ1 = log(ω1) λ2 = log(ω2)
1977 Hz
λ1 = log(ω1)
512ms window
|x ⇥ λ1|(t) |x ? λ1| ? (t) ||x ? λ1| ? λ2| ? (t) for 1 = log(1977)
Scattering Transform
t λ1
|W1|
|x ? λ1(t)| x(t)
Q1 = 16 Q2 = 1 Mel Frequency Spectrum
Output:
t t λ1
|x ? λ1| ? 2J(t)
λ1 λ2 t
|W2|
||x ? λ1| ? λ2(t)|
|W3|
λ1 λ2
t
time average time average ||x ? λ1| ? λ2| ? 2J(t)
ScatteringConvolution Network
Modulation Spectrum 1D 3D
Output: window: 2Jms log ω =
2J 2J
no vertical connection
rotated and dilated:
real parts imaginary parts
Scale separation with Wavelets
ψ2j,θ(u) = 2−j ψ(2−jrθu)
- Wavelet transform:
: average : higher frequencies
Wx = ✓ x ? 2J(u) x ? 2j,θ(u) ◆
j≤J,θ
- Wavelet filter ψ(u):
| ˆ ψλ(ω)|2
ω1
ω2
Preserves norm: Wx2 = x2 . x ? 2j,θ(u) = Z x(v) 2j,θ(u − v) dv + i
Averaging Pyramid
Hx(u) = x(2u) + x(2u + 1) 2
x(u) u H2x H3x H4x
- Multiscale averaging by cascade of pair averaging:
where h is a low frequency and g is a high frequency filter. Hx(u) = x ? h(2u) and Gx(u) = x ? g(2u)
Haar Filtering
nx(2u) − x(2u + 1) √ 2
- u≤d/2
G H {x(u)}u≤d nx(2u) + x(2u + 1) √ 2
- u≤d/2
j depth
Haar Wavelet Transform
G H G H G H G H
x ? J(2Jk)
2J
φJ(u)
u
x ? j(2jk)
ψj(u)
2j
u
x
J
W1
20 21
|x ? 21,θ|
Fast Wavelet Filter Bank
|W1|
2J Scale
H G1
G2
G3 G4
20 22 2J
|x ? 22,θ|
|W1|
Scale 21
|x ? 21,θ|
|W1|
Wavelet Filter Bank
x(u) ρ(α) = |α|
- Sparse representation
|x ? 2j,θ|
If u ≥ 0 then ρ(u) = u ρ has no effect after an averaging.
x
|x ? λ1|
|W3| |W2||W1| x |W3| |W4|
20 21 22 23
|x ? λ0
1|
2J
|W4|
Wavelet Convolution Network Tree
Scale
|W1|
x ? J
|x ? λ1| ? J |||x ? λ1| ? λ2| ? J |x ? λ0
1| ? J
|L1| |L2| |L3| |L4| S4x = |L4| |L3| |L2| |L1|x = |W2|
||x ? λ1| ? λ2|
|||x ? λ1| ? λ2| ? λ3|
ρ has no effect after an averaging.
- it preserves the norm |W|x = x
|W|x = ✓ x ⇤ (t) |x ⇤ ⇥λ(t)| ◆
t,λ
is non-linear Wx = ✓ x ⇤ (t) x ⇤ ⇥λ(t) ◆
t,λ
is linear and kWxk = kxk
- it is contractive ⇤|W|x |W|y⇤ ⇥ ⇤x y⇤
because for (a, b) ∈ C2 ||a| − |b|| ≤ |a − b|
Contraction
ρ(u) = |u|
= . . . |W3| |W2| |W1| x
SJx = x ? 2J |x ? λ1| ? 2J ||x ? λ1| ? λ2| ? 2J |||x ? λ2| ? λ2| ? λ3| ? 2J ...
λ1,λ2,λ3,...
preserves norms kSJxk = kxk kWkxk = kxk ) k|Wkx| |Wkx0|k kx x0k Lemma : k[Wk, Dτ]k = kWkDτ DτWkk C krτk∞ translations invariance and deformation stability: if Dτx(u) = x(u − τ(u)) then lim
J→∞ kSJDτx SJxk C krτk∞ kxk
Scattering Properties
contractive kSJx SJyk kx yk (L2 stability)
Theorem: For appropriate wavelets, a scattering is
LeCun et. al.
Classification Errors Joan Bruna
Digit Classification: MNIST
SJx y = f(x) x Supervised Linear classifier
Invariants to specific deformations Separates different patterns Invariants to translations Linearises small deformations No learning
Training size
- Conv. Net.
Scattering 50000 0.4% 0.4%
2D Turbulence
Ω1 Ω2
Classification of Stationary Textures
- What stochastic models ?
Non Gaussian with long-range dependance.
- Can we ”Gaussianize” (linearize) such distributions
in a reduced dimensional space ?
- J. Bruna
Classification of Textures
CUREt database
Texte
Classification Errors
SJx y = f(x) x Supervised Linear classifier
Training Fourier Scattering per class Spectr. 46 1% 0.2 %
The scattering transform of a stationary process X(t)
Scattering Moments of Processes
E(SX) = E(X) E(|X ? λ1|) E(||X ? λ1| ? λ2|) E(|||X ? λ2| ? λ2| ? λ3|) ...
λ1,λ2,λ3,...
J → ∞
Gaussian distribution: N ⇣ E(SX), ΣJ → 0 ⌘
with ”weak” ergodicity conditions Central limit theorem
: scattering moments : stationary vector
SJX = X ? 2J(t) |X ? λ1| ? 2J(t) ||X ? λ1| ? λ2| ? 2J(t) |||X ? λ2| ? λ2| ? λ3| ? 2J(t) ...
λ1,λ2,λ3,...
- J. Bruna
The scattering transform of a stationary process X(t)
Scattering Moments of Processes
J → ∞
Gaussian distribution: N ⇣ E(SX), ΣJ → 0 ⌘
with ”weak” ergodicity conditions Central limit theorem
: stationary vector
SJX = X ? 2J(t) |X ? λ1| ? 2J(t) ||X ? λ1| ? λ2| ? 2J(t) |||X ? λ2| ? λ2| ? λ3| ? 2J(t) ...
λ1,λ2,λ3,...
kSJ ˜ X SJXk2
- Reconstruction: compute ˜
X which minimises
- Gradient descent
Original Paper Cocktail Party
Representation of Audio Textures
Joan Bruna
60 20 40 60 20 40 60
Applauds Gaussian in time Gaussian in scattering
t
ω
Ergodic Texture Reconstructions
Joan Bruna
2D Turbulence E(|x ? λ1|) , E(||x ? λ1| ? λ2|)
Second order Gaussian Scattering: O(log N 2) moments Textures of N pixels Gaussian process model with N second order moments
Ising Model and Inverse Problem
p(x) = Z−1
β
exp ⇣ − β X
i,j
Ji,j x(i) x(j) ⌘ with x(i) = ±1
βc β
scattering
low-resolution TV optim. Scat pred.
Gaussian
Ising
Bruna, Dokmanic, Maarten de Hoop
x(u) x1(u, k1) x2(u, k2) xJ(u, kJ) k1 k2
Deep Convolutional Trees
ρL1 ρLJ xj = ρ Lj xj−1
classification
Lj is composed of convolutions and subs samplings: xj(u, kj) = ⇢ ⇣ xj−1(·, k) ? hkj,k(u) ⌘ No channel communication: what limitations ?
UIUC database: 25 classes Scattering classification errors Training
- Scat. Translation
20 20 %
Rotation and Scaling Invariance
Laurent Sifre
x(u) x1(u, k1) x2(u, k2) xJ(u, kJ) k1 k2
Deep Convolutional Networks
ρL1 ρLJ xj = ρ Lj xj−1 xj(u, kj) = ⇢ ⇣ X
k
xj−1(·, k) ? hkj,k(u) ⌘
sum across channels
classification
- Lj is a linear combination of convolutions and subsampling:
What is the role of channel connections ? Linearize other symmetries beyond translations.
Rotation Invariance
2J
|x ? 22,θ| |x ? 23,θ|
Scale
|x ? 21,θ|
|W1|
x ? J
θ
- Channel connections linearize other symmetries.
- Invariance to rotations are computed by convolutions
along the rotation variable θ with wavelet filters. ⇒ invariance to rigid mouvements.
- Action on wavelet coefficients:
Extension to Rigid Mouvements
Laurent Sifre x(u) |W1| R x(u)du
- Group of rigid displacements: translations and rotations
|W1| R x(u)du xj(rα(u − c), θ − α) x(rα(u − c)) xj(u, ✓) = |x ? 2j,θ(u)|
rotation & translation rotation & translation , angle translation
Need to capture the variability of spatial directions.
- To build invariants: second wavelet transform on L2(G):
with wavelets ψλ2(u, θ)
Extension to Rigid Mouvements
Laurent Sifre
- Scattering on rigid mouvements:
Wavelets on Rigid Mvt. Wavelets on Rigid Mvt.
xj(u , θ)
Wavelets on Translations
x(u) R x(u)du |W1| |W2| |xj ~ ψλ2(v, θ)| R xj(u, θ) dudθ |W3|
Z |xj ~ ψλ2(v, θ)|dudθ
convolutions of xj(u, θ)
x ~ ψλ(u, θ) = Z 2π ✓Z
R2 x(u0, θ0) ψθ,2j(rθ0(u − u0))
◆ ψ2k(θ − θ0)dθ0dt0
ψ2k(θ)
( ) θ u1 u2 ψθ2
ψθ,2j(u1, u2)
UIUC database: 25 classes Scattering classification errors Training
- Scat. Translation
- Scat. Rigid Mouvt.
20 20 % 0.6%
Rotation and Scaling Invariance
Laurent Sifre
- Energy of d interacting bodies:
Can we learn the interaction energy f(x) of a system with x = n positions, values
- ?
Astronomy Quantum Chemistry
Learning Physics: N-Body Problem
Matthew Hirn
- N. Poilvert
Kohn-Sham model: E(ρ) = T(ρ) + Z ρ(u) V (u) + 1 2 Z ρ(u)ρ(v) |u − v| dudv + Exc(ρ)
Molecular energy
At equilibrium:
Density Functional Theory
Kinetic energy electron-electron Coulomb repulsion electron-nuclei attraction Exchange
- correlat. energy
74
f(x) = E(ρx) = min
ρ E(ρ)
Quantum Chemistry Invariants
stable to deformations. Quantum chemistry: f(x) is invariant to rigid mouvements, Depends on the true electronic density (Kohn-Sham)
Ground state electronic density computed with Schroedinger Density ˜ ρx computed as a sum of blobs
- Can we estimate f(x) from a naive electronic density ?
Quantum Chemistry
Matthew Hirn
Quantum Regression
- N. Poilvert
scattering coefficients and squared Fourier modulus coefficients and squared
- r
Φx = {φn(˜ ρx)}n : fM(x) =
M
X
k=1
wk φnk(˜ ρx)
- Linear regressions computed with invariant change of variables:
Regression coefficients wk: equivalent potential.
Scattering Dictionary
Recover translation variability: |ρ ∗ ψj1,θ1| ∗ ψj2,θ2(u)
Rotations θ1 Scales j1 |ρ ∗ ψj1,θ1(u)|
Combine to recover roto-translation variabiltiy:
||ρ ∗ ψj1,·| ∗ ψj2,θ2(u) ~ ψl2(θ1)| ρ(u)
Recover rotation variability: |ρ ∗ ψj1,·(u)| ~ ψl2(θ1)
x Scattering Regression
Regression:
78
1 2 3 4 5 6 7 8 9 10
Model Complexity log (M)
1 2 3 4 5 6 7 8
- |
− |
- Fourier
Wavelet Scattering Coulomb
log2 M 5.8 14.2 16.7 kcal/mol : State of the art 1.8 kcal/mol
Data basis {xi , f(xi)}i≤N of 4357 planar molecules
Interaction terms across scales Fourier Scattering
Time-Frequency Translation Group
|x ? λ| ? J ||x ? λ| ? α ? β| ? J
t t
t t
log λ
t
Time-frequency wavelet convolutions
- J. Anden and V. Lostanlen
Joint Time-Frequency Scattering
Original Time Scattering Time/Freq Scattering
- J. Anden and V. Lostanlen
Musical Instrument Classificaiton
MFCC audio descriptors 0,39 time scattering 0,31 ConvNet 0,31 time-frequency scattering 0,18
MedleyDB: 8 classes 10k training examples class-wise average error
clarinet electric guitar female singer flute piano tenor saxophone trumpet violin
- J. Anden and V. Lostanlen
Environmental Sound Classification
MFCC audio descriptors 0,39 time scattering 0,27 ConvNet (Piczak, MLSP 2015) 0,26 time-frequency scattering 0,2
UrbanSound8k: 10 classes 8k training examples class-wise average error
air conditioner car horns children playing dog barks drilling engine at idle gunshot jackhammer siren street music
- J. Anden and V. Lostanlen
Complex Image Classification
Bateau Nénuphare Metronome Castore Arbre de Joshua Ancre
Edouard Oyallon Data Basis Deep-Net Scat/Unsupervised CIFAR-10 7% 20% SJx y = f(x)
x
Supervised Linear classifier
No learning
Linearisation in Deep Networks
- Trained on a data basis of faces:linearization
- On a data basis including bedrooms: interpolaitons
- A. Radford, L. Metz, S. Chintala
x(u) x1(u, k1) x2(u, k2) xJ(u, kJ) k1 k2
Deep Convolutional Networks
ρL1 ρLJ
classification
- The convolution network operators have many roles:
– Linearize non-linear transformations (symmetries) – Reduce dimension with projections – Memory storage of « characteristic » structures
- Difficult to separate these roles when analyzing learned networks
Lj
x(u) x1(u, k1) x2(u, k2) xJ(u, kJ) k1 k2
Open Problems
ρL1 ρLJ
classification
- Can we recover symmetry groups from the matrices Lj ?
- What kind of groups ?
- Can we characterise the regularity of f(x) from these groups ?
- Can we define classes of high-dimensional « regular » functions
that are well approximated by deep neural networks ?
- Can we get approximation theorems giving errors depending on
number of training exemples, with a fast decay ?
Conclusions
- Deep convolutional networks have spectacular high-dimensional
approximation capabilities.
- Seem to compute hierarchical invariants of complex symmetries
- Used as models in physiological vision and audition
- Close link with particle and statistical physics
- Outstanding mathematical problem to understand them: