Deep Neural Network Mathematical Mysteries for High Dimensional - - PowerPoint PPT Presentation

deep neural network mathematical mysteries for high
SMART_READER_LITE
LIVE PREVIEW

Deep Neural Network Mathematical Mysteries for High Dimensional - - PowerPoint PPT Presentation

Deep Neural Network Mathematical Mysteries for High Dimensional Learning Stphane Mallat cole Normale Suprieure www.di.ens.fr/data High Dimensional Learning High-dimensional x = ( x (1) , ..., x ( d )) R d : Classification:


slide-1
SLIDE 1

Deep Neural Network Mathematical Mysteries for High Dimensional Learning

Stéphane Mallat École Normale Supérieure

www.di.ens.fr/data

slide-2
SLIDE 2

given n sample values {xi , yi = f(xi)}i≤n

  • High-dimensional x = (x(1), ..., x(d)) ∈ Rd:
  • Classification: estimate a class label f(x)

High Dimensional Learning

Image Classification

d = 106

Anchor Joshua Tree Beaver Lotus Water Lily

Huge variability inside classes Find invariants

slide-3
SLIDE 3
  • High-dimensional x = (x(1), ..., x(d)) ∈ Rd:
  • Regression: approximate a functional f(x)

given n sample values {xi , yi = f(xi) ∈ R}i≤n

High Dimensional Learning

Astronomy Quantum Chemistry Physics: energy f(x) of a state vector x Importance of symmetries.

slide-4
SLIDE 4

Curse of Dimensionality

local interpolation if f is regular and there are close examples:

  • f(x) can be approximated from examples {xi , f(xi)}i by

?

x Problem: kx xik is always large

  • Need ✏−d points to cover [0, 1]d at a Euclidean distance ✏
  • o o o o o o o o o
  • o o o o o o o o o
  • o o o o o o o o o
  • o o o o o o o o o
  • o o o o o o o o o
  • o o o o o o o o o
  • o o o o o o o o o
  • o o o o o o o o o
  • o o o o o o o o o
  • o o o o o o o o o

1 1 d=2

slide-5
SLIDE 5

Multiscale Separation

  • Variables x(u) indexed by a low-dimensional u: time/space...

pixels in images, particles in physics, words in text... From d2 interactions to O(log2 d) multiscale interactions.

  • Mutliscale interactions of d variables:
  • Multiscale analysis: wavelets on groups of symmetries.

hierarchical architecture.

u1 u2

slide-6
SLIDE 6

Overview

  • 1 Hidden Layer Network, Approximation theory and Curse
  • Kernel learning
  • Dimension reduction with change of variables
  • Deep Neural networks and symmetry groups
  • Wavelet Scattering transforms
  • Applications and many open questions

Understanding Deep Convolutional Networks, arXiv 2016.

slide-7
SLIDE 7
  • To estimate f(x) from a sampling {xi , yi = f(xi)}i≤M
  • Precise sparse approximation requires some ”regularity”.
  • For binary classification f(x) =

⇢ 1 if x ∈ Ω −1 if x / ∈ Ω f(x) = sign( ˜ f(x)) where ˜ f is potentially regular.

  • What type of regularity ? How to compute fM ?

we must build an M-parameter approximation fM of f.

Learning as an Approximation

slide-8
SLIDE 8

ρ(wn.x + bn) M fM(x) =

M

X

n=1

αn ρ(wn.x + bn) αn

wn.x = P

k wk,nxk

One-hidden layer neural network: {wk,k}k,n and {αn}n are learned non-linear approximation.

1 Hidden Layer Neural Networks

d

x

Theorem: For ”resonnable” bounded ρ(u) 8f 2 L2[0, 1]d lim

M→∞ kf fMk = 0 .

and appropriate choices of wn,k and αn:

Cybenko, Hornik, Stinchcombe, White

No big deal: curse of dimensionality still there. ridge functions

wn

ρ(x.wn + bn)

slide-9
SLIDE 9

ρ(wn.x + bn) M fM(x) =

M

X

n=1

αn ρ(wn.x + bn) αn

wn.x = P

k wk,nxk

One-hidden layer neural network: {wk,k}k,n and {αn}n are learned non-linear approximation.

1 Hidden Layer Neural Networks

d

x

fM(x) =

M

X

n=1

αn eiwn.x For nearly all ρ: essentially same approximation results. Fourier series: ρ(u) = eiu

slide-10
SLIDE 10

Piecewise Linear Approximation

f(x)

x

  • Piecewise linear approximation:

ρ(u) = max(u, 0) ˜ f(x) = X

n

an ⇢(x − n✏) n✏

⇒ Need M = ✏−1 points to cover [0, 1] at a distance ✏

kf fMk  C M −1 If f is Lipschitz: |f(x) − f(x0)| ≤ C |x − x0| ⇒ |f(x) − ˜ f(x)| ≤ C ✏.

slide-11
SLIDE 11

Linear Ridge Approximation

ρ(u) = max(u, 0)

need M = ✏−d points to cover [0, 1]d at a distance ✏

⇒ kf fMk  C M −1/d

Curse of dimensionality!

˜ f(x) = X

n

an ⇢(wn.x − n✏)

  • Piecewise linear ridge approximation: x ∈ [0, 1]d

Sampling at a distance ✏: ⇒ |f(x) − ˜ f(x)| ≤ C ✏. If f is Lipschitz: |f(x) f(x0)|  C kx x0k

slide-12
SLIDE 12
  • What prior condition makes learning possible ?

Approximation with Regularity

∀x, u |f(x) − pu(x)| ≤ C |x − u|s with pu(x) polynomial

  • Approximation of regular functions in Cs[0, 1]d:

f(x) pu(x)

x u

|x − u| ≤ ✏1/s ⇒ |f(x) − pu(x)| ≤ C ✏ Need M −d/s point to cover [0, 1]d at a distance ✏1/s kf fMk  C M −s/d ⇒

  • Can not do better in Cs[0, 1]d, not good because s ⌧ d.

Failure of classical approximation theory.

slide-13
SLIDE 13

Data:

Kernel Learning

x ∈ Rd x Change of variable Φ(x) = {φk(x)}k≤d0 ˜ f(x) = hΦ(x) , wi = X

k

wk φk(x) . to nearly linearize f(x), which is approximated by:

1D projection

  • What ”regularity” of f is needed ?
  • How and when is possible to find such a Φ ?

Metric: kx x0k

Φ

Linear Classifier

Φ(x) ∈ Rd0

w

kΦ(x) Φ(x0)k

slide-14
SLIDE 14

⇒ Choose Φ increasing dimensionality ! Proposition: There exists a hyperplane separating any two subsets of N points {Φxi}i in dimension d0 > N + 1 if {Φxi}i are not in an affine subspace of dimension < N.

Increase Dimensionality

Problem: generalisation. If σ is small, nearest neighbor classifier type:

σ

Example: Gaussian kernel hΦ(x), Φ(x0)i = exp ⇣kx x0k2 2σ2 ⌘ Φ(x) is of dimension d0 = ∞ , overfitting.

slide-15
SLIDE 15

Reduction of Dimensionality

) kf fMk  C M −1/d0 Φ(x) 6= Φ(x0) if f(x) 6= f(x0) ⇒ ∃ ˜ f with f(x) = ˜ f(Φ(x))

  • For x ∈ Ω, if Φ(Ω) is bounded and a low dimension d0
  • Discriminative change of variable Φ(x):

, |f(x) f(x0)|  C kΦ(x) Φ(x0)k z = Φ(x)

  • If ˜

f is Lipschitz: | ˜ f(z) ˜ f(z0)|  C kz z0k Discriminative: kΦ(x) Φ(x0)k C1 |f(x) f(x0)|

slide-16
SLIDE 16

x

Linear Classificat.

ρ

linear convolution linear convolution

Deep Convolution Neworks

L2 ρ Φ(x)

. . .

non-linear scalar:

L1

neuron

Why does it work so well ? Optimize Lj with architecture constraints: over 109 parameters Exceptional results for images, speech, language, bio-data...

ρ(u) = max(u, 0)

  • The revival of neural networks: Y. LeCun

Hierarchical invariants Linearization

y = ˜ f(x) A difficult problem

slide-17
SLIDE 17

ImageNet Data Basis

  • Data basis with 1 million images and 2000 classes
slide-18
SLIDE 18
  • Imagenet supervised training: 1.2 106 examples, 103 classes

15.3% testing error

Wavelets

Alex Deep Convolution Network

  • A. Krizhevsky, Sutsever, Hinton

in 2012 New networks with 5% errors. Up to 150 layers!

slide-19
SLIDE 19

Image Classification

slide-20
SLIDE 20

Scene Labeling / Car Driving

slide-21
SLIDE 21

k✏k < 10−2kxk + ✏ = with

Why Understading ?

correctly classified classified as

  • strich

x

˜ x

Szegedy, Zaremba, Sutskever, Bruna, Erhan, Goodfellow, Fergus

  • Trial and error testing can not guarantee reliability.
slide-22
SLIDE 22

x(u) x1(u, k1) x2(u, k2) xJ(u, kJ) k1 k2

Deep Convolutional Networks

ρL1 ρLJ xj = ρ Lj xj−1 xj(u, kj) = ⇢ ⇣ X

k

xj−1(·, k) ? hkj,k(u) ⌘

sum across channels

classification

  • Lj is a linear combination of convolutions and subsampling:
  • ρ is contractive: |ρ(u) − ρ(u0)| ≤ |u − u0|

ρ(u) = max(u, 0) or ρ(u) = |u|

slide-23
SLIDE 23

Linearisation in Deep Networks

  • Trained on a data basis of faces:linearization
  • On a data basis including bedrooms: interpolaitons
  • A. Radford, L. Metz, S. Chintala
slide-24
SLIDE 24

Many Questions

  • Why convolutions ? Translation covariance.
  • Why no overfitting ? Contractions, dimension reduction
  • Why hierarchical cascade ?
  • Why introducing non-linearities ?
  • How and what to linearise ?
  • What are the roles of the multiple channels in each layer ?

x(u) x1(u, k1) x2(u, k2) xJ(u, kJ) k1 k2

ρL1 ρLJ

classification

ρ Lj

slide-25
SLIDE 25

Linear Dimension Reduction

Level sets of f(x) Ωt = {x : f(x) = t} Ω1 Ω2 Ω3 Classes by linear projections: invariants. If level sets (classes) are parallel to a linear space then variables are eliminated

Φ(x) x

slide-26
SLIDE 26

Linearise for Dimensionality Reduction

Level sets of f(x) Ωt = {x : f(x) = t}

  • If level sets Ωt are not parallel to a linear space
  • Linearise them with a change of variable Φ(x)
  • Then reduce dimension with linear projections

Classes Ω1 Ω2 Ω3

  • Difficult because Ωt are high-dimensional, irregular,

known on few samples.

Φ(x)

x

slide-27
SLIDE 27

Level Set Geometry: Symmetries

  • A symmetry is an operator g which preserves level sets:

∀x , f(g.x) = f(x) . : global

g g

Level sets: classes Ω1 Ω2

  • Curse of dimensionality ⇒ not local but global geometry

f(g1.g2.x) = f(g2.x) = f(x) If g1 and g2 are symmetries then g1.g2 is also a symmetry , characterised by their global symmetries.

slide-28
SLIDE 28

Groups of symmetries

  • G = { all symmetries } is a group: unknown

∀(g, g0) ∈ G2 ⇒ g.g0 ∈ G ∀g ∈ G , g−1 ∈ G (g.g0).g00 = g.(g0.g00) Inverse: Associative: If commutative g.g0 = g0.g : Abelian group.

  • Group of dimension n if it has n generators:

g = gp1

1 gp2 2 ... gpn n

  • Lie group: infinitely small generators (Lie Algebra)
slide-29
SLIDE 29

x(u)

Translation and Deformations

Video of Philipp Scott Johnson

  • Digit classification:
  • Globally invariant to the translation group
  • Locally invariant to small diffeomorphisms

: small : huge group

x0(u) = x(u − τ(u))

Ω3

Ω5

slide-30
SLIDE 30

H : Heisenberg group of ”time-frequency” translations

Frequency Transpositions

encyclopaedias

log(ω) t log(ω) t

slide-31
SLIDE 31

Time and frequency translations and deformations: for speech recognition not for locutor recognition.

  • Frequency transposition invariance is needed

Frequency Transpositions

log(ω) t

slide-32
SLIDE 32

SO(2) × Diff(SO(2)) Group:

  • Rotation and deformations
  • Scaling and deformations

R × Diff(R) Group:

Rotation and Scaling Variability

slide-33
SLIDE 33

Linearize Symmetries

  • A change of variable Φ(x) must linearize the orbits {g.x}g∈G

x gp

1.x

g1x gp

1.x0

g1x0

x0

  • Linearise symmetries with a change of variable Φ(x)

Φ(gp

1.x0)

Φ(x0) Φ(x) Φ(gp

1.x)

  • Lipschitz: 8x, g : kΦ(x) Φ(g.x)k  C kgk
slide-34
SLIDE 34

x(u) x0(u)

Translation and Deformations

Video of Philipp Scott Johnson

  • Digit classification:
  • Globally invariant to the translation group
  • Locally invariant to small diffeomorphisms

Linearize small diffeomorphisms: ⇒ Lipschitz regular

slide-35
SLIDE 35

Translations and Deformations

  • Invariance to translations:

g.x(u) = x(u − c) ⇒ Φ(g.x) = Φ(x) .

  • Small diffeomorphisms: g.x(u) = x(u − τ(u))

Metric: kgk = krτk∞ maximum scaling Linearisation by Lipschitz continuity kΦ(x) Φ(g.x)k  C krτk∞ . kΦ(x) Φ(x0)k C1 |f(x) f(x0)|

  • Discriminative change of variable:
slide-36
SLIDE 36

|b x(ω)| |b xτ(ω)|

  • Fourier transform ˆ

x(ω) = R x(t) e−iωt dt The modulus is invariant to translations: ) k|ˆ x| |ˆ xτ|k krτk∞ kxk Φ(x) = |ˆ x| = |ˆ xc|

Fourier Deformation Instability

| |ˆ xτ(ω)| − |ˆ x(ω)| | is big at high frequencies

  • Instabilites to small deformations xτ(t) = x(t − τ(t)) :

ω

xc(t) = x(t − c) ⇒ ˆ xc(ω) = e−icω ˆ x(ω)

⌧(t) = ✏ t

slide-37
SLIDE 37

Deep Neural Network Mathematical Mysteries for High Dimensional Learning

Stéphane Mallat École Normale Supérieure

www.di.ens.fr/data

slide-38
SLIDE 38

x(u) x1(u, k1) x2(u, k2) xJ(u, kJ) k1 k2

Deep Convolutional Trees

ρL1 ρLJ xj = ρ Lj xj−1

classification

Lj is composed of convolutions and subs samplings: xj(u, kj) = ⇢ ⇣ xj−1(·, k) ? hkj,k(u) ⌘ No channel communication: how far can we go ? Why hierachical cascade ?

slide-39
SLIDE 39

Translations and Deformations

  • Invariance to translations:

g.x(u) = x(u − c) ⇒ Φ(g.x) = Φ(x) .

  • Small diffeomorphisms: g.x(u) = x(u − τ(u))

Metric: kgk = krτk∞ maximum scaling Linearisation by Lipschitz continuity kΦ(x) Φ(g.x)k  C krτk∞ . kΦ(x) Φ(x0)k C1 |f(x) f(x0)|

  • Discriminative change of variable:
slide-40
SLIDE 40

Overview Part II

  • Wavelet Scattering transform along translations
  • Generation of textures and random processes
  • Channel connections for more general groups
  • Image and audio classification with small training sets
  • Quantum chemistry
  • Open problems

Understanding Deep Convolutional Networks, arXiv 2016.

slide-41
SLIDE 41
  • Dilated wavelets: ψλ(t) = 2−j/Q ψ(2−j/Qt) with λ = 2−j/Q

Multiscale Wavelet Transform

Q-constant band-pass filters ˆ ψλ

| ˆ ψλ(ω)|2 λ | ˆ ψλ(ω)|2 λ ω |ˆ φ(ω)|2 ψλ(t) ψλ(t)

Wx = ✓ x ? 2J(t) x ? λ(t) ◆

λ≤2J

  • Wavelet transform:

Preserves norm: Wx2 = x2 .

: average : higher frequencies

x ? λ(t) = Z x(u) λ(t − u) du ⇒ \ x ? λ(!) = b x(!) b λ(!)

slide-42
SLIDE 42
  • Wavelets are uniformly stable to deformations:

if ψλ,τ(t) = ψλ(t − τ(t)) then ⇤ψλ ψλ,τ⇤ ⇥ C sup

t |⌅τ(t)| .

Why Wavelets ?

  • Wavelets separate multiscale information.
  • Wavelets provide sparse representations.
slide-43
SLIDE 43

x(t) |x ⇥ λ1(t)| =

  • Z

x(u)λ1(t − u) du

  • ψλ1

1/λ1

Singular Functions

|x ⇥ λ1(t)|

slide-44
SLIDE 44

Scattering Transform

t |x ? λ1(t)|

Time-Frequency Fibers

λ1 log ω = t

x(t)

Wavelet transform modulus: |W|

frequency time

slide-45
SLIDE 45

x(t)

|W1|x = ✓ x ? 2J |x ? λ1| ◆

λ1

First wavelet transform

Modulus improves invariance:

W1x = ✓ x ? λ1 ◆

λ1

x ? 2J

Wavelet Translation Invariance

x ? λ1(t) = x ? a

λ1(t) + i x ? b λ1(t)

|x ? λ1(t)| = q |x ? a

λ1(t)|2 + |x ? b λ1(t)|2

|x ? λ1| ? 2J(t)

2J

local translation invariance x ? 2J(t) full translation invariance

2J = ∞

Second wavelet transform modulus |W2| |x ? λ1|= ✓ |x ? λ1| ? 2J(t) ||x ? λ1| ? λ2(t)| ◆

λ2

slide-46
SLIDE 46

x(t) |x ⇥ λ1(t)| =

  • Z

x(u)λ1(t − u) du

  • ψλ1

1/λ1

Singular Functions

|x ⇥ λ1(t)| ψλ2

slide-47
SLIDE 47

log(!1) t First−order windowed scattering (small scale) log(!1) t First−order windowed scattering (large scale) log(!2) t Second−order windowed scattering (large scale) Band #75

18 Hz

Harmonic sound: x(t) = a(t) e ? h(t) with varying a(t)

Amplitude Modulation

λ1 = log(ω1) λ2 = log(ω2)

1977 Hz

λ1 = log(ω1)

512ms window

|x ⇥ λ1|(t) |x ? λ1| ? (t) ||x ? λ1| ? λ2| ? (t) for 1 = log(1977)

slide-48
SLIDE 48

Scattering Transform

t λ1

|W1|

|x ? λ1(t)| x(t)

Q1 = 16 Q2 = 1 Mel Frequency Spectrum

Output:

t t λ1

|x ? λ1| ? 2J(t)

λ1 λ2 t

|W2|

||x ? λ1| ? λ2(t)|

|W3|

λ1 λ2

t

time average time average ||x ? λ1| ? λ2| ? 2J(t)

ScatteringConvolution Network

Modulation Spectrum 1D 3D

Output: window: 2Jms log ω =

2J 2J

no vertical connection

slide-49
SLIDE 49

rotated and dilated:

real parts imaginary parts

Scale separation with Wavelets

ψ2j,θ(u) = 2−j ψ(2−jrθu)

  • Wavelet transform:

: average : higher frequencies

Wx = ✓ x ? 2J(u) x ? 2j,θ(u) ◆

j≤J,θ

  • Wavelet filter ψ(u):

| ˆ ψλ(ω)|2

ω1

ω2

Preserves norm: Wx2 = x2 . x ? 2j,θ(u) = Z x(v) 2j,θ(u − v) dv + i

slide-50
SLIDE 50

Averaging Pyramid

Hx(u) = x(2u) + x(2u + 1) 2

x(u) u H2x H3x H4x

  • Multiscale averaging by cascade of pair averaging:
slide-51
SLIDE 51

where h is a low frequency and g is a high frequency filter. Hx(u) = x ? h(2u) and Gx(u) = x ? g(2u)

Haar Filtering

nx(2u) − x(2u + 1) √ 2

  • u≤d/2

G H {x(u)}u≤d nx(2u) + x(2u + 1) √ 2

  • u≤d/2
slide-52
SLIDE 52

j depth

Haar Wavelet Transform

G H G H G H G H

x ? J(2Jk)

2J

φJ(u)

u

x ? j(2jk)

ψj(u)

2j

u

x

J

W1

slide-53
SLIDE 53

20 21

|x ? 21,θ|

Fast Wavelet Filter Bank

|W1|

2J Scale

H G1

G2

G3 G4

slide-54
SLIDE 54

20 22 2J

|x ? 22,θ|

|W1|

Scale 21

|x ? 21,θ|

|W1|

Wavelet Filter Bank

x(u) ρ(α) = |α|

  • Sparse representation

|x ? 2j,θ|

If u ≥ 0 then ρ(u) = u ρ has no effect after an averaging.

slide-55
SLIDE 55

x

|x ? λ1|

|W3| |W2||W1| x |W3| |W4|

20 21 22 23

|x ? λ0

1|

2J

|W4|

Wavelet Convolution Network Tree

Scale

|W1|

x ? J

|x ? λ1| ? J |||x ? λ1| ? λ2| ? J |x ? λ0

1| ? J

|L1| |L2| |L3| |L4| S4x = |L4| |L3| |L2| |L1|x = |W2|

||x ? λ1| ? λ2|

|||x ? λ1| ? λ2| ? λ3|

ρ has no effect after an averaging.

slide-56
SLIDE 56
  • it preserves the norm |W|x = x

|W|x = ✓ x ⇤ (t) |x ⇤ ⇥λ(t)| ◆

t,λ

is non-linear Wx = ✓ x ⇤ (t) x ⇤ ⇥λ(t) ◆

t,λ

is linear and kWxk = kxk

  • it is contractive ⇤|W|x |W|y⇤ ⇥ ⇤x y⇤

because for (a, b) ∈ C2 ||a| − |b|| ≤ |a − b|

Contraction

ρ(u) = |u|

slide-57
SLIDE 57

= . . . |W3| |W2| |W1| x

SJx =       x ? 2J |x ? λ1| ? 2J ||x ? λ1| ? λ2| ? 2J |||x ? λ2| ? λ2| ? λ3| ? 2J ...      

λ1,λ2,λ3,...

preserves norms kSJxk = kxk kWkxk = kxk ) k|Wkx| |Wkx0|k  kx x0k Lemma : k[Wk, Dτ]k = kWkDτ DτWkk  C krτk∞ translations invariance and deformation stability: if Dτx(u) = x(u − τ(u)) then lim

J→∞ kSJDτx SJxk  C krτk∞ kxk

Scattering Properties

contractive kSJx SJyk  kx yk (L2 stability)

Theorem: For appropriate wavelets, a scattering is

slide-58
SLIDE 58

LeCun et. al.

Classification Errors Joan Bruna

Digit Classification: MNIST

SJx y = f(x) x Supervised Linear classifier

Invariants to specific deformations Separates different patterns Invariants to translations Linearises small deformations No learning

Training size

  • Conv. Net.

Scattering 50000 0.4% 0.4%

slide-59
SLIDE 59

2D Turbulence

Ω1 Ω2

Classification of Stationary Textures

  • What stochastic models ?

Non Gaussian with long-range dependance.

  • Can we ”Gaussianize” (linearize) such distributions

in a reduced dimensional space ?

slide-60
SLIDE 60
  • J. Bruna

Classification of Textures

CUREt database

Texte

Classification Errors

SJx y = f(x) x Supervised Linear classifier

Training Fourier Scattering per class Spectr. 46 1% 0.2 %

slide-61
SLIDE 61

The scattering transform of a stationary process X(t)

Scattering Moments of Processes

E(SX) =       E(X) E(|X ? λ1|) E(||X ? λ1| ? λ2|) E(|||X ? λ2| ? λ2| ? λ3|) ...      

λ1,λ2,λ3,...

J → ∞

Gaussian distribution: N ⇣ E(SX), ΣJ → 0 ⌘

with ”weak” ergodicity conditions Central limit theorem

: scattering moments : stationary vector

SJX =       X ? 2J(t) |X ? λ1| ? 2J(t) ||X ? λ1| ? λ2| ? 2J(t) |||X ? λ2| ? λ2| ? λ3| ? 2J(t) ...      

λ1,λ2,λ3,...

  • J. Bruna
slide-62
SLIDE 62

The scattering transform of a stationary process X(t)

Scattering Moments of Processes

J → ∞

Gaussian distribution: N ⇣ E(SX), ΣJ → 0 ⌘

with ”weak” ergodicity conditions Central limit theorem

: stationary vector

SJX =       X ? 2J(t) |X ? λ1| ? 2J(t) ||X ? λ1| ? λ2| ? 2J(t) |||X ? λ2| ? λ2| ? λ3| ? 2J(t) ...      

λ1,λ2,λ3,...

kSJ ˜ X SJXk2

  • Reconstruction: compute ˜

X which minimises

  • Gradient descent
slide-63
SLIDE 63

Original Paper Cocktail Party

Representation of Audio Textures

Joan Bruna

60 20 40 60 20 40 60

Applauds Gaussian in time Gaussian in scattering

t

ω

slide-64
SLIDE 64

Ergodic Texture Reconstructions

Joan Bruna

2D Turbulence E(|x ? λ1|) , E(||x ? λ1| ? λ2|)

Second order Gaussian Scattering: O(log N 2) moments Textures of N pixels Gaussian process model with N second order moments

slide-65
SLIDE 65

Ising Model and Inverse Problem

p(x) = Z−1

β

exp ⇣ − β X

i,j

Ji,j x(i) x(j) ⌘ with x(i) = ±1

βc β

scattering

low-resolution TV optim. Scat pred.

Gaussian

Ising

Bruna, Dokmanic, Maarten de Hoop

slide-66
SLIDE 66

x(u) x1(u, k1) x2(u, k2) xJ(u, kJ) k1 k2

Deep Convolutional Trees

ρL1 ρLJ xj = ρ Lj xj−1

classification

Lj is composed of convolutions and subs samplings: xj(u, kj) = ⇢ ⇣ xj−1(·, k) ? hkj,k(u) ⌘ No channel communication: what limitations ?

slide-67
SLIDE 67

UIUC database: 25 classes Scattering classification errors Training

  • Scat. Translation

20 20 %

Rotation and Scaling Invariance

Laurent Sifre

slide-68
SLIDE 68

x(u) x1(u, k1) x2(u, k2) xJ(u, kJ) k1 k2

Deep Convolutional Networks

ρL1 ρLJ xj = ρ Lj xj−1 xj(u, kj) = ⇢ ⇣ X

k

xj−1(·, k) ? hkj,k(u) ⌘

sum across channels

classification

  • Lj is a linear combination of convolutions and subsampling:

What is the role of channel connections ? Linearize other symmetries beyond translations.

slide-69
SLIDE 69

Rotation Invariance

2J

|x ? 22,θ| |x ? 23,θ|

Scale

|x ? 21,θ|

|W1|

x ? J

θ

  • Channel connections linearize other symmetries.
  • Invariance to rotations are computed by convolutions

along the rotation variable θ with wavelet filters. ⇒ invariance to rigid mouvements.

slide-70
SLIDE 70
  • Action on wavelet coefficients:

Extension to Rigid Mouvements

Laurent Sifre x(u) |W1| R x(u)du

  • Group of rigid displacements: translations and rotations

|W1| R x(u)du xj(rα(u − c), θ − α) x(rα(u − c)) xj(u, ✓) = |x ? 2j,θ(u)|

rotation & translation rotation & translation , angle translation

Need to capture the variability of spatial directions.

slide-71
SLIDE 71
  • To build invariants: second wavelet transform on L2(G):

with wavelets ψλ2(u, θ)

Extension to Rigid Mouvements

Laurent Sifre

  • Scattering on rigid mouvements:

Wavelets on Rigid Mvt. Wavelets on Rigid Mvt.

xj(u , θ)

Wavelets on Translations

x(u) R x(u)du |W1| |W2| |xj ~ ψλ2(v, θ)| R xj(u, θ) dudθ |W3|

Z |xj ~ ψλ2(v, θ)|dudθ

convolutions of xj(u, θ)

x ~ ψλ(u, θ) = Z 2π ✓Z

R2 x(u0, θ0) ψθ,2j(rθ0(u − u0))

◆ ψ2k(θ − θ0)dθ0dt0

ψ2k(θ)

( ) θ u1 u2 ψθ2

ψθ,2j(u1, u2)

slide-72
SLIDE 72

UIUC database: 25 classes Scattering classification errors Training

  • Scat. Translation
  • Scat. Rigid Mouvt.

20 20 % 0.6%

Rotation and Scaling Invariance

Laurent Sifre

slide-73
SLIDE 73
  • Energy of d interacting bodies:

Can we learn the interaction energy f(x) of a system with x = n positions, values

  • ?

Astronomy Quantum Chemistry

Learning Physics: N-Body Problem

Matthew Hirn

  • N. Poilvert
slide-74
SLIDE 74

Kohn-Sham model: E(ρ) = T(ρ) + Z ρ(u) V (u) + 1 2 Z ρ(u)ρ(v) |u − v| dudv + Exc(ρ)

Molecular energy

At equilibrium:

Density Functional Theory

Kinetic energy electron-electron Coulomb repulsion electron-nuclei attraction Exchange

  • correlat. energy

74

f(x) = E(ρx) = min

ρ E(ρ)

slide-75
SLIDE 75

Quantum Chemistry Invariants

stable to deformations. Quantum chemistry: f(x) is invariant to rigid mouvements, Depends on the true electronic density (Kohn-Sham)

Ground state electronic density computed with Schroedinger Density ˜ ρx computed as a sum of blobs

  • Can we estimate f(x) from a naive electronic density ?
slide-76
SLIDE 76

Quantum Chemistry

Matthew Hirn

Quantum Regression

  • N. Poilvert

scattering coefficients and squared Fourier modulus coefficients and squared

  • r

Φx = {φn(˜ ρx)}n : fM(x) =

M

X

k=1

wk φnk(˜ ρx)

  • Linear regressions computed with invariant change of variables:

Regression coefficients wk: equivalent potential.

slide-77
SLIDE 77

Scattering Dictionary

Recover translation variability: |ρ ∗ ψj1,θ1| ∗ ψj2,θ2(u)

Rotations θ1 Scales j1 |ρ ∗ ψj1,θ1(u)|

Combine to recover roto-translation variabiltiy:

||ρ ∗ ψj1,·| ∗ ψj2,θ2(u) ~ ψl2(θ1)| ρ(u)

Recover rotation variability: |ρ ∗ ψj1,·(u)| ~ ψl2(θ1)

slide-78
SLIDE 78

x Scattering Regression

Regression:

78

1 2 3 4 5 6 7 8 9 10

Model Complexity log (M)

1 2 3 4 5 6 7 8

  • |

− |

  • Fourier

Wavelet Scattering Coulomb

log2 M 5.8 14.2 16.7 kcal/mol : State of the art 1.8 kcal/mol

Data basis {xi , f(xi)}i≤N of 4357 planar molecules

Interaction terms across scales Fourier Scattering

slide-79
SLIDE 79

Time-Frequency Translation Group

|x ? λ| ? J ||x ? λ| ? α ? β| ? J

t t

t t

log λ

t

Time-frequency wavelet convolutions

  • J. Anden and V. Lostanlen
slide-80
SLIDE 80

Joint Time-Frequency Scattering

Original Time Scattering Time/Freq Scattering

  • J. Anden and V. Lostanlen
slide-81
SLIDE 81

Musical Instrument Classificaiton

MFCC audio descriptors 0,39 time scattering 0,31 ConvNet 0,31 time-frequency scattering 0,18

MedleyDB: 8 classes 10k training examples class-wise average error

clarinet electric guitar female singer flute piano tenor saxophone trumpet violin

  • J. Anden and V. Lostanlen
slide-82
SLIDE 82

Environmental Sound Classification

MFCC audio descriptors 0,39 time scattering 0,27 ConvNet
 (Piczak, MLSP 2015) 0,26 time-frequency scattering 0,2

UrbanSound8k: 10 classes 8k training examples class-wise average error

air conditioner car horns children playing dog barks drilling engine at idle gunshot jackhammer siren street music

  • J. Anden and V. Lostanlen
slide-83
SLIDE 83

Complex Image Classification

Bateau Nénuphare Metronome Castore Arbre de Joshua Ancre

Edouard Oyallon Data Basis Deep-Net Scat/Unsupervised CIFAR-10 7% 20% SJx y = f(x)

x

Supervised Linear classifier

No learning

slide-84
SLIDE 84

Linearisation in Deep Networks

  • Trained on a data basis of faces:linearization
  • On a data basis including bedrooms: interpolaitons
  • A. Radford, L. Metz, S. Chintala
slide-85
SLIDE 85

x(u) x1(u, k1) x2(u, k2) xJ(u, kJ) k1 k2

Deep Convolutional Networks

ρL1 ρLJ

classification

  • The convolution network operators have many roles:

– Linearize non-linear transformations (symmetries) – Reduce dimension with projections – Memory storage of « characteristic » structures

  • Difficult to separate these roles when analyzing learned networks

Lj

slide-86
SLIDE 86

x(u) x1(u, k1) x2(u, k2) xJ(u, kJ) k1 k2

Open Problems

ρL1 ρLJ

classification

  • Can we recover symmetry groups from the matrices Lj ?
  • What kind of groups ?
  • Can we characterise the regularity of f(x) from these groups ?
  • Can we define classes of high-dimensional « regular » functions

that are well approximated by deep neural networks ?

  • Can we get approximation theorems giving errors depending on

number of training exemples, with a fast decay ?

slide-87
SLIDE 87

Conclusions

  • Deep convolutional networks have spectacular high-dimensional

approximation capabilities.

  • Seem to compute hierarchical invariants of complex symmetries
  • Used as models in physiological vision and audition
  • Close link with particle and statistical physics
  • Outstanding mathematical problem to understand them:

notions of complexity, regularity, approximation theorems…

Understanding Deep Convolutional Networks, arXiv 2016.