Harmonic Analysis of Deep Convolutional Networks 1 Yuan YAO HKUST - - PowerPoint PPT Presentation

harmonic analysis of deep convolutional networks
SMART_READER_LITE
LIVE PREVIEW

Harmonic Analysis of Deep Convolutional Networks 1 Yuan YAO HKUST - - PowerPoint PPT Presentation

Harmonic Analysis of Deep Convolutional Networks 1 Yuan YAO HKUST Based on Mallat and Bolcskei talks etc. Acknowledgement A following-up course at HKUST: https://deeplearning-math.github.io/ High Dimensional Natural Image Classification


slide-1
SLIDE 1

Harmonic Analysis of Deep Convolutional Networks

Yuan YAO HKUST Based on Mallat and Bolcskei talks etc.

1

slide-2
SLIDE 2

Acknowledgement

A following-up course at HKUST: https://deeplearning-math.github.io/

slide-3
SLIDE 3

High Dimensional Natural Image Classification

given n sample values {xi , yi = f(xi)}i≤n

  • High-dimensional x = (x(1), ..., x(d)) ∈ Rd:
  • Classification: estimate a class label f(x)

Image Classification

d = 106

Anchor Joshua Tree Beaver Lotus Water Lily

Huge variability inside classes Find invariants

slide-4
SLIDE 4
  • Analysis in high dimension: x ∈ Rd with d ≥ 106.
  • need 10d points over [0, 1]d

impossible if d ≥ 20 ⇒ Euclidean metrics are not appropriate on raw data.

Curse of Dimensionality

  • Points are far away in high dimensions d:
  • 10 points cover [0, 1] at a distance 10−1
  • o o o o o o o
  • 100 points for [0, 1]2
  • o o o o o o o o o
  • o o o o o o o o o
  • o o o o o o o o o
  • o o o o o o o o o
  • o o o o o o o o o
  • o o o o o o o o o
  • o o o o o o o o o
  • o o o o o o o o o
  • o o o o o o o o o
  • o o o o o o o o o

points are concentrated in 2d corners!

lim

d→∞

volume sphere of radius r volume [0, r]d = 0

slide-5
SLIDE 5

A Blessing from Physical world? Multiscale “compositional” sparsity

  • Variables x(u) indexed by a low-dimensional u: time/space...

pixels in images, particles in physics, words in text... From d2 interactions to O(log2 d) multiscale interactions.

  • Mutliscale interactions of d variables:
  • Multiscale analysis: wavelets on groups of symmetries.

hierarchical architecture.

u1 u2

slide-6
SLIDE 6
  • To estimate f(x) from a sampling {xi , yi = f(xi)}i≤M
  • Precise sparse approximation requires some ”regularity”.
  • For binary classification f(x) =

⇢ 1 if x ∈ Ω −1 if x / ∈ Ω f(x) = sign( ˜ f(x)) where ˜ f is potentially regular.

  • What type of regularity ? How to compute fM ?

we must build an M-parameter approximation fM of f.

Learning as an Approximation

slide-7
SLIDE 7
slide-8
SLIDE 8

ρ(wn.x + bn) M fM(x) =

M

X

n=1

αn ρ(wn.x + bn) αn

wn.x = P

k wk,nxk

One-hidden layer neural network: {wk,k}k,n and {αn}n are learned non-linear approximation.

1 Hidden Layer Neural Networks

d

x

fM(x) =

M

X

n=1

αn eiwn.x For nearly all ρ: essentially same approximation results. Fourier series: ρ(u) = eiu

slide-9
SLIDE 9

Piecewise Linear Approximation

f(x)

x

  • Piecewise linear approximation:

ρ(u) = max(u, 0) ˜ f(x) = X

n

an ⇢(x − n✏) n✏

⇒ Need M = ✏−1 points to cover [0, 1] at a distance ✏ kf fMk  C M −1 If f is Lipschitz: |f(x) − f(x0)| ≤ C |x − x0| ⇒ |f(x) − ˜ f(x)| ≤ C ✏.

slide-10
SLIDE 10

Linear Ridge Approximation

ρ(u) = max(u, 0)

need M = ✏−d points to cover [0, 1]d at a distance ✏ ⇒ kf fMk  C M −1/d Curse of dimensionality!

˜ f(x) = X

n

an ⇢(wn.x − n✏)

  • Piecewise linear ridge approximation: x ∈ [0, 1]d

Sampling at a distance ✏: ⇒ |f(x) − ˜ f(x)| ≤ C ✏. If f is Lipschitz: |f(x) f(x0)|  C kx x0k

slide-11
SLIDE 11
  • What prior condition makes learning possible ?

Approximation with Regularity

∀x, u |f(x) − pu(x)| ≤ C |x − u|s with pu(x) polynomial

  • Approximation of regular functions in Cs[0, 1]d:

f(x) pu(x)

x u

|x − u| ≤ ✏1/s ⇒ |f(x) − pu(x)| ≤ C ✏ Need M −d/s point to cover [0, 1]d at a distance ✏1/s kf fMk  C M −s/d ⇒

  • Can not do better in Cs[0, 1]d, not good because s ⌧ d.

Failure of classical approximation theory.

slide-12
SLIDE 12

Data:

Kernel Learning

x ∈ Rd x Change of variable Φ(x) = {φk(x)}k≤d0 ˜ f(x) = hΦ(x) , wi = X

k

wk φk(x) . to nearly linearize f(x), which is approximated by:

1D projection

  • What ”regularity” of f is needed ?
  • How and when is possible to find such a Φ ?

Metric: kx x0k

Φ

Linear Classifier

Φ(x) ∈ Rd0 w

kΦ(x) Φ(x0)k

slide-13
SLIDE 13
slide-14
SLIDE 14

Spirit in Fisher’s Linear Discriminant Analysis

Reduction of Dimensionality

) kf fMk  C M −1/d0 Φ(x) 6= Φ(x0) if f(x) 6= f(x0) ⇒ ∃ ˜ f with f(x) = ˜ f(Φ(x))

  • For x ∈ Ω, if Φ(Ω) is bounded and a low dimension d0
  • Discriminative change of variable Φ(x):

, |f(x) f(x0)|  C kΦ(x) Φ(x0)k z = Φ(x)

  • If ˜

f is Lipschitz: | ˜ f(z) ˜ f(z0)|  C kz z0k Discriminative: kΦ(x) Φ(x0)k C1 |f(x) f(x0)|

slide-15
SLIDE 15

x

Linear Classificat.

ρ

linear convolution linear convolution

Deep Convolution Neworks

L2 ρ Φ(x)

. . .

non-linear scalar:

L1

neuron

Why does it work so well ? Optimize Lj with architecture constraints: over 109 parameters Exceptional results for images, speech, language, bio-data...

ρ(u) = max(u, 0)

  • The revival of neural networks: Y. LeCun

Hierarchical invariants Linearization

y = ˜ f(x) A difficult problem

slide-16
SLIDE 16

x(u) x1(u, k1) x2(u, k2) xJ(u, kJ) k1 k2

Deep Convolutional Networks

ρL1 ρLJ xj = ρ Lj xj−1 xj(u, kj) = ⇢ ⇣ X

k

xj−1(·, k) ? hkj,k(u) ⌘

sum across channels

classification

  • Lj is a linear combination of convolutions and subsampling:
  • ρ is contractive: |ρ(u) − ρ(u0)| ≤ |u − u0|

ρ(u) = max(u, 0) or ρ(u) = |u|

slide-17
SLIDE 17

Many Questions

  • Why convolutions ? Translation covariance.
  • Why no overfitting ? Contractions, dimension reduction
  • Why hierarchical cascade ?
  • Why introducing non-linearities ?
  • How and what to linearise ?
  • What are the roles of the multiple channels in each layer ?

x(u) x1(u, k1) x2(u, k2) xJ(u, kJ) k1 k2

ρL1 ρLJ

classification

ρ Lj

slide-18
SLIDE 18

Linear Dimension Reduction

Level sets of f(x) Ωt = {x : f(x) = t} Ω1 Ω2 Ω3 Classes by linear projections: invariants. If level sets (classes) are parallel to a linear space then variables are eliminated

Φ(x) x

slide-19
SLIDE 19

Linearise for Dimensionality Reduction

Level sets of f(x) Ωt = {x : f(x) = t}

  • If level sets Ωt are not parallel to a linear space
  • Linearise them with a change of variable Φ(x)
  • Then reduce dimension with linear projections

Classes Ω1 Ω2 Ω3

  • Difficult because Ωt are high-dimensional, irregular,

known on few samples.

Φ(x) x

slide-20
SLIDE 20

Level Set Geometry: Symmetries

  • A symmetry is an operator g which preserves level sets:

∀x , f(g.x) = f(x) . : global

g g

Level sets: classes Ω1 Ω2

  • Curse of dimensionality ⇒ not local but global geometry

f(g1.g2.x) = f(g2.x) = f(x) If g1 and g2 are symmetries then g1.g2 is also a symmetry , characterised by their global symmetries.

slide-21
SLIDE 21

Groups of symmetries

  • G = { all symmetries } is a group: unknown

∀(g, g0) ∈ G2 ⇒ g.g0 ∈ G ∀g ∈ G , g−1 ∈ G (g.g0).g00 = g.(g0.g00) Inverse: Associative: If commutative g.g0 = g0.g : Abelian group.

  • Group of dimension n if it has n generators:

g = gp1

1 gp2 2 ... gpn n

  • Lie group: infinitely small generators (Lie Algebra)
slide-22
SLIDE 22

x(u)

Translation and Deformations

Video of Philipp Scott Johnson

  • Digit classification:
  • Globally invariant to the translation group
  • Locally invariant to small diffeomorphisms

: small : huge group

x0(u) = x(u − τ(u))

Ω3 Ω5

slide-23
SLIDE 23

SO(2) × Diff(SO(2)) Group:

  • Rotation and deformations
  • Scaling and deformations

R × Diff(R) Group:

Rotation and Scaling Variability

slide-24
SLIDE 24

Linearize Symmetries

  • A change of variable Φ(x) must linearize the orbits {g.x}g∈G

x gp

1.x

g1x gp

1.x0

g1x0 x0

  • Linearise symmetries with a change of variable Φ(x)

Φ(gp

1.x0)

Φ(x0) Φ(x) Φ(gp

1.x)

  • Lipschitz: 8x, g : kΦ(x) Φ(g.x)k  C kgk
slide-25
SLIDE 25

x(u) x0(u)

Translation and Deformations

Video of Philipp Scott Johnson

  • Digit classification:
  • Globally invariant to the translation group
  • Locally invariant to small diffeomorphisms

Linearize small diffeomorphisms: ⇒ Lipschitz regular

slide-26
SLIDE 26

Translations and Deformations

  • Invariance to translations:

g.x(u) = x(u − c) ⇒ Φ(g.x) = Φ(x) .

  • Small diffeomorphisms: g.x(u) = x(u − τ(u))

Metric: kgk = krτk∞ maximum scaling Linearisation by Lipschitz continuity kΦ(x) Φ(g.x)k  C krτk∞ . kΦ(x) Φ(x0)k C1 |f(x) f(x0)|

  • Discriminative change of variable:
slide-27
SLIDE 27

|b x(ω)| |b xτ(ω)|

  • Fourier transform ˆ

x(ω) = R x(t) e−iωt dt The modulus is invariant to translations: ) k|ˆ x| |ˆ xτ|k krτk∞ kxk Φ(x) = |ˆ x| = |ˆ xc|

Fourier Deformation Instability

| |ˆ xτ(ω)| − |ˆ x(ω)| | is big at high frequencies

  • Instabilites to small deformations xτ(t) = x(t − τ(t)) :

ω

xc(t) = x(t − c) ⇒ ˆ xc(ω) = e−icω ˆ x(ω)

⌧(t) = ✏ t

slide-28
SLIDE 28
  • Dilated:

Unitary: Wx2 = x2 .

  • Complex wavelet: ψ(t) = ψa(t) + i ψb(t)

x ? λ(t) = Z x(u) λ(t − u) du ψλ(t) = 2−j ψ(2−jt) with λ = 2−j .

Wavelet Transform

| ˆ ψλ(ω)|2 λ | ˆ ψλ(ω)|2 λ ω |ˆ φ(ω)|2 ψλ(t) ψλ(t)

Wx = ✓ x ? (t) x ? λ(t) ◆

t,λ

  • Wavelet transform:

ˆ x(ω)

slide-29
SLIDE 29

rotated and dilated:

real parts imaginary parts

  • Complex wavelet: ψ(t) = ψa(t) + i ψb(t) , t = (t1, t2)

ψλ(t) = 2−j ψ(2−jrt) with λ = (2j, r)

Image Wavelet Transform

Wx = ✓ x ? (t) x ? λ(t) ◆

t,λ

Unitary: Wx2 = x2 .

  • Wavelet transform:

| ˆ ψλ(ω)|2

ω1

ω2

slide-30
SLIDE 30
  • Wavelets are uniformly stable to deformations:

if ψλ,τ(t) = ψλ(t − τ(t)) then ⇤ψλ ψλ,τ⇤ ⇥ C sup

t |⌅τ(t)| .

Why Wavelets ?

  • Wavelets separate multiscale information.
  • Wavelets provide sparse representations.
slide-31
SLIDE 31

Why Wavelets?

´ Wavelets are uniformly stable to deformations ´ Wavelets are sparse representations of functions ´ Wavelets separate multiscale information ´ Wavelets can be locally translation invariant

if ψλ,τ(t) = ψλ(t − τ(t)) then ⇤ψλ ψλ,τ⇤ ⇥ C sup

t |⌅τ(t)| .

at ion

slide-32
SLIDE 32

Sparsity of Wavelet Transforms

x(t) |x ⇥ λ1(t)| =

  • Z

x(u)λ1(t − u) du

  • ψλ1

1/λ1

Singular Functions

|x ⇥ λ1(t)|

slide-33
SLIDE 33

Singularity is preserved in multiscale transform

x(t) |x ⇥ λ1(t)| =

  • Z

x(u)λ1(t − u) du

  • ψλ1

1/λ1

Singular Functions

|x ⇥ λ1(t)| ψλ2

Second wavelet transform modulus |W2| |x ? λ1|= ✓ |x ? λ1| ? 2J(t) ||x ? λ1| ? λ2(t)| ◆

λ2

slide-34
SLIDE 34

x ? λ1(t) = x ? a

λ1(t) + i x ? b λ1(t)

Wavelet Translation Invariance

slide-35
SLIDE 35
  • The modulus |x ? λ1| is a regular envelop

Wavelet Translation Invariance

pooling

|x ? λ1(t)| = q |x ? a

λ1(t)|2 + |x ? b λ1(t)|2

slide-36
SLIDE 36
  • The modulus |x ? λ1| is a regular envelop

|x ? λ1| ? (t)

  • The average |x ? λ1| ? (t) is invariant to small translations

relatively to the support of φ.

Wavelet Translation Invariance

slide-37
SLIDE 37
  • The modulus |x ? λ1| is a regular envelop

|x ? λ1| ? (t)

  • The average |x ? λ1| ? (t) is invariant to small translations

relatively to the support of φ. lim

φ→1 |x ? λ1| ? (t) =

Z |x ? λ1(u)| du = kx ? λ1k1

Wavelet Translation Invariance

slide-38
SLIDE 38

|x ? λ1|

  • The high frequencies of |x ? λ1| are in wavelet coefficients:

W|x ? λ1| = ✓ |x ? λ1| ? (t) |x ? λ1| ? λ2(t) ◆

t,λ2

Recovering Lost Information

∀1 , 2 , | | x ? λ1| ? λ2| ? (t)

  • Translation invariance by time averaging the amplitude:

|x ⇤⇥ λ1| ⇤

slide-39
SLIDE 39

20 22 2J

|x ? 22,θ|

|W1|

Scale 21

|x ? 21,θ|

|W1|

Wavelet Filter Bank

x(u) ρ(α) = |α|

  • Sparse representation

|x ? 2j,θ|

If u ≥ 0 then ρ(u) = u ρ has no effect after an averaging.

slide-40
SLIDE 40
  • it preserves the norm |W|x = x

|W|x = ✓ x ⇤ (t) |x ⇤ ⇥λ(t)| ◆

t,λ

is non-linear Wx = ✓ x ⇤ (t) x ⇤ ⇥λ(t) ◆

t,λ

is linear and kWxk = kxk

  • it is contractive ⇤|W|x |W|y⇤ ⇥ ⇤x y⇤

because for (a, b) ∈ C2 ||a| − |b|| ≤ |a − b|

Contraction

ρ(u) = |u|

slide-41
SLIDE 41

Wavelet Scattering Network

  • Cascade of contractive operators

⇤|Wk|x |Wk|x0⇤ ⇥ ⇤x x0⇤ with |Wk|x = x .

Cascade of Contractions

x

|W1| |W2| |W3|

x ? |x ? λ1| ? ||x ? λ1| ? λ2| ?

slide-42
SLIDE 42

Stability of Wavelet Scattering Transform