SLIDE 1 Harmonic Analysis of Deep Convolutional Neural Networks
Helmut B˝
Department of Information Technology and Electrical Engineering
October 2017
joint work with Thomas Wiatowski and Philipp Grohs
SLIDE 2
ImageNet
SLIDE 3
ImageNet
ski rock coffee plant
SLIDE 4
ImageNet
ski rock coffee plant
CNNs win the ImageNet 2015 challenge [He et al., 2015]
SLIDE 5
Describing the content of an image
CNNs generate sentences describing the content of an image [Vinyals et al., 2015]
SLIDE 6
Describing the content of an image
CNNs generate sentences describing the content of an image [Vinyals et al., 2015]
“Carlos .”
SLIDE 7
Describing the content of an image
CNNs generate sentences describing the content of an image [Vinyals et al., 2015]
“Carlos Kleiber .”
SLIDE 8
Describing the content of an image
CNNs generate sentences describing the content of an image [Vinyals et al., 2015]
“Carlos Kleiber conducting the .”
SLIDE 9
Describing the content of an image
CNNs generate sentences describing the content of an image [Vinyals et al., 2015]
“Carlos Kleiber conducting the Vienna Philharmonic’s .”
SLIDE 10
Describing the content of an image
CNNs generate sentences describing the content of an image [Vinyals et al., 2015]
“Carlos Kleiber conducting the Vienna Philharmonic’s New Year’s Concert .”
SLIDE 11
Describing the content of an image
CNNs generate sentences describing the content of an image [Vinyals et al., 2015]
“Carlos Kleiber conducting the Vienna Philharmonic’s New Year’s Concert 1989.”
SLIDE 12 Feature extraction and classification
input: f = non-linear feature extraction feature vector Φ(f) linear classifier
⇒ Shannon w, Φ(f) < 0, ⇒ von Neumann
SLIDE 13
Why non-linear feature extractors?
Task: Separate two categories of data through a linear classifier
1
: w, f > 0 : w, f < 0
SLIDE 14
Why non-linear feature extractors?
Task: Separate two categories of data through a linear classifier
1
: w, f > 0 : w, f < 0 not possible!
SLIDE 15 Why non-linear feature extractors?
Task: Separate two categories of data through a linear classifier
1
: w, f > 0 : w, f < 0 not possible! Φ(f) = f 1
: w, Φ(f) < 0 possible with w = 1 −1
SLIDE 16 Why non-linear feature extractors?
Task: Separate two categories of data through a linear classifier Φ(f) = f 1
- ⇒ Φ is invariant to angular component of the data
SLIDE 17 Why non-linear feature extractors?
Task: Separate two categories of data through a linear classifier Φ(f) = f 1
- ⇒ Φ is invariant to angular component of the data
⇒ Linear separability in feature space!
SLIDE 18
Translation invariance
Handwritten digits from the MNIST database [LeCun & Cortes, 1998]
Feature vector should be invariant to spatial location ⇒ translation invariance
SLIDE 19
Deformation insensitivity
Feature vector should be independent of cameras (of different resolutions), and insensitive to small acquisition jitters
SLIDE 20 Scattering networks ([Mallat, 2012], [Wiatowski and HB, 2015])
feature map
f |f ∗ gλ(k)
1 |
||f ∗ gλ(k)
1 | ∗ gλ(l) 2 |
|f ∗ gλ(p)
1 |
||f ∗ gλ(p)
1 | ∗ gλ(r) 2 |
SLIDE 21 Scattering networks ([Mallat, 2012], [Wiatowski and HB, 2015])
feature map
f |f ∗ gλ(k)
1 |
· ∗ χ2 ||f ∗ gλ(k)
1 | ∗ gλ(l) 2 |
· ∗ χ3 |f ∗ gλ(p)
1 |
· ∗ χ2 ||f ∗ gλ(p)
1 | ∗ gλ(r) 2 |
· ∗ χ3 · ∗ χ1
SLIDE 22 Scattering networks ([Mallat, 2012], [Wiatowski and HB, 2015])
feature map feature vector Φ(f)
f |f ∗ gλ(k)
1 |
· ∗ χ2 ||f ∗ gλ(k)
1 | ∗ gλ(l) 2 |
· ∗ χ3 |f ∗ gλ(p)
1 |
· ∗ χ2 ||f ∗ gλ(p)
1 | ∗ gλ(r) 2 |
· ∗ χ3 · ∗ χ1
General scattering networks guarantee [Wiatowski & HB, 2015]
- (vertical) translation invariance
- small deformation sensitivity
essentially irrespective of filters, non-linearities, and poolings!
SLIDE 23 Building blocks
Basic operations in the n-th network layer f . . . gλ(r)
n
non-lin. pool. gλ(k)
n
non-lin. pool. Filters: Semi-discrete frame Ψn := {χn} ∪ {gλn}λn∈Λn Anf2
2 ≤ f ∗ χn2 2 +
f ∗ gλn2 ≤ Bnf2
2,
∀f ∈ L2(Rd)
SLIDE 24 Building blocks
Basic operations in the n-th network layer f . . . gλ(r)
n
non-lin. pool. gλ(k)
n
non-lin. pool. Filters: Semi-discrete frame Ψn := {χn} ∪ {gλn}λn∈Λn Anf2
2 ≤ f ∗ χn2 2 +
f ∗ gλn2 ≤ Bnf2
2,
∀f ∈ L2(Rd) e.g.: Structured filters
SLIDE 25 Building blocks
Basic operations in the n-th network layer f . . . gλ(r)
n
non-lin. pool. gλ(k)
n
non-lin. pool. Filters: Semi-discrete frame Ψn := {χn} ∪ {gλn}λn∈Λn Anf2
2 ≤ f ∗ χn2 2 +
f ∗ gλn2 ≤ Bnf2
2,
∀f ∈ L2(Rd) e.g.: Unstructured filters
SLIDE 26 Building blocks
Basic operations in the n-th network layer f . . . gλ(r)
n
non-lin. pool. gλ(k)
n
non-lin. pool. Filters: Semi-discrete frame Ψn := {χn} ∪ {gλn}λn∈Λn Anf2
2 ≤ f ∗ χn2 2 +
f ∗ gλn2 ≤ Bnf2
2,
∀f ∈ L2(Rd) e.g.: Learned filters
SLIDE 27 Building blocks
Basic operations in the n-th network layer f . . . gλ(r)
n
non-lin. pool. gλ(k)
n
non-lin. pool. Non-linearities: Point-wise and Lipschitz-continuous Mn(f) − Mn(h)2 ≤ Lnf − h2, ∀ f, h ∈ L2(Rd)
SLIDE 28 Building blocks
Basic operations in the n-th network layer f . . . gλ(r)
n
non-lin. pool. gλ(k)
n
non-lin. pool. Non-linearities: Point-wise and Lipschitz-continuous Mn(f) − Mn(h)2 ≤ Lnf − h2, ∀ f, h ∈ L2(Rd) ⇒ Satisfied by virtually all non-linearities used in the deep learning literature! ReLU: Ln = 1; modulus: Ln = 1; logistic sigmoid: Ln = 1
4; ...
SLIDE 29 Building blocks
Basic operations in the n-th network layer f . . . gλ(r)
n
non-lin. pool. gλ(k)
n
non-lin. pool. Pooling: In continuous-time according to f → Sd/2
n
Pn(f)(Sn·), where Sn ≥ 1 is the pooling factor and Pn : L2(Rd) → L2(Rd) is Rn-Lipschitz-continuous
SLIDE 30 Building blocks
Basic operations in the n-th network layer f . . . gλ(r)
n
non-lin. pool. gλ(k)
n
non-lin. pool. Pooling: In continuous-time according to f → Sd/2
n
Pn(f)(Sn·), where Sn ≥ 1 is the pooling factor and Pn : L2(Rd) → L2(Rd) is Rn-Lipschitz-continuous ⇒ Emulates most poolings used in the deep learning literature! e.g.: Pooling by sub-sampling Pn(f) = f with Rn = 1
SLIDE 31 Building blocks
Basic operations in the n-th network layer f . . . gλ(r)
n
non-lin. pool. gλ(k)
n
non-lin. pool. Pooling: In continuous-time according to f → Sd/2
n
Pn(f)(Sn·), where Sn ≥ 1 is the pooling factor and Pn : L2(Rd) → L2(Rd) is Rn-Lipschitz-continuous ⇒ Emulates most poolings used in the deep learning literature! e.g.: Pooling by averaging Pn(f) = f ∗ φn with Rn = φn1
SLIDE 32 Vertical translation invariance
Theorem (Wiatowski and HB, 2015) Assume that the filters, non-linearities, and poolings satisfy Bn ≤ min{1, L−2
n R−2 n },
∀ n ∈ N. Let the pooling factors be Sn ≥ 1, n ∈ N. Then, |||Φn(Ttf) − Φn(f)||| = O
S1 . . . Sn
for all f ∈ L2(Rd), t ∈ Rd, n ∈ N.
SLIDE 33 Vertical translation invariance
Theorem (Wiatowski and HB, 2015) Assume that the filters, non-linearities, and poolings satisfy Bn ≤ min{1, L−2
n R−2 n },
∀ n ∈ N. Let the pooling factors be Sn ≥ 1, n ∈ N. Then, |||Φn(Ttf) − Φn(f)||| = O
S1 . . . Sn
for all f ∈ L2(Rd), t ∈ Rd, n ∈ N. ⇒ Features become more invariant with increasing network depth!
SLIDE 34 Vertical translation invariance
Theorem (Wiatowski and HB, 2015) Assume that the filters, non-linearities, and poolings satisfy Bn ≤ min{1, L−2
n R−2 n },
∀ n ∈ N. Let the pooling factors be Sn ≥ 1, n ∈ N. Then, |||Φn(Ttf) − Φn(f)||| = O
S1 . . . Sn
for all f ∈ L2(Rd), t ∈ Rd, n ∈ N. Full translation invariance: If lim
n→∞ S1 · S2 · . . . · Sn = ∞, then
lim
n→∞ |||Φn(Ttf) − Φn(f)||| = 0
SLIDE 35 Vertical translation invariance
Theorem (Wiatowski and HB, 2015) Assume that the filters, non-linearities, and poolings satisfy Bn ≤ min{1, L−2
n R−2 n },
∀ n ∈ N. Let the pooling factors be Sn ≥ 1, n ∈ N. Then, |||Φn(Ttf) − Φn(f)||| = O
S1 . . . Sn
for all f ∈ L2(Rd), t ∈ Rd, n ∈ N. The condition Bn ≤ min{1, L−2
n R−2 n },
∀ n ∈ N, is easily satisfied by normalizing the filters {gλn}λn∈Λn.
SLIDE 36 Vertical translation invariance
Theorem (Wiatowski and HB, 2015) Assume that the filters, non-linearities, and poolings satisfy Bn ≤ min{1, L−2
n R−2 n },
∀ n ∈ N. Let the pooling factors be Sn ≥ 1, n ∈ N. Then, |||Φn(Ttf) − Φn(f)||| = O
S1 . . . Sn
for all f ∈ L2(Rd), t ∈ Rd, n ∈ N. ⇒ applies to general filters, non-linearities, and poolings
SLIDE 37
Philosophy behind invariance results
Mallat’s “horizontal” translation invariance [Mallat, 2012]: lim
J→∞ |||ΦW (Ttf) − ΦW (f)||| = 0,
∀f ∈ L2(Rd), ∀t ∈ Rd “Vertical” translation invariance: lim
n→∞ |||Φn(Ttf) − Φn(f)||| = 0,
∀f ∈ L2(Rd), ∀t ∈ Rd
SLIDE 38 Philosophy behind invariance results
Mallat’s “horizontal” translation invariance [Mallat, 2012]: lim
J→∞ |||ΦW (Ttf) − ΦW (f)||| = 0,
∀f ∈ L2(Rd), ∀t ∈ Rd
- features become invariant in every network layer, but needs
J → ∞ “Vertical” translation invariance: lim
n→∞ |||Φn(Ttf) − Φn(f)||| = 0,
∀f ∈ L2(Rd), ∀t ∈ Rd
- features become more invariant with increasing network depth
SLIDE 39 Philosophy behind invariance results
Mallat’s “horizontal” translation invariance [Mallat, 2012]: lim
J→∞ |||ΦW (Ttf) − ΦW (f)||| = 0,
∀f ∈ L2(Rd), ∀t ∈ Rd
- features become invariant in every network layer, but needs
J → ∞
- applies to wavelet transform and modulus non-linearity without
pooling “Vertical” translation invariance: lim
n→∞ |||Φn(Ttf) − Φn(f)||| = 0,
∀f ∈ L2(Rd), ∀t ∈ Rd
- features become more invariant with increasing network depth
- applies to general filters, general non-linearities, and general
poolings
SLIDE 40
Non-linear deformations
Non-linear deformation (Fτf)(x) = f(x − τ(x)), where τ : Rd → Rd For “small” τ:
SLIDE 41
Non-linear deformations
Non-linear deformation (Fτf)(x) = f(x − τ(x)), where τ : Rd → Rd For “large” τ:
SLIDE 42
Deformation sensitivity for signal classes
Consider (Fτf)(x) = f(x − τ(x)) = f(x − e−x2)
x f1(x), (Fτf1)(x) x f2(x), (Fτf2)(x)
For given τ the amount of deformation induced can depend drastically on f ∈ L2(Rd)
SLIDE 43 Philosophy behind deformation stability/sensitivity bounds
Mallat’s deformation stability bound [Mallat, 2012]: |||ΦW (Fτf)−ΦW (f)||| ≤ C
for all f ∈ HW ⊆ L2(Rd)
- The signal class HW and the corresponding norm · W depend
- n the mother wavelet (and hence the network)
Our deformation sensitivity bound: |||Φ(Fτf) − Φ(f)||| ≤ CCτα
∞,
∀f ∈ C ⊆ L2(Rd)
- The signal class C (band-limited functions, cartoon functions, or
Lipschitz functions) is independent of the network
SLIDE 44 Philosophy behind deformation stability/sensitivity bounds
Mallat’s deformation stability bound [Mallat, 2012]: |||ΦW (Fτf)−ΦW (f)||| ≤ C
for all f ∈ HW ⊆ L2(Rd)
- Signal class description complexity implicit via norm · W
Our deformation sensitivity bound: |||Φ(Fτf) − Φ(f)||| ≤ CCτα
∞,
∀f ∈ C ⊆ L2(Rd)
- Signal class description complexity explicit via CC
- L-band-limited functions: CC = O(L)
- cartoon functions of size K: CC = O(K3/2)
- M-Lipschitz functions CC = O(M)
SLIDE 45 Philosophy behind deformation stability/sensitivity bounds
Mallat’s deformation stability bound [Mallat, 2012]: |||ΦW (Fτf)−ΦW (f)||| ≤ C
for all f ∈ HW ⊆ L2(Rd) Our deformation sensitivity bound: |||Φ(Fτf) − Φ(f)||| ≤ CCτα
∞,
∀f ∈ C ⊆ L2(Rd)
- Decay rate α > 0 of the deformation error is signal-class-
specific (band-limited functions: α = 1, cartoon functions: α = 1
2, Lipschitz functions: α = 1)
SLIDE 46 Philosophy behind deformation stability/sensitivity bounds
Mallat’s deformation stability bound [Mallat, 2012]: |||ΦW (Fτf)−ΦW (f)||| ≤ C
for all f ∈ HW ⊆ L2(Rd)
- The bound depends explicitly on higher order derivatives of τ
Our deformation sensitivity bound: |||Φ(Fτf) − Φ(f)||| ≤ CCτα
∞,
∀f ∈ C ⊆ L2(Rd)
- The bound implicitly depends on derivative of τ via the
condition Dτ∞ ≤
1 2d
SLIDE 47 Philosophy behind deformation stability/sensitivity bounds
Mallat’s deformation stability bound [Mallat, 2012]: |||ΦW (Fτf)−ΦW (f)||| ≤ C
for all f ∈ HW ⊆ L2(Rd)
- The bound is coupled to horizontal translation invariance
lim
J→∞ |||ΦW (Ttf) − ΦW (f)||| = 0,
∀f ∈ L2(Rd), ∀t ∈ Rd Our deformation sensitivity bound: |||Φ(Fτf) − Φ(f)||| ≤ CCτα
∞,
∀f ∈ C ⊆ L2(Rd)
- The bound is decoupled from vertical translation invariance
lim
n→∞ |||Φn(Ttf) − Φn(f)||| = 0,
∀f ∈ L2(Rd), ∀t ∈ Rd
SLIDE 48
CNNs in a nutshell
CNNs used in practice employ potentially hundreds of layers and 10,000s of nodes!
SLIDE 49 CNNs in a nutshell
CNNs used in practice employ potentially hundreds of layers and 10,000s of nodes! e.g.: Winner of the ImageNet 2015 challenge [He et al., 2015]
- Network depth: 152 layers
- average # of nodes per layer: 472
- # of FLOPS for a single forward pass: 11.3 billion
SLIDE 50 CNNs in a nutshell
CNNs used in practice employ potentially hundreds of layers and 10,000s of nodes! e.g.: Winner of the ImageNet 2015 challenge [He et al., 2015]
- Network depth: 152 layers
- average # of nodes per layer: 472
- # of FLOPS for a single forward pass: 11.3 billion
Such depths (and breadths) pose formidable computational challenges in training and operating the network!
SLIDE 51
Topology reduction
Determine how fast the energy contained in the propagated signals (a.k.a. feature maps) decays across layers
SLIDE 52
Topology reduction
Determine how fast the energy contained in the propagated signals (a.k.a. feature maps) decays across layers Guarantee trivial null-space for feature extractor Φ
SLIDE 53
Topology reduction
Determine how fast the energy contained in the propagated signals (a.k.a. feature maps) decays across layers Guarantee trivial null-space for feature extractor Φ Specify the number of layers needed to have “most” of the input signal energy be contained in the feature vector
SLIDE 54
Topology reduction
Determine how fast the energy contained in the propagated signals (a.k.a. feature maps) decays across layers Guarantee trivial null-space for feature extractor Φ Specify the number of layers needed to have “most” of the input signal energy be contained in the feature vector For a fixed (possibly small) depth, design CNNs that capture “most” of the input signal energy
SLIDE 55 Building blocks
Basic operations in the n-th network layer f . . . gλ(r)
n
| · | ↓S gλ(k)
n
| · | ↓S Filters: Semi-discrete frame Ψn := {χn} ∪ {gλn}λn∈Λn Non-linearity: Modulus | · | Pooling: Sub-sampling with pooling factor S ≥ 1
SLIDE 56 Demodulation effect of modulus non-linearity
Components of feature vector given by |f ∗ gλn| ∗ χn+1
1 ω
· · · · · ·
SLIDE 57 Demodulation effect of modulus non-linearity
Components of feature vector given by |f ∗ gλn| ∗ χn+1
1 ω
· · · · · ·
1 ω
gλn(ω)
SLIDE 58 Demodulation effect of modulus non-linearity
Components of feature vector given by |f ∗ gλn| ∗ χn+1
1 ω
· · · · · ·
1 ω
gλn(ω)
Modulus squared: |f ∗ gλn(x)|2 R
f· gλn(ω)
SLIDE 59 Demodulation effect of modulus non-linearity
Components of feature vector given by |f ∗ gλn| ∗ χn+1
1 ω
· · · · · ·
1 ω
gλn(ω) 1 ω |f ∗ gλn|
Φ(f) via χn+1
SLIDE 60
Do all non-linearities demodulate?
High-pass filtered signal:
−2R 2R 2R F(f ∗ gλ) ω
SLIDE 61
Do all non-linearities demodulate?
High-pass filtered signal:
−2R 2R 2R F(f ∗ gλ) ω
Modulus: Yes!
−2R 2R ω |F(|f ∗ gλ|)|
... but (small) tails!
SLIDE 62
Do all non-linearities demodulate?
High-pass filtered signal:
−2R 2R 2R F(f ∗ gλ) ω
Modulus squared: Yes, and sharply so!
−2R 2R |F(|f ∗ gλ|2)| ω
... but not Lipschitz-continuous!
SLIDE 63
Do all non-linearities demodulate?
High-pass filtered signal:
−2R 2R 2R F(f ∗ gλ) ω
Rectified linear unit: No!
−2R 2R |F(ReLU(f ∗ gλ))| ω
SLIDE 64 First goal: Quantify feature map energy decay
W1(f) W2(f)
f |f ∗ gλ(k)
1 |
· ∗ χ2 ||f ∗ gλ(k)
1 | ∗ gλ(l) 2 |
· ∗ χ3 |f ∗ gλ(p)
1 |
· ∗ χ2 ||f ∗ gλ(p)
1 | ∗ gλ(r) 2 |
· ∗ χ3 · ∗ χ1
SLIDE 65 Assumptions (on the filters)
i) Analyticity: For every filter gλn there exists a (not necessarily canonical) orthant Hλn ⊆ Rd such that supp( gλn) ⊆ Hλn. ii) High-pass: There exists δ > 0 such that
| gλn(ω)|2 = 0, a.e. ω ∈ Bδ(0).
SLIDE 66 Assumptions (on the filters)
i) Analyticity: For every filter gλn there exists a (not necessarily canonical) orthant Hλn ⊆ Rd such that supp( gλn) ⊆ Hλn. ii) High-pass: There exists δ > 0 such that
| gλn(ω)|2 = 0, a.e. ω ∈ Bδ(0). ⇒ Comprises various contructions of WH filters, wavelets, ridgelets, (α)-curvelets, shearlets e.g.: analytic band-limited curvelets:
ω1 ω2
SLIDE 67 Input signal classes
Sobolev functions of order s ≥ 0: Hs(Rd) =
- f ∈ L2(Rd)
- Rd(1 + |ω|2)s|
f(ω)|2dω < ∞
SLIDE 68 Input signal classes
Sobolev functions of order s ≥ 0: Hs(Rd) =
- f ∈ L2(Rd)
- Rd(1 + |ω|2)s|
f(ω)|2dω < ∞
- Hs(Rd) contains a wide range of practically relevant signal classes
SLIDE 69 Input signal classes
Sobolev functions of order s ≥ 0: Hs(Rd) =
- f ∈ L2(Rd)
- Rd(1 + |ω|2)s|
f(ω)|2dω < ∞
- Hs(Rd) contains a wide range of practically relevant signal classes
- square-integrable functions L2(Rd) = H0(Rd)
SLIDE 70 Input signal classes
Sobolev functions of order s ≥ 0: Hs(Rd) =
- f ∈ L2(Rd)
- Rd(1 + |ω|2)s|
f(ω)|2dω < ∞
- Hs(Rd) contains a wide range of practically relevant signal classes
- square-integrable functions L2(Rd) = H0(Rd)
- L-band-limited functions L2
L(Rd) ⊆ Hs(Rd), ∀L > 0, ∀s ≥ 0
SLIDE 71 Input signal classes
Sobolev functions of order s ≥ 0: Hs(Rd) =
- f ∈ L2(Rd)
- Rd(1 + |ω|2)s|
f(ω)|2dω < ∞
- Hs(Rd) contains a wide range of practically relevant signal classes
- square-integrable functions L2(Rd) = H0(Rd)
- L-band-limited functions L2
L(Rd) ⊆ Hs(Rd), ∀L > 0, ∀s ≥ 0
- cartoon functions [Donoho, 2001] CCART ⊆ Hs(Rd), ∀s ∈ [0, 1
2)
Handwritten digits from MNIST database [LeCun & Cortes, 1998]
SLIDE 72 Exponential energy decay
Theorem Let the filters be wavelets with mother wavelet supp( ψ ) ⊆ [r−1, r], r > 1,
- r Weyl-Heisenberg (WH) filters with prototype function
supp( g) ⊆ [−R, R], R > 0. Then, for every f ∈ Hs(Rd), there exists β > 0 such that Wn(f) = O
2s+β+1
where a = r2+1
r2−1 in the wavelet case, and a = 1 2 + 1 R in the WH case.
SLIDE 73 Exponential energy decay
Theorem Let the filters be wavelets with mother wavelet supp( ψ ) ⊆ [r−1, r], r > 1,
- r Weyl-Heisenberg (WH) filters with prototype function
supp( g) ⊆ [−R, R], R > 0. Then, for every f ∈ Hs(Rd), there exists β > 0 such that Wn(f) = O
2s+β+1
where a = r2+1
r2−1 in the wavelet case, and a = 1 2 + 1 R in the WH case.
⇒ decay factor a is explicit and can be tuned via r, R
SLIDE 74 Exponential energy decay
Exponential energy decay: Wn(f) = O
2s+β+1
SLIDE 75 Exponential energy decay
Exponential energy decay: Wn(f) = O
2s+β+1
- β > 0 determines the decay of
f(ω) (as |ω| → ∞) according to | f(ω)| ≤ µ(1 + |ω|2)−( s
2 + 1 4 + β 4 ),
∀ |ω| ≥ L, for some µ > 0, and L acts as an “effective bandwidth”
SLIDE 76 Exponential energy decay
Exponential energy decay: Wn(f) = O
2s+β+1
- β > 0 determines the decay of
f(ω) (as |ω| → ∞) according to | f(ω)| ≤ µ(1 + |ω|2)−( s
2 + 1 4 + β 4 ),
∀ |ω| ≥ L, for some µ > 0, and L acts as an “effective bandwidth”
- smoother input signals (i.e., s↑) lead to faster energy decay
SLIDE 77 Exponential energy decay
Exponential energy decay: Wn(f) = O
2s+β+1
- β > 0 determines the decay of
f(ω) (as |ω| → ∞) according to | f(ω)| ≤ µ(1 + |ω|2)−( s
2 + 1 4 + β 4 ),
∀ |ω| ≥ L, for some µ > 0, and L acts as an “effective bandwidth”
- smoother input signals (i.e., s↑) lead to faster energy decay
- pooling through sub-sampling f → S1/2f(S·) leads to decay
factor a
S
SLIDE 78 Exponential energy decay
Exponential energy decay: Wn(f) = O
2s+β+1
- β > 0 determines the decay of
f(ω) (as |ω| → ∞) according to | f(ω)| ≤ µ(1 + |ω|2)−( s
2 + 1 4 + β 4 ),
∀ |ω| ≥ L, for some µ > 0, and L acts as an “effective bandwidth”
- smoother input signals (i.e., s↑) lead to faster energy decay
- pooling through sub-sampling f → S1/2f(S·) leads to decay
factor a
S
What about general filters? ⇒ polynomial energy decay!
SLIDE 79
... our second goal ... trivial null-space for Φ
Why trivial null-space? Feature space
w
: w, Φ(f) > 0 : w, Φ(f) < 0
SLIDE 80
... our second goal ... trivial null-space for Φ
Why trivial null-space? Feature space
w Φ(f ∗)
: w, Φ(f) > 0 : w, Φ(f) < 0 Non-trivial null-space: ∃ f∗ = 0 such that Φ(f∗) = 0 ⇒ w, Φ(f∗) = 0 for all w ! ⇒ these f∗ become unclassifiable!
SLIDE 81 ... our second goal ...
Trivial null-space for feature extractor:
- f ∈ L2(Rd) | Φ(f) = 0
- =
- Feature extractor Φ(·) = ∞
n=0 Φn(·) shall satisfy
Af2
2 ≤ |||Φ(f)|||2 ≤ Bf2 2,
∀f ∈ L2(Rd), for some A, B > 0.
SLIDE 82
“Energy conservation”
Theorem For the frame upper {Bn}n∈N and frame lower bounds {An}n∈N, define B := ∞
n=1 max{1, Bn} and A := ∞ n=1 min{1, An}. If
0 < A ≤ B < ∞, then Af2
2 ≤ |||Φ(f)|||2 ≤ Bf2 2,
∀ f ∈ L2(Rd).
SLIDE 83 “Energy conservation”
Theorem For the frame upper {Bn}n∈N and frame lower bounds {An}n∈N, define B := ∞
n=1 max{1, Bn} and A := ∞ n=1 min{1, An}. If
0 < A ≤ B < ∞, then Af2
2 ≤ |||Φ(f)|||2 ≤ Bf2 2,
∀ f ∈ L2(Rd).
- For Parseval frames (i.e., An = Bn = 1, n ∈ N), this yields
|||Φ(f)|||2 = f2
2
SLIDE 84 “Energy conservation”
Theorem For the frame upper {Bn}n∈N and frame lower bounds {An}n∈N, define B := ∞
n=1 max{1, Bn} and A := ∞ n=1 min{1, An}. If
0 < A ≤ B < ∞, then Af2
2 ≤ |||Φ(f)|||2 ≤ Bf2 2,
∀ f ∈ L2(Rd).
- For Parseval frames (i.e., An = Bn = 1, n ∈ N), this yields
|||Φ(f)|||2 = f2
2
- Connection to energy decay:
f2
2 = n−1
|||Φk(f)|||2 + Wn(f)
→ 0
SLIDE 85
... and our third goal ...
For a given CNN, specify the number of layers needed to capture “most” of the input signal energy
SLIDE 86 ... and our third goal ...
For a given CNN, specify the number of layers needed to capture “most” of the input signal energy How many layers n are needed to have at least ((1 − ε) · 100)% of the input signal energy be contained in the feature vector, i.e., (1 − ε)f2
2 ≤ n
|||Φk(f)|||2 ≤ f2
2,
∀f ∈ L2(Rd).
SLIDE 87 Number of layers needed
Theorem Let the frame bounds satisfy An = Bn = 1, n ∈ N. Let the input signal f be L-band-limited, and let ε ∈ (0, 1). If n ≥
(1 − √1 − ε )
then (1 − ε)f2
2 ≤ n
|||Φk(f)|||2 ≤ f2
2.
SLIDE 88 Number of layers needed
Theorem Let the frame bounds satisfy An = Bn = 1, n ∈ N. Let the input signal f be L-band-limited, and let ε ∈ (0, 1). If n ≥
(1 − √1 − ε )
then (1 − ε)f2
2 ≤ n
|||Φk(f)|||2 ≤ f2
2.
⇒ also guarantees trivial null-space for n
k=0 Φk(f)
SLIDE 89 Number of layers needed
Theorem Let the frame bounds satisfy An = Bn = 1, n ∈ N. Let the input signal f be L-band-limited, and let ε ∈ (0, 1). If n ≥
(1 − √1 − ε )
then (1 − ε)f2
2 ≤ n
|||Φk(f)|||2 ≤ f2
2.
- lower bound depends on
- description complexity of input signals (i.e., bandwidth L)
- decay factor (wavelets a = r2+1
r2−1, WH filters a = 1 2 + 1 R)
SLIDE 90 Number of layers needed
Theorem Let the frame bounds satisfy An = Bn = 1, n ∈ N. Let the input signal f be L-band-limited, and let ε ∈ (0, 1). If n ≥
(1 − √1 − ε )
then (1 − ε)f2
2 ≤ n
|||Φk(f)|||2 ≤ f2
2.
- lower bound depends on
- description complexity of input signals (i.e., bandwidth L)
- decay factor (wavelets a = r2+1
r2−1, WH filters a = 1 2 + 1 R)
- similar estimates for Sobolev input signals and for general
filters (polynomial decay!)
SLIDE 91
Number of layers needed
Numerical example for bandwidth L = 1:
(1 − ε) 0.25 0.5 0.75 0.9 0.95 0.99 wavelets (r = 2) 2 3 4 6 8 11 WH filters (R = 1) 2 4 5 8 10 14 general filters 2 3 7 19 39 199
SLIDE 92
Number of layers needed
Numerical example for bandwidth L = 1:
(1 − ε) 0.25 0.5 0.75 0.9 0.95 0.99 wavelets (r = 2) 2 3 4 6 8 11 WH filters (R = 1) 2 4 5 8 10 14 general filters 2 3 7 19 39 199
SLIDE 93 Number of layers needed
Numerical example for bandwidth L = 1:
(1 − ε) 0.25 0.5 0.75 0.9 0.95 0.99 wavelets (r = 2) 2 3 4 6 8 11 WH filters (R = 1) 2 4 5 8 10 14 general filters 2 3 7 19 39 199
Recall: Winner of the ImageNet 2015 challenge [He et al., 2015]
- Network depth: 152 layers
- average # of nodes per layer: 472
- # of FLOPS for a single forward pass: 11.3 billion
SLIDE 94
... our fourth and last goal ...
For a fixed (possibly small) depth N, design scattering networks that capture “most” of the input signal energy
SLIDE 95 ... our fourth and last goal ...
For a fixed (possibly small) depth N, design scattering networks that capture “most” of the input signal energy Recall: Let the filters be wavelets with mother wavelet supp( ψ ) ⊆ [r−1, r], r > 1,
- r Weyl-Heisenberg filters with prototype function
supp( g) ⊆ [−R, R], R > 0.
SLIDE 96 ... our fourth and last goal ...
For a fixed (possibly small) depth N, design scattering networks that capture “most” of the input signal energy For fixed depth N, want to choose r in the wavelet and R in the WH case so that (1 − ε)f2
2 ≤ N
|||Φk(f)|||2 ≤ f2
2,
∀f ∈ L2(Rd).
SLIDE 97 Depth-constrained networks
Theorem Let the frame bounds satisfy An = Bn = 1, n ∈ N. Let the input signal f be L-band-limited, and fix ε ∈ (0, 1) and N ∈ N. If, in the wavelet case, 1 < r ≤
κ − 1,
0 < R ≤
κ − 1
2
, where κ :=
(1−√1−ε )
1
N , then
(1 − ε)f2
2 ≤ N
|||Φk(f)|||2 ≤ f2
2.
SLIDE 98 Depth-width tradeoff
Spectral supports of wavelet filters: ω
L
1 r 1
r r2 r3 1
SLIDE 99 Depth-width tradeoff
Spectral supports of wavelet filters: ω
L
1 r 1
r r2 r3 1
Smaller depth N ⇒ smaller “bandwidth” r of mother wavelet
SLIDE 100 Depth-width tradeoff
Spectral supports of wavelet filters: ω
L
1 r 1
r r2 r3 1
Smaller depth N ⇒ smaller “bandwidth” r of mother wavelet ⇒ larger number of wavelets (O(logr(L))) to cover the spectral support [−L, L] of input signal
SLIDE 101 Depth-width tradeoff
Spectral supports of wavelet filters: ω
L
1 r 1
r r2 r3 1
Smaller depth N ⇒ smaller “bandwidth” r of mother wavelet ⇒ larger number of wavelets (O(logr(L))) to cover the spectral support [−L, L] of input signal ⇒ larger number of filters in the first layer
SLIDE 102 Depth-width tradeoff
Spectral supports of wavelet filters: ω
L
1 r 1
r r2 r3 1
Smaller depth N ⇒ smaller “bandwidth” r of mother wavelet ⇒ larger number of wavelets (O(logr(L))) to cover the spectral support [−L, L] of input signal ⇒ larger number of filters in the first layer ⇒ depth-width tradeoff
SLIDE 103
Yours truly
SLIDE 104 Experiment: Handwritten digit classification
- Dataset: MNIST database of handwritten digits [LeCun &
Cortes, 1998]; 60,000 training and 10,000 test images
- Φ-network: D = 3 layers; same filters, non-linearities, and
pooling operators in all layers
- Classifier: SVM with radial basis function kernel [Vapnik, 1995]
- Dimensionality reduction: Supervised orthogonal least squares
scheme [Chen et al., 1991]
SLIDE 105
Experiment: Handwritten digit classification
Classification error in percent: Haar wavelet Bi-orthogonal wavelet abs ReLU tanh LogSig abs ReLU tanh LogSig n.p. 0.57 0.57 1.35 1.49 0.51 0.57 1.12 1.22 sub. 0.69 0.66 1.25 1.46 0.61 0.61 1.20 1.18 max. 0.58 0.65 0.75 0.74 0.52 0.64 0.78 0.73 avg. 0.55 0.60 1.27 1.35 0.58 0.59 1.07 1.26
SLIDE 106 Experiment: Handwritten digit classification
Classification error in percent: Haar wavelet Bi-orthogonal wavelet abs ReLU tanh LogSig abs ReLU tanh LogSig n.p. 0.57 0.57 1.35 1.49 0.51 0.57 1.12 1.22 sub. 0.69 0.66 1.25 1.46 0.61 0.61 1.20 1.18 max. 0.58 0.65 0.75 0.74 0.52 0.64 0.78 0.73 avg. 0.55 0.60 1.27 1.35 0.58 0.59 1.07 1.26
- modulus and ReLU perform better than tanh and LogSig
SLIDE 107 Experiment: Handwritten digit classification
Classification error in percent: Haar wavelet Bi-orthogonal wavelet abs ReLU tanh LogSig abs ReLU tanh LogSig n.p. 0.57 0.57 1.35 1.49 0.51 0.57 1.12 1.22 sub. 0.69 0.66 1.25 1.46 0.61 0.61 1.20 1.18 max. 0.58 0.65 0.75 0.74 0.52 0.64 0.78 0.73 avg. 0.55 0.60 1.27 1.35 0.58 0.59 1.07 1.26
- modulus and ReLU perform better than tanh and LogSig
- results with pooling (S = 2) are competitive with those without
pooling, at significanly lower computational cost
SLIDE 108 Experiment: Handwritten digit classification
Classification error in percent: Haar wavelet Bi-orthogonal wavelet abs ReLU tanh LogSig abs ReLU tanh LogSig n.p. 0.57 0.57 1.35 1.49 0.51 0.57 1.12 1.22 sub. 0.69 0.66 1.25 1.46 0.61 0.61 1.20 1.18 max. 0.58 0.65 0.75 0.74 0.52 0.64 0.78 0.73 avg. 0.55 0.60 1.27 1.35 0.58 0.59 1.07 1.26
- modulus and ReLU perform better than tanh and LogSig
- results with pooling (S = 2) are competitive with those without
pooling, at significanly lower computational cost
- state-of-the-art: 0.43 [Bruna and Mallat, 2013]
- similar feature extraction network with directional, non-separable
wavelets and no pooling
- significantly higher computational complexity
SLIDE 109 Energy decay: Related work
[Waldspurger, 2017]: Exponential energy decay Wn(f) = O(a−n), for some unspecified a > 1.
- 1-D wavelet filters
- every network layer equipped with the same set of wavelets
SLIDE 110 Energy decay: Related work
[Waldspurger, 2017]: Exponential energy decay Wn(f) = O(a−n), for some unspecified a > 1.
- 1-D wavelet filters
- every network layer equipped with the same set of wavelets
- vanishing moments condition on the mother wavelet
SLIDE 111 Energy decay: Related work
[Waldspurger, 2017]: Exponential energy decay Wn(f) = O(a−n), for some unspecified a > 1.
- 1-D wavelet filters
- every network layer equipped with the same set of wavelets
- vanishing moments condition on the mother wavelet
- applies to 1-D real-valued band-limited input signals f ∈ L2(R)
SLIDE 112 Energy decay: Related work
[Czaja and Li, 2016]: Exponential energy decay Wn(f) = O(a−n), for some unspecified a > 1.
- d-dimensional uniform covering filters (similar to Weyl-
Heisenberg filters), but does not cover multi-scale filters (e.g. wavelets, ridgedelets, curvelets etc.)
- every network layer equipped with the same set of filters
SLIDE 113 Energy decay: Related work
[Czaja and Li, 2016]: Exponential energy decay Wn(f) = O(a−n), for some unspecified a > 1.
- d-dimensional uniform covering filters (similar to Weyl-
Heisenberg filters), but does not cover multi-scale filters (e.g. wavelets, ridgedelets, curvelets etc.)
- every network layer equipped with the same set of filters
- analyticity and vanishing moments conditions on the filters
SLIDE 114 Energy decay: Related work
[Czaja and Li, 2016]: Exponential energy decay Wn(f) = O(a−n), for some unspecified a > 1.
- d-dimensional uniform covering filters (similar to Weyl-
Heisenberg filters), but does not cover multi-scale filters (e.g. wavelets, ridgedelets, curvelets etc.)
- every network layer equipped with the same set of filters
- analyticity and vanishing moments conditions on the filters
- applies to d-dimensional complex-valued input signals
f ∈ L2(Rd)