Harmonic Analysis of Deep Convolutional Neural Networks Helmut B - - PowerPoint PPT Presentation

harmonic analysis of deep convolutional neural networks
SMART_READER_LITE
LIVE PREVIEW

Harmonic Analysis of Deep Convolutional Neural Networks Helmut B - - PowerPoint PPT Presentation

Harmonic Analysis of Deep Convolutional Neural Networks Helmut B olcskei Department of Information Technology and Electrical Engineering October 2017 joint work with Thomas Wiatowski and Philipp Grohs ImageNet ImageNet ski rock plant


slide-1
SLIDE 1

Harmonic Analysis of Deep Convolutional Neural Networks

Helmut B˝

  • lcskei

Department of Information Technology and Electrical Engineering

October 2017

joint work with Thomas Wiatowski and Philipp Grohs

slide-2
SLIDE 2

ImageNet

slide-3
SLIDE 3

ImageNet

ski rock coffee plant

slide-4
SLIDE 4

ImageNet

ski rock coffee plant

CNNs win the ImageNet 2015 challenge [He et al., 2015]

slide-5
SLIDE 5

Describing the content of an image

CNNs generate sentences describing the content of an image [Vinyals et al., 2015]

slide-6
SLIDE 6

Describing the content of an image

CNNs generate sentences describing the content of an image [Vinyals et al., 2015]

“Carlos .”

slide-7
SLIDE 7

Describing the content of an image

CNNs generate sentences describing the content of an image [Vinyals et al., 2015]

“Carlos Kleiber .”

slide-8
SLIDE 8

Describing the content of an image

CNNs generate sentences describing the content of an image [Vinyals et al., 2015]

“Carlos Kleiber conducting the .”

slide-9
SLIDE 9

Describing the content of an image

CNNs generate sentences describing the content of an image [Vinyals et al., 2015]

“Carlos Kleiber conducting the Vienna Philharmonic’s .”

slide-10
SLIDE 10

Describing the content of an image

CNNs generate sentences describing the content of an image [Vinyals et al., 2015]

“Carlos Kleiber conducting the Vienna Philharmonic’s New Year’s Concert .”

slide-11
SLIDE 11

Describing the content of an image

CNNs generate sentences describing the content of an image [Vinyals et al., 2015]

“Carlos Kleiber conducting the Vienna Philharmonic’s New Year’s Concert 1989.”

slide-12
SLIDE 12

Feature extraction and classification

input: f = non-linear feature extraction feature vector Φ(f) linear classifier

  • w, Φ(f) > 0,

⇒ Shannon w, Φ(f) < 0, ⇒ von Neumann

  • utput:
slide-13
SLIDE 13

Why non-linear feature extractors?

Task: Separate two categories of data through a linear classifier

1

: w, f > 0 : w, f < 0

slide-14
SLIDE 14

Why non-linear feature extractors?

Task: Separate two categories of data through a linear classifier

1

: w, f > 0 : w, f < 0 not possible!

slide-15
SLIDE 15

Why non-linear feature extractors?

Task: Separate two categories of data through a linear classifier

1

: w, f > 0 : w, f < 0 not possible! Φ(f) = f 1

  • : w, Φ(f) > 0

: w, Φ(f) < 0 possible with w = 1 −1

slide-16
SLIDE 16

Why non-linear feature extractors?

Task: Separate two categories of data through a linear classifier Φ(f) = f 1

  • ⇒ Φ is invariant to angular component of the data
slide-17
SLIDE 17

Why non-linear feature extractors?

Task: Separate two categories of data through a linear classifier Φ(f) = f 1

  • ⇒ Φ is invariant to angular component of the data

⇒ Linear separability in feature space!

slide-18
SLIDE 18

Translation invariance

Handwritten digits from the MNIST database [LeCun & Cortes, 1998]

Feature vector should be invariant to spatial location ⇒ translation invariance

slide-19
SLIDE 19

Deformation insensitivity

Feature vector should be independent of cameras (of different resolutions), and insensitive to small acquisition jitters

slide-20
SLIDE 20

Scattering networks ([Mallat, 2012], [Wiatowski and HB, 2015])

feature map

f |f ∗ gλ(k)

1 |

||f ∗ gλ(k)

1 | ∗ gλ(l) 2 |

|f ∗ gλ(p)

1 |

||f ∗ gλ(p)

1 | ∗ gλ(r) 2 |

slide-21
SLIDE 21

Scattering networks ([Mallat, 2012], [Wiatowski and HB, 2015])

feature map

f |f ∗ gλ(k)

1 |

· ∗ χ2 ||f ∗ gλ(k)

1 | ∗ gλ(l) 2 |

· ∗ χ3 |f ∗ gλ(p)

1 |

· ∗ χ2 ||f ∗ gλ(p)

1 | ∗ gλ(r) 2 |

· ∗ χ3 · ∗ χ1

slide-22
SLIDE 22

Scattering networks ([Mallat, 2012], [Wiatowski and HB, 2015])

feature map feature vector Φ(f)

f |f ∗ gλ(k)

1 |

· ∗ χ2 ||f ∗ gλ(k)

1 | ∗ gλ(l) 2 |

· ∗ χ3 |f ∗ gλ(p)

1 |

· ∗ χ2 ||f ∗ gλ(p)

1 | ∗ gλ(r) 2 |

· ∗ χ3 · ∗ χ1

General scattering networks guarantee [Wiatowski & HB, 2015]

  • (vertical) translation invariance
  • small deformation sensitivity

essentially irrespective of filters, non-linearities, and poolings!

slide-23
SLIDE 23

Building blocks

Basic operations in the n-th network layer f . . . gλ(r)

n

non-lin. pool. gλ(k)

n

non-lin. pool. Filters: Semi-discrete frame Ψn := {χn} ∪ {gλn}λn∈Λn Anf2

2 ≤ f ∗ χn2 2 +

  • λn∈Λn

f ∗ gλn2 ≤ Bnf2

2,

∀f ∈ L2(Rd)

slide-24
SLIDE 24

Building blocks

Basic operations in the n-th network layer f . . . gλ(r)

n

non-lin. pool. gλ(k)

n

non-lin. pool. Filters: Semi-discrete frame Ψn := {χn} ∪ {gλn}λn∈Λn Anf2

2 ≤ f ∗ χn2 2 +

  • λn∈Λn

f ∗ gλn2 ≤ Bnf2

2,

∀f ∈ L2(Rd) e.g.: Structured filters

slide-25
SLIDE 25

Building blocks

Basic operations in the n-th network layer f . . . gλ(r)

n

non-lin. pool. gλ(k)

n

non-lin. pool. Filters: Semi-discrete frame Ψn := {χn} ∪ {gλn}λn∈Λn Anf2

2 ≤ f ∗ χn2 2 +

  • λn∈Λn

f ∗ gλn2 ≤ Bnf2

2,

∀f ∈ L2(Rd) e.g.: Unstructured filters

slide-26
SLIDE 26

Building blocks

Basic operations in the n-th network layer f . . . gλ(r)

n

non-lin. pool. gλ(k)

n

non-lin. pool. Filters: Semi-discrete frame Ψn := {χn} ∪ {gλn}λn∈Λn Anf2

2 ≤ f ∗ χn2 2 +

  • λn∈Λn

f ∗ gλn2 ≤ Bnf2

2,

∀f ∈ L2(Rd) e.g.: Learned filters

slide-27
SLIDE 27

Building blocks

Basic operations in the n-th network layer f . . . gλ(r)

n

non-lin. pool. gλ(k)

n

non-lin. pool. Non-linearities: Point-wise and Lipschitz-continuous Mn(f) − Mn(h)2 ≤ Lnf − h2, ∀ f, h ∈ L2(Rd)

slide-28
SLIDE 28

Building blocks

Basic operations in the n-th network layer f . . . gλ(r)

n

non-lin. pool. gλ(k)

n

non-lin. pool. Non-linearities: Point-wise and Lipschitz-continuous Mn(f) − Mn(h)2 ≤ Lnf − h2, ∀ f, h ∈ L2(Rd) ⇒ Satisfied by virtually all non-linearities used in the deep learning literature! ReLU: Ln = 1; modulus: Ln = 1; logistic sigmoid: Ln = 1

4; ...

slide-29
SLIDE 29

Building blocks

Basic operations in the n-th network layer f . . . gλ(r)

n

non-lin. pool. gλ(k)

n

non-lin. pool. Pooling: In continuous-time according to f → Sd/2

n

Pn(f)(Sn·), where Sn ≥ 1 is the pooling factor and Pn : L2(Rd) → L2(Rd) is Rn-Lipschitz-continuous

slide-30
SLIDE 30

Building blocks

Basic operations in the n-th network layer f . . . gλ(r)

n

non-lin. pool. gλ(k)

n

non-lin. pool. Pooling: In continuous-time according to f → Sd/2

n

Pn(f)(Sn·), where Sn ≥ 1 is the pooling factor and Pn : L2(Rd) → L2(Rd) is Rn-Lipschitz-continuous ⇒ Emulates most poolings used in the deep learning literature! e.g.: Pooling by sub-sampling Pn(f) = f with Rn = 1

slide-31
SLIDE 31

Building blocks

Basic operations in the n-th network layer f . . . gλ(r)

n

non-lin. pool. gλ(k)

n

non-lin. pool. Pooling: In continuous-time according to f → Sd/2

n

Pn(f)(Sn·), where Sn ≥ 1 is the pooling factor and Pn : L2(Rd) → L2(Rd) is Rn-Lipschitz-continuous ⇒ Emulates most poolings used in the deep learning literature! e.g.: Pooling by averaging Pn(f) = f ∗ φn with Rn = φn1

slide-32
SLIDE 32

Vertical translation invariance

Theorem (Wiatowski and HB, 2015) Assume that the filters, non-linearities, and poolings satisfy Bn ≤ min{1, L−2

n R−2 n },

∀ n ∈ N. Let the pooling factors be Sn ≥ 1, n ∈ N. Then, |||Φn(Ttf) − Φn(f)||| = O

  • t

S1 . . . Sn

  • ,

for all f ∈ L2(Rd), t ∈ Rd, n ∈ N.

slide-33
SLIDE 33

Vertical translation invariance

Theorem (Wiatowski and HB, 2015) Assume that the filters, non-linearities, and poolings satisfy Bn ≤ min{1, L−2

n R−2 n },

∀ n ∈ N. Let the pooling factors be Sn ≥ 1, n ∈ N. Then, |||Φn(Ttf) − Φn(f)||| = O

  • t

S1 . . . Sn

  • ,

for all f ∈ L2(Rd), t ∈ Rd, n ∈ N. ⇒ Features become more invariant with increasing network depth!

slide-34
SLIDE 34

Vertical translation invariance

Theorem (Wiatowski and HB, 2015) Assume that the filters, non-linearities, and poolings satisfy Bn ≤ min{1, L−2

n R−2 n },

∀ n ∈ N. Let the pooling factors be Sn ≥ 1, n ∈ N. Then, |||Φn(Ttf) − Φn(f)||| = O

  • t

S1 . . . Sn

  • ,

for all f ∈ L2(Rd), t ∈ Rd, n ∈ N. Full translation invariance: If lim

n→∞ S1 · S2 · . . . · Sn = ∞, then

lim

n→∞ |||Φn(Ttf) − Φn(f)||| = 0

slide-35
SLIDE 35

Vertical translation invariance

Theorem (Wiatowski and HB, 2015) Assume that the filters, non-linearities, and poolings satisfy Bn ≤ min{1, L−2

n R−2 n },

∀ n ∈ N. Let the pooling factors be Sn ≥ 1, n ∈ N. Then, |||Φn(Ttf) − Φn(f)||| = O

  • t

S1 . . . Sn

  • ,

for all f ∈ L2(Rd), t ∈ Rd, n ∈ N. The condition Bn ≤ min{1, L−2

n R−2 n },

∀ n ∈ N, is easily satisfied by normalizing the filters {gλn}λn∈Λn.

slide-36
SLIDE 36

Vertical translation invariance

Theorem (Wiatowski and HB, 2015) Assume that the filters, non-linearities, and poolings satisfy Bn ≤ min{1, L−2

n R−2 n },

∀ n ∈ N. Let the pooling factors be Sn ≥ 1, n ∈ N. Then, |||Φn(Ttf) − Φn(f)||| = O

  • t

S1 . . . Sn

  • ,

for all f ∈ L2(Rd), t ∈ Rd, n ∈ N. ⇒ applies to general filters, non-linearities, and poolings

slide-37
SLIDE 37

Philosophy behind invariance results

Mallat’s “horizontal” translation invariance [Mallat, 2012]: lim

J→∞ |||ΦW (Ttf) − ΦW (f)||| = 0,

∀f ∈ L2(Rd), ∀t ∈ Rd “Vertical” translation invariance: lim

n→∞ |||Φn(Ttf) − Φn(f)||| = 0,

∀f ∈ L2(Rd), ∀t ∈ Rd

slide-38
SLIDE 38

Philosophy behind invariance results

Mallat’s “horizontal” translation invariance [Mallat, 2012]: lim

J→∞ |||ΦW (Ttf) − ΦW (f)||| = 0,

∀f ∈ L2(Rd), ∀t ∈ Rd

  • features become invariant in every network layer, but needs

J → ∞ “Vertical” translation invariance: lim

n→∞ |||Φn(Ttf) − Φn(f)||| = 0,

∀f ∈ L2(Rd), ∀t ∈ Rd

  • features become more invariant with increasing network depth
slide-39
SLIDE 39

Philosophy behind invariance results

Mallat’s “horizontal” translation invariance [Mallat, 2012]: lim

J→∞ |||ΦW (Ttf) − ΦW (f)||| = 0,

∀f ∈ L2(Rd), ∀t ∈ Rd

  • features become invariant in every network layer, but needs

J → ∞

  • applies to wavelet transform and modulus non-linearity without

pooling “Vertical” translation invariance: lim

n→∞ |||Φn(Ttf) − Φn(f)||| = 0,

∀f ∈ L2(Rd), ∀t ∈ Rd

  • features become more invariant with increasing network depth
  • applies to general filters, general non-linearities, and general

poolings

slide-40
SLIDE 40

Non-linear deformations

Non-linear deformation (Fτf)(x) = f(x − τ(x)), where τ : Rd → Rd For “small” τ:

slide-41
SLIDE 41

Non-linear deformations

Non-linear deformation (Fτf)(x) = f(x − τ(x)), where τ : Rd → Rd For “large” τ:

slide-42
SLIDE 42

Deformation sensitivity for signal classes

Consider (Fτf)(x) = f(x − τ(x)) = f(x − e−x2)

x f1(x), (Fτf1)(x) x f2(x), (Fτf2)(x)

For given τ the amount of deformation induced can depend drastically on f ∈ L2(Rd)

slide-43
SLIDE 43

Philosophy behind deformation stability/sensitivity bounds

Mallat’s deformation stability bound [Mallat, 2012]: |||ΦW (Fτf)−ΦW (f)||| ≤ C

  • 2−Jτ∞ +JDτ∞ +D2τ∞
  • fW ,

for all f ∈ HW ⊆ L2(Rd)

  • The signal class HW and the corresponding norm · W depend
  • n the mother wavelet (and hence the network)

Our deformation sensitivity bound: |||Φ(Fτf) − Φ(f)||| ≤ CCτα

∞,

∀f ∈ C ⊆ L2(Rd)

  • The signal class C (band-limited functions, cartoon functions, or

Lipschitz functions) is independent of the network

slide-44
SLIDE 44

Philosophy behind deformation stability/sensitivity bounds

Mallat’s deformation stability bound [Mallat, 2012]: |||ΦW (Fτf)−ΦW (f)||| ≤ C

  • 2−Jτ∞ +JDτ∞ +D2τ∞
  • fW ,

for all f ∈ HW ⊆ L2(Rd)

  • Signal class description complexity implicit via norm · W

Our deformation sensitivity bound: |||Φ(Fτf) − Φ(f)||| ≤ CCτα

∞,

∀f ∈ C ⊆ L2(Rd)

  • Signal class description complexity explicit via CC
  • L-band-limited functions: CC = O(L)
  • cartoon functions of size K: CC = O(K3/2)
  • M-Lipschitz functions CC = O(M)
slide-45
SLIDE 45

Philosophy behind deformation stability/sensitivity bounds

Mallat’s deformation stability bound [Mallat, 2012]: |||ΦW (Fτf)−ΦW (f)||| ≤ C

  • 2−Jτ∞ +JDτ∞ +D2τ∞
  • fW ,

for all f ∈ HW ⊆ L2(Rd) Our deformation sensitivity bound: |||Φ(Fτf) − Φ(f)||| ≤ CCτα

∞,

∀f ∈ C ⊆ L2(Rd)

  • Decay rate α > 0 of the deformation error is signal-class-

specific (band-limited functions: α = 1, cartoon functions: α = 1

2, Lipschitz functions: α = 1)

slide-46
SLIDE 46

Philosophy behind deformation stability/sensitivity bounds

Mallat’s deformation stability bound [Mallat, 2012]: |||ΦW (Fτf)−ΦW (f)||| ≤ C

  • 2−Jτ∞ +JDτ∞ +D2τ∞
  • fW ,

for all f ∈ HW ⊆ L2(Rd)

  • The bound depends explicitly on higher order derivatives of τ

Our deformation sensitivity bound: |||Φ(Fτf) − Φ(f)||| ≤ CCτα

∞,

∀f ∈ C ⊆ L2(Rd)

  • The bound implicitly depends on derivative of τ via the

condition Dτ∞ ≤

1 2d

slide-47
SLIDE 47

Philosophy behind deformation stability/sensitivity bounds

Mallat’s deformation stability bound [Mallat, 2012]: |||ΦW (Fτf)−ΦW (f)||| ≤ C

  • 2−Jτ∞ +JDτ∞ +D2τ∞
  • fW ,

for all f ∈ HW ⊆ L2(Rd)

  • The bound is coupled to horizontal translation invariance

lim

J→∞ |||ΦW (Ttf) − ΦW (f)||| = 0,

∀f ∈ L2(Rd), ∀t ∈ Rd Our deformation sensitivity bound: |||Φ(Fτf) − Φ(f)||| ≤ CCτα

∞,

∀f ∈ C ⊆ L2(Rd)

  • The bound is decoupled from vertical translation invariance

lim

n→∞ |||Φn(Ttf) − Φn(f)||| = 0,

∀f ∈ L2(Rd), ∀t ∈ Rd

slide-48
SLIDE 48

CNNs in a nutshell

CNNs used in practice employ potentially hundreds of layers and 10,000s of nodes!

slide-49
SLIDE 49

CNNs in a nutshell

CNNs used in practice employ potentially hundreds of layers and 10,000s of nodes! e.g.: Winner of the ImageNet 2015 challenge [He et al., 2015]

  • Network depth: 152 layers
  • average # of nodes per layer: 472
  • # of FLOPS for a single forward pass: 11.3 billion
slide-50
SLIDE 50

CNNs in a nutshell

CNNs used in practice employ potentially hundreds of layers and 10,000s of nodes! e.g.: Winner of the ImageNet 2015 challenge [He et al., 2015]

  • Network depth: 152 layers
  • average # of nodes per layer: 472
  • # of FLOPS for a single forward pass: 11.3 billion

Such depths (and breadths) pose formidable computational challenges in training and operating the network!

slide-51
SLIDE 51

Topology reduction

Determine how fast the energy contained in the propagated signals (a.k.a. feature maps) decays across layers

slide-52
SLIDE 52

Topology reduction

Determine how fast the energy contained in the propagated signals (a.k.a. feature maps) decays across layers Guarantee trivial null-space for feature extractor Φ

slide-53
SLIDE 53

Topology reduction

Determine how fast the energy contained in the propagated signals (a.k.a. feature maps) decays across layers Guarantee trivial null-space for feature extractor Φ Specify the number of layers needed to have “most” of the input signal energy be contained in the feature vector

slide-54
SLIDE 54

Topology reduction

Determine how fast the energy contained in the propagated signals (a.k.a. feature maps) decays across layers Guarantee trivial null-space for feature extractor Φ Specify the number of layers needed to have “most” of the input signal energy be contained in the feature vector For a fixed (possibly small) depth, design CNNs that capture “most” of the input signal energy

slide-55
SLIDE 55

Building blocks

Basic operations in the n-th network layer f . . . gλ(r)

n

| · | ↓S gλ(k)

n

| · | ↓S Filters: Semi-discrete frame Ψn := {χn} ∪ {gλn}λn∈Λn Non-linearity: Modulus | · | Pooling: Sub-sampling with pooling factor S ≥ 1

slide-56
SLIDE 56

Demodulation effect of modulus non-linearity

Components of feature vector given by |f ∗ gλn| ∗ χn+1

1 ω

· · · · · ·

  • gλn(ω)
  • χn+1(ω)
  • f(ω)
slide-57
SLIDE 57

Demodulation effect of modulus non-linearity

Components of feature vector given by |f ∗ gλn| ∗ χn+1

1 ω

· · · · · ·

  • gλn(ω)
  • χn+1(ω)
  • f(ω)

1 ω

  • f(ω) ·

gλn(ω)

slide-58
SLIDE 58

Demodulation effect of modulus non-linearity

Components of feature vector given by |f ∗ gλn| ∗ χn+1

1 ω

· · · · · ·

  • gλn(ω)
  • χn+1(ω)
  • f(ω)

1 ω

  • f(ω) ·

gλn(ω)

Modulus squared: |f ∗ gλn(x)|2 R

f· gλn(ω)

slide-59
SLIDE 59

Demodulation effect of modulus non-linearity

Components of feature vector given by |f ∗ gλn| ∗ χn+1

1 ω

· · · · · ·

  • gλn(ω)
  • χn+1(ω)
  • f(ω)

1 ω

  • f(ω) ·

gλn(ω) 1 ω |f ∗ gλn|

  • (ω)

Φ(f) via χn+1

slide-60
SLIDE 60

Do all non-linearities demodulate?

High-pass filtered signal:

−2R 2R 2R F(f ∗ gλ) ω

slide-61
SLIDE 61

Do all non-linearities demodulate?

High-pass filtered signal:

−2R 2R 2R F(f ∗ gλ) ω

Modulus: Yes!

−2R 2R ω |F(|f ∗ gλ|)|

... but (small) tails!

slide-62
SLIDE 62

Do all non-linearities demodulate?

High-pass filtered signal:

−2R 2R 2R F(f ∗ gλ) ω

Modulus squared: Yes, and sharply so!

−2R 2R |F(|f ∗ gλ|2)| ω

... but not Lipschitz-continuous!

slide-63
SLIDE 63

Do all non-linearities demodulate?

High-pass filtered signal:

−2R 2R 2R F(f ∗ gλ) ω

Rectified linear unit: No!

−2R 2R |F(ReLU(f ∗ gλ))| ω

slide-64
SLIDE 64

First goal: Quantify feature map energy decay

W1(f) W2(f)

f |f ∗ gλ(k)

1 |

· ∗ χ2 ||f ∗ gλ(k)

1 | ∗ gλ(l) 2 |

· ∗ χ3 |f ∗ gλ(p)

1 |

· ∗ χ2 ||f ∗ gλ(p)

1 | ∗ gλ(r) 2 |

· ∗ χ3 · ∗ χ1

slide-65
SLIDE 65

Assumptions (on the filters)

i) Analyticity: For every filter gλn there exists a (not necessarily canonical) orthant Hλn ⊆ Rd such that supp( gλn) ⊆ Hλn. ii) High-pass: There exists δ > 0 such that

  • λn∈Λn

| gλn(ω)|2 = 0, a.e. ω ∈ Bδ(0).

slide-66
SLIDE 66

Assumptions (on the filters)

i) Analyticity: For every filter gλn there exists a (not necessarily canonical) orthant Hλn ⊆ Rd such that supp( gλn) ⊆ Hλn. ii) High-pass: There exists δ > 0 such that

  • λn∈Λn

| gλn(ω)|2 = 0, a.e. ω ∈ Bδ(0). ⇒ Comprises various contructions of WH filters, wavelets, ridgelets, (α)-curvelets, shearlets e.g.: analytic band-limited curvelets:

ω1 ω2

slide-67
SLIDE 67

Input signal classes

Sobolev functions of order s ≥ 0: Hs(Rd) =

  • f ∈ L2(Rd)
  • Rd(1 + |ω|2)s|

f(ω)|2dω < ∞

slide-68
SLIDE 68

Input signal classes

Sobolev functions of order s ≥ 0: Hs(Rd) =

  • f ∈ L2(Rd)
  • Rd(1 + |ω|2)s|

f(ω)|2dω < ∞

  • Hs(Rd) contains a wide range of practically relevant signal classes
slide-69
SLIDE 69

Input signal classes

Sobolev functions of order s ≥ 0: Hs(Rd) =

  • f ∈ L2(Rd)
  • Rd(1 + |ω|2)s|

f(ω)|2dω < ∞

  • Hs(Rd) contains a wide range of practically relevant signal classes
  • square-integrable functions L2(Rd) = H0(Rd)
slide-70
SLIDE 70

Input signal classes

Sobolev functions of order s ≥ 0: Hs(Rd) =

  • f ∈ L2(Rd)
  • Rd(1 + |ω|2)s|

f(ω)|2dω < ∞

  • Hs(Rd) contains a wide range of practically relevant signal classes
  • square-integrable functions L2(Rd) = H0(Rd)
  • L-band-limited functions L2

L(Rd) ⊆ Hs(Rd), ∀L > 0, ∀s ≥ 0

slide-71
SLIDE 71

Input signal classes

Sobolev functions of order s ≥ 0: Hs(Rd) =

  • f ∈ L2(Rd)
  • Rd(1 + |ω|2)s|

f(ω)|2dω < ∞

  • Hs(Rd) contains a wide range of practically relevant signal classes
  • square-integrable functions L2(Rd) = H0(Rd)
  • L-band-limited functions L2

L(Rd) ⊆ Hs(Rd), ∀L > 0, ∀s ≥ 0

  • cartoon functions [Donoho, 2001] CCART ⊆ Hs(Rd), ∀s ∈ [0, 1

2)

Handwritten digits from MNIST database [LeCun & Cortes, 1998]

slide-72
SLIDE 72

Exponential energy decay

Theorem Let the filters be wavelets with mother wavelet supp( ψ ) ⊆ [r−1, r], r > 1,

  • r Weyl-Heisenberg (WH) filters with prototype function

supp( g) ⊆ [−R, R], R > 0. Then, for every f ∈ Hs(Rd), there exists β > 0 such that Wn(f) = O

  • a− n(2s+β)

2s+β+1

  • ,

where a = r2+1

r2−1 in the wavelet case, and a = 1 2 + 1 R in the WH case.

slide-73
SLIDE 73

Exponential energy decay

Theorem Let the filters be wavelets with mother wavelet supp( ψ ) ⊆ [r−1, r], r > 1,

  • r Weyl-Heisenberg (WH) filters with prototype function

supp( g) ⊆ [−R, R], R > 0. Then, for every f ∈ Hs(Rd), there exists β > 0 such that Wn(f) = O

  • a− n(2s+β)

2s+β+1

  • ,

where a = r2+1

r2−1 in the wavelet case, and a = 1 2 + 1 R in the WH case.

⇒ decay factor a is explicit and can be tuned via r, R

slide-74
SLIDE 74

Exponential energy decay

Exponential energy decay: Wn(f) = O

  • a− n(2s+β)

2s+β+1

slide-75
SLIDE 75

Exponential energy decay

Exponential energy decay: Wn(f) = O

  • a− n(2s+β)

2s+β+1

  • β > 0 determines the decay of

f(ω) (as |ω| → ∞) according to | f(ω)| ≤ µ(1 + |ω|2)−( s

2 + 1 4 + β 4 ),

∀ |ω| ≥ L, for some µ > 0, and L acts as an “effective bandwidth”

slide-76
SLIDE 76

Exponential energy decay

Exponential energy decay: Wn(f) = O

  • a− n(2s+β)

2s+β+1

  • β > 0 determines the decay of

f(ω) (as |ω| → ∞) according to | f(ω)| ≤ µ(1 + |ω|2)−( s

2 + 1 4 + β 4 ),

∀ |ω| ≥ L, for some µ > 0, and L acts as an “effective bandwidth”

  • smoother input signals (i.e., s↑) lead to faster energy decay
slide-77
SLIDE 77

Exponential energy decay

Exponential energy decay: Wn(f) = O

  • a− n(2s+β)

2s+β+1

  • β > 0 determines the decay of

f(ω) (as |ω| → ∞) according to | f(ω)| ≤ µ(1 + |ω|2)−( s

2 + 1 4 + β 4 ),

∀ |ω| ≥ L, for some µ > 0, and L acts as an “effective bandwidth”

  • smoother input signals (i.e., s↑) lead to faster energy decay
  • pooling through sub-sampling f → S1/2f(S·) leads to decay

factor a

S

slide-78
SLIDE 78

Exponential energy decay

Exponential energy decay: Wn(f) = O

  • a− n(2s+β)

2s+β+1

  • β > 0 determines the decay of

f(ω) (as |ω| → ∞) according to | f(ω)| ≤ µ(1 + |ω|2)−( s

2 + 1 4 + β 4 ),

∀ |ω| ≥ L, for some µ > 0, and L acts as an “effective bandwidth”

  • smoother input signals (i.e., s↑) lead to faster energy decay
  • pooling through sub-sampling f → S1/2f(S·) leads to decay

factor a

S

What about general filters? ⇒ polynomial energy decay!

slide-79
SLIDE 79

... our second goal ... trivial null-space for Φ

Why trivial null-space? Feature space

w

: w, Φ(f) > 0 : w, Φ(f) < 0

slide-80
SLIDE 80

... our second goal ... trivial null-space for Φ

Why trivial null-space? Feature space

w Φ(f ∗)

: w, Φ(f) > 0 : w, Φ(f) < 0 Non-trivial null-space: ∃ f∗ = 0 such that Φ(f∗) = 0 ⇒ w, Φ(f∗) = 0 for all w ! ⇒ these f∗ become unclassifiable!

slide-81
SLIDE 81

... our second goal ...

Trivial null-space for feature extractor:

  • f ∈ L2(Rd) | Φ(f) = 0
  • =
  • Feature extractor Φ(·) = ∞

n=0 Φn(·) shall satisfy

Af2

2 ≤ |||Φ(f)|||2 ≤ Bf2 2,

∀f ∈ L2(Rd), for some A, B > 0.

slide-82
SLIDE 82

“Energy conservation”

Theorem For the frame upper {Bn}n∈N and frame lower bounds {An}n∈N, define B := ∞

n=1 max{1, Bn} and A := ∞ n=1 min{1, An}. If

0 < A ≤ B < ∞, then Af2

2 ≤ |||Φ(f)|||2 ≤ Bf2 2,

∀ f ∈ L2(Rd).

slide-83
SLIDE 83

“Energy conservation”

Theorem For the frame upper {Bn}n∈N and frame lower bounds {An}n∈N, define B := ∞

n=1 max{1, Bn} and A := ∞ n=1 min{1, An}. If

0 < A ≤ B < ∞, then Af2

2 ≤ |||Φ(f)|||2 ≤ Bf2 2,

∀ f ∈ L2(Rd).

  • For Parseval frames (i.e., An = Bn = 1, n ∈ N), this yields

|||Φ(f)|||2 = f2

2

slide-84
SLIDE 84

“Energy conservation”

Theorem For the frame upper {Bn}n∈N and frame lower bounds {An}n∈N, define B := ∞

n=1 max{1, Bn} and A := ∞ n=1 min{1, An}. If

0 < A ≤ B < ∞, then Af2

2 ≤ |||Φ(f)|||2 ≤ Bf2 2,

∀ f ∈ L2(Rd).

  • For Parseval frames (i.e., An = Bn = 1, n ∈ N), this yields

|||Φ(f)|||2 = f2

2

  • Connection to energy decay:

f2

2 = n−1

  • k=0

|||Φk(f)|||2 + Wn(f)

→ 0

slide-85
SLIDE 85

... and our third goal ...

For a given CNN, specify the number of layers needed to capture “most” of the input signal energy

slide-86
SLIDE 86

... and our third goal ...

For a given CNN, specify the number of layers needed to capture “most” of the input signal energy How many layers n are needed to have at least ((1 − ε) · 100)% of the input signal energy be contained in the feature vector, i.e., (1 − ε)f2

2 ≤ n

  • k=0

|||Φk(f)|||2 ≤ f2

2,

∀f ∈ L2(Rd).

slide-87
SLIDE 87

Number of layers needed

Theorem Let the frame bounds satisfy An = Bn = 1, n ∈ N. Let the input signal f be L-band-limited, and let ε ∈ (0, 1). If n ≥

  • loga
  • L

(1 − √1 − ε )

  • ,

then (1 − ε)f2

2 ≤ n

  • k=0

|||Φk(f)|||2 ≤ f2

2.

slide-88
SLIDE 88

Number of layers needed

Theorem Let the frame bounds satisfy An = Bn = 1, n ∈ N. Let the input signal f be L-band-limited, and let ε ∈ (0, 1). If n ≥

  • loga
  • L

(1 − √1 − ε )

  • ,

then (1 − ε)f2

2 ≤ n

  • k=0

|||Φk(f)|||2 ≤ f2

2.

⇒ also guarantees trivial null-space for n

k=0 Φk(f)

slide-89
SLIDE 89

Number of layers needed

Theorem Let the frame bounds satisfy An = Bn = 1, n ∈ N. Let the input signal f be L-band-limited, and let ε ∈ (0, 1). If n ≥

  • loga
  • L

(1 − √1 − ε )

  • ,

then (1 − ε)f2

2 ≤ n

  • k=0

|||Φk(f)|||2 ≤ f2

2.

  • lower bound depends on
  • description complexity of input signals (i.e., bandwidth L)
  • decay factor (wavelets a = r2+1

r2−1, WH filters a = 1 2 + 1 R)

slide-90
SLIDE 90

Number of layers needed

Theorem Let the frame bounds satisfy An = Bn = 1, n ∈ N. Let the input signal f be L-band-limited, and let ε ∈ (0, 1). If n ≥

  • loga
  • L

(1 − √1 − ε )

  • ,

then (1 − ε)f2

2 ≤ n

  • k=0

|||Φk(f)|||2 ≤ f2

2.

  • lower bound depends on
  • description complexity of input signals (i.e., bandwidth L)
  • decay factor (wavelets a = r2+1

r2−1, WH filters a = 1 2 + 1 R)

  • similar estimates for Sobolev input signals and for general

filters (polynomial decay!)

slide-91
SLIDE 91

Number of layers needed

Numerical example for bandwidth L = 1:

(1 − ε) 0.25 0.5 0.75 0.9 0.95 0.99 wavelets (r = 2) 2 3 4 6 8 11 WH filters (R = 1) 2 4 5 8 10 14 general filters 2 3 7 19 39 199

slide-92
SLIDE 92

Number of layers needed

Numerical example for bandwidth L = 1:

(1 − ε) 0.25 0.5 0.75 0.9 0.95 0.99 wavelets (r = 2) 2 3 4 6 8 11 WH filters (R = 1) 2 4 5 8 10 14 general filters 2 3 7 19 39 199

slide-93
SLIDE 93

Number of layers needed

Numerical example for bandwidth L = 1:

(1 − ε) 0.25 0.5 0.75 0.9 0.95 0.99 wavelets (r = 2) 2 3 4 6 8 11 WH filters (R = 1) 2 4 5 8 10 14 general filters 2 3 7 19 39 199

Recall: Winner of the ImageNet 2015 challenge [He et al., 2015]

  • Network depth: 152 layers
  • average # of nodes per layer: 472
  • # of FLOPS for a single forward pass: 11.3 billion
slide-94
SLIDE 94

... our fourth and last goal ...

For a fixed (possibly small) depth N, design scattering networks that capture “most” of the input signal energy

slide-95
SLIDE 95

... our fourth and last goal ...

For a fixed (possibly small) depth N, design scattering networks that capture “most” of the input signal energy Recall: Let the filters be wavelets with mother wavelet supp( ψ ) ⊆ [r−1, r], r > 1,

  • r Weyl-Heisenberg filters with prototype function

supp( g) ⊆ [−R, R], R > 0.

slide-96
SLIDE 96

... our fourth and last goal ...

For a fixed (possibly small) depth N, design scattering networks that capture “most” of the input signal energy For fixed depth N, want to choose r in the wavelet and R in the WH case so that (1 − ε)f2

2 ≤ N

  • k=0

|||Φk(f)|||2 ≤ f2

2,

∀f ∈ L2(Rd).

slide-97
SLIDE 97

Depth-constrained networks

Theorem Let the frame bounds satisfy An = Bn = 1, n ∈ N. Let the input signal f be L-band-limited, and fix ε ∈ (0, 1) and N ∈ N. If, in the wavelet case, 1 < r ≤

  • κ + 1

κ − 1,

  • r, in the WH case,

0 < R ≤

  • 1

κ − 1

2

, where κ :=

  • L

(1−√1−ε )

1

N , then

(1 − ε)f2

2 ≤ N

  • k=0

|||Φk(f)|||2 ≤ f2

2.

slide-98
SLIDE 98

Depth-width tradeoff

Spectral supports of wavelet filters: ω

L

1 r 1

r r2 r3 1

  • g1
  • g2
  • g3
  • ψ
slide-99
SLIDE 99

Depth-width tradeoff

Spectral supports of wavelet filters: ω

L

1 r 1

r r2 r3 1

  • g1
  • g2
  • g3
  • ψ

Smaller depth N ⇒ smaller “bandwidth” r of mother wavelet

slide-100
SLIDE 100

Depth-width tradeoff

Spectral supports of wavelet filters: ω

L

1 r 1

r r2 r3 1

  • g1
  • g2
  • g3
  • ψ

Smaller depth N ⇒ smaller “bandwidth” r of mother wavelet ⇒ larger number of wavelets (O(logr(L))) to cover the spectral support [−L, L] of input signal

slide-101
SLIDE 101

Depth-width tradeoff

Spectral supports of wavelet filters: ω

L

1 r 1

r r2 r3 1

  • g1
  • g2
  • g3
  • ψ

Smaller depth N ⇒ smaller “bandwidth” r of mother wavelet ⇒ larger number of wavelets (O(logr(L))) to cover the spectral support [−L, L] of input signal ⇒ larger number of filters in the first layer

slide-102
SLIDE 102

Depth-width tradeoff

Spectral supports of wavelet filters: ω

L

1 r 1

r r2 r3 1

  • g1
  • g2
  • g3
  • ψ

Smaller depth N ⇒ smaller “bandwidth” r of mother wavelet ⇒ larger number of wavelets (O(logr(L))) to cover the spectral support [−L, L] of input signal ⇒ larger number of filters in the first layer ⇒ depth-width tradeoff

slide-103
SLIDE 103

Yours truly

slide-104
SLIDE 104

Experiment: Handwritten digit classification

  • Dataset: MNIST database of handwritten digits [LeCun &

Cortes, 1998]; 60,000 training and 10,000 test images

  • Φ-network: D = 3 layers; same filters, non-linearities, and

pooling operators in all layers

  • Classifier: SVM with radial basis function kernel [Vapnik, 1995]
  • Dimensionality reduction: Supervised orthogonal least squares

scheme [Chen et al., 1991]

slide-105
SLIDE 105

Experiment: Handwritten digit classification

Classification error in percent: Haar wavelet Bi-orthogonal wavelet abs ReLU tanh LogSig abs ReLU tanh LogSig n.p. 0.57 0.57 1.35 1.49 0.51 0.57 1.12 1.22 sub. 0.69 0.66 1.25 1.46 0.61 0.61 1.20 1.18 max. 0.58 0.65 0.75 0.74 0.52 0.64 0.78 0.73 avg. 0.55 0.60 1.27 1.35 0.58 0.59 1.07 1.26

slide-106
SLIDE 106

Experiment: Handwritten digit classification

Classification error in percent: Haar wavelet Bi-orthogonal wavelet abs ReLU tanh LogSig abs ReLU tanh LogSig n.p. 0.57 0.57 1.35 1.49 0.51 0.57 1.12 1.22 sub. 0.69 0.66 1.25 1.46 0.61 0.61 1.20 1.18 max. 0.58 0.65 0.75 0.74 0.52 0.64 0.78 0.73 avg. 0.55 0.60 1.27 1.35 0.58 0.59 1.07 1.26

  • modulus and ReLU perform better than tanh and LogSig
slide-107
SLIDE 107

Experiment: Handwritten digit classification

Classification error in percent: Haar wavelet Bi-orthogonal wavelet abs ReLU tanh LogSig abs ReLU tanh LogSig n.p. 0.57 0.57 1.35 1.49 0.51 0.57 1.12 1.22 sub. 0.69 0.66 1.25 1.46 0.61 0.61 1.20 1.18 max. 0.58 0.65 0.75 0.74 0.52 0.64 0.78 0.73 avg. 0.55 0.60 1.27 1.35 0.58 0.59 1.07 1.26

  • modulus and ReLU perform better than tanh and LogSig
  • results with pooling (S = 2) are competitive with those without

pooling, at significanly lower computational cost

slide-108
SLIDE 108

Experiment: Handwritten digit classification

Classification error in percent: Haar wavelet Bi-orthogonal wavelet abs ReLU tanh LogSig abs ReLU tanh LogSig n.p. 0.57 0.57 1.35 1.49 0.51 0.57 1.12 1.22 sub. 0.69 0.66 1.25 1.46 0.61 0.61 1.20 1.18 max. 0.58 0.65 0.75 0.74 0.52 0.64 0.78 0.73 avg. 0.55 0.60 1.27 1.35 0.58 0.59 1.07 1.26

  • modulus and ReLU perform better than tanh and LogSig
  • results with pooling (S = 2) are competitive with those without

pooling, at significanly lower computational cost

  • state-of-the-art: 0.43 [Bruna and Mallat, 2013]
  • similar feature extraction network with directional, non-separable

wavelets and no pooling

  • significantly higher computational complexity
slide-109
SLIDE 109

Energy decay: Related work

[Waldspurger, 2017]: Exponential energy decay Wn(f) = O(a−n), for some unspecified a > 1.

  • 1-D wavelet filters
  • every network layer equipped with the same set of wavelets
slide-110
SLIDE 110

Energy decay: Related work

[Waldspurger, 2017]: Exponential energy decay Wn(f) = O(a−n), for some unspecified a > 1.

  • 1-D wavelet filters
  • every network layer equipped with the same set of wavelets
  • vanishing moments condition on the mother wavelet
slide-111
SLIDE 111

Energy decay: Related work

[Waldspurger, 2017]: Exponential energy decay Wn(f) = O(a−n), for some unspecified a > 1.

  • 1-D wavelet filters
  • every network layer equipped with the same set of wavelets
  • vanishing moments condition on the mother wavelet
  • applies to 1-D real-valued band-limited input signals f ∈ L2(R)
slide-112
SLIDE 112

Energy decay: Related work

[Czaja and Li, 2016]: Exponential energy decay Wn(f) = O(a−n), for some unspecified a > 1.

  • d-dimensional uniform covering filters (similar to Weyl-

Heisenberg filters), but does not cover multi-scale filters (e.g. wavelets, ridgedelets, curvelets etc.)

  • every network layer equipped with the same set of filters
slide-113
SLIDE 113

Energy decay: Related work

[Czaja and Li, 2016]: Exponential energy decay Wn(f) = O(a−n), for some unspecified a > 1.

  • d-dimensional uniform covering filters (similar to Weyl-

Heisenberg filters), but does not cover multi-scale filters (e.g. wavelets, ridgedelets, curvelets etc.)

  • every network layer equipped with the same set of filters
  • analyticity and vanishing moments conditions on the filters
slide-114
SLIDE 114

Energy decay: Related work

[Czaja and Li, 2016]: Exponential energy decay Wn(f) = O(a−n), for some unspecified a > 1.

  • d-dimensional uniform covering filters (similar to Weyl-

Heisenberg filters), but does not cover multi-scale filters (e.g. wavelets, ridgedelets, curvelets etc.)

  • every network layer equipped with the same set of filters
  • analyticity and vanishing moments conditions on the filters
  • applies to d-dimensional complex-valued input signals

f ∈ L2(Rd)