Why Are Convlotuional Nets More Sample-Efficient than - - PowerPoint PPT Presentation

why are convlotuional nets more sample efficient than
SMART_READER_LITE
LIVE PREVIEW

Why Are Convlotuional Nets More Sample-Efficient than - - PowerPoint PPT Presentation

Why Are Convlotuional Nets More Sample-Efficient than Fully-Connected Nets? Zhiyuan Li Joint work with Sanjeev Arora, Yi Zhang Princeton University August 19, 2020 @ IJTCS Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets


slide-1
SLIDE 1

Why Are Convlotuional Nets More Sample-Efficient than Fully-Connected Nets?

Zhiyuan Li

Joint work with Sanjeev Arora, Yi Zhang

Princeton University

August 19, 2020 @ IJTCS

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 1 / 30

slide-2
SLIDE 2

Introduction

Table of Contents

1

Introduction

2

Intuition and Warm-up example

3

Identifying Algorithmic Equivariance

4

Lower Bound for Equivariant Algorithms

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 2 / 30

slide-3
SLIDE 3

Introduction

Introduction

CNNs (Convolutional neural networks) often perform better than its fully connected counterparts, FC Nets, especially on vision tasks.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 3 / 30

slide-4
SLIDE 4

Introduction

Introduction

CNNs (Convolutional neural networks) often perform better than its fully connected counterparts, FC Nets, especially on vision tasks. Not an issue of expressiveness — FC nets easily gets full training accuracy, but still generalizes poorly.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 3 / 30

slide-5
SLIDE 5

Introduction

Introduction

CNNs (Convolutional neural networks) often perform better than its fully connected counterparts, FC Nets, especially on vision tasks. Not an issue of expressiveness — FC nets easily gets full training accuracy, but still generalizes poorly. Often explained by “better inductive bias”. Ex: Over-parametrized Linear Regression has multiple solutions, and GD (Gradient Descent) initialized from 0 picks the one with min ℓ2 norm.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 3 / 30

slide-6
SLIDE 6

Introduction

Introduction

CNNs (Convolutional neural networks) often perform better than its fully connected counterparts, FC Nets, especially on vision tasks. Not an issue of expressiveness — FC nets easily gets full training accuracy, but still generalizes poorly. Often explained by “better inductive bias”. Ex: Over-parametrized Linear Regression has multiple solutions, and GD (Gradient Descent) initialized from 0 picks the one with min ℓ2 norm. Question: Can we justify this rigorously by showing a sample complexity separation?

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 3 / 30

slide-7
SLIDE 7

Introduction

Introduction

CNN often performs better than FC Nets, especially on vision tasks. Often explained by “better inductive bias”. Ex: Over-parametrized Linear Regression has multiple solutions, and GD (Gradient Descent) initialized from 0 picks the one with min ℓ2 norm. Question: Can we justify this rigorously by showing a sample complexity separation? Since ultra-wide FC nets can simulate any CNN, the hurdle here is how to show that (S)GD + FC net doesn’t learn those CNN with good generalization.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 4 / 30

slide-8
SLIDE 8

Introduction

Introduction

CNN often performs better than FC Nets, especially on vision tasks. Often explained by “better inductive bias”. Ex: Over-parametrized Linear Regression has multiple solutions, and GD (Gradient Descent) initialized from 0 picks the one with min ℓ2 norm. Question: Can we justify this rigorously by showing a sample complexity separation? Since ultra-wide FC nets can simulate any CNN, the hurdle here is how to show that (S)GD + FC net doesn’t learn those CNN with good generalization. This Work A single distribution + a single target function which can be learnt by CNN with constant samples, but SGD on FC nets of any depth and width require Ω(d2) samples.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 4 / 30

slide-9
SLIDE 9

Introduction

Setting

Binary classification, Y = {−1, 1}. Data domain X = Rd

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 5 / 30

slide-10
SLIDE 10

Introduction

Setting

Binary classification, Y = {−1, 1}. Data domain X = Rd Joint distribution P supported on X × Y = Rd × {−1, 1}. In this talk, PY|X is always a deterministic function, h∗: Rd → {−1, 1}, i.e. P = PX ⋄ h∗.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 5 / 30

slide-11
SLIDE 11

Introduction

Setting

Binary classification, Y = {−1, 1}. Data domain X = Rd Joint distribution P supported on X × Y = Rd × {−1, 1}. In this talk, PY|X is always a deterministic function, h∗: Rd → {−1, 1}, i.e. P = PX ⋄ h∗. A Learning Algorithm A maps from a sequence of training data, {xi, yi}n

i=1 ∈ (X × Y)n,

to a hypothesis A({xi, yi}n

i=1) ∈ YX . A could also be random.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 5 / 30

slide-12
SLIDE 12

Introduction

Setting

Binary classification, Y = {−1, 1}. Data domain X = Rd Joint distribution P supported on X × Y = Rd × {−1, 1}. In this talk, PY|X is always a deterministic function, h∗: Rd → {−1, 1}, i.e. P = PX ⋄ h∗. A Learning Algorithm A maps from a sequence of training data, {xi, yi}n

i=1 ∈ (X × Y)n,

to a hypothesis A({xi, yi}n

i=1) ∈ YX . A could also be random.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 5 / 30

slide-13
SLIDE 13

Introduction

Setting

Binary classification, Y = {−1, 1}. Data domain X = Rd Joint distribution P supported on X × Y = Rd × {−1, 1}. In this talk, PY|X is always a deterministic function, h∗: Rd → {−1, 1}, i.e. P = PX ⋄ h∗. A Learning Algorithm A maps from a sequence of training data, {xi, yi}n

i=1 ∈ (X × Y)n,

to a hypothesis A({xi, yi}n

i=1) ∈ YX . A could also be random.

Two examples, Kernel Regression and ERM (Empirical Risk Minimization):

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 5 / 30

slide-14
SLIDE 14

Introduction

Setting

Binary classification, Y = {−1, 1}. Data domain X = Rd Joint distribution P supported on X × Y = Rd × {−1, 1}. In this talk, PY|X is always a deterministic function, h∗: Rd → {−1, 1}, i.e. P = PX ⋄ h∗. A Learning Algorithm A maps from a sequence of training data, {xi, yi}n

i=1 ∈ (X × Y)n,

to a hypothesis A({xi, yi}n

i=1) ∈ YX . A could also be random.

Two examples, Kernel Regression and ERM (Empirical Risk Minimization): REGK({xi, yi}n

i=1)(x) := 1

  • K(x, Xn) · K(Xn, Xn)†y ≥ 0
  • .

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 5 / 30

slide-15
SLIDE 15

Introduction

Setting

Binary classification, Y = {−1, 1}. Data domain X = Rd Joint distribution P supported on X × Y = Rd × {−1, 1}. In this talk, PY|X is always a deterministic function, h∗: Rd → {−1, 1}, i.e. P = PX ⋄ h∗. A Learning Algorithm A maps from a sequence of training data, {xi, yi}n

i=1 ∈ (X × Y)n,

to a hypothesis A({xi, yi}n

i=1) ∈ YX . A could also be random.

Two examples, Kernel Regression and ERM (Empirical Risk Minimization): REGK({xi, yi}n

i=1)(x) := 1

  • K(x, Xn) · K(Xn, Xn)†y ≥ 0
  • .

ERMH({xi, yi}n

i=1) = argminh∈H

n

i=1 1 [h(xi) = yi]. 1

1Strictly speaking, ERMH is not a well-defined algorithm. In this talk, we consider the worst performance of

all the empirical minimizers in H.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 5 / 30

slide-16
SLIDE 16

Introduction

Setting

errP(h) = P(X,Y )∼P [h(X) = Y ]. Sample Complexity: single joint distribution P The (ε, δ)-sample complexity, denoted N(A, P, ε, δ), is the smallest number n such that w.p. 1 − δ

  • ver the randomness of {xi, yi}n

i=1, errP(A({xi, yi}n i=1)) ≤ ε.

We also define the ε-expected sample complexity, N ∗(A, P, ε), as the smallest number n such that E

(xi,yi)∼P

  • errP(A({xi, yi}n

i=1))

  • ≤ ε.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 6 / 30

slide-17
SLIDE 17

Introduction

Setting

errP(h) = P(X,Y )∼P [h(X) = Y ]. Sample Complexity: single joint distribution P The (ε, δ)-sample complexity, denoted N(A, P, ε, δ), is the smallest number n such that w.p. 1 − δ

  • ver the randomness of {xi, yi}n

i=1, errP(A({xi, yi}n i=1)) ≤ ε.

We also define the ε-expected sample complexity, N ∗(A, P, ε), as the smallest number n such that E

(xi,yi)∼P

  • errP(A({xi, yi}n

i=1))

  • ≤ ε.

Sample Complexity: a family of distributions, P N(A, P, ε, δ) = max

P∈P N(A, P, ε, δ) ;

N ∗(A, P, ε) = max

P∈P N ∗(A, P, ε)

Fact: N ∗(A, P, ε + δ) ≤ N(A, P, ε, δ) ≤ N ∗(A, P, εδ), ∀ε, δ ∈ [0, 1].

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 6 / 30

slide-18
SLIDE 18

Introduction

Parametric Models

A parametric model M : W → YX is a functional mapping from weight W to a hypothesis M(W) : X → Y.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 7 / 30

slide-19
SLIDE 19

Introduction

Parametric Models

A parametric model M : W → YX is a functional mapping from weight W to a hypothesis M(W) : X → Y. Fully-connected (FC) Neural Networks: Rd → R FC-NN[W](x) = WLσ(WL−1 · · · σ(W2σ(W1x + b1) + b2) + bL−1) + bL, where W = ({Wi}L

i=1, {bi}L i=1), Wi ∈ Rdi−1×di, bi ∈ Rdi, d0 = d, and dL = 1. Here, σ : R → R is the

activation function, and we abuse the notation such that σ is also defined for vector inputs, i.e. that [σ(x)]i = σ(xi).

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 7 / 30

slide-20
SLIDE 20

Introduction

Parametric Models

A parametric model M : W → YX is a functional mapping from weight W to a hypothesis M(W) : X → Y. Fully-connected (FC) Neural Networks: Rd → R FC-NN[W](x) = WLσ(WL−1 · · · σ(W2σ(W1x + b1) + b2) + bL−1) + bL, where W = ({Wi}L

i=1, {bi}L i=1), Wi ∈ Rdi−1×di, bi ∈ Rdi, d0 = d, and dL = 1. Here, σ : R → R is the

activation function, and we abuse the notation such that σ is also defined for vector inputs, i.e. that [σ(x)]i = σ(xi). Convolutional Neural Networks(CNN): Rd → R CNN[W](x) = r

i=1 arσ([w ∗ x]d′(r−1)+1:d′r) + b,

where W = (w, a, b) ∈ Rk × Rr × R, d = d′r. ∗ : Rk × Rd → Rd is the convolution operator, defined as [w ∗ x]i = k

j=1 wjx[i−j−1 mod d]+1, and σ : Rd′ → R is the composition of pooling and element-wise

non-linearity.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 7 / 30

slide-21
SLIDE 21

Introduction

Parametric Models

A parametric model M : W → YX is a functional mapping from weight W to a hypothesis M(W) : X → Y. FC-NN[W](x) = WLσ(WL−1 · · · σ(W2σ(W1x + b1) + b2) + bL−1) + bL, CNN[W](x) = r

i=1 arσ([w ∗ x]d′(r−1)+1:d′r) + b,

Not possible to separate every learning algorithm on FC nets from CNN, as FC nets could simulate CNN.

Question What property of SGD prevents it from finding CNNs among FC nets?

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 8 / 30

slide-22
SLIDE 22

Intuition and Warm-up example

Table of Contents

1

Introduction

2

Intuition and Warm-up example

3

Identifying Algorithmic Equivariance

4

Lower Bound for Equivariant Algorithms

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 9 / 30

slide-23
SLIDE 23

Intuition and Warm-up example

Key Intuition: Equivariance

Definition (Equivariant Algorithms) A learning algorithm A is GX -equivariant iff for any dataset {xi, yi}n

i=1 and ∀g ∈ GX , x ∈ X,

A({g(xi), yi}n

i=1)(g(x)) d

= [A({xi, yi}n

i=1)](x).

SGD for FC Nets are O(d)-equivariant. (a.k.a. orthogonal/rotation equivariant)

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 10 / 30

slide-24
SLIDE 24

Intuition and Warm-up example

Key Intuition: Equivariance

Definition (Equivariant Algorithms) A learning algorithm A is GX -equivariant iff for any dataset {xi, yi}n

i=1 and ∀g ∈ GX , x ∈ X,

A({g(xi), yi}n

i=1)(g(x)) d

= [A({xi, yi}n

i=1)](x).

SGD for FC Nets are O(d)-equivariant. (a.k.a. orthogonal/rotation equivariant) Algorithmic equivariance constraints lead to sample complexity lower bounds.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 10 / 30

slide-25
SLIDE 25

Intuition and Warm-up example

Key Intuition: Equivariance

Definition (Equivariant Algorithms) A learning algorithm A is GX -equivariant iff for any dataset {xi, yi}n

i=1 and ∀g ∈ GX , x ∈ X,

A({g(xi), yi}n

i=1)(g(x)) d

= [A({xi, yi}n

i=1)](x).

SGD for FC Nets are O(d)-equivariant. (a.k.a. orthogonal/rotation equivariant) Algorithmic equivariance constraints lead to sample complexity lower bounds. Convolution and pooling layers in CNN break these constraints.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 10 / 30

slide-26
SLIDE 26

Intuition and Warm-up example

Warmup: a Ω(d) lower bound against orthogonal equivariant algorithms

X = Rd, Pc = Unif{(eiy, cy) | i ∈ [d], y = ±1}}, c ∈ {−1, 1}

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 11 / 30

slide-27
SLIDE 27

Intuition and Warm-up example

Warmup: a Ω(d) lower bound against orthogonal equivariant algorithms

X = Rd, Pc = Unif{(eiy, cy) | i ∈ [d], y = ±1}}, c ∈ {−1, 1}

  • w. Global average pooling: only 1 sample required.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 11 / 30

slide-28
SLIDE 28

Intuition and Warm-up example

Warmup: a Ω(d) lower bound against orthogonal equivariant algorithms

X = Rd, Pc = Unif{(eiy, cy) | i ∈ [d], y = ±1}}, c ∈ {−1, 1}

  • w. Global average pooling: only 1 sample required.

For orthogonal equivariant A, ∀R ∈ O(d), A({(Rxi, yi)}n

i=1)(Rx) = A({(xi, yi)}n i=1)(x)

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 11 / 30

slide-29
SLIDE 29

Intuition and Warm-up example

Warmup: a Ω(d) lower bound against orthogonal equivariant algorithms

X = Rd, Pc = Unif{(eiy, cy) | i ∈ [d], y = ±1}}, c ∈ {−1, 1}

  • w. Global average pooling: only 1 sample required.

For orthogonal equivariant A, ∀R ∈ O(d), A({(Rxi, yi)}n

i=1)(Rx) = A({(xi, yi)}n i=1)(x)

Let S = {xi, yi}n

i=1, A(S)(x) = fS(x⊤x1, . . . , x⊤xn), i.e.

(x⊤x1, . . . , x⊤xn) = (x′⊤x1, . . . , x′⊤xn) = ⇒ A(S)(x) = A(S)(x′).

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 11 / 30

slide-30
SLIDE 30

Intuition and Warm-up example

Warmup: a Ω(d) lower bound against orthogonal equivariant algorithms

X = Rd, Pc = Unif{(eiy, cy) | i ∈ [d], y = ±1}}, c ∈ {−1, 1}

  • w. Global average pooling: only 1 sample required.

For orthogonal equivariant A, ∀R ∈ O(d), A({(Rxi, yi)}n

i=1)(Rx) = A({(xi, yi)}n i=1)(x)

Let S = {xi, yi}n

i=1, A(S)(x) = fS(x⊤x1, . . . , x⊤xn), i.e.

(x⊤x1, . . . , x⊤xn) = (x′⊤x1, . . . , x′⊤xn) = ⇒ A(S)(x) = A(S)(x′). when n ≤ d

2 , w.p. 1 2, A(S)(x) = fS(0, . . . , 0)

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 11 / 30

slide-31
SLIDE 31

Intuition and Warm-up example

Warmup: a Ω(d) lower bound against orthogonal equivariant algorithms

X = Rd, Pc = Unif{(eiy, cy) | i ∈ [d], y = ±1}}, c ∈ {−1, 1}

  • w. Global average pooling: only 1 sample required.

For orthogonal equivariant A, ∀R ∈ O(d), A({(Rxi, yi)}n

i=1)(Rx) = A({(xi, yi)}n i=1)(x)

Let S = {xi, yi}n

i=1, A(S)(x) = fS(x⊤x1, . . . , x⊤xn), i.e.

(x⊤x1, . . . , x⊤xn) = (x′⊤x1, . . . , x′⊤xn) = ⇒ A(S)(x) = A(S)(x′). when n ≤ d

2 , w.p. 1 2, A(S)(x) = fS(0, . . . , 0)

= ⇒ at least 1

2error w.p. 1 2.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 11 / 30

slide-32
SLIDE 32

Intuition and Warm-up example

Related Work

[DWZ+18] proved Θ(#filter size) worst-case sample complexity for two-layer CNNs, better than the folklore Ω(d) lower bound for linear function class. Not a sample complexity separation, as their upper and lower bounds are proved on different classes of tasks.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 12 / 30

slide-33
SLIDE 33

Intuition and Warm-up example

Related Work

[DWZ+18] proved Θ(#filter size) worst-case sample complexity for two-layer CNNs, better than the folklore Ω(d) lower bound for linear function class. Not a sample complexity separation, as their upper and lower bounds are proved on different classes of tasks. [Ng04] showed that every orthogonal equivariant algorithm requires Ω(d) samples to learn a fixed linear function for all distributions. However, it doesn’t imply a sample complexity separation between FC nets and CNNs on image distributions or other natural distributions.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 12 / 30

slide-34
SLIDE 34

Intuition and Warm-up example

Related Work

[DWZ+18] proved Θ(#filter size) worst-case sample complexity for two-layer CNNs, better than the folklore Ω(d) lower bound for linear function class. Not a sample complexity separation, as their upper and lower bounds are proved on different classes of tasks. [Ng04] showed that every orthogonal equivariant algorithm requires Ω(d) samples to learn a fixed linear function for all distributions. However, it doesn’t imply a sample complexity separation between FC nets and CNNs on image distributions or other natural distributions. Recently, there have been a progress in showing lower bounds against learning with

  • kernels. [WLLM19] constructed a single task on which they proved a sample complexity separation

between learning with neural networks vs. with neural tangent kernels [JGH18]. Relatedly, [AZL19] showed a sample complexity lower bound against all kernels for a family of tasks, i.e., learning k-XOR on the hypercube.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 12 / 30

slide-35
SLIDE 35

Intuition and Warm-up example

Our Contributions

Prove orthogonal/permutation equivariance for a broad class of gradient based methods for FC nets.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 13 / 30

slide-36
SLIDE 36

Intuition and Warm-up example

Our Contributions

Prove orthogonal/permutation equivariance for a broad class of gradient based methods for FC nets.

Identifying a sufficient condition for general iterative algorithms to be equivariant, which is easy to check

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 13 / 30

slide-37
SLIDE 37

Intuition and Warm-up example

Our Contributions

Prove orthogonal/permutation equivariance for a broad class of gradient based methods for FC nets.

Identifying a sufficient condition for general iterative algorithms to be equivariant, which is easy to check

Sample complexity lower bounds for equivariant algorithms via reduction

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 13 / 30

slide-38
SLIDE 38

Intuition and Warm-up example

Our Contributions

Prove orthogonal/permutation equivariance for a broad class of gradient based methods for FC nets.

Identifying a sufficient condition for general iterative algorithms to be equivariant, which is easy to check

Sample complexity lower bounds for equivariant algorithms via reduction

Ω(d2/ε) lower bound for O(d)-equivariance, all distributions and single quadratic function.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 13 / 30

slide-39
SLIDE 39

Intuition and Warm-up example

Our Contributions

Prove orthogonal/permutation equivariance for a broad class of gradient based methods for FC nets.

Identifying a sufficient condition for general iterative algorithms to be equivariant, which is easy to check

Sample complexity lower bounds for equivariant algorithms via reduction

Ω(d2/ε) lower bound for O(d)-equivariance, all distributions and single quadratic function. Ω(d2) lower bound for O(d)-equivariance, single gaussian distribution and single quadratic function.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 13 / 30

slide-40
SLIDE 40

Intuition and Warm-up example

Our Contributions

Prove orthogonal/permutation equivariance for a broad class of gradient based methods for FC nets.

Identifying a sufficient condition for general iterative algorithms to be equivariant, which is easy to check

Sample complexity lower bounds for equivariant algorithms via reduction

Ω(d2/ε) lower bound for O(d)-equivariance, all distributions and single quadratic function. Ω(d2) lower bound for O(d)-equivariance, single gaussian distribution and single quadratic function.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 13 / 30

slide-41
SLIDE 41

Intuition and Warm-up example

Our Contributions

Prove orthogonal/permutation equivariance for a broad class of gradient based methods for FC nets.

Identifying a sufficient condition for general iterative algorithms to be equivariant, which is easy to check

Sample complexity lower bounds for equivariant algorithms via reduction

Ω(d2/ε) lower bound for O(d)-equivariance, all distributions and single quadratic function. Ω(d2) lower bound for O(d)-equivariance, single gaussian distribution and single quadratic

  • function. Ex:CIFAR10 has 50k images of size 32 × 32 × 3. To learn whether the Red channel
  • r Green Channel has larger signal strength (in ℓ2 sense), FC nets needs around 324 ≈ 1M

images if the image distribution is complex enough, e.g. close to i.i.d. gaussian.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 13 / 30

slide-42
SLIDE 42

Intuition and Warm-up example

Our Contributions

Prove orthogonal/permutation equivariance for a broad class of gradient based methods for FC nets.

Identifying a sufficient condition for general iterative algorithms to be equivariant, which is easy to check

Sample complexity lower bounds for equivariant algorithms via reduction

Ω(d2/ε) lower bound for O(d)-equivariance, all distributions and single quadratic function. Ω(d2) lower bound for O(d)-equivariance, single gaussian distribution and single quadratic

  • function. Ex:CIFAR10 has 50k images of size 32 × 32 × 3. To learn whether the Red channel
  • r Green Channel has larger signal strength (in ℓ2 sense), FC nets needs around 324 ≈ 1M

images if the image distribution is complex enough, e.g. close to i.i.d. gaussian. Ω(d) lower bound for permutation equivariance, single distribution and single function.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 13 / 30

slide-43
SLIDE 43

Intuition and Warm-up example

Our Contributions

Prove orthogonal/permutation equivariance for a broad class of gradient based methods for FC nets.

Identifying a sufficient condition for general iterative algorithms to be equivariant, which is easy to check

Sample complexity lower bounds for equivariant algorithms via reduction

Ω(d2/ε) lower bound for O(d)-equivariance, all distributions and single quadratic function. Ω(d2) lower bound for O(d)-equivariance, single gaussian distribution and single quadratic

  • function. Ex:CIFAR10 has 50k images of size 32 × 32 × 3. To learn whether the Red channel
  • r Green Channel has larger signal strength (in ℓ2 sense), FC nets needs around 324 ≈ 1M

images if the image distribution is complex enough, e.g. close to i.i.d. gaussian. Ω(d) lower bound for permutation equivariance, single distribution and single function. All above problems can be learnt by simple 2-layer CNN with GD using O(1) samples.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 13 / 30

slide-44
SLIDE 44

Identifying Algorithmic Equivariance

Table of Contents

1

Introduction

2

Intuition and Warm-up example

3

Identifying Algorithmic Equivariance

4

Lower Bound for Equivariant Algorithms

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 14 / 30

slide-45
SLIDE 45

Identifying Algorithmic Equivariance

Iterative Algorithms on Parametric Models

Algorithm 1 Iterative algorithm A

Input: Initial parameter distribution Pinit supported in W = Rm, total iterations T, training dataset {xi, yi}n

i=1, parametric model M : W → YX ,

(possibly random) iterative update rule F(W, M, {xi, yi}n

i=1)

Output: Hypothesis h : X → Y. Sample W(0) ∼ Pinit. for t = 0 to T − 1 do W(t+1) = F(W(t), M, {xi, yi}n

i=1).

return h = sign

  • M[W(T)]
  • .

Examples (Gradient Based Iterative Algorithms) SGD (+ ℓ2 regularization)(+ BatchNorm) SGD + Momentum/ Adam/ AdaGrad (W(t+1) = F({W(t′)}t

t′=1, M, {xi, yi}n i=1))

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 15 / 30

slide-46
SLIDE 46

Identifying Algorithmic Equivariance

Gradient Descent for FC Nets

FC-NN[W](x) = WLσ(WL−1 · · · σ(W2σ(W1x + b1) + b2) + bL−1) + bL.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 16 / 30

slide-47
SLIDE 47

Identifying Algorithmic Equivariance

Gradient Descent for FC Nets

FC-NN[W](x) = WLσ(WL−1 · · · σ(W2σ(W1x + b1) + b2) + bL−1) + bL.

Algorithm 3 Gradient Descent for FC-NN (FC networks)

Input: Initial parameter distribution Pinit , total iterations T, training dataset {xi, yi}n

i=1, loss function ℓ

Sample W(0) ∼ Pinit. for t = 0 to T − 1 do W(t+1) = W(t) − η n

i=1 ∇ℓ(FC-NN(W(t))(xi), yi)

return h = sign

  • FC-NN[W(T)]
  • .

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 16 / 30

slide-48
SLIDE 48

Identifying Algorithmic Equivariance

Gradient Descent for FC Nets

FC-NN[W](x) = WLσ(WL−1 · · · σ(W2σ(W1x + b1) + b2) + bL−1) + bL.

Algorithm 4 Gradient Descent for FC-NN (FC networks)

Input: Initial parameter distribution Pinit , total iterations T, training dataset {xi, yi}n

i=1, loss function ℓ

Sample W(0) ∼ Pinit. for t = 0 to T − 1 do W(t+1) = W(t) − η n

i=1 ∇ℓ(FC-NN(W(t))(xi), yi)

return h = sign

  • FC-NN[W(T)]
  • .

Goal: FC-NN[ W(t)](Rx) = FC-NN[W(t)](x), where W trained on Rxi and W trained on xi.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 16 / 30

slide-49
SLIDE 49

Identifying Algorithmic Equivariance

Gradient Descent for FC Nets

FC-NN[W](x) = WLσ(WL−1 · · · σ(W2σ(W1x + b1) + b2) + bL−1) + bL.

Algorithm 5 Gradient Descent for FC-NN (FC networks)

Input: Initial parameter distribution Pinit , total iterations T, training dataset {xi, yi}n

i=1, loss function ℓ

Sample W(0) ∼ Pinit. for t = 0 to T − 1 do W(t+1) = W(t) − η n

i=1 ∇ℓ(FC-NN(W(t))(xi), yi)

return h = sign

  • FC-NN[W(T)]
  • .

Goal: FC-NN[ W(t)](Rx) = FC-NN[W(t)](x), where W trained on Rxi and W trained on xi. Claim: W (0)

1

= W (0)

1 R−1,

W(0)

−1 = W(0) −1 =

⇒ W (t)

1

= W (t)

1 R−1,

W(t)

−1 = W(t) −1, ∀t.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 16 / 30

slide-50
SLIDE 50

Identifying Algorithmic Equivariance

Gradient Descent for FC Nets

FC-NN[W](x) = WLσ(WL−1 · · · σ(W2σ(W1x + b1) + b2) + bL−1) + bL.

Algorithm 6 Gradient Descent for FC-NN (FC networks)

Input: Initial parameter distribution Pinit , total iterations T, training dataset {xi, yi}n

i=1, loss function ℓ

Sample W(0) ∼ Pinit. for t = 0 to T − 1 do W(t+1) = W(t) − η n

i=1 ∇ℓ(FC-NN(W(t))(xi), yi)

return h = sign

  • FC-NN[W(T)]
  • .

Goal: FC-NN[ W(t)](Rx) = FC-NN[W(t)](x), where W trained on Rxi and W trained on xi. Claim: W (0)

1

= W (0)

1 R−1,

W(0)

−1 = W(0) −1 =

⇒ W (t)

1

= W (t)

1 R−1,

W(t)

−1 = W(t) −1, ∀t.

Induction: If W = ( W1, W−1) = (W1R−1, W−1), then ∀R ∈ O(d), ∇

W1ℓ(FC-NN(

W)(Rxi), yi) = ∇W1ℓ(FC-NN(W)(xi), yi)R−1 (chain rule) ∇

W−1ℓ(FC-NN(

W)(Rxi), yi) = ∇W−1ℓ(FC-NN(W)(xi), yi) ( W1Rxi = W1xi)

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 16 / 30

slide-51
SLIDE 51

Identifying Algorithmic Equivariance

Sufficient Conditions for general equivariance

Theorem The iterative algorithm A is GX -equivariant if the following conditions are met:

1

There’s a group GW acting on W and a group isomorphism τ : GX → GW, such that M[τ(g)(W)](g(x)) = M[W](x), ∀x ∈ X, W ∈ W, g ∈ G.

2

The initialization Pinit is invariant under group GW, i.e. ∀g ∈ GW, Pinit = Pinit ◦ g −1.

3

Update rule F is invariant under any joint group action (g, τ(g)), ∀g ∈ G. In other words, [τ(g)](F(W, M, {xi, yi}n

i=1)) = F([τ(g)](W), M, {g(xi), yi}n i=1).

Remark (1) is the minimum expressiveness requirement, (2) is the induction basis and (3) is the for induction

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 17 / 30

slide-52
SLIDE 52

Identifying Algorithmic Equivariance

Examples of equivariance

Symmetry Sign Flip Permutation Orthogonal Linear Matrix Group Diagonal, |Mii| = 1 Permutation Orthogonal Invertible Algorithms AdaGrad, Adam AdaGrad, Adam GD Newton’s method Initialization Symmetric distribution i.i.d. i.i.d. Gaussian All zero Regularization ℓp norm ℓp norm ℓ2 norm None Table 1: Examples of gradient-based equivariant training algorithms for FC networks. The initialization requirement is only for the first layer of the network.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 18 / 30

slide-53
SLIDE 53

Identifying Algorithmic Equivariance

Examples of equivariance

Symmetry Sign Flip Permutation Orthogonal Linear Matrix Group Diagonal, |Mii| = 1 Permutation Orthogonal Invertible Algorithms AdaGrad, Adam AdaGrad, Adam GD Newton’s method Initialization Symmetric distribution i.i.d. i.i.d. Gaussian All zero Regularization ℓp norm ℓp norm ℓ2 norm None Table 1: Examples of gradient-based equivariant training algorithms for FC networks. The initialization requirement is only for the first layer of the network. Equivariance for non-iterative algorithms

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 18 / 30

slide-54
SLIDE 54

Identifying Algorithmic Equivariance

Examples of equivariance

Symmetry Sign Flip Permutation Orthogonal Linear Matrix Group Diagonal, |Mii| = 1 Permutation Orthogonal Invertible Algorithms AdaGrad, Adam AdaGrad, Adam GD Newton’s method Initialization Symmetric distribution i.i.d. i.i.d. Gaussian All zero Regularization ℓp norm ℓp norm ℓ2 norm None Table 1: Examples of gradient-based equivariant training algorithms for FC networks. The initialization requirement is only for the first layer of the network. Equivariance for non-iterative algorithms Kernel Regression: If kernel K is GX -equivariant, i.e., ∀g ∈ GX , x, y ∈ X, K(g(x), g(y)) = K(x, y), then algorithm REGK is GX -equivariant.

Inner product kernel, i.e. K(x, y) = f (x, y), is O(d)-equivariant, including NTK. CNTK[ADH+19] is translation and flipping equivariant on images. (Acceleration when data aug is on [LWY+19])

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 18 / 30

slide-55
SLIDE 55

Identifying Algorithmic Equivariance

Examples of equivariance

Symmetry Sign Flip Permutation Orthogonal Linear Matrix Group Diagonal, |Mii| = 1 Permutation Orthogonal Invertible Algorithms AdaGrad, Adam AdaGrad, Adam GD Newton’s method Initialization Symmetric distribution i.i.d. i.i.d. Gaussian All zero Regularization ℓp norm ℓp norm ℓ2 norm None Table 1: Examples of gradient-based equivariant training algorithms for FC networks. The initialization requirement is only for the first layer of the network. Equivariance for non-iterative algorithms Kernel Regression: If kernel K is GX -equivariant, i.e., ∀g ∈ GX , x, y ∈ X, K(g(x), g(y)) = K(x, y), then algorithm REGK is GX -equivariant.

Inner product kernel, i.e. K(x, y) = f (x, y), is O(d)-equivariant, including NTK. CNTK[ADH+19] is translation and flipping equivariant on images. (Acceleration when data aug is on [LWY+19])

ERM: If F = F ◦ GX , and argminh∈F n

i=1 1 [h(xi) = yi] is unique, then ERMF is GX -equivariant.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 18 / 30

slide-56
SLIDE 56

Lower Bound for Equivariant Algorithms

Table of Contents

1

Introduction

2

Intuition and Warm-up example

3

Identifying Algorithmic Equivariance

4

Lower Bound for Equivariant Algorithms

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 19 / 30

slide-57
SLIDE 57

Lower Bound for Equivariant Algorithms

Recap: Upper and lower bounds related to VC dimension

Growth function ΠH(n) := sup

x1,...,xn∈X

|{(h(x1), . . . , h(xn)) | h ∈ H}| . VC dimension VCdim(H) := max{n | ΠH(n) = 2n}.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 20 / 30

slide-58
SLIDE 58

Lower Bound for Equivariant Algorithms

Recap: Upper and lower bounds related to VC dimension

Growth function ΠH(n) := sup

x1,...,xn∈X

|{(h(x1), . . . , h(xn)) | h ∈ H}| . VC dimension VCdim(H) := max{n | ΠH(n) = 2n}. Lemma (Sauer-Shelah)

ΠH(n) ≤

  • en

VCdim(H) VCdim(H) for n ≥ VCdim(H)

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 20 / 30

slide-59
SLIDE 59

Lower Bound for Equivariant Algorithms

Recap: Upper and lower bounds related to VC dimension

Growth function ΠH(n) := sup

x1,...,xn∈X

|{(h(x1), . . . , h(xn)) | h ∈ H}| . VC dimension VCdim(H) := max{n | ΠH(n) = 2n}. Lemma (Sauer-Shelah)

ΠH(n) ≤

  • en

VCdim(H) VCdim(H) for n ≥ VCdim(H)

Theorem ([BEHW89]) If A is consistent and ranges in H, then for any distribution PX ,∀0 < ε, δ < 1, N(A, PX ⋄ H, ε, δ) = O(VCdim(H) ln 1

ε + ln 1 δ

ε ). (1) Let PX be the set of all possible distributions on X, for any 0 < ε, δ < 1 and A, N(A, PX ⋄ H, ε, δ) = Ω(VCdim(H) + ln 1

δ

ε ). (2)

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 20 / 30

slide-60
SLIDE 60

Lower Bound for Equivariant Algorithms

Reduction to learning with algorithmic equivariance

Notation: Define PX ◦ g by X ∼ PX ⇐ ⇒ g −1(X) ∼ PX ◦ g and P ◦ g by (X, Y ) ∼ P ⇐ ⇒ (g −1(X), Y ) ∼ P ◦ g, where P = PX ⋄ h. That is, (PX ⋄ h) ◦ g = (PX ◦ g) ⋄ (h ◦ g −1).

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 21 / 30

slide-61
SLIDE 61

Lower Bound for Equivariant Algorithms

Reduction to learning with algorithmic equivariance

Notation: Define PX ◦ g by X ∼ PX ⇐ ⇒ g −1(X) ∼ PX ◦ g and P ◦ g by (X, Y ) ∼ P ⇐ ⇒ (g −1(X), Y ) ∼ P ◦ g, where P = PX ⋄ h. That is, (PX ⋄ h) ◦ g = (PX ◦ g) ⋄ (h ◦ g −1). Thus A is GX -equivariant = ⇒ N ∗(A, P, ε) = N ∗(A, P ◦ g, ε), ∀g ∈ GX . Consequently, we have N ∗(A, P, ε) = N ∗(A, P ◦ GX , ε). (3)

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 21 / 30

slide-62
SLIDE 62

Lower Bound for Equivariant Algorithms

Reduction to learning with algorithmic equivariance

Notation: Define PX ◦ g by X ∼ PX ⇐ ⇒ g −1(X) ∼ PX ◦ g and P ◦ g by (X, Y ) ∼ P ⇐ ⇒ (g −1(X), Y ) ∼ P ◦ g, where P = PX ⋄ h. That is, (PX ⋄ h) ◦ g = (PX ◦ g) ⋄ (h ◦ g −1). Thus A is GX -equivariant = ⇒ N ∗(A, P, ε) = N ∗(A, P ◦ g, ε), ∀g ∈ GX . Consequently, we have N ∗(A, P, ε) = N ∗(A, P ◦ GX , ε). (3) Lemma: Let A be the set of all algorithms and AGX be the set of all GX -equivariant algorithms, we have inf

A∈AGX

N ∗(A, P, ε) ≥ inf

A∈A N ∗(A, P ◦ GX , ε)

(4) The equality is attained when GX is a compact group.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 21 / 30

slide-63
SLIDE 63

Lower Bound for Equivariant Algorithms

Reduction to learning with algorithmic equivariance

Lemma: Let A be the set of all algorithms and AGX be the set of all GX -equivariant algorithms, then inf

A∈AGX

N ∗(A, P, ε) ≥ inf

A∈A N ∗(A, P ◦ GX , ε)

The equality is attained when GX is a compact group.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 22 / 30

slide-64
SLIDE 64

Lower Bound for Equivariant Algorithms

Reduction to learning with algorithmic equivariance

Lemma: Let A be the set of all algorithms and AGX be the set of all GX -equivariant algorithms, then inf

A∈AGX

N ∗(A, P, ε) ≥ inf

A∈A N ∗(A, P ◦ GX , ε)

The equality is attained when GX is a compact group.

Proof of Equality Let µ be Haar measure , i.e. ∀S ⊂ GX , g ∈ GX , µ(S) = µ(g ◦ S). We construct A′({xi, yi}n

i=1) = A({g(xi), yi}n i=1) ◦ g, where g ∼ µ.

By the definition of Haar measure, A′ is GX -equivariant.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 22 / 30

slide-65
SLIDE 65

Lower Bound for Equivariant Algorithms

Reduction to learning with algorithmic equivariance

Lemma: inf

A∈AGX

N ∗(A, P, ε) ≥ inf

A∈A N ∗(A, P ◦ GX , ε)

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 23 / 30

slide-66
SLIDE 66

Lower Bound for Equivariant Algorithms

Reduction to learning with algorithmic equivariance

Lemma: inf

A∈AGX

N ∗(A, P, ε) ≥ inf

A∈A N ∗(A, P ◦ GX , ε)

Theorem: Suppose PX is invariant under group GX , i.e., PX ◦ GX = PX , inf

A∈AGX

N ∗(A, PX ⋄ H, ε) ≥ inf

A∈A N ∗(A, PX ⋄ (H ◦ GX ), ε)

(5) The equality is attained when GX is a compact group.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 23 / 30

slide-67
SLIDE 67

Lower Bound for Equivariant Algorithms

Reduction to learning with algorithmic equivariance

Lemma: inf

A∈AGX

N ∗(A, P, ε) ≥ inf

A∈A N ∗(A, P ◦ GX , ε)

Theorem: Suppose PX is invariant under group GX , i.e., PX ◦ GX = PX , inf

A∈AGX

N ∗(A, PX ⋄ H, ε) ≥ inf

A∈A N ∗(A, PX ⋄ (H ◦ GX ), ε)

(5) The equality is attained when GX is a compact group. Proof (PX ⋄ H) ◦ GX = ∪g∈GX (PX ◦ g) ⋄ (H ◦ g −1) = ∪g∈GX PX ⋄ (H ◦ g −1) = PX ⋄ (H ◦ GX ).

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 23 / 30

slide-68
SLIDE 68

Lower Bound for Equivariant Algorithms

Reduction to learning with algorithmic equivariance

Lemma: inf

A∈AGX

N ∗(A, P, ε) ≥ inf

A∈A N ∗(A, P ◦ GX , ε)

Theorem: Suppose PX is invariant under group GX , i.e., PX ◦ GX = PX , inf

A∈AGX

N ∗(A, PX ⋄ H, ε) ≥ inf

A∈A N ∗(A, PX ⋄ (H ◦ GX ), ε)

(5) The equality is attained when GX is a compact group. Proof (PX ⋄ H) ◦ GX = ∪g∈GX (PX ◦ g) ⋄ (H ◦ g −1) = ∪g∈GX PX ⋄ (H ◦ g −1) = PX ⋄ (H ◦ GX ).

Take Home Message: Learning under equivariance constraint is as hard as learning an augmented function class.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 23 / 30

slide-69
SLIDE 69

Lower Bound for Equivariant Algorithms

Separation on single function + all distributions

Construction: Let X = R2d, and h∗(x) = sign d

i=1 x2 i − 2d i=d+1 x2 i

  • .

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 24 / 30

slide-70
SLIDE 70

Lower Bound for Equivariant Algorithms

Separation on single function + all distributions

Construction: Let X = R2d, and h∗(x) = sign d

i=1 x2 i − 2d i=d+1 x2 i

  • .

Theorem (single function All distributions ) Let P = {all distributions} ⋄ {h∗}, for any orthogonal equivariant algorithm A, N(A, P, ε, δ) = Ω((d2 + ln 1 δ )/ε), while there’s a 2-layer CNN architecture, such that N(ERMCNN, P, ε, δ) = O 1 ε

  • log 1

ε + log 1 δ )

  • .

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 24 / 30

slide-71
SLIDE 71

Lower Bound for Equivariant Algorithms

Separation on single function + all distributions

Construction: Let X = R2d, and h∗(x) = sign d

i=1 x2 i − 2d i=d+1 x2 i

  • .

Theorem (single function All distributions ) Let P = {all distributions} ⋄ {h∗}, for any orthogonal equivariant algorithm A, N(A, P, ε, δ) = Ω((d2 + ln 1 δ )/ε), while there’s a 2-layer CNN architecture, such that N(ERMCNN, P, ε, δ) = O 1 ε

  • log 1

ε + log 1 δ )

  • .

Proof of Lower Bound: Id −Id

  • is similar to

U U⊤

  • . Thus H = {hU | U ∈ O(d)} ⊆ h∗ ◦ GX , where

hU(x) = sign

  • x⊤

1:dUxd+1:2d

  • . It suffices to show VCdim(H) = Ω(d2).

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 24 / 30

slide-72
SLIDE 72

Lower Bound for Equivariant Algorithms

Separation on single function + all distributions

Proof of Lower Bound (Cont’d): Now we claim H shatters {ei + ed+j}1≤i<j≤d, i.e. O(d) can shatter {eie⊤

j }1≤i<j≤d, which implies

VCdim(H) ≥ d(d−1)

2

. Let so(d) = {M | M = −M⊤, M ∈ Rd×d}, we know exp(u) = Id + u + u2 2 + · · · ∈ SO(d), ∀u ∈ so(d). Thus for any sign pattern {σij}1≤i<j≤d, let u =

  • 1≤i<j≤d

σij(eie⊤

j − eje⊤ i ) and λ → 0,

sign

  • exp(λu), eie⊤

j

  • = sign
  • 0 + λσij + O(λ2)
  • = sign [σij + O(λ)] = σij

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 25 / 30

slide-73
SLIDE 73

Lower Bound for Equivariant Algorithms

Separation on single function + all distributions

Proof of Upper bound: N(ERMCNN, P, ε, δ) = O 1

ε

  • log 1

ε + log 1 δ

  • It suffices to construct a CNN with constant VC dimension but still able to express the target quadratic

function. Let σ : Rd → R, σ(x) = d

i=1 x2 i (square activation + average pooling), we have

FCNN =

  • sign

2

i=1 ai

d

j=1 x2 (i−1)d+jw 2 1

  • + b
  • |a1, a2, w1, b ∈ R
  • .

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 26 / 30

slide-74
SLIDE 74

Lower Bound for Equivariant Algorithms

Separation on single function + all distributions

Proof of Upper bound: N(ERMCNN, P, ε, δ) = O 1

ε

  • log 1

ε + log 1 δ

  • It suffices to construct a CNN with constant VC dimension but still able to express the target quadratic

function. Let σ : Rd → R, σ(x) = d

i=1 x2 i (square activation + average pooling), we have

FCNN =

  • sign

2

i=1 ai

d

j=1 x2 (i−1)d+jw 2 1

  • + b
  • |a1, a2, w1, b ∈ R
  • .

Remark The upper bound would still work for 2-layer CNNs with constantly larger filter size, channels. The point here is to show how simple the target is and the huge loss in sample efficiency by ignoring the prior knowledge of the task, i.e. to learn with an orthogonal equivariant algorithm.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 26 / 30

slide-75
SLIDE 75

Lower Bound for Equivariant Algorithms

Separation on single function + single distribution

Construction: Let X = R2d, h∗(x) = sign d

i=1 x2 i − 2d i=d+1 x2 i

  • , and PX = N(0, Id).

Theorem: Let P = {PX ⋄ h∗}. There is a constant ε0 > 0, if A is O(d)-equivariant, then N ∗(A, P, ε0) = Ω(d2). (6)

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 27 / 30

slide-76
SLIDE 76

Lower Bound for Equivariant Algorithms

Separation on single function + single distribution

Construction: Let X = R2d, h∗(x) = sign d

i=1 x2 i − 2d i=d+1 x2 i

  • , and PX = N(0, Id).

Theorem: Let P = {PX ⋄ h∗}. There is a constant ε0 > 0, if A is O(d)-equivariant, then N ∗(A, P, ε0) = Ω(d2). (6) Proof Sketch Define hU = sign

  • x⊤

1:dU xd+1:2d

  • , ∀U ∈ Rd×d, we have H = {hU | U ∈ O(d)} ⊆ h∗ ◦ O(2d). Thus it

suffices to show N ∗(A, N(0, I2d) ⋄ H, ε0) = Ω(d2) for any algorithm A.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 27 / 30

slide-77
SLIDE 77

Lower Bound for Equivariant Algorithms

Separation on single function + single distribution

Construction: Let X = R2d, h∗(x) = sign d

i=1 x2 i − 2d i=d+1 x2 i

  • , and PX = N(0, Id).

Theorem: Let P = {PX ⋄ h∗}. There is a constant ε0 > 0, if A is O(d)-equivariant, then N ∗(A, P, ε0) = Ω(d2). (6) Proof Sketch Define hU = sign

  • x⊤

1:dU xd+1:2d

  • , ∀U ∈ Rd×d, we have H = {hU | U ∈ O(d)} ⊆ h∗ ◦ O(2d). Thus it

suffices to show N ∗(A, N(0, I2d) ⋄ H, ε0) = Ω(d2) for any algorithm A.

Theorem (Benedek-Itai’s lower bound[BI91]) For any algorithm A (ε, δ)-learns H with n i.i.d. samples from a fixed PX , it must hold that ΠG(n) ≥ (1 − δ)D(H, ρX , 2ε) (7) Since ΠG(n) ≤ 2n, we have N(A, PX ⋄ H, ε, δ) ≥ log2 D(H, ρX , 2ε) + log2(1 − δ). Here D(H, ρX , 2ε) is the packing number w.r.t. ρX , where ρX (h, h′) = PX∼PX [h(X) = h′(X)].

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 27 / 30

slide-78
SLIDE 78

Lower Bound for Equivariant Algorithms

Separation on single function + single distribution

Define hU = sign

  • x⊤

1:dU xd+1:2d

  • , then H = {hU | U ∈ O(d)} ⊆ h∗ ◦ O(2d).

Theorem (Benedek-Itai’s lower bound[BI91]) For any algorithm A (ε, δ)-learns H with n i.i.d. samples from a fixed PX , N(A, PX ⋄ H, ε, δ) ≥ log2 D(H, ρX , 2ε) + log2(1 − δ). Here D(H, ρX , 2ε) is the packing number w.r.t. ρX , where ρX (h, h′) = PX∼PX [h(X) = h′(X)].

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 28 / 30

slide-79
SLIDE 79

Lower Bound for Equivariant Algorithms

Separation on single function + single distribution

Define hU = sign

  • x⊤

1:dU xd+1:2d

  • , then H = {hU | U ∈ O(d)} ⊆ h∗ ◦ O(2d).

Theorem (Benedek-Itai’s lower bound[BI91]) For any algorithm A (ε, δ)-learns H with n i.i.d. samples from a fixed PX , N(A, PX ⋄ H, ε, δ) ≥ log2 D(H, ρX , 2ε) + log2(1 − δ). Here D(H, ρX , 2ε) is the packing number w.r.t. ρX , where ρX (h, h′) = PX∼PX [h(X) = h′(X)]. Proof Sketch of log2 D(H, ρX , 2ε) = Ω(d2).

1

ρX (hU, hV ) = Ω( U−V F

√ d

).

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 28 / 30

slide-80
SLIDE 80

Lower Bound for Equivariant Algorithms

Separation on single function + single distribution

Define hU = sign

  • x⊤

1:dU xd+1:2d

  • , then H = {hU | U ∈ O(d)} ⊆ h∗ ◦ O(2d).

Theorem (Benedek-Itai’s lower bound[BI91]) For any algorithm A (ε, δ)-learns H with n i.i.d. samples from a fixed PX , N(A, PX ⋄ H, ε, δ) ≥ log2 D(H, ρX , 2ε) + log2(1 − δ). Here D(H, ρX , 2ε) is the packing number w.r.t. ρX , where ρX (h, h′) = PX∼PX [h(X) = h′(X)]. Proof Sketch of log2 D(H, ρX , 2ε) = Ω(d2).

1

ρX (hU, hV ) = Ω( U−V F

√ d

).

2

∀u, v ∈ so(d), u∞ , v∞ ≤ π

4 , exp(u) − exp(v)F = Ω(u − vF ). [Sza97] Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 28 / 30

slide-81
SLIDE 81

Lower Bound for Equivariant Algorithms

Separation on single function + single distribution

Define hU = sign

  • x⊤

1:dU xd+1:2d

  • , then H = {hU | U ∈ O(d)} ⊆ h∗ ◦ O(2d).

Theorem (Benedek-Itai’s lower bound[BI91]) For any algorithm A (ε, δ)-learns H with n i.i.d. samples from a fixed PX , N(A, PX ⋄ H, ε, δ) ≥ log2 D(H, ρX , 2ε) + log2(1 − δ). Here D(H, ρX , 2ε) is the packing number w.r.t. ρX , where ρX (h, h′) = PX∼PX [h(X) = h′(X)]. Proof Sketch of log2 D(H, ρX , 2ε) = Ω(d2).

1

ρX (hU, hV ) = Ω( U−V F

√ d

).

2

∀u, v ∈ so(d), u∞ , v∞ ≤ π

4 , exp(u) − exp(v)F = Ω(u − vF ). [Sza97] 3

Covering the spectral norm ball in the tangent space of SO(d) at Id via volume argument. D(H, ρX , ε0) ≥ D(so(d) ∩ π 4 Bd2

∞, ·F /

√ d, O(ε0)) ≥

  • vol(so(d) ∩ C

√ dBd2

∞)

vol(so(d) ∩ ε0Bd2

2 )

  • 2

d(d−1)

≥ C ε0 d(d−1)

2

.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 28 / 30

slide-82
SLIDE 82

Lower Bound for Equivariant Algorithms

Conclusions

Sufficient conditions for iterative algorithms to be equivariant

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 29 / 30

slide-83
SLIDE 83

Lower Bound for Equivariant Algorithms

Conclusions

Sufficient conditions for iterative algorithms to be equivariant

SGD + 1st layer is FC + i.i.d. gaussian initialization (+ momentum) (+BatchNorm) is orthogonal equivariant.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 29 / 30

slide-84
SLIDE 84

Lower Bound for Equivariant Algorithms

Conclusions

Sufficient conditions for iterative algorithms to be equivariant

SGD + 1st layer is FC + i.i.d. gaussian initialization (+ momentum) (+BatchNorm) is orthogonal equivariant.

Worst-case sample complexity under equivariance constraint is equal to that of the augmented function class.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 29 / 30

slide-85
SLIDE 85

Lower Bound for Equivariant Algorithms

Conclusions

Sufficient conditions for iterative algorithms to be equivariant

SGD + 1st layer is FC + i.i.d. gaussian initialization (+ momentum) (+BatchNorm) is orthogonal equivariant.

Worst-case sample complexity under equivariance constraint is equal to that of the augmented function class.

There’s a quadratic function which can be learnt by CNN with constant samples for any distribution, but learning it on d dimensional gaussian distribution requires Ω(d2) samples.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 29 / 30

slide-86
SLIDE 86

Lower Bound for Equivariant Algorithms

Thank You!

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 30 / 30

slide-87
SLIDE 87

References

[ADH+19] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, pages 8141–8150, 2019. [AZL19] Zeyuan Allen-Zhu and Yuanzhi Li. What can resnet learn efficiently, going beyond kernels? In Advances in Neural Information Processing Systems, pages 9015–9025, 2019. [BEHW89] Anselm Blumer, A. Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the vapnik-chervonenkis dimension. J. ACM, 36(4):929–965, October 1989. [BI91] Gyora M Benedek and Alon Itai. Learnability with respect to fixed distributions. Theoretical Computer Science, 86(2):377–389, 1991. [DWZ+18] Simon S Du, Yining Wang, Xiyu Zhai, Sivaraman Balakrishnan, Russ R Salakhutdinov, and Aarti Singh. How many samples are needed to estimate a convolutional neural network? In Advances in Neural Information Processing Systems, pages 373–383, 2018. [JGH18] Arthur Jacot, Franck Gabriel, and Cl´ ement Hongler. Neural tangent kernel:

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 30 / 30

slide-88
SLIDE 88

Lower Bound for Equivariant Algorithms

Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018. [LWY+19] Zhiyuan Li, Ruosong Wang, Dingli Yu, Simon S Du, Wei Hu, Ruslan Salakhutdinov, and Sanjeev Arora. Enhanced convolutional neural tangent kernels. arXiv preprint arXiv:1911.00809, 2019. [Ng04] Andrew Y Ng. Feature selection, l 1 vs. l 2 regularization, and rotational

  • invariance. In Proceedings of the twenty-first international conference on Machine

learning, page 78, 2004. [Sza97] Stanislaw J Szarek. Metric entropy of homogeneous spaces. arXiv preprint math/9701213, 1997. [WLLM19] Colin Wei, Jason D Lee, Qiang Liu, and Tengyu Ma. Regularization matters: Generalization and optimization of neural nets vs their induced kernel. In Advances in Neural Information Processing Systems, pages 9709–9721, 2019.

Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 30 / 30