Approximation power of deep networks Matus Telgarsky - - PowerPoint PPT Presentation

approximation power of deep networks
SMART_READER_LITE
LIVE PREVIEW

Approximation power of deep networks Matus Telgarsky - - PowerPoint PPT Presentation

Approximation power of deep networks Matus Telgarsky <mjt@illinois.edu> (with help from many friends!) Goal: in some prediction problem, replace f : R d R with neural network g : R d R . Goal: in some prediction problem, replace f : R


slide-1
SLIDE 1

Approximation power of deep networks

Matus Telgarsky <mjt@illinois.edu> (with help from many friends!)

slide-2
SLIDE 2

Goal: in some prediction problem, replace f : Rd → R with neural network g : Rd → R.

slide-3
SLIDE 3

Goal: in some prediction problem, replace f : Rd → R with neural network g : Rd → R. Primary setting: statistical learning theory, thus

  • ℓ(f(x), y) dP(x, y)

vs.

  • ℓ(g(x), y) dP(x, y).
slide-4
SLIDE 4

Goal: in some prediction problem, replace f : Rd → R with neural network g : Rd → R. Primary setting: statistical learning theory, thus

  • ℓ(f(x), y) dP(x, y)

vs.

  • ℓ(g(x), y) dP(x, y).

◮ Upper bounds: If ℓ(·, y) is 1-Lipschitz, ℓ(g(x), y) − ℓ(f(x), y)

  • dP(x, y) ≤
  • g(x) − f(x)
  • dP(x, y);

we make this small everywhere (universal/uniform/L∞(P) apx), or in L1(P).

slide-5
SLIDE 5

Goal: in some prediction problem, replace f : Rd → R with neural network g : Rd → R. Primary setting: statistical learning theory, thus

  • ℓ(f(x), y) dP(x, y)

vs.

  • ℓ(g(x), y) dP(x, y).

◮ Upper bounds: If ℓ(·, y) is 1-Lipschitz, ℓ(g(x), y) − ℓ(f(x), y)

  • dP(x, y) ≤
  • g(x) − f(x)
  • dP(x, y);

we make this small everywhere (universal/uniform/L∞(P) apx), or in L1(P). ◮ Lower bounds: we want large error on a large set; as a surrogate, |g − f| large in L1(P) or L1(Unif).

slide-6
SLIDE 6

By deep networks we mostly mean x → ALσL−1

  • · · · σ1(A1x + b1) · · ·
  • + bL,

where nonlinearity/activation/transfer σi is applied coordinate-wise.

slide-7
SLIDE 7

By deep networks we mostly mean x → ALσL−1

  • · · · σ1(A1x + b1) · · ·
  • + bL,

where nonlinearity/activation/transfer σi is applied coordinate-wise. There are many conventions; we will briefly discuss others.

slide-8
SLIDE 8

By deep networks we mostly mean x → ALσL−1

  • · · · σ1(A1x + b1) · · ·
  • + bL,

where nonlinearity/activation/transfer σi is applied coordinate-wise. There are many conventions; we will briefly discuss others. We’ll mostly stick to the ReLU z → max{0, z} (Fukushima ’80); it’s easy to convert.

slide-9
SLIDE 9

Elementary universal approximation. Classical universal approximation. Benefits of depth.

1 1

Sobolev spaces. Odds & ends.

slide-10
SLIDE 10

Univariate functions via step activations

slide-11
SLIDE 11

Univariate functions via step activations x → 2·1[x−3 ≥ 0]+1[x−5 ≥ 0]+2·1[x−7 ≥ 0]−1[x−13 ≥ 0] · · ·

slide-12
SLIDE 12

Univariate functions via step activations x → 2·1[x−3 ≥ 0]+1[x−5 ≥ 0]+2·1[x−7 ≥ 0]−1[x−13 ≥ 0] · · ·

  • Remark. By contrast, polynomials struggle with flat regions.
slide-13
SLIDE 13

Smooth univariate functions via step activations

slide-14
SLIDE 14

Smooth univariate functions via step activations

slide-15
SLIDE 15

Smooth univariate functions via step activations

slide-16
SLIDE 16

Smooth univariate functions via step activations Approach #1: subdivide range, Lip/ǫ steps.

slide-17
SLIDE 17

Smooth univariate functions via step activations Approach #1: subdivide range, Lip/ǫ steps. Approach #2: by FTC, for x ≥ 0, f(x) = f(0) + x f′(b) db = f(0) + ∞ 1[x − b ≥ 0]f′(b) db. This is a density over infinitely many steps/nodes! Sample avg Lip/ǫ2 steps.

slide-18
SLIDE 18

Smooth univariate functions via step activations Approach #1: subdivide range, Lip/ǫ steps. Approach #2: by FTC, for x ≥ 0, f(x) = f(0) + x f′(b) db = f(0) + ∞ 1[x − b ≥ 0]f′(b) db. This is a density over infinitely many steps/nodes! Sample avg Lip/ǫ2 steps. Remarks. ◮ Infinite width network! ◮ Refined average-case estimate! (Captures flat regions.)

slide-19
SLIDE 19

Univariate functions via ReLU activations

slide-20
SLIDE 20

Univariate functions via ReLU activations

slide-21
SLIDE 21

Univariate functions via ReLU activations Include ReLU z → max{0, z} with change of slope. How about smooth functions?

slide-22
SLIDE 22

Univariate functions via ReLU activations Include ReLU z → max{0, z} with change of slope. How about smooth functions? For x ≥ 0, f(x) = f(0) + σr(x)f′(0) + ∞ σr(x − b)f′′(b) d (b). Need to sample avg smooth/ǫ2 ReLU!

slide-23
SLIDE 23

Univariate functions via ReLU activations Include ReLU z → max{0, z} with change of slope. How about smooth functions? For x ≥ 0, f(x) = f(0) + σr(x)f′(0) + ∞ σr(x − b)f′′(b) d (b). Need to sample avg smooth/ǫ2 ReLU! (In some sense optimal (Savarese-Evron-Soudry-Srebro ’19).)

slide-24
SLIDE 24

Multivariate, but finitely many points

slide-25
SLIDE 25

Multivariate, but finitely many points

slide-26
SLIDE 26

Multivariate, but finitely many points With probability 1, a random line has unique projections. We’ve reduced to the univariate case.

slide-27
SLIDE 27

Multivariate, but finitely many points With probability 1, a random line has unique projections. We’ve reduced to the univariate case. Caveats: ◮ Representation size may have blown up. ◮ Not our original goal.

slide-28
SLIDE 28

Approximate a multivariate box.

slide-29
SLIDE 29

Approximate a multivariate box. Supporting hyperplanes!

slide-30
SLIDE 30

Approximate a multivariate box. Supporting hyperplanes!

slide-31
SLIDE 31

Approximate a multivariate box.

4 2 2 2 2 3 3 3 3

Supporting hyperplanes! . . . oops.

slide-32
SLIDE 32

Approximate a multivariate box.

4 2 2 2 2 3 3 3 3

Supporting hyperplanes! . . . oops. Fix #1: product halfspaces together! (we’ll return to this...)

slide-33
SLIDE 33

Approximate a multivariate box.

1

Supporting hyperplanes! . . . oops. Fix #1: product halfspaces together! (we’ll return to this...) Fix #2: add a layer, thresholding at 3.5!

slide-34
SLIDE 34

Approximate a multivariate box.

1

Supporting hyperplanes! . . . oops. Fix #1: product halfspaces together! (we’ll return to this...) Fix #2: add a layer, thresholding at 3.5! ...how about one ReLU/hidden layer?

slide-35
SLIDE 35

Approximate a multivariate ball. Fix #3: add all the hyperplanes!

slide-36
SLIDE 36

Approximate a multivariate ball. Fix #3: add all the hyperplanes!

slide-37
SLIDE 37

Approximate a multivariate ball. Fix #3: add all the hyperplanes! Resulting radial function is constant within ball, attenuates away from it.

slide-38
SLIDE 38

Approximate a multivariate ball. Fix #3: add all the hyperplanes! Resulting radial function is constant within ball, attenuates away from it. Bad news: good apx seems to require 2d nodes. . . (We’ll come back to this.)

slide-39
SLIDE 39

Combinations of radial bumps. Normalize bumps/RBFs into density p; convolve with f.

slide-40
SLIDE 40

Combinations of radial bumps. Normalize bumps/RBFs into density p; convolve with f.

  • f(x) −
  • f(z)p(x − z) dz
  • =
  • f(x) −
  • f(x − z)p(z) dz
  • =
  • f(x)p(z) dz −
  • f(x − z)p(z) dz

f(x) − f(x − z)

  • p(z) dz,

which is small if p(z) ≈ 0 for large z.

slide-41
SLIDE 41

Combinations of radial bumps. Normalize bumps/RBFs into density p; convolve with f.

  • f(x) −
  • f(z)p(x − z) dz
  • =
  • f(x) −
  • f(x − z)p(z) dz
  • =
  • f(x)p(z) dz −
  • f(x − z)p(z) dz

f(x) − f(x − z)

  • p(z) dz,

which is small if p(z) ≈ 0 for large z. Size estimate:

  • d·Lip/ǫ

O(d). (Mhaskar-Michelli ’92, BJTX ’19.)

slide-42
SLIDE 42

So far: ◮ Easy univariate constructions. ◮ 3-layer box constructions over Rd: size

  • Lip/ǫ

O(d). ◮ 2-layer RBF convolutions over Rd: size

  • d·Lip/ǫ

O(d). Remarks. ◮ Impractical constructions! Bad Lipschitz constants. ◮ Contrast with polynomials: flat pieces. ◮ Usefuleness of infinite width! Note also: Eσr(aTx) = 1 2E|aTx| = x √ 2π. ◮ Poor complexity measures outside univariate!

slide-43
SLIDE 43

Interlude: three questions

  • 1. Are fixed DN architectures closed under addition?
  • 2. Can RNNs model Turing Machines?

f f f s1 s2 s3 y1 y2 y3 x1 x2 x3

  • 3. Given continuous g : Rd → R,

can we construct custom univariate activations so that g(x) ! =

2d

  • i=0

fi  

d

  • j=1

hi,j(xj)  ?

slide-44
SLIDE 44

Elementary universal approximation. Classical universal approximation. Benefits of depth.

1 1

Sobolev spaces. Odds & ends.

slide-45
SLIDE 45

Bumps via multiplication

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 0.0 0.2 0.4 0.6 0.8 1.0 cos(x) cos(x)2 cos(x)4 cos(x)8 cos(x)16 cos(x)32

slide-46
SLIDE 46

Bumps via multiplication

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 0.0 0.2 0.4 0.6 0.8 1.0 cos(x) cos(x)2 cos(x)4 cos(x)8 cos(x)16 cos(x)32

Univariate bump: cos(x)p for large p.

slide-47
SLIDE 47

Bumps via multiplication

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 0.0 0.2 0.4 0.6 0.8 1.0 cos(x) cos(x)2 cos(x)4 cos(x)8 cos(x)16 cos(x)32

Univariate bump: cos(x)p for large p. Multivariate bump: 1

  • x∞ ≤ 1
  • =

d

  • i=1

1

  • |xi| ≤ 1
  • and

d

  • i=1

cos(xi)p.

slide-48
SLIDE 48

Bumps via multiplication

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 0.0 0.2 0.4 0.6 0.8 1.0 cos(x) cos(x)2 cos(x)4 cos(x)8 cos(x)16 cos(x)32

Univariate bump: cos(x)p for large p. Multivariate bump: 1

  • x∞ ≤ 1
  • =

d

  • i=1

1

  • |xi| ≤ 1
  • and

d

  • i=1

cos(xi)p. To remove the product: cos(x) cos(x) = 1 2

  • cos(2x) + 1
  • ,

2 cos(x1) cos(x2) = cos(x1 + x2) + cos(x1 − x2).

slide-49
SLIDE 49

Weierstrass approximation theorem Theorem (Weierstrass, 1885). Polynomials can uniformly approximate continuous functions over compact sets.

slide-50
SLIDE 50

Weierstrass approximation theorem Theorem (Weierstrass, 1885). Polynomials can uniformly approximate continuous functions over compact sets. Remarks. ◮ Not a consequence of interpolation: must control behavior between interpolants. ◮ Proofs are interesting; e.g., Bernstein (Bernstein polynomials and tail bounds), Weierstrass (Gaussian smoothing gives analytic functions). . . . ◮ Stone-Weierstrass theorem: Polynomial-like function families (e.g., closed under multiplication) also approximate continuous function.

slide-51
SLIDE 51

Theorem (Hornik-Stinchcombe-White ’89). Let σ : R → R be given with lim

z→−∞ σ(z) = 0,

lim

z→+∞ σ(z) = 1,

and define Hσ :=

  • x → σ(aTx − b) : (a, b) ∈ Rd+1

. Then span(Hσ) uniformly approximates continuous functions on compact sets.

slide-52
SLIDE 52

Theorem (Hornik-Stinchcombe-White ’89). Let σ : R → R be given with lim

z→−∞ σ(z) = 0,

lim

z→+∞ σ(z) = 1,

and define Hσ :=

  • x → σ(aTx − b) : (a, b) ∈ Rd+1

. Then span(Hσ) uniformly approximates continuous functions on compact sets. Proof #1. Hcos is closed under products since 2 cos(a) cos(b) = cos(a + b) + cos(a − b). Now uniformly approximate fixed Hcos with span(Hσ). (Univariate fitting.) Proof #2. Hexp is closed under products since eaeb = ea+b. Now uniformly approximate fixed Hexp with span(Hσ). (Univariate fitting.)

slide-53
SLIDE 53

Theorem (Hornik-Stinchcombe-White ’89). Let σ : R → R be given with lim

z→−∞ σ(z) = 0,

lim

z→+∞ σ(z) = 1,

and define Hσ :=

  • x → σ(aTx − b) : (a, b) ∈ Rd+1

. Then span(Hσ) uniformly approximates continuous functions on compact sets.

slide-54
SLIDE 54

Theorem (Hornik-Stinchcombe-White ’89). Let σ : R → R be given with lim

z→−∞ σ(z) = 0,

lim

z→+∞ σ(z) = 1,

and define Hσ :=

  • x → σ(aTx − b) : (a, b) ∈ Rd+1

. Then span(Hσ) uniformly approximates continuous functions on compact sets. Remarks. ◮ ReLU is fine: use σ(z) := σr(z) − σr(z − 1). ◮ Size estimate: expanding terms, seem to get

  • Lip/ǫ

Ω(d). ◮ Best conditions on σ (Leshno-Lin-Pinkus-Schocken ’93): theorem holds iff σ not a polynomial. ◮ Inner hint about DN: no need for explicit multiplication?

slide-55
SLIDE 55

Other proofs. ◮ (Cybenko ’89.) Assume contradictorily you miss some functions. By duality, 0 =

  • σ(aTx − b) dµ(x)

for some signed measure µ, all (a, b). Using Fourier, can show this implies µ = 0. . . ◮ (Leshno-Lin-Pinkus-Schocken ’93.) If σ a polynomial, . . . ; else can (roughly) get derivatives of all orders, polynomials of all orders. ◮ (Barron ’93.) Consider activation x → exp(iaTx), infinite width form

  • exp(iaTx)

f(a) da. Take real part and sample (Maurey) to get g ∈ span(Hcos); convert to span(Hσ) as before. ◮ (Funahashi ’89.) Also Fourier, measure-theoretic.

slide-56
SLIDE 56

“Universal approximation” (Uniform approximation of cont. functions on compact sets). ◮ Elementary proof: RBF (Mhaskar-Michelli ’92; BJTX ’19). ◮ Slick proof: Stone-Weierstrass and Hcos or Hexp (Hornik-Stinchcombe-White, ’89). ◮ Proof with size estimates beating

  • Lip/ǫ

d, indeed norm of Fourier transform of gradient, related to “sampling measure”: (Barron ’93). Remarks. ◮ Exhibits nothing special about DN; indeed, same proofs work for boosting, RBF SVM, . . . ◮ Size estimates huge (soon we’ll see dΩ(d)). ◮ Proofs use nice representation “tricks”; (e.g., Leshno et al “iff not polynomial”).

slide-57
SLIDE 57

Elementary universal approximation. Classical universal approximation. Benefits of depth.

1 1

Sobolev spaces. Odds & ends.

slide-58
SLIDE 58

Radial functions are easy with two ReLU layers Consider f(x2) with Lipschitz constant Lip. ◮ Pick h(x) ≈ǫ x2

2 = i x2 i with d·Lip/ǫ ReLU.

slide-59
SLIDE 59

Radial functions are easy with two ReLU layers Consider f(x2) with Lipschitz constant Lip. ◮ Pick h(x) ≈ǫ x2

2 = i x2 i with d·Lip/ǫ ReLU.

◮ Pick g ≈ǫ f with Lip/ǫ ReLU; then

  • f(x2) − g(h(x))
  • f(x2) − f(h(x))
  • +
  • f(h(x)) − g(h(x))
  • ≤ Lip
  • x2 − h(x)
  • + ǫ ≤ 2ǫ.
slide-60
SLIDE 60

Radial functions are easy with two ReLU layers Consider f(x2) with Lipschitz constant Lip. ◮ Pick h(x) ≈ǫ x2

2 = i x2 i with d·Lip/ǫ ReLU.

◮ Pick g ≈ǫ f with Lip/ǫ ReLU; then

  • f(x2) − g(h(x))
  • f(x2) − f(h(x))
  • +
  • f(h(x)) − g(h(x))
  • ≤ Lip
  • x2 − h(x)
  • + ǫ ≤ 2ǫ.

Remarks. ◮ Final size of g ◦ h is poly(Lip, d, 1/ǫ). ◮ Proof style is “typical”/lazy; (problematically) pays with Lipschitz constant. ◮ That was easy/intuitive; how about the 2 layer case?...

slide-61
SLIDE 61

Radial functions are not easy with only one ReLU layer (I) Theorem (Eldan-Shamir, 2015). There exists a radial function f, expressible with two ReLU layers of width poly(d), and a probability measure P so that every g with a single ReLU layer of width 2O(d) satisfies f(x) − g(x) 2 dP(x) ≥ Ω(1).

slide-62
SLIDE 62

Radial functions are not easy with only one ReLU layer (I) Theorem (Eldan-Shamir, 2015). There exists a radial function f, expressible with two ReLU layers of width poly(d), and a probability measure P so that every g with a single ReLU layer of width 2O(d) satisfies f(x) − g(x) 2 dP(x) ≥ Ω(1). Proof hints. Apply Fourier isometry and consider the transforms. Transform of g is supported on a small set of tubes; transform of f has large mass they can’t reach.

slide-63
SLIDE 63

Radial functions are not easy with only one ReLU layer (II) Theorem (Daniely, 2017). Let (x, x′) ∼ P be uniform on two sphere surfaces, define h(x, x′) = sin(πd3xTx′). For any g with a single ReLU layer

  • f width dO(d) and weight magnitude O(2d),

h(x, x′) − g(x, x′) 2 dP(x, x′) ≥ Ω(1), and h can be approximated to accuracy ǫ by f with two ReLU layers of size poly(d, 1/ǫ).

slide-64
SLIDE 64

Radial functions are not easy with only one ReLU layer (II) Theorem (Daniely, 2017). Let (x, x′) ∼ P be uniform on two sphere surfaces, define h(x, x′) = sin(πd3xTx′). For any g with a single ReLU layer

  • f width dO(d) and weight magnitude O(2d),

h(x, x′) − g(x, x′) 2 dP(x, x′) ≥ Ω(1), and h can be approximated to accuracy ǫ by f with two ReLU layers of size poly(d, 1/ǫ). Proof hints. Spherical harmonics reduce this to a univariate problem; apply region counting.

slide-65
SLIDE 65

Approximation of high-dimensional radial functions (A radial function contour plot.) If we can approximate each shell, we can approximate the overall function.

slide-66
SLIDE 66

Approximation of high-dimensional radial shell Let’s approximate a single shell; consider x → 1

  • x ∈ [1 − 1/d , 1]
  • ,

which has a constant fraction of sphere volume.

slide-67
SLIDE 67

Approximation of high-dimensional radial shell Let’s approximate a single shell; consider x → 1

  • x ∈ [1 − 1/d , 1]
  • ,

which has a constant fraction of sphere volume. Can’t cut too deeply; get bad error on inner zero part. . .

slide-68
SLIDE 68

Approximation of high-dimensional radial shell Let’s approximate a single shell; consider x → 1

  • x ∈ [1 − 1/d , 1]
  • ,

which has a constant fraction of sphere volume. Can’t cut too deeply; get bad error on inner zero part. . . . . . but then we need to cover exponentially many caps.

slide-69
SLIDE 69

Let’s go back to the drawing board; what do shallow representations do exceptionally badly?

slide-70
SLIDE 70

Let’s go back to the drawing board; what do shallow representations do exceptionally badly? One weakness: their complexity scales with #bumps.

slide-71
SLIDE 71

Consider the tent map ∆(x) := σr(2x) − σr(4x − 2) =

  • 2x

x ∈ [0, 1/2), 2(1 − x) x ∈ [1/2, 1].

1 1

∆.

slide-72
SLIDE 72

Consider the tent map ∆(x) := σr(2x) − σr(4x − 2) =

  • 2x

x ∈ [0, 1/2), 2(1 − x) x ∈ [1/2, 1].

1 1

∆. What is the effect of composition? f(∆(x)) =

  • x ∈ [0, 1/2)

= ⇒ f(2x) = f squeezed into [0, 1/2], x ∈ [1/2, 1] = ⇒ f

  • 2(1 − x)
  • = f reversed, squeezed.
slide-73
SLIDE 73

Consider the tent map ∆(x) := σr(2x) − σr(4x − 2) =

  • 2x

x ∈ [0, 1/2), 2(1 − x) x ∈ [1/2, 1].

1 1

∆.

1 1

∆2 = ∆ ◦ ∆. What is the effect of composition? f(∆(x)) =

  • x ∈ [0, 1/2)

= ⇒ f(2x) = f squeezed into [0, 1/2], x ∈ [1/2, 1] = ⇒ f

  • 2(1 − x)
  • = f reversed, squeezed.
slide-74
SLIDE 74

Consider the tent map ∆(x) := σr(2x) − σr(4x − 2) =

  • 2x

x ∈ [0, 1/2), 2(1 − x) x ∈ [1/2, 1].

1 1

∆.

1 1

∆2 = ∆ ◦ ∆.

1 1

∆k. What is the effect of composition? f(∆(x)) =

  • x ∈ [0, 1/2)

= ⇒ f(2x) = f squeezed into [0, 1/2], x ∈ [1/2, 1] = ⇒ f

  • 2(1 − x)
  • = f reversed, squeezed.
slide-75
SLIDE 75

Consider the tent map ∆(x) := σr(2x) − σr(4x − 2) =

  • 2x

x ∈ [0, 1/2), 2(1 − x) x ∈ [1/2, 1].

1 1

∆.

1 1

∆2 = ∆ ◦ ∆.

1 1

∆k. What is the effect of composition? f(∆(x)) =

  • x ∈ [0, 1/2)

= ⇒ f(2x) = f squeezed into [0, 1/2], x ∈ [1/2, 1] = ⇒ f

  • 2(1 − x)
  • = f reversed, squeezed.

∆k uses O(k) layers & nodes, but has O(2k) bumps.

slide-76
SLIDE 76

Theorem (T ’15). Let #layers k ≥ 1 be given.

slide-77
SLIDE 77

Theorem (T ’15). Let #layers k ≥ 1 be given. Exists ReLU network f : [0, 1] → [0, 1] with 4 distinct parameters, 3k2 + 9 nodes, 2k2 + 6 layers,

slide-78
SLIDE 78

Theorem (T ’15). Let #layers k ≥ 1 be given. Exists ReLU network f : [0, 1] → [0, 1] with 4 distinct parameters, 3k2 + 9 nodes, 2k2 + 6 layers, such that every ReLU network g : Rd → R with ≤ k layers, ≤ 2k nodes

slide-79
SLIDE 79

Theorem (T ’15). Let #layers k ≥ 1 be given. Exists ReLU network f : [0, 1] → [0, 1] with 4 distinct parameters, 3k2 + 9 nodes, 2k2 + 6 layers, such that every ReLU network g : Rd → R with ≤ k layers, ≤ 2k nodes satisfies

  • [0,1]

|f(x) − g(x)| dx ≥ 1 32.

slide-80
SLIDE 80

Theorem (T ’15). Let #layers k ≥ 1 be given. Exists ReLU network f : [0, 1] → [0, 1] with 4 distinct parameters, 3k2 + 9 nodes, 2k2 + 6 layers, such that every ReLU network g : Rd → R with ≤ k layers, ≤ 2k nodes satisfies

  • [0,1]

|f(x) − g(x)| dx ≥ 1 32. Proof.

  • 1. g with few oscillations can’t apx oscillatory regular f.
  • 2. There exists a regular, oscillatory f. (f = ∆k2+3.)
  • 3. Width m depth L =

⇒ few (O(mL)) oscillations. Rediscovered many times; (T ’15) gives elementary univariate argument; multivariate arguments in (Warren ’68), (Arnold ?), (Montufar, Pascanu, Cho, Bengio, ’14), (BT ’18), . . .

slide-81
SLIDE 81

g with few oscillations; f highly oscillatory = ⇒

  • [0,1]

|g − f| large .

f g

slide-82
SLIDE 82

g with few oscillations; f highly oscillatory = ⇒

  • [0,1]

|g − f| large .

f g

slide-83
SLIDE 83

g with few oscillations; f highly oscillatory

?

= ⇒

  • [0,1]

|g − f| large .

f g

slide-84
SLIDE 84

g with few oscillations; f highly oscillatory, regular = ⇒

  • [0,1]

|g − f| large .

g f

Let’s use f = ∆k2+3.

slide-85
SLIDE 85

g with few oscillations; f highly oscillatory, regular = ⇒

  • [0,1]

|g − f| large .

g f

Let’s use f = ∆k2+3.

slide-86
SLIDE 86

g with few oscillations; f highly oscillatory, regular = ⇒

  • [0,1]

|g − f| large .

g f

Let’s use f = ∆k2+3.

slide-87
SLIDE 87

g with few oscillations; f highly oscillatory, regular = ⇒

  • [0,1]

|g − f| large .

g f

Let’s use f = ∆k2+3.

slide-88
SLIDE 88

g with few oscillations; f highly oscillatory, regular = ⇒

  • [0,1]

|g − f| large .

g f

Let’s use f = ∆k2+3.

slide-89
SLIDE 89

Story from benefits of depth: ◮ Certain radial functions have polynomial width 2 ReLU layer representation, exponential width 1 ReLU layer representation. ◮ ∆k2+3 can be written with O(k2) depth and O(1) width, requires width Ω(2k) if depth O(k).

slide-90
SLIDE 90

Story from benefits of depth: ◮ Certain radial functions have polynomial width 2 ReLU layer representation, exponential width 1 ReLU layer representation. ◮ ∆k2+3 can be written with O(k2) depth and O(1) width, requires width Ω(2k) if depth O(k). Remarks. ◮ ∆k is 2k-Lipschitz; possibly nonsensical, unrealistic. ◮ These results have stood a few years now; many “technical” questions, also “realistic” questions.

slide-91
SLIDE 91

Elementary universal approximation. Classical universal approximation. Benefits of depth.

1 1

Sobolev spaces. Odds & ends.

slide-92
SLIDE 92

hk := piecewise-affine interpolation of x2 at {0, 1

2k , 2 2k , . . . , 2k 2k }.

h1.

slide-93
SLIDE 93

hk := piecewise-affine interpolation of x2 at {0, 1

2k , 2 2k , . . . , 2k 2k }.

h1. h2.

slide-94
SLIDE 94

hk := piecewise-affine interpolation of x2 at {0, 1

2k , 2 2k , . . . , 2k 2k }.

h1. h2. h1 − h2.

slide-95
SLIDE 95

hk := piecewise-affine interpolation of x2 at {0, 1

2k , 2 2k , . . . , 2k 2k }.

h1. h2. h1 − h2.

slide-96
SLIDE 96

hk := piecewise-affine interpolation of x2 at {0, 1

2k , 2 2k , . . . , 2k 2k }.

h1. h2. h1 − h2.

slide-97
SLIDE 97

hk := piecewise-affine interpolation of x2 at {0, 1

2k , 2 2k , . . . , 2k 2k }.

h1. h2. h1 − h2. Thus hk(x) = x −

i≤k ∆i(x)/4i.

slide-98
SLIDE 98

hk := piecewise-affine interpolation of x2 at {0, 1

2k , 2 2k , . . . , 2k 2k }.

h1. h2. Thus hk(x) = x −

i≤k ∆i(x)/4i.

slide-99
SLIDE 99

hk := piecewise-affine interpolation of x2 at {0, 1

2k , 2 2k , . . . , 2k 2k }.

h1. h2. Thus hk(x) = x −

i≤k ∆i(x)/4i.

◮ hk needs k = O(ln(1/ǫ)) to ǫ-apx x → x2 (Yarotsky, ’16), with matching lower bounds. ◮ Squaring implies multiplication via polarization: xTy = 1 2

  • x + y2 − x2 − y2

.

slide-100
SLIDE 100

hk := piecewise-affine interpolation of x2 at {0, 1

2k , 2 2k , . . . , 2k 2k }.

h1. h2. Thus hk(x) = x −

i≤k ∆i(x)/4i.

◮ hk needs k = O(ln(1/ǫ)) to ǫ-apx x → x2 (Yarotsky, ’16), with matching lower bounds. ◮ Squaring implies multiplication via polarization: xTy = 1 2

  • x + y2 − x2 − y2

. ◮ This implies efficient approximation of polynomials; can we do more?

slide-101
SLIDE 101

Theorem (Yarotsky ’16). Let dimension d and smoothness order r be given. Given f : [0, 1]d → R, all rth order derivatives bounded by 1, exists a network g with Cd,r ln(e/ǫ) layers and Cd,rǫ−d/r ln(e/ǫ) nodes so that sup

x∈[0,1]d |f(x) − g(x)| ≤ ǫ.

slide-102
SLIDE 102

Theorem (Yarotsky ’16). Let dimension d and smoothness order r be given. Given f : [0, 1]d → R, all rth order derivatives bounded by 1, exists a network g with Cd,r ln(e/ǫ) layers and Cd,rǫ−d/r ln(e/ǫ) nodes so that sup

x∈[0,1]d |f(x) − g(x)| ≤ ǫ.

Proof. Conditions imply accurate local Taylor expansions. Therefore can write f as a linear combination of this basis: polynomials multiplied by local bumps.

slide-103
SLIDE 103

Theorem (Yarotsky ’16). Let dimension d and smoothness order r be given. Given f : [0, 1]d → R, all rth order derivatives bounded by 1, exists a network g with Cd,r ln(e/ǫ) layers and Cd,rǫ−d/r ln(e/ǫ) nodes so that sup

x∈[0,1]d |f(x) − g(x)| ≤ ǫ.

slide-104
SLIDE 104

Theorem (Yarotsky ’16). Let dimension d and smoothness order r be given. Given f : [0, 1]d → R, all rth order derivatives bounded by 1, exists a network g with Cd,r ln(e/ǫ) layers and Cd,rǫ−d/r ln(e/ǫ) nodes so that sup

x∈[0,1]d |f(x) − g(x)| ≤ ǫ.

Remarks. ◮ There is depth, but it is function independent:

  • nly the basis coefficients use f.

◮ This is a shallow representation:

  • nly the basis coefficients

◮ Lipschitz constant is possibly bad: ∆1/ǫ is 1/ǫ-Lipschitz, the bumps are 1/ǫd/r-Lipschitz.

slide-105
SLIDE 105

Theorem (Yarotsky ’16). Let dimension d and smoothness order r be given. Given f : [0, 1]d → R, all rth order derivatives bounded by 1, exists a network g with Cd,r ln(e/ǫ) layers and Cd,rǫ−d/r ln(e/ǫ) nodes so that sup

x∈[0,1]d |f(x) − g(x)| ≤ ǫ.

Remarks.

slide-106
SLIDE 106

Theorem (Yarotsky ’16). Let dimension d and smoothness order r be given. Given f : [0, 1]d → R, all rth order derivatives bounded by 1, exists a network g with Cd,r ln(e/ǫ) layers and Cd,rǫ−d/r ln(e/ǫ) nodes so that sup

x∈[0,1]d |f(x) − g(x)| ≤ ǫ.

Remarks. ◮ There is parallel and subsequent work with similar proof ideas and Lipschitz constants: (Safran-Shamir ’16), (Petersen-Voigtlaender ’17), (Schmidt-Hieber ’17). ◮ Another appearance of polynomials in DN: Sum-product networks. These were the first to have depth separation (Delalleau-Bengio ’11).

slide-107
SLIDE 107

Theorem (Yarotsky ’16). Let dimension d and smoothness order r be given. Given f : [0, 1]d → R, all rth order derivatives bounded by 1, exists a network g with Cd,r ln(e/ǫ) layers and Cd,rǫ−d/r ln(e/ǫ) nodes so that sup

x∈[0,1]d |f(x) − g(x)| ≤ ǫ.

Remarks.

slide-108
SLIDE 108

Theorem (Yarotsky ’16). Let dimension d and smoothness order r be given. Given f : [0, 1]d → R, all rth order derivatives bounded by 1, exists a network g with Cd,r ln(e/ǫ) layers and Cd,rǫ−d/r ln(e/ǫ) nodes so that sup

x∈[0,1]d |f(x) − g(x)| ≤ ǫ.

Remarks. ◮ DN can approximate polynomials efficiently, but the reverse is false: a single ReLU requires degree 1/ǫ. ◮ Polynomials can not handle flat regions well; this is used above, and in approximated rational functions (T ’17).

slide-109
SLIDE 109

Theorem (Yarotsky ’16). Let dimension d and smoothness order r be given. Given f : [0, 1]d → R, all rth order derivatives bounded by 1, exists a network g with Cd,r ln(e/ǫ) layers and Cd,rǫ−d/r ln(e/ǫ) nodes so that sup

x∈[0,1]d |f(x) − g(x)| ≤ ǫ.

Remarks.

slide-110
SLIDE 110

Theorem (Yarotsky ’16). Let dimension d and smoothness order r be given. Given f : [0, 1]d → R, all rth order derivatives bounded by 1, exists a network g with Cd,r ln(e/ǫ) layers and Cd,rǫ−d/r ln(e/ǫ) nodes so that sup

x∈[0,1]d |f(x) − g(x)| ≤ ǫ.

Remarks. ◮ Corresponding lower bounds indicate depth is needed.

slide-111
SLIDE 111

Interlude: three questions

  • 1. Are fixed DN architectures closed under addition?

No, add together perturbed copies of ∆k.

  • 2. Can RNNs model Turing Machines?

f f f s1 s2 s3 y1 y2 y3 x1 x2 x3

  • Hint. ReLU networks can do exact Boolean formulae.

Set f to state transition table, encode tape on s.

  • 3. Given continuous g : Rd → R,

can we construct custom univariate activations so that g(x) ! =

2d

  • i=0

fi  

d

  • j=1

hi,j(xj)  ? Hint? Contradicts a Hilbert problem?

slide-112
SLIDE 112

Elementary universal approximation. Classical universal approximation. Benefits of depth.

1 1

Sobolev spaces. Odds & ends.

slide-113
SLIDE 113

Generative modeling Typical setup: pushforward measure g#µ, meaning sample x ∼ µ,

  • utput g(x).

Many easy constructions have bad/∞ Lipschitz constants! E.g., mapping uniform into [0, 1/2], (3/2, 2]. Some literature: (Arora-Ge-Liang-Ma-Zhang ’17, BT ’18, Bai-Ma-Risteski ’19, Elchanan’s talk this week!)

slide-114
SLIDE 114

Randomly initialized networks Approximation fact in recent optimization papers: a small perturbation of random initialization gives any function you want! (Du-Lee-Li-Wang-Zhai ’18, AllenZhu-Li-Song ’18). There is residual error from the noise approximating high-Lipschitz functions is problematic! (BJTX ’19.)

slide-115
SLIDE 115

Randomly sampled networks

  • Theorem. With probability ≥ 1 − 1/e,

sup

x2≤1

  • σr(aTx − b) dµ(a, b) − µ1

N

N

  • i=1

σr(aT

i x − bi)

  • ≤ O

Bµ1 √ N

  • ,

where support of µ has (a, b) ≤ B.

slide-116
SLIDE 116

Randomly sampled networks

  • Theorem. With probability ≥ 1 − 1/e,

sup

x2≤1

  • σr(aTx − b) dµ(a, b) − µ1

N

N

  • i=1

σr(aT

i x − bi)

  • ≤ O

Bµ1 √ N

  • ,

where support of µ has (a, b) ≤ B.

  • Proof. Invoke Rademacher complexity,

but swap inputs and parameters. (Koiran-Gurvits ’97, Sun-Gilbert-Tewari ’18, BJTX ’19.) Also Maurey’s Lemma (Barron ’93).

slide-117
SLIDE 117

Adversarial stability Adversarial examples lower bound the Lipschitz constant. . .

slide-118
SLIDE 118

Adversarial stability Adversarial examples lower bound the Lipschitz constant. . . . . . but a bad Lipschitz constant can be good for adversarial examples!

slide-119
SLIDE 119

Adversarial stability Adversarial examples lower bound the Lipschitz constant. . . . . . but a bad Lipschitz constant can be good for adversarial examples! Given the existence of adversarial examples, uniform approximation too stringent?

slide-120
SLIDE 120

Turing machines and RNNs f f f s1 s2 s3 y1 y2 y3 x1 x2 x3 ◮ Make f the TM state transition table, s the tape.

slide-121
SLIDE 121

Turing machines and RNNs f f f s1 s2 s3 y1 y2 y3 x1 x2 x3 ◮ Make f the TM state transition table, s the tape. ◮ x → 1[x ≥ 0] is not computable; bits need a special encoding within s.

slide-122
SLIDE 122

Turing machines and RNNs f f f s1 s2 s3 y1 y2 y3 x1 x2 x3 ◮ Make f the TM state transition table, s the tape. ◮ x → 1[x ≥ 0] is not computable; bits need a special encoding within s. ◮ Use a robust “cantor-like” encoding. (Siegelmann-Sontag ’94.)

slide-123
SLIDE 123

Kolmogorov-Arnold ’56 There exist continuous ((hi,j)2d

i=0)d j=1 : R → R,

so that for any continuous g : Rd → R, there exist continuous (fi)2d

i=0 : R → R

with g(x) =

2d

  • i=0

fi

  • d
  • j=1

hi,j(xj)

  • .
slide-124
SLIDE 124

Kolmogorov-Arnold ’56 There exist continuous ((hi,j)2d

i=0)d j=1 : R → R,

so that for any continuous g : Rd → R, there exist continuous (fi)2d

i=0 : R → R

with g(x) =

2d

  • i=0

fi

  • d
  • j=1

hi,j(xj)

  • .

Step 1. Fix target accuracy ǫ > 0.

slide-125
SLIDE 125

Kolmogorov-Arnold ’56 There exist continuous ((hi,j)2d

i=0)d j=1 : R → R,

so that for any continuous g : Rd → R, there exist continuous (fi)2d

i=0 : R → R

with g(x) =

2d

  • i=0

fi

  • d
  • j=1

hi,j(xj)

  • .

Step 2. Choose f : R → R, nearly injective Q : Rd → R, g ≈ f(Q(x))

slide-126
SLIDE 126

Kolmogorov-Arnold ’56 There exist continuous ((hi,j)2d

i=0)d j=1 : R → R,

so that for any continuous g : Rd → R, there exist continuous (fi)2d

i=0 : R → R

with g(x) =

2d

  • i=0

fi

  • d
  • j=1

hi,j(xj)

  • .

Step 3. Replace near-injection Q : Rd → R with

j hj(xj).

1 2 3 4 2 √ 2 3 √ 2 4 √ 2

slide-127
SLIDE 127

Kolmogorov-Arnold ’56 There exist continuous ((hi,j)2d

i=0)d j=1 : R → R,

so that for any continuous g : Rd → R, there exist continuous (fi)2d

i=0 : R → R

with g(x) =

2d

  • i=0

fi

  • d
  • j=1

hi,j(xj)

  • .

Step 4. Replace f(

j hj(xj))

with staggered versions

i fi( j hi,j(xj));

for any x ∈ [0, 1]d, ≥ half are correct.

slide-128
SLIDE 128

Kolmogorov-Arnold ’56 There exist continuous ((hi,j)2d

i=0)d j=1 : R → R,

so that for any continuous g : Rd → R, there exist continuous (fi)2d

i=0 : R → R

with g(x) =

2d

  • i=0

fi

  • d
  • j=1

hi,j(xj)

  • .

Step 5. Embed the solutions for infinitely many ǫ into one.

slide-129
SLIDE 129

Main story. ◮ Can fit continuous functions in various ways; the size is bad (

  • d·Lip/ǫ

O(d)). ◮ Composition and depth bring some concrete benefits; exponential reductions in width! ◮ Polynomials may be efficiently approximated, but also some non-polynomials (Sobolov balls, rational functions, flat regions, . . . ). Remarks. ◮ Refined depth separations (e.g., a single new layer) and practical depth separations are still elusive. ◮ Refined, average-case complexity measures are elusive.

slide-130
SLIDE 130

Elementary universal approximation. Classical universal approximation. Benefits of depth.

1 1

Sobolev spaces. Odds & ends.

slide-131
SLIDE 131

Elementary universal approximation. Classical universal approximation. Benefits of depth.

1 1

Sobolev spaces. Odds & ends.

  • Thanks. . . any questions?