Deep Approximation via Deep Learning Zuowei Shen Department of - - PowerPoint PPT Presentation

deep approximation via deep learning
SMART_READER_LITE
LIVE PREVIEW

Deep Approximation via Deep Learning Zuowei Shen Department of - - PowerPoint PPT Presentation

Deep Approximation via Deep Learning Zuowei Shen Department of Mathematics National University of Singapore Outline Introduction of approximation theory 1 Approximation of functions by compositions 2 Approximation rate in term of number of


slide-1
SLIDE 1

Deep Approximation via Deep Learning

Zuowei Shen

Department of Mathematics National University of Singapore

slide-2
SLIDE 2

Outline

1

Introduction of approximation theory

2

Approximation of functions by compositions

3

Approximation rate in term of number of nurons

slide-3
SLIDE 3

Outline

1

Introduction of approximation theory

2

Approximation of functions by compositions

3

Approximation rate in term of number of nurons

slide-4
SLIDE 4

A brief introduction

For a given function f : Rd → R and ǫ > 0, approximation is to find a simple function g such that f − g < ǫ.

slide-5
SLIDE 5

A brief introduction

For a given function f : Rd → R and ǫ > 0, approximation is to find a simple function g such that f − g < ǫ. Function g : Rn → R can be as simple as g(x) = a · x. To make sense of this approximation, we need to find a map T : Rd → Rn, such that f − g ◦ T < ǫ.

slide-6
SLIDE 6

A brief introduction

For a given function f : Rd → R and ǫ > 0, approximation is to find a simple function g such that f − g < ǫ. Function g : Rn → R can be as simple as g(x) = a · x. To make sense of this approximation, we need to find a map T : Rd → Rn, such that f − g ◦ T < ǫ. In practice, we only have sample data {(xi, f(xi))}m

i=1 of f, one

needs develop algorithms to find T.

slide-7
SLIDE 7

A brief introduction

For a given function f : Rd → R and ǫ > 0, approximation is to find a simple function g such that f − g < ǫ. Function g : Rn → R can be as simple as g(x) = a · x. To make sense of this approximation, we need to find a map T : Rd → Rn, such that f − g ◦ T < ǫ. In practice, we only have sample data {(xi, f(xi))}m

i=1 of f, one

needs develop algorithms to find T.

1

Classical approximation: T is independent of f or data, while n depends on ǫ.

2

Learning: T is learned from data and determined by a few

  • parameters. n depends on ǫ.

3

Deep learning: T is fully learned from data with huge number of parameters. T is a composition of many simple maps, and n can be independent of ǫ.

slide-8
SLIDE 8

Classical approximation

Linear approximation: Given a finite fixed set of generators {φ1, . . . , φn}, e.g. splines, wavelet frames, finite elements or generators in reproducing kernel Hilbert spaces. Define T = [φ1, φ2, . . . , φn]⊤ : Rd → Rn and g(x) = a · x. The linear approximation is to find a ∈ Rn such that g ◦ T =

n

  • i=1

aiφi ∼ f It is linear because f1 ∼ g1, f2 ∼ g2 ⇒ f1 + f2 ∼ g1 + g2.

slide-9
SLIDE 9

Classical approximation

Linear approximation: Given a finite fixed set of generators {φ1, . . . , φn}, e.g. splines, wavelet frames, finite elements or generators in reproducing kernel Hilbert spaces. Define T = [φ1, φ2, . . . , φn]⊤ : Rd → Rn and g(x) = a · x. The linear approximation is to find a ∈ Rn such that g ◦ T =

n

  • i=1

aiφi ∼ f It is linear because f1 ∼ g1, f2 ∼ g2 ⇒ f1 + f2 ∼ g1 + g2. The best n-term approximation: Given dictionary D that can have infinitely many generators , e.g. D = {φi}∞

i=1 and define

T = [φ1, φ2, . . . , ]⊤ : Rd →∈ R∞ and g(x) = a · x The best n-term approximation of f is to find a with n nonzero terms such that g ◦ T ∼ f.is the best approximation among all the n-term choices It is nonlinear because f1 ∼ g1, f2 ∼ g2 f1 + f2 ∼ g1 + g2, as the support of the a1 and a2 depends on f1 and f2.

slide-10
SLIDE 10

Examples

Consider a function space L2(Rd), let {φi}∞

i=1 be an orthonormal

basis of L2(Rd).

slide-11
SLIDE 11

Examples

Consider a function space L2(Rd), let {φi}∞

i=1 be an orthonormal

basis of L2(Rd).

Linear approximation For a given n, T = [φ1, . . . , φn]⊤ and g = a · x where aj = f, φj. Denote H = span{φ1, . . . , φn} ⊆ L2(Rd). Then, g ◦ T =

n

  • i=1

f, φiφi is the orthogonal projection onto the space H and is the best approximation

  • f f from the space H.
slide-12
SLIDE 12

Examples

Consider a function space L2(Rd), let {φi}∞

i=1 be an orthonormal

basis of L2(Rd).

Linear approximation For a given n, T = [φ1, . . . , φn]⊤ and g = a · x where aj = f, φj. Denote H = span{φ1, . . . , φn} ⊆ L2(Rd). Then, g ◦ T =

n

  • i=1

f, φiφi is the orthogonal projection onto the space H and is the best approximation

  • f f from the space H.

g ◦ T provides a good approximation of f when the sequence {f, φj}∞

j=1

decays fast as j → +∞.

slide-13
SLIDE 13

Examples

Consider a function space L2(Rd), let {φi}∞

i=1 be an orthonormal

basis of L2(Rd).

Linear approximation For a given n, T = [φ1, . . . , φn]⊤ and g = a · x where aj = f, φj. Denote H = span{φ1, . . . , φn} ⊆ L2(Rd). Then, g ◦ T =

n

  • i=1

f, φiφi is the orthogonal projection onto the space H and is the best approximation

  • f f from the space H.

g ◦ T provides a good approximation of f when the sequence {f, φj}∞

j=1

decays fast as j → +∞. Therefore,

1

Linear approximation provides a good approximation for smooth functions.

2

Advantage: It is a good approximation scheme for d is small, domain is simple, function form is complicated but smooth.

3

Disadvantage: It does not do well if d is big and/or domain of f is complex.

slide-14
SLIDE 14

Examples

The best n-term approximation T = (φj)∞

j=1 : Rd → R∞ and g(x) = a · x and each aj is

aj =

  • f, φj,

for the largest n terms in the sequence {|f, φj|}∞

j=1

0,

  • therwise.
slide-15
SLIDE 15

Examples

The best n-term approximation T = (φj)∞

j=1 : Rd → R∞ and g(x) = a · x and each aj is

aj =

  • f, φj,

for the largest n terms in the sequence {|f, φj|}∞

j=1

0,

  • therwise.

The approximation of f by g ◦ T depends less on the decay of the sequence {|f, φj|}∞

j=1. Therefore,

1

the best n-term approximation is better than the linear approximation when f is nonsmooth.

2

It is not a good scheme if d is big and/or domain of f is complex.

slide-16
SLIDE 16

Approximation for deep learning

Given data {(xi, f(xi))}m

i=1.

1

The key of deep learning is to construct a T by the given data and chosen g.

slide-17
SLIDE 17

Approximation for deep learning

Given data {(xi, f(xi))}m

i=1.

1

The key of deep learning is to construct a T by the given data and chosen g.

2

T can simplify the domain of f through the change of variables while keeping the key features of the domain of f, so that

slide-18
SLIDE 18

Approximation for deep learning

Given data {(xi, f(xi))}m

i=1.

1

The key of deep learning is to construct a T by the given data and chosen g.

2

T can simplify the domain of f through the change of variables while keeping the key features of the domain of f, so that

3

It is robust to approximate f by g ◦ T .

slide-19
SLIDE 19

Classical approximation vs deep learning

For both linear and the best n-term approximations, T is fixed. Neither of them suits for approximating f, when f is defined on a complex domain, e.g manifold in a very high dimensional space.

slide-20
SLIDE 20

Classical approximation vs deep learning

For both linear and the best n-term approximations, T is fixed. Neither of them suits for approximating f, when f is defined on a complex domain, e.g manifold in a very high dimensional space. For deep learning, T is constructed by and adapted to the given

  • data. T changes variables and maps domain of f to mach with

that of a simple function g. It is normally used to approximate f with complex domain.

slide-21
SLIDE 21

Classical approximation vs deep learning

For both linear and the best n-term approximations, T is fixed. Neither of them suits for approximating f, when f is defined on a complex domain, e.g manifold in a very high dimensional space. For deep learning, T is constructed by and adapted to the given

  • data. T changes variables and maps domain of f to mach with

that of a simple function g. It is normally used to approximate f with complex domain. What is the mathematics behind this? Settings: construct a measurable map T : Rd → Rn and a simple function g (e.g. g = a · x ) from data such that the feature

  • f the domain of f can be rearranged by T to match with those
  • f g. This leads to g ◦ T provides a good approximation of f.
slide-22
SLIDE 22

Outline

1

Introduction of approximation theory

2

Approximation of functions by compositions

3

Approximation rate in term of number of nurons

slide-23
SLIDE 23

Approximation by compositions (with Qianxiao Li and Cheng Tai)

Question 1: For given f and g, is there a measurable T : Rd → Rn such that f = g ◦ T?

slide-24
SLIDE 24

Approximation by compositions (with Qianxiao Li and Cheng Tai)

Question 1: For given f and g, is there a measurable T : Rd → Rn such that f = g ◦ T? Answer: Yes! We have proven

Theorem

Let f : Rd → R and g : Rn → R and assume Im(f) ⊆ Im(g) and g is continuous. Then, there exists a measurable map T : Rd → Rn such that f = g ◦ T, a.e. This is an existence proof. T cannot be written out

  • analytically. This leads to the following relaxed question
slide-25
SLIDE 25

Approximation by compositions

Question 2: For arbitrarily given ǫ > 0, can one construct a measurable T : Rd → Rn such that f − g ◦ T ≤ ǫ?

slide-26
SLIDE 26

Approximation by compositions

Question 2: For arbitrarily given ǫ > 0, can one construct a measurable T : Rd → Rn such that f − g ◦ T ≤ ǫ? Answer: Yes!

Theorem

Let f : Rd → R and g : Rn → R and assume Im(f) ⊆ Im(g). For an arbitrarily given ǫ > 0, a measurable map T : Rd → Rn can be constructed in terms of f and g, such that f − g ◦ T ≤ ǫ While T can be written out in terms of f and g, T can be complex to be constructed when only sample data of f is

  • given. This leads to
slide-27
SLIDE 27

Approximation by compositions

Question 3: Can T be a composition of simple maps? That is, can we write T = T1 ◦ · · · ◦ TJ, where each Ti, i = 1, 2, . . . , J is simple, e.g. “perturbation of identity.” Answer: Yes!

Theorem

Denote f : Rd → R and g : Rn → R. For an arbitrarily given ǫ > 0, if Im(f) ⊆ Im(g), then there exists J simple maps Ti, i = 1, 2, . . . , J such that T = T1 ◦ T2 . . . ◦ TJ : Rd → Rn and f − g ◦ T1 ◦ · · · ◦ TJ ≤ ǫ The proof of existence of Ti, i = 1, 2, . . . , J is constructive. In fact, an algorithm can be devised to carry it out approximately in practice.

slide-28
SLIDE 28

Algorithm

Input: Hypothesis spaces: I, H; Loss functions: L, L′; Tolerance: ǫ Data: {xi, f(xi)}N

i=1

Result: A function fn that approximates a given f initialization: Set f0 = g, Img ⊃ Imf; for j from 0 to n − 1 do Ij = arg minI∈I

1 N

N

i=1 L(I(xi), ✶{|fj−f|>ǫ}(xi));

hj = arg minh∈H

1 N

N

i=1 L′(f(xi), fj ◦ Th,j(xi))

where Th,j(x) := Ij(x)h(x) + [1 − Ij(x)]x; Set fj+1 = fj ◦ Thj,j end

slide-29
SLIDE 29

Advantage of Multi-level Composition

For any given any approximator, this algorithm systematically improve its performance by adding one more layer of composition

slide-30
SLIDE 30

Advantage of Multi-level Composition

For any given any approximator, this algorithm systematically improve its performance by adding one more layer of composition The performance improvement can be quantified by Dǫ(f, g ◦ t) = Dǫ(f, g)

  • 1 − r
  • 1 − a

p

  • a, r, p can be estimated at each stage to see if we can go

further

slide-31
SLIDE 31

Advantage of Multi-level Composition

For any given any approximator, this algorithm systematically improve its performance by adding one more layer of composition The performance improvement can be quantified by Dǫ(f, g ◦ t) = Dǫ(f, g)

  • 1 − r
  • 1 − a

p

  • a, r, p can be estimated at each stage to see if we can go

further This procedure also naturally picks up some multi-scale structure

slide-32
SLIDE 32

Ideas

Classical approximation sub-divides the domain, The key to a good approximation is to reproduce poly locally. The smoothness of f is needed. It is a local approach (e.g. Riemann integration, TV method ). Alternative approach sub-divides the range. The key to good approximation is the location, volume, and geometry

  • f f−1(Bi), The smoothness of f is no more important. It is

non-local (e.g. Lebesgue integration, non-local TV method) Our theory and algorithm iteratively rearranges f−1(Bi) by constructing T, so that it matches with g−1(Bi), Consequently, g ◦ T approximates f well.

x f x f Bi

slide-33
SLIDE 33

Ideas

Classical approximation sub-divides the domain, The key to a good approximation is to reproduce poly locally. The smoothness of f is needed. It is a local approach (e.g. Riemann integration, TV method ). Alternative approach sub-divides the range. The key to good approximation is the location, volume, and geometry

  • f f−1(Bi), The smoothness of f is no more important. It is

non-local (e.g. Lebesgue integration, non-local TV method) Our theory and algorithm iteratively rearranges f−1(Bi) by constructing T, so that it matches with g−1(Bi), Consequently, g ◦ T approximates f well.

x f x f Bi

slide-34
SLIDE 34

A Binary Classification Toy Problem

0.25 0.00 0.25 0.50 0.75 1.00 x1 0.0 0.2 0.4 0.6 0.8 1.0 x2 f(x) = 0 f(x) = 1 0.25 0.00 0.25 0.50 0.75 1.00 x1 0.0 0.2 0.4 0.6 0.8 1.0 x2 f(x) = f0(x) f(x) f0(x) 0.25 0.00 0.25 0.50 0.75 1.00 T0(x)1 0.0 0.2 0.4 0.6 0.8 1.0 T0(x)2 f(x) = 0 f(x) = 1 0.0 0.2 0.4 0.6 0.8 1.0 f0(x) 0.0 0.2 0.4 0.6 0.8 1.0 I0(x) 0.0 0.2 0.4 0.6 0.8 1.0 f0 T0(x) f0 f1 f2 f3 f4 f5

composition 0.6 0.7 0.8 0.9 1.0 accuracy network list FC-1 FC-2 FC-3

slide-35
SLIDE 35

Other Classification and Regression Benchmarks

f0 f1 f2 f3 f4 f5

composition

0.92 0.94 0.96 0.98

accuracy

data train test

(a) MNIST

f0 f1 f2 f3 f4 f5

composition

0.85 0.90 0.95 1.00

accuracy

data train test

(b) Fashion-MNIST

f0 f1 f2 f3 f4 f5

composition

0.5 0.6 0.7 0.8 0.9 1.0

accuracy

data train test

(c) SGEMM1

Remark: For the image classification problems, h, I composes

  • f small convolution blocks with 4-32 channels, and 2-4 layers
  • each. f0 is linear.
  • Q. Li, Z. Shen, and C Tai Deep approximation of functions via composition (2019).

1Cedric Nugteren and Valeriu Codreanu. MCSoC, 2015

(http://ieeexplore.ieee.org/document/7328205/)

slide-36
SLIDE 36

Other Classification and Regression Benchmarks

f0 f1 f2 f3 f4 f5

composition

0.92 0.94 0.96 0.98

accuracy

data train test

(d) MNIST

f0 f1 f2 f3 f4 f5

composition

0.85 0.90 0.95 1.00

accuracy

data train test

(e) Fashion-MNIST

f0 f1 f2 f3 f4 f5

composition

0.5 0.6 0.7 0.8 0.9 1.0

accuracy

data train test

(f) SGEMM1

Remark: The last problem is regression, with fully connected blocks for h, I. “Accuracy” is defined as in the preceding theory: Dǫ(f, fj) = µ{|f − fj| > ǫ}. Here, we take ǫ = 0.1.

  • Q. Li, Z. Shen, and C Tai Deep approximation of functions via composition (2019).

1Cedric Nugteren and Valeriu Codreanu. MCSoC, 2015

(http://ieeexplore.ieee.org/document/7328205/)

slide-37
SLIDE 37

Outline

1

Introduction of approximation theory

2

Approximation of functions by compositions

3

Approximation rate in term of number of nurons

slide-38
SLIDE 38

The best N-term Approximation via Dictionary with Compositions(with Haizhao Yang and Shijun Zhang)

N-term approximation Given a dictionary D and f, the best n-term approximation from D is to find φ∗

i ∈ D and a∗ i ∈ R such

that g =

n

  • i=1

a∗

i φ∗ i

is a solution of inf

ai∈R, φi∈D

  • f −

n

  • i=1

aiφi

  • .
slide-39
SLIDE 39

The best N-term Approximation via Dictionary with Compositions

First dictionary is defined as D1 := {σ(W · x + b) : W ∈ Rd, b ∈ R}

slide-40
SLIDE 40

The best N-term Approximation via Dictionary with Compositions

First dictionary is defined as D1 := {σ(W · x + b) : W ∈ Rd, b ∈ R} Each element of D1 is a piecewise linear function.

slide-41
SLIDE 41

The best N-term Approximation via Dictionary with Compositions

First dictionary is defined as D1 := {σ(W · x + b) : W ∈ Rd, b ∈ R} Each element of D1 is a piecewise linear function. When d = 1, for arbitrary Lipchitz continuous f on [0, 1], the best n-term approximation from D1 achieve the approximation rate O(n−1).

slide-42
SLIDE 42

The best N-term Approximation via Dictionary with Compositions

Dictionary via compositions:

slide-43
SLIDE 43

The best N-term Approximation via Dictionary with Compositions

Dictionary via compositions:

Choosing h1, h2, · · · , hn ∈ D1, denote column vector [h1, h2, · · · , hn]T by h, the second dictionary is defined as D2 := {σ(W · h + b) : W ∈ Rn, b ∈ R}.

slide-44
SLIDE 44

The best N-term Approximation via Dictionary with Compositions

Dictionary via compositions:

Choosing h1, h2, · · · , hn ∈ D1, denote column vector [h1, h2, · · · , hn]T by h, the second dictionary is defined as D2 := {σ(W · h + b) : W ∈ Rn, b ∈ R}. Each element of D2 is compositions of piecewise linear functions.

slide-45
SLIDE 45

The best N-term Approximation via Dictionary with Compositions

Dictionary via compositions:

Choosing h1, h2, · · · , hn ∈ D1, denote column vector [h1, h2, · · · , hn]T by h, the second dictionary is defined as D2 := {σ(W · h + b) : W ∈ Rn, b ∈ R}. Each element of D2 is compositions of piecewise linear functions. Compositions of piecewise linear functions are still piecewise linear functions. This process can continue inductively to derive multilayer composition dictionaries D3, . . . DL.

slide-46
SLIDE 46

The best N-term Approximation via Dictionary with Compositions

The N-term approximation from D2 can be implemented numerically by the ReLU networks with 2 hidden layer approximation.

slide-47
SLIDE 47

The best N-term Approximation via Dictionary with Compositions

The N-term approximation from D2 can be implemented numerically by the ReLU networks with 2 hidden layer approximation.

slide-48
SLIDE 48

The best N-term Approximation via Dictionary with Compositions

The N-term approximation from D2 can be implemented numerically by the ReLU networks with 2 hidden layer approximation. When d = 1, for any Lipchitz continuous f on [0, 1], the best n-term approximation from D2 achieve the approximation rate O(n−2).

slide-49
SLIDE 49

The best N-term Approximation via Dictionary with Compositions

dictionary corresponding network approximation rate D1 1 hidden layer O(n−1) D2 2 hidden layer O(n−2) The Dictionary with composition improves n-term approximation rate!

slide-50
SLIDE 50

The best N-term Approximation via Dictionary with Compositions

dictionary corresponding network approximation rate D1 1 hidden layer O(n−1) D2 2 hidden layer O(n−2) The Dictionary with composition improves n-term approximation rate!

slide-51
SLIDE 51

The best N-term Approximation via Dictionary with Compositions

dictionary corresponding network approximation rate D1 1 hidden layer O(n−1) D2 2 hidden layer O(n−2) The Dictionary with composition improves n-term approximation rate!

slide-52
SLIDE 52

The best N-term Approximation via Dictionary with Compositions

dictionary corresponding network approximation rate D1 1 hidden layer O(n−1) D2 2 hidden layer O(n−2) The Dictionary with composition improves n-term approximation rate! For any fixed L, can the dictionary DL attain the n-term of approximation rate O(n−L) for L ≥ 3?

slide-53
SLIDE 53

The answer is unlikely, unless one do something else... Given L ≥ 1, there exists f with Lipchitz constant 1 such that the n-term approximation error from DL cannot be better than O(n−(2+ρ)) for sufficiently large n and any ρ > 0.

slide-54
SLIDE 54

The answer is unlikely, unless one do something else... Given L ≥ 1, there exists f with Lipchitz constant 1 such that the n-term approximation error from DL cannot be better than O(n−(2+ρ)) for sufficiently large n and any ρ > 0. Multilayer implying multiplication of the approximate rate is only true for 2 hidden layers but not for L ≥ 3.

slide-55
SLIDE 55

The answer is unlikely, unless one do something else... Given L ≥ 1, there exists f with Lipchitz constant 1 such that the n-term approximation error from DL cannot be better than O(n−(2+ρ)) for sufficiently large n and any ρ > 0. Multilayer implying multiplication of the approximate rate is only true for 2 hidden layers but not for L ≥ 3. That means that one cannot expect to reach the n -term approximation rate O(n−L) for multilayer composition dictionary DL for fixed L ≥ 3.

slide-56
SLIDE 56

The answer is unlikely, unless one do something else... Given L ≥ 1, there exists f with Lipchitz constant 1 such that the n-term approximation error from DL cannot be better than O(n−(2+ρ)) for sufficiently large n and any ρ > 0. Multilayer implying multiplication of the approximate rate is only true for 2 hidden layers but not for L ≥ 3. That means that one cannot expect to reach the n -term approximation rate O(n−L) for multilayer composition dictionary DL for fixed L ≥ 3. How about the case d > 1?

slide-57
SLIDE 57

The answer is unlikely, unless one do something else... Given L ≥ 1, there exists f with Lipchitz constant 1 such that the n-term approximation error from DL cannot be better than O(n−(2+ρ)) for sufficiently large n and any ρ > 0. Multilayer implying multiplication of the approximate rate is only true for 2 hidden layers but not for L ≥ 3. That means that one cannot expect to reach the n -term approximation rate O(n−L) for multilayer composition dictionary DL for fixed L ≥ 3. How about the case d > 1? For any Lipchitz continuous f on [0, 1]d, the best N-term approximation from the dictionary with composition achieves the approximation rate O(n−2/d).

  • Z. Shen, H. Yang, and S. Zhang, Nonlinear Approximation via Compositions, arXiv

e-prints, (2019), arXiv:1902.10170,601p. arXiv:1902.1017.

slide-58
SLIDE 58

Approximation Rate of ReLU Networks

For given N, L > 1 ∈ N+, design a network of order O(NL) φ=

slide-59
SLIDE 59

Approximation Rate of ReLU Networks

For given N, L > 1 ∈ N+, design a network of order O(NL) φ= Question: What is the approximation rate for this ReLU network?

slide-60
SLIDE 60

Approximation Rate of ReLU Networks

For given N, L > 1 ∈ N+, design a network of order O(NL) φ= Question: What is the approximation rate for this ReLU network? Suppose f is Lipchitz with constant ν, then f − φLp([0,1]d) ≤ 40ν √ dN−2/dL−2/d, for p ∈ [1, ∞). When d > 1, the width is max

  • 8d⌊N1/d⌋ + 4d, 12N + 14
  • .
slide-61
SLIDE 61

Approximation Rate of ReLU Networks

For general continuous functions, define the modulus of continuity, for any r > 0, as ωf(r) := sup{|f(x) − f(y)| : x, y ∈ [0, 1]d, |x − y| ≤ r}.

slide-62
SLIDE 62

Approximation Rate of ReLU Networks

For general continuous functions, define the modulus of continuity, for any r > 0, as ωf(r) := sup{|f(x) − f(y)| : x, y ∈ [0, 1]d, |x − y| ≤ r}.

Theorem

Let f be continuous, ∀ L > 1, N ∈ N+ and ∀ p ∈ [1, ∞), ∃ a ReLU network φ with width max

  • 8d⌊N1/d⌋ + 4d, 12N + 14
  • and

depth 9L + 12 such that f − φLp([0,1]d) ≤ 5ωf(8 √ dN−2/dL−2/d). The rate O(N−2/dL−2/d) is nearly optimal.

slide-63
SLIDE 63

Approximation Rate of ReLU Networks

slide-64
SLIDE 64

Approximation Rate of ReLU Networks

: N ≥ 1, L = 1, rate O(N−1/d), well known.

slide-65
SLIDE 65

Approximation Rate of ReLU Networks

: N ≥ 1, L = 1, rate O(N−1/d), well known. : N = 2d + 10, L sufficient large, rate O(L−2/d), Yarotsky, 2018.

slide-66
SLIDE 66

Approximation Rate of ReLU Networks

: N ≥ 1, L = 1, rate O(N−1/d), well known. : N = 2d + 10, L sufficient large, rate O(L−2/d), Yarotsky, 2018. : N ≥ 1, L = 2, 3, rate O(N−2/d), implied by the results

  • f N-term approximation, Shen, Yang, Zhang, 2019.
slide-67
SLIDE 67

Approximation Rate of ReLU Networks

: N ≥ 1, L = 1, rate O(N−1/d), well known. : N = 2d + 10, L sufficient large, rate O(L−2/d), Yarotsky, 2018. : N ≥ 1, L = 2, 3, rate O(N−2/d), implied by the results

  • f N-term approximation, Shen, Yang, Zhang, 2019.

: N ≥ 1, L ≥ 1, rate O(N−2/dL−2/d), Shen, Yang, Zhang, 2019.

slide-68
SLIDE 68

Approximation on Low-Dimensional Manifold

Question: When data is concentrated around a low dimension manifold, can we use the intrinsic dimension in the approximation rate estimate?

slide-69
SLIDE 69

Approximation on Low-Dimensional Manifold

Question: When data is concentrated around a low dimension manifold, can we use the intrinsic dimension in the approximation rate estimate? Answer: Yes, we can achieve the rate O(N−2/dδL−2/dδ), for Lipschitz functions on a small neighborhood of dM-dim manifold M ⊆ [0, 1]d where dδ = O(dM ln d),

slide-70
SLIDE 70

Approximation on Low-Dimensional Manifold

Question: When data is concentrated around a low dimension manifold, can we use the intrinsic dimension in the approximation rate estimate? Answer: Yes, we can achieve the rate O(N−2/dδL−2/dδ), for Lipschitz functions on a small neighborhood of dM-dim manifold M ⊆ [0, 1]d where dδ = O(dM ln d), How about the general continuous functions? We can extend our result to arbitrary continuous functions by using ωf(·) as we defined previously.

slide-71
SLIDE 71

Approximation on Low-Dimensional Manifold

Define ε-neighborhood of dM-dim manifold M ⊆ [0, 1]d as Mε :=

  • x ∈ [0, 1]d : inf{|x − y| : y ∈ M} ≤ ε
  • ,

Let ̺(·) be a PDF supported on Mε and we say µ̺(·) is a measure of a probability density function (PDF) ̺(·) if µ̺(E) :=

  • E

̺(x)dx, for any measurable set E.

slide-72
SLIDE 72

Approximation on Low-Dimensional Manifold

Define ε-neighborhood of dM-dim manifold M ⊆ [0, 1]d as Mε :=

  • x ∈ [0, 1]d : inf{|x − y| : y ∈ M} ≤ ε
  • ,

Let ̺(·) be a PDF supported on Mε and we say µ̺(·) is a measure of a probability density function (PDF) ̺(·) if µ̺(E) :=

  • E

̺(x)dx, for any measurable set E.

Theorem

Let f ∈ C([0, 1]d). ∀, N, L ∈ .N+, and ǫ ∈ (0, 1), ( ǫ can be O(N−2/dδL−2/dδ)), ∃ a ReLU network φ with width max

  • 8dδ⌊N1/dδ⌋ + 4dδ, 12N + 14
  • and depth 9L + 12 s.t.

f − φLp([0,1]d,µ̺) ≤ 3ωf

  • 8d ε
  • + 5ωf
  • 32d N−2/dδL−2/dδ

, where dδ = O(dM ln d) and p ∈ [1, ∞).

slide-73
SLIDE 73

Approximation on Low-Dimensional Manifold

Define ε-neighborhood of dM-dim manifold M ⊆ [0, 1]d as Mε :=

  • x ∈ [0, 1]d : inf{|x − y| : y ∈ M} ≤ ε
  • ,

Let ̺(·) be a PDF supported on Mε and we say µ̺(·) is a measure of a probability density function (PDF) ̺(·) if µ̺(E) :=

  • E

̺(x)dx, for any measurable set E.

Theorem

Let f ∈ C([0, 1]d). ∀, N, L ∈ .N+, and ǫ ∈ (0, 1), ( ǫ can be O(N−2/dδL−2/dδ)), ∃ a ReLU network φ with width max

  • 8dδ⌊N1/dδ⌋ + 4dδ, 12N + 14
  • and depth 9L + 12 s.t.

f − φLp([0,1]d,µ̺) ≤ 3ωf

  • 8d ε
  • + 5ωf
  • 32d N−2/dδL−2/dδ

, where dδ = O(dM ln d) and p ∈ [1, ∞).

Zuowei Shen, Haizhao Yang, Shijun Zhang. Deep Network Approximation Characterized by Number of Neurons. 2019.

slide-74
SLIDE 74

Happy Birthday John!

http://www.math.nus.edu.sg/∼matzuows/