UAT: From Shallow to Deep Ju Sun Computer Science & Engineering - - PowerPoint PPT Presentation

uat from shallow to deep
SMART_READER_LITE
LIVE PREVIEW

UAT: From Shallow to Deep Ju Sun Computer Science & Engineering - - PowerPoint PPT Presentation

UAT: From Shallow to Deep Ju Sun Computer Science & Engineering University of Minnesota, Twin Cities January 30, 2020 1 / 22 Logistics L A T EX source of homework posted in Canvas (Thanks to Logan Stapleton!) 2 / 22 Logistics


slide-1
SLIDE 1

UAT: From Shallow to Deep

Ju Sun

Computer Science & Engineering University of Minnesota, Twin Cities

January 30, 2020

1 / 22

slide-2
SLIDE 2

Logistics

– L

AT

EX source of homework posted in Canvas (Thanks to Logan Stapleton!)

2 / 22

slide-3
SLIDE 3

Logistics

– L

AT

EX source of homework posted in Canvas (Thanks to Logan Stapleton!) Mind L

AT

EX! Mind your math! * Ten Signs a Claimed Mathematical Breakthrough is Wrong

2 / 22

slide-4
SLIDE 4

Logistics

– L

AT

EX source of homework posted in Canvas (Thanks to Logan Stapleton!) Mind L

AT

EX! Mind your math! * Ten Signs a Claimed Mathematical Breakthrough is Wrong * Paper Gestalt (50%/18%, 2009)

2 / 22

slide-5
SLIDE 5

Logistics

– L

AT

EX source of homework posted in Canvas (Thanks to Logan Stapleton!) Mind L

AT

EX! Mind your math! * Ten Signs a Claimed Mathematical Breakthrough is Wrong * Paper Gestalt (50%/18%, 2009) = ⇒ Deep Paper Gestalt (50%/0.4%, 2018)

2 / 22

slide-6
SLIDE 6

Logistics

– L

AT

EX source of homework posted in Canvas (Thanks to Logan Stapleton!) Mind L

AT

EX! Mind your math! * Ten Signs a Claimed Mathematical Breakthrough is Wrong * Paper Gestalt (50%/18%, 2009) = ⇒ Deep Paper Gestalt (50%/0.4%, 2018)

2 / 22

slide-7
SLIDE 7

Logistics

– L

AT

EX source of homework posted in Canvas (Thanks to Logan Stapleton!) Mind L

AT

EX! Mind your math! * Ten Signs a Claimed Mathematical Breakthrough is Wrong * Paper Gestalt (50%/18%, 2009) = ⇒ Deep Paper Gestalt (50%/0.4%, 2018) – Matrix Cookbook? Yes and No

2 / 22

slide-8
SLIDE 8

Outline

Recap and more thoughts From shallow to deep NNs

3 / 22

slide-9
SLIDE 9

Supervised learning as function approximation

– Underlying true function: f0 – Training data: yi ≈ f0 (xi) – Choose a family of functions H, so that ∃f ∈ H and f and f0 are close – Approximation capacity: H matters (e.g., linear? quadratic? sinusoids? etc) – Optimization & Generalization: how to find the best f ∈ H matters We focus on approximation capacity now.

4 / 22

slide-10
SLIDE 10

Approximation capacities of NNs

– A single neuron has limited capacity

5 / 22

slide-11
SLIDE 11

Approximation capacities of NNs

– A single neuron has limited capacity – Deep NNs with linear activation is no better

5 / 22

slide-12
SLIDE 12

Approximation capacities of NNs

– A single neuron has limited capacity – Deep NNs with linear activation is no better – Add in both depth and nonlinearity activation two-layer network, linear activation at output

universal approximation theorem The 2-layer network can approximate arbitrary continuous functions arbitrarily well, provided that the hidden layer is sufficiently wide.

5 / 22

slide-13
SLIDE 13

[A] universal approximation theorem (UAT)

Theorem (UAT, [Cybenko, 1989, Hornik, 1991])

Let σ : R → R be a nonconstant, bounded, and continuous function. Let Im denote the m-dimensional unit hypercube [0, 1]m. The space of real-valued continuous functions on Im is denoted by C(Im). Then, given any ε > 0 and any function f ∈ C(Im), there exist an integer N, real constants vi, bi ∈ R and real vectors wi ∈ Rm for i = 1, . . . , N, such that we may define: F(x) =

N

  • i=1

viσ

  • wT

i x + bi

  • as an approximate realization of the function f; that is,

|F(x) − f(x)| < ε for all x ∈ Im.

6 / 22

slide-14
SLIDE 14

Thoughts

– Approximate continuous functions with vector outputs, i.e., Im → Rn?

7 / 22

slide-15
SLIDE 15

Thoughts

– Approximate continuous functions with vector outputs, i.e., Im → Rn? think of the component functions

7 / 22

slide-16
SLIDE 16

Thoughts

– Approximate continuous functions with vector outputs, i.e., Im → Rn? think of the component functions – Map to [0, 1], {−1, +1}, [0, ∞)? choose appropriate activation σ at the output F(x) = σ N

  • i=1

viσ

  • wT

i x + bi

  • ... universality holds in modified form

7 / 22

slide-17
SLIDE 17

Thoughts

– Approximate continuous functions with vector outputs, i.e., Im → Rn? think of the component functions – Map to [0, 1], {−1, +1}, [0, ∞)? choose appropriate activation σ at the output F(x) = σ N

  • i=1

viσ

  • wT

i x + bi

  • ... universality holds in modified form

– Get deeper? three-layer NN?

7 / 22

slide-18
SLIDE 18

Thoughts

– Approximate continuous functions with vector outputs, i.e., Im → Rn? think of the component functions – Map to [0, 1], {−1, +1}, [0, ∞)? choose appropriate activation σ at the output F(x) = σ N

  • i=1

viσ

  • wT

i x + bi

  • ... universality holds in modified form

– Get deeper? three-layer NN? change to matrix-vector notation for convenience F (x) = w⊺σ(W2σ(W1x + b1) + b2) as

  • k

wkgk (x) use wk’s to linearly combine the same function – For geeks: approximate both f and f ′?

7 / 22

slide-19
SLIDE 19

Thoughts

– Approximate continuous functions with vector outputs, i.e., Im → Rn? think of the component functions – Map to [0, 1], {−1, +1}, [0, ∞)? choose appropriate activation σ at the output F(x) = σ N

  • i=1

viσ

  • wT

i x + bi

  • ... universality holds in modified form

– Get deeper? three-layer NN? change to matrix-vector notation for convenience F (x) = w⊺σ(W2σ(W1x + b1) + b2) as

  • k

wkgk (x) use wk’s to linearly combine the same function – For geeks: approximate both f and f ′? check out [Hornik et al., 1990]

7 / 22

slide-20
SLIDE 20

Learn to take square-root

8 / 22

slide-21
SLIDE 21

Learn to take square-root

Suppose we lived in a time square-root is not defined ...

– Training data:

  • xi, x2

i

  • i, where

xi ∈ R

8 / 22

slide-22
SLIDE 22

Learn to take square-root

Suppose we lived in a time square-root is not defined ...

– Training data:

  • xi, x2

i

  • i, where

xi ∈ R – Forward: if x → y, −x → y also

8 / 22

slide-23
SLIDE 23

Learn to take square-root

Suppose we lived in a time square-root is not defined ...

– Training data:

  • xi, x2

i

  • i, where

xi ∈ R – Forward: if x → y, −x → y also – To invert, what to output? What if just throw in the training data?

8 / 22

slide-24
SLIDE 24

Learn to take square-root

Suppose we lived in a time square-root is not defined ...

– Training data:

  • xi, x2

i

  • i, where

xi ∈ R – Forward: if x → y, −x → y also – To invert, what to output? What if just throw in the training data?

8 / 22

slide-25
SLIDE 25

Visual “proof” of UAT

9 / 22

slide-26
SLIDE 26

What about ReLU?

ReLU difference of ReLU’s

10 / 22

slide-27
SLIDE 27

What about ReLU?

ReLU difference of ReLU’s what happens when the slopes of the ReLU’s are changed?

10 / 22

slide-28
SLIDE 28

What about ReLU?

ReLU difference of ReLU’s what happens when the slopes of the ReLU’s are changed? How general σ can be?

10 / 22

slide-29
SLIDE 29

What about ReLU?

ReLU difference of ReLU’s what happens when the slopes of the ReLU’s are changed? How general σ can be? ... enough when σ not a polynomial [Leshno et al., 1993]

10 / 22

slide-30
SLIDE 30

Outline

Recap and more thoughts From shallow to deep NNs

11 / 22

slide-31
SLIDE 31

What’s bad about shallow NNs?

From UAT, “... there exist an interger N, ...”, but how large?

12 / 22

slide-32
SLIDE 32

What’s bad about shallow NNs?

From UAT, “... there exist an interger N, ...”, but how large? What happens in 1D?

12 / 22

slide-33
SLIDE 33

What’s bad about shallow NNs?

From UAT, “... there exist an interger N, ...”, but how large? What happens in 1D? Assume the target f is 1-Lipschitz, i.e., |f(x) − f(y)| ≤ |x − y| , ∀ x, y ∈ R

12 / 22

slide-34
SLIDE 34

What’s bad about shallow NNs?

From UAT, “... there exist an interger N, ...”, but how large? What happens in 1D? Assume the target f is 1-Lipschitz, i.e., |f(x) − f(y)| ≤ |x − y| , ∀ x, y ∈ R For ε accuracy, need 1

ε bumps

12 / 22

slide-35
SLIDE 35

What’s bad about shallow NNs?

From UAT, “... there exist an interger N, ...”, but how large?

13 / 22

slide-36
SLIDE 36

What’s bad about shallow NNs?

From UAT, “... there exist an interger N, ...”, but how large? What happens in 2D? Visual proof in 2D first σ(w⊺x + b) , σ sigmod

13 / 22

slide-37
SLIDE 37

What’s bad about shallow NNs?

From UAT, “... there exist an interger N, ...”, but how large? What happens in 2D? Visual proof in 2D first σ(w⊺x + b) , σ sigmod approach 2D step function when making w large

Credit: CMU 11-785

13 / 22

slide-38
SLIDE 38

Visual proof for 2D functions

Keep increasing the number of step functions that are distributed evenly ...

14 / 22

slide-39
SLIDE 39

Visual proof for 2D functions

Keep increasing the number of step functions that are distributed evenly ...

14 / 22

slide-40
SLIDE 40

Visual proof for 2D functions

Keep increasing the number of step functions that are distributed evenly ...

14 / 22

slide-41
SLIDE 41

Visual proof for 2D functions

Keep increasing the number of step functions that are distributed evenly ...

Image Credit: CMU 11-785

14 / 22

slide-42
SLIDE 42

What’s bad about shallow NNs?

From UAT, “... there exist an interger N, ...”, but how large?

15 / 22

slide-43
SLIDE 43

What’s bad about shallow NNs?

From UAT, “... there exist an interger N, ...”, but how large? What happens in 2D?

Image Credit: CMU 11-785

15 / 22

slide-44
SLIDE 44

What’s bad about shallow NNs?

From UAT, “... there exist an interger N, ...”, but how large? What happens in 2D?

Image Credit: CMU 11-785

Assume the target f is 1-Lipschitz, i.e., |f(x) − f(y)| ≤ x − y2 , ∀ x, y ∈ R2

15 / 22

slide-45
SLIDE 45

What’s bad about shallow NNs?

From UAT, “... there exist an interger N, ...”, but how large? What happens in 2D?

Image Credit: CMU 11-785

Assume the target f is 1-Lipschitz, i.e., |f(x) − f(y)| ≤ x − y2 , ∀ x, y ∈ R2 For ε accuracy, need O

  • ε−2

bumps.

15 / 22

slide-46
SLIDE 46

What’s bad about shallow NNs?

From UAT, “... there exist an interger N, ...”, but how large? What happens in 2D?

Image Credit: CMU 11-785

Assume the target f is 1-Lipschitz, i.e., |f(x) − f(y)| ≤ x − y2 , ∀ x, y ∈ R2 For ε accuracy, need O

  • ε−2
  • bumps. What about the n-D case?

15 / 22

slide-47
SLIDE 47

What’s bad about shallow NNs?

From UAT, “... there exist an interger N, ...”, but how large? What happens in 2D?

Image Credit: CMU 11-785

Assume the target f is 1-Lipschitz, i.e., |f(x) − f(y)| ≤ x − y2 , ∀ x, y ∈ R2 For ε accuracy, need O

  • ε−2
  • bumps. What about the n-D case? O(ε−n).

15 / 22

slide-48
SLIDE 48

What’s good about deep NNs?

Learn Boolean functions (f : {+1, −1}n → {+1, −1}): DNNs can have #nodes linear in n, whereas 2-layer NN needs exponential nodes (more in HW1)

16 / 22

slide-49
SLIDE 49

What’s good about deep NNs?

Learn Boolean functions (f : {+1, −1}n → {+1, −1}): DNNs can have #nodes linear in n, whereas 2-layer NN needs exponential nodes (more in HW1) What general functions set deep and shallow NNs apart?

16 / 22

slide-50
SLIDE 50

What’s good about deep NNs?

Learn Boolean functions (f : {+1, −1}n → {+1, −1}): DNNs can have #nodes linear in n, whereas 2-layer NN needs exponential nodes (more in HW1) What general functions set deep and shallow NNs apart? A family: compositional function [Poggio et al., 2017]

16 / 22

slide-51
SLIDE 51

Compositional functions

W n

m: class of n-variable functions with partial derivatives up to m-th order,

W n,2

m

⊂ W n

m is the compositional subclass following binary tree structures

from [Poggio et al., 2017] ; see Sec 4.2 of [Poggio et al., 2017] for lower bound

17 / 22

slide-52
SLIDE 52

Nonsmooth activation

A terse version of UAT

18 / 22

slide-53
SLIDE 53

Nonsmooth activation

A terse version of UAT Shallow vs. deep from [Poggio et al., 2017]

18 / 22

slide-54
SLIDE 54

Width-bounded DNNs

Narrower than n + 4 is fine But no narrower than n − 1 from [Lu et al., 2017]; see also [Kidger and Lyons, 2019]

19 / 22

slide-55
SLIDE 55

Width-bounded DNNs

Narrower than n + 4 is fine But no narrower than n − 1 from [Lu et al., 2017]; see also [Kidger and Lyons, 2019] Deep vs. shallow still active area of research

19 / 22

slide-56
SLIDE 56

Number one principle of DL

Fundamental theorem of DNNs Universal approximation theorems

20 / 22

slide-57
SLIDE 57

Number one principle of DL

Fundamental theorem of DNNs Universal approximation theorems Fundamental slogan of DL Where there is a mapping, there is a NN... and fit it!

20 / 22

slide-58
SLIDE 58

References i

[Cybenko, 1989] Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 2(4):303–314. [Hornik, 1991] Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2):251–257. [Hornik et al., 1990] Hornik, K., Stinchcombe, M., and White, H. (1990). Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Networks, 3(5):551–560. [Kidger and Lyons, 2019] Kidger, P. and Lyons, T. (2019). Universal approximation with deep narrow networks. arXiv:1905.08539. [Leshno et al., 1993] Leshno, M., Lin, V. Y., Pinkus, A., and Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 6(6):861–867. [Lu et al., 2017] Lu, Z., Pu, H., Wang, F., Hu, Z., and Wang, L. (2017). The expressive power of neural networks: A view from the width. In Advances in neural information processing systems, pages 6231–6239. 21 / 22

slide-59
SLIDE 59

References ii

[Poggio et al., 2017] Poggio, T., Mhaskar, H., Rosasco, L., Miranda, B., and Liao, Q. (2017). Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review. International Journal of Automation and Computing, 14(5):503–519. 22 / 22