Better Depth-Width Trade-offs for Neural Networks through the lens - - PowerPoint PPT Presentation

better depth width trade offs for neural networks through
SMART_READER_LITE
LIVE PREVIEW

Better Depth-Width Trade-offs for Neural Networks through the lens - - PowerPoint PPT Presentation

Better Depth-Width Trade-offs for Neural Networks through the lens of Dynamical Systems Ioannis Panageas Vaggos Chatziafratis Sai Ganesh Nagarajan (SUTD => UC Irvine) (Stanford & Google NY) (SUTD) Deep Neural Networks Are


slide-1
SLIDE 1

Vaggos Chatziafratis (Stanford & Google NY)

Sai Ganesh Nagarajan (SUTD)

Better Depth-Width Trade-offs for Neural Networks through the lens of Dynamical Systems

Ioannis Panageas (SUTD => UC Irvine)

slide-2
SLIDE 2

Deep Neural Networks

Are Deeper NNs more powerful?

slide-3
SLIDE 3

Approximation Theory (1885-today) ReLU activation units Semi-algebraic units [Telgarsky 15’,16’]: piecewise polynomials, max/min gates, and (boosted) decision trees

slide-4
SLIDE 4

Expressivity of NNs

Which functions can NNs approximate?

Cybenko [1989]:

Any continuous function can be represented as a (hidden) 1-layer sigmoid net (with “some” width).

slide-5
SLIDE 5

Expressivity of NNs

Which functions can NNs approximate?

Cybenko [1989]:

Any continuous function can be represented as a (hidden) 1-layer sigmoid net (with “some” width).

in practice: bounded resources!

slide-6
SLIDE 6

Depth Separation Results

Is there a function expressible by a deep NN that cannot be approximated with a much wider shallow NN?

Yes! Challenging!

slide-7
SLIDE 7

Depth Separation Results

Is there a function expressible by a deep NN that cannot be approximated with a much wider shallow NN?

Yes! Challenging!

L=100 400 vs 10000 ReLUs

Tent or Triangle map

slide-8
SLIDE 8

[Telgarsky’15,’16] Tantalizing open question:

  • 1. Can we understand larger families of functions?
  • 2. Why is the tent map suitable to prove depth separations?

(what if we slightly tweak the tent map?)

Prior Work

slide-9
SLIDE 9

[Telgarsky’15,’16] Tantalizing open question:

  • 1. Can we understand larger families of functions?
  • 2. Why is the tent map suitable to prove depth separations?

(what if we slightly tweak the tent map?)

Prior Work

slide-10
SLIDE 10

[Telgarsky’15,’16] Tantalizing open question:

  • 1. Can we understand larger families of functions?
  • 2. Why is the tent map suitable to prove depth separations?

(what if we slightly tweak the tent map?)

Prior Work

slide-11
SLIDE 11

[Telgarsky’15,’16] Tantalizing open question:

  • 1. Can we understand larger families of functions?
  • 2. Why is the tent map suitable to prove depth separations?

(what if we slightly tweak the tent map?)

Prior Work

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

f(x)

slide-12
SLIDE 12

[Telgarsky’15,’16] Tantalizing open question:

  • 1. Can we understand larger families of functions?
  • 2. Why is the tent map suitable to prove depth separations?

(what if we slightly tweak the tent map?)

Prior Work

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

f(x)

slide-13
SLIDE 13
  • 2. We show tight connections between Lipschitz constant,

periods of f, and oscillations.

  • 3. Sharper period-dependent depth-width tradeoffs and

easy constructions of examples. Connections to Dynamical Systems [ICLR’20]:

Our work in ICML 2020

  • 4. Experimental validation of our theoretical results.
  • 1. We get L1-approximation error and


not just classification error.

slide-14
SLIDE 14

Tent Map (by Telgarsky)

slide-15
SLIDE 15

exponentially many bumps

Repeated Compositions

slide-16
SLIDE 16

exponentially many bumps

Repeated Compositions

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

f6(x)

ReLU NN: #linearRegions:

slide-17
SLIDE 17

Our starting observation: Period 3

slide-18
SLIDE 18

Li-Yorke Chaos (1975)

slide-19
SLIDE 19

Sharkovsky’s Theorem (1964)

slide-20
SLIDE 20

Sharkovsky’s Theorem (1964)

slide-21
SLIDE 21

Period-dependent Trade-offs

Main Lemma:

[ICLR 2020]

slide-22
SLIDE 22

Main Lemma:

Period-dependent Trade-offs [ICLR 2020]

Informal Main Result:

slide-23
SLIDE 23

Main Lemma:

Period-dependent Trade-offs [ICLR 2020]

slide-24
SLIDE 24

[ICLR 2020]

period 3 period 3 period 5 period 4

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

f(x)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

f(x)

period 4 period 4

Examples

slide-25
SLIDE 25

Main Lemma:

Period-dependent Trade-offs [ICLR 2020]

Informal Main Result:

slide-26
SLIDE 26
  • 2. We show tight connections between Lipschitz constant,

periods of f, and oscillations.

  • 3. Sharper period-dependent depth-width tradeoffs and

easy constructions of examples. Further connections to Dynamical Systems:

Our work in ICML 2020

  • 4. Experimental validation of our theoretical results.
  • 1. We get L1-approximation error and


not just classification error.

slide-27
SLIDE 27

Further connections to Dynamical Systems:

Our work in ICML 2020

  • 1. We get L1-approximation error and


not just classification error.

slide-28
SLIDE 28

Further connections to Dynamical Systems:

Our work in ICML 2020

Is it so hard to obtain L1 guarantees? Period 3 of f, only informs us on 3 values of f.

slide-29
SLIDE 29
  • 2. We show tight connections between Lipschitz constant,

periods of f, and oscillations.

  • 3. Sharper period-dependent depth-width tradeoffs and

easy constructions of examples. Further connections to Dynamical Systems:

Our work in ICML 2020

  • 4. Experimental validation of our theoretical results.
  • 1. We get L1-approximation error and


not just classification error.

slide-30
SLIDE 30

Lemma (Lower Bound on L):

Periods, Oscillations, Lipschitz

Informal Main Result (Lipschitz matches oscillations):

slide-31
SLIDE 31

Proof Sketch

Definitions: Fact [Telgarsky’16]:

slide-32
SLIDE 32

Proof Sketch

Definitions: Claim:

slide-33
SLIDE 33

Proof Sketch

slide-34
SLIDE 34
  • 2. We show tight connections between Lipschitz constant,

periods of f, and oscillations.

  • 3. Sharper period-dependent depth-width tradeoffs and

easy constructions of examples. Further connections to Dynamical Systems:

Our work in ICML 2020

  • 4. Experimental validation of our theoretical results.
  • 1. We get L1-approximation error and


not just classification error.

slide-35
SLIDE 35

Periods, Oscillations

Main Lemma:

If f has period p, how many oscillations?

Period-specific threshold phenomenon:

slide-36
SLIDE 36

Proof Sketch If f has period p, how many oscillations? Oscillations Root of 
 characteristic

slide-37
SLIDE 37

Proof Sketch If f has period p, how many oscillations?

slide-38
SLIDE 38

Tight examples - Sensitivity

Function of period p & Lipschitz 
 matching oscillation growth: If slope is less than 1.618, then no period 3 appears

slide-39
SLIDE 39
  • 2. We show tight connections between Lipschitz constant,

periods of f, and oscillations.

  • 3. Sharper period-dependent depth-width tradeoffs and

easy constructions of examples. Further connections to Dynamical Systems:

Our work in ICML 2020

  • 4. Experimental validation of our theoretical results.
  • 1. We get L1-approximation error and


not just classification error.

slide-40
SLIDE 40

Experimental Section Goals:

  • 2. Validate our theoretical threshold


for separating shallow NNs from deep.

  • 1. Instantiate benefits of depth 


for a period-specific task. Setting: f(x)=1.618|x|-1 Training: Define a regression task on 10K datapoints chosen uniformly at random by evaluating f. We use Adam as the optimizer and train for 1500 epochs. Overfitting: We are interested in representation. Width: 20, #layers: 1 up to 5 Easy Task: We take only 8 compositions of f. Hard Task: We take 40 compositions of f.

slide-41
SLIDE 41

Regression error vs depth for easy task Classification error vs depth for the easy task appearing in

  • ur ICLR 2020 paper

Easy Task: We take only 8 compositions of f. Adding depth does help in reducing error.

slide-42
SLIDE 42

Hard Task: We take 40 compositions of f. Error (blue line) is independent of depth
 and is extremely close to theoretical bound (orange line).

slide-43
SLIDE 43

Recap

  • 2. Tight connections between Lipschitz, periods,oscillations.

Natural property of continuous funcitons: Period

  • 1. Sharp depth-width tradeoffs and L1-separations

Simple constructions useful for proving separations.

Future Work

Understanding optimization (e.g., Malach, Shalev-Shwartz’19) Unifying notions of complexity used for separations:
 trajectory length, global curvature, algrebraic varieties Topological Entropy from Dynamical Systems

slide-44
SLIDE 44

Vaggos Chatziafratis (Stanford & Google NY)

MIT Mifods Talk by Panageas (2020):

https://www.youtube.com/watch?v=HNQ204BmOQ8

Better Depth-Width Trade-offs for Neural Networks through the lens of Dynamical Systems

Sai Ganesh Nagarajan (SUTD)

Ioannis Panageas (SUTD => UC Irvine)

ICLR 2020 spotlight talk:

https://iclr.cc/virtual_2020/poster_BJe55gBtvH.html