better depth width trade offs for neural networks through
play

Better Depth-Width Trade-offs for Neural Networks through the lens - PowerPoint PPT Presentation

Better Depth-Width Trade-offs for Neural Networks through the lens of Dynamical Systems Ioannis Panageas Vaggos Chatziafratis Sai Ganesh Nagarajan (SUTD => UC Irvine) (Stanford & Google NY) (SUTD) Deep Neural Networks Are


  1. Better Depth-Width Trade-offs for Neural Networks through the lens of Dynamical Systems Ioannis Panageas Vaggos Chatziafratis � Sai Ganesh Nagarajan � (SUTD => UC Irvine) (Stanford & Google NY) � (SUTD) �

  2. Deep Neural Networks Are Deeper NNs more powerful?

  3. Approximation Theory (1885-today) ReLU activation units Semi-algebraic units [Telgarsky 15’,16’]: piecewise polynomials, max/min gates, and (boosted) decision trees

  4. Expressivity of NNs Which functions can NNs approximate? Cybenko [1989]: Any continuous function can be represented as a (hidden) 1-layer sigmoid net (with “some” width).

  5. Expressivity of NNs Which functions can NNs approximate? Cybenko [1989]: Any continuous function can be represented as a (hidden) 1-layer sigmoid net (with “some” width). in practice: bounded resources!

  6. Depth Separation Results Is there a function expressible by a deep NN that cannot be approximated with a much wider shallow NN? Yes! Challenging!

  7. Depth Separation Results Is there a function expressible by a deep NN that cannot be approximated with a much wider shallow NN? Yes! Challenging! L=100 400 vs 10000 ReLUs Tent or Triangle map

  8. Prior Work [Telgarsky’15,’16] Tantalizing open question: 1. Can we understand larger families of functions? 2. Why is the tent map suitable to prove depth separations? (what if we slightly tweak the tent map?)

  9. Prior Work [Telgarsky’15,’16] Tantalizing open question: 1. Can we understand larger families of functions? 2. Why is the tent map suitable to prove depth separations? (what if we slightly tweak the tent map?)

  10. Prior Work [Telgarsky’15,’16] Tantalizing open question: 1. Can we understand larger families of functions? 2. Why is the tent map suitable to prove depth separations? (what if we slightly tweak the tent map?)

  11. Prior Work [Telgarsky’15,’16] Tantalizing open question: 1. Can we understand larger families of functions? 2. Why is the tent map suitable to prove depth separations? (what if we slightly tweak the tent map?) 5 4.5 4 3.5 3 f(x) 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x

  12. Prior Work [Telgarsky’15,’16] Tantalizing open question: 1. Can we understand larger families of functions? 2. Why is the tent map suitable to prove depth separations? (what if we slightly tweak the tent map?) 5 4.5 4 3.5 3 f(x) 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x

  13. Our work in ICML 2020 Connections to Dynamical Systems [ICLR’20]: 1. We get L1-approximation error and 
 not just classification error. 2. We show tight connections between Lipschitz constant, periods of f, and oscillations. 3. Sharper period-dependent depth-width tradeoffs and easy constructions of examples. 4. Experimental validation of our theoretical results.

  14. Tent Map (by Telgarsky)

  15. Repeated Compositions exponentially many bumps

  16. Repeated Compositions ReLU NN: 1 #linearRegions: 0.9 0.8 0.7 0.6 f 6 (x) 0.5 0.4 0.3 exponentially 0.2 many bumps 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x

  17. Our starting observation: Period 3

  18. Li-Yorke Chaos (1975)

  19. Sharkovsky’s Theorem (1964)

  20. Sharkovsky’s Theorem (1964)

  21. Period-dependent Trade-offs [ICLR 2020] Main Lemma:

  22. Period-dependent Trade-offs [ICLR 2020] Main Lemma: Informal Main Result:

  23. Period-dependent Trade-offs [ICLR 2020] Main Lemma:

  24. [ICLR 2020] Examples period 3 period 3 period 5 5 5 4.5 4.5 4 4 3.5 3.5 3 3 f(x) f(x) 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x x period 4 period 4 period 4

  25. Period-dependent Trade-offs [ICLR 2020] Main Lemma: Informal Main Result:

  26. Our work in ICML 2020 Further connections to Dynamical Systems: 1. We get L1-approximation error and 
 not just classification error. 2. We show tight connections between Lipschitz constant, periods of f, and oscillations. 3. Sharper period-dependent depth-width tradeoffs and easy constructions of examples. 4. Experimental validation of our theoretical results.

  27. Our work in ICML 2020 Further connections to Dynamical Systems: 1. We get L1-approximation error and 
 not just classification error.

  28. Our work in ICML 2020 Further connections to Dynamical Systems: Is it so hard to obtain L1 guarantees? Period 3 of f, only informs us on 3 values of f.

  29. Our work in ICML 2020 Further connections to Dynamical Systems: 1. We get L1-approximation error and 
 not just classification error. 2. We show tight connections between Lipschitz constant, periods of f, and oscillations. 3. Sharper period-dependent depth-width tradeoffs and easy constructions of examples. 4. Experimental validation of our theoretical results.

  30. Periods, Oscillations, Lipschitz Lemma (Lower Bound on L): Informal Main Result (Lipschitz matches oscillations):

  31. Proof Sketch Definitions: Fact [Telgarsky’16]:

  32. Proof Sketch Definitions: Claim :

  33. Proof Sketch

  34. Our work in ICML 2020 Further connections to Dynamical Systems: 1. We get L1-approximation error and 
 not just classification error. 2. We show tight connections between Lipschitz constant, periods of f, and oscillations. 3. Sharper period-dependent depth-width tradeoffs and easy constructions of examples. 4. Experimental validation of our theoretical results.

  35. Periods, Oscillations If f has period p, how many oscillations? Main Lemma: Period-specific threshold phenomenon:

  36. Proof Sketch If f has period p, how many oscillations? Root of 
 Oscillations characteristic

  37. Proof Sketch If f has period p, how many oscillations?

  38. Tight examples - Sensitivity Function of period p & Lipschitz 
 matching oscillation growth: If slope is less than 1.618, then no period 3 appears

  39. Our work in ICML 2020 Further connections to Dynamical Systems: 1. We get L1-approximation error and 
 not just classification error. 2. We show tight connections between Lipschitz constant, periods of f, and oscillations. 3. Sharper period-dependent depth-width tradeoffs and easy constructions of examples. 4. Experimental validation of our theoretical results.

  40. Experimental Section Goals: 1. Instantiate benefits of depth 
 for a period-specific task. 2. Validate our theoretical threshold 
 for separating shallow NNs from deep. Setting: f(x)=1.618|x|-1 Width: 20, #layers: 1 up to 5 Easy Task: We take only 8 compositions of f. Hard Task: We take 40 compositions of f. Training: Define a regression task on 10K datapoints chosen uniformly at random by evaluating f. We use Adam as the optimizer and train for 1500 epochs. Overfitting: We are interested in representation.

  41. Easy Task: We take only 8 compositions of f. Classification error vs depth Regression error vs depth for the easy task appearing in for easy task our ICLR 2020 paper Adding depth does help in reducing error.

  42. Hard Task: We take 40 compositions of f. Error (blue line) is independent of depth 
 and is extremely close to theoretical bound (orange line).

  43. Recap Natural property of continuous funcitons: Period 1. Sharp depth-width tradeoffs and L1-separations 2. Tight connections between Lipschitz, periods,oscillations. Simple constructions useful for proving separations. Future Work Understanding optimization (e.g., Malach, Shalev-Shwartz’19) Unifying notions of complexity used for separations: 
 trajectory length, global curvature, algrebraic varieties Topological Entropy from Dynamical Systems

  44. Better Depth-Width Trade-offs for Neural Networks through the lens of Dynamical Systems MIT Mifods Talk by Panageas (2020): https://www.youtube.com/watch?v=HNQ204BmOQ8 ICLR 2020 spotlight talk: https://iclr.cc/virtual_2020/poster_BJe55gBtvH.html Ioannis Panageas Vaggos Chatziafratis � Sai Ganesh Nagarajan � (SUTD => UC Irvine) (Stanford & Google NY) � (SUTD) �

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend