Better Depth-Width Trade-offs for Neural Networks through the lens - PowerPoint PPT Presentation

Better Depth-Width Trade-offs for Neural Networks through the lens of Dynamical Systems Ioannis Panageas Vaggos Chatziafratis � Sai Ganesh Nagarajan � (SUTD => UC Irvine) (Stanford & Google NY) � (SUTD) �

Deep Neural Networks Are Deeper NNs more powerful?

Approximation Theory (1885-today) ReLU activation units Semi-algebraic units [Telgarsky 15’,16’]: piecewise polynomials, max/min gates, and (boosted) decision trees

Expressivity of NNs Which functions can NNs approximate? Cybenko [1989]: Any continuous function can be represented as a (hidden) 1-layer sigmoid net (with “some” width).

Expressivity of NNs Which functions can NNs approximate? Cybenko [1989]: Any continuous function can be represented as a (hidden) 1-layer sigmoid net (with “some” width). in practice: bounded resources!

Depth Separation Results Is there a function expressible by a deep NN that cannot be approximated with a much wider shallow NN? Yes! Challenging!

Depth Separation Results Is there a function expressible by a deep NN that cannot be approximated with a much wider shallow NN? Yes! Challenging! L=100 400 vs 10000 ReLUs Tent or Triangle map

Prior Work [Telgarsky’15,’16] Tantalizing open question: 1. Can we understand larger families of functions? 2. Why is the tent map suitable to prove depth separations? (what if we slightly tweak the tent map?)

Prior Work [Telgarsky’15,’16] Tantalizing open question: 1. Can we understand larger families of functions? 2. Why is the tent map suitable to prove depth separations? (what if we slightly tweak the tent map?) 5 4.5 4 3.5 3 f(x) 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x

Our work in ICML 2020 Connections to Dynamical Systems [ICLR’20]: 1. We get L1-approximation error and   not just classification error. 2. We show tight connections between Lipschitz constant, periods of f, and oscillations. 3. Sharper period-dependent depth-width tradeoffs and easy constructions of examples. 4. Experimental validation of our theoretical results.

Tent Map (by Telgarsky)

Repeated Compositions exponentially many bumps

Repeated Compositions ReLU NN: 1 #linearRegions: 0.9 0.8 0.7 0.6 f 6 (x) 0.5 0.4 0.3 exponentially 0.2 many bumps 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x

Our starting observation: Period 3

Li-Yorke Chaos (1975)

Sharkovsky’s Theorem (1964)

Period-dependent Trade-offs [ICLR 2020] Main Lemma:

Period-dependent Trade-offs [ICLR 2020] Main Lemma: Informal Main Result:

Period-dependent Trade-offs [ICLR 2020] Main Lemma:

[ICLR 2020] Examples period 3 period 3 period 5 5 5 4.5 4.5 4 4 3.5 3.5 3 3 f(x) f(x) 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x x period 4 period 4 period 4

Period-dependent Trade-offs [ICLR 2020] Main Lemma: Informal Main Result:

Our work in ICML 2020 Further connections to Dynamical Systems: 1. We get L1-approximation error and   not just classification error. 2. We show tight connections between Lipschitz constant, periods of f, and oscillations. 3. Sharper period-dependent depth-width tradeoffs and easy constructions of examples. 4. Experimental validation of our theoretical results.

Our work in ICML 2020 Further connections to Dynamical Systems: 1. We get L1-approximation error and   not just classification error.

Our work in ICML 2020 Further connections to Dynamical Systems: Is it so hard to obtain L1 guarantees? Period 3 of f, only informs us on 3 values of f.

Periods, Oscillations, Lipschitz Lemma (Lower Bound on L): Informal Main Result (Lipschitz matches oscillations):

Proof Sketch Definitions: Fact [Telgarsky’16]:

Proof Sketch Definitions: Claim :

Proof Sketch

Periods, Oscillations If f has period p, how many oscillations? Main Lemma: Period-specific threshold phenomenon:

Proof Sketch If f has period p, how many oscillations? Root of   Oscillations characteristic

Proof Sketch If f has period p, how many oscillations?

Tight examples - Sensitivity Function of period p & Lipschitz   matching oscillation growth: If slope is less than 1.618, then no period 3 appears

Experimental Section Goals: 1. Instantiate benefits of depth   for a period-specific task. 2. Validate our theoretical threshold   for separating shallow NNs from deep. Setting: f(x)=1.618|x|-1 Width: 20, #layers: 1 up to 5 Easy Task: We take only 8 compositions of f. Hard Task: We take 40 compositions of f. Training: Define a regression task on 10K datapoints chosen uniformly at random by evaluating f. We use Adam as the optimizer and train for 1500 epochs. Overfitting: We are interested in representation.

Easy Task: We take only 8 compositions of f. Classification error vs depth Regression error vs depth for the easy task appearing in for easy task our ICLR 2020 paper Adding depth does help in reducing error.

Hard Task: We take 40 compositions of f. Error (blue line) is independent of depth   and is extremely close to theoretical bound (orange line).

Recap Natural property of continuous funcitons: Period 1. Sharp depth-width tradeoffs and L1-separations 2. Tight connections between Lipschitz, periods,oscillations. Simple constructions useful for proving separations. Future Work Understanding optimization (e.g., Malach, Shalev-Shwartz’19) Unifying notions of complexity used for separations:   trajectory length, global curvature, algrebraic varieties Topological Entropy from Dynamical Systems

Better Depth-Width Trade-offs for Neural Networks through the lens of Dynamical Systems MIT Mifods Talk by Panageas (2020): https://www.youtube.com/watch?v=HNQ204BmOQ8 ICLR 2020 spotlight talk: https://iclr.cc/virtual_2020/poster_BJe55gBtvH.html Ioannis Panageas Vaggos Chatziafratis � Sai Ganesh Nagarajan � (SUTD => UC Irvine) (Stanford & Google NY) � (SUTD) �

Better Depth-Width Trade-offs for Neural Networks through the lens - PowerPoint PPT Presentation

Better Depth-Width Trade-offs for Neural Networks through the lens of Dynamical Systems Ioannis Panageas Vaggos Chatziafratis Sai Ganesh Nagarajan (SUTD => UC Irvine) (Stanford & Google NY) (SUTD) Deep Neural Networks Are

Evolution of valley depth and width Evolution of valley depth and width Evolution of valley depth

Time-memory Trade-offs for Near-collisions Conclusion Combining trunc & codes Time-memory

Chapter 2 Trade-offs, Comparative Advantage, and the Market System Modeling Trade-offs:

TRADE-OFFS AMONG AI TRADE-OFFS AMONG AI TECHNIQUES TECHNIQUES Christian Kaestner With slides

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Carving-width, tree-width and area-optimal planar graph drawing Therese Biedl University of

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Le Lecture 9 9 - Convolu lutional l Neural l Networks I2DL: Prof. Niessner, Prof.

for each dst in my.out_edges if dst.depth > my.depth+1 then dst.depth = my.depth+1

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Approximating the Diameter, Width, Smallest Enclosing Cylinder, and Minimum-Width Annulus

Multi-Clique-Width, a Powerful New Width Parameter Martin Frer Pennsylvania State University

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

NIH Collaboratory: Research Transformation in Progress Kevin P. Weinfurt, PhD Adrian Hernandez,

Beam test results of an SOI monolithic pixel sensor SOFIST for the ILC vertex detector 2 SOI

Polynomials Paul Valiant Brown University (Based on joint work with Gregory Valiant, mostly

CS-184: Computer Graphics Lecture #11: Texture and Other Maps Prof. James OBrien University

Design and Analysis for Multifidelity Computer Experiments Ying Hung Department of Statistics

for the IBL Detector Andrea Gaudiello Universit degli Studi Di Genova INFN On behalf of the

Data Mining and Matrices 05 Semi-Discrete Decomposition Rainer Gemulla, Pauli Miettinen May

Members of the SLS Beam Dynamics Group J. Chrin, M. Mu noz, A. Streun, M. B oge JLab

Better Depth-Width Trade-offs for Neural Networks through the lens - PowerPoint PPT Presentation

Better Depth-Width Trade-offs for Neural Networks through the lens of Dynamical Systems Ioannis Panageas Vaggos Chatziafratis Sai Ganesh Nagarajan (SUTD => UC Irvine) (Stanford & Google NY) (SUTD) Deep Neural Networks Are

Evolution of valley depth and width Evolution of valley depth and width Evolution of valley depth

Time-memory Trade-offs for Near-collisions Conclusion Combining trunc &amp; codes Time-memory

Chapter 2 Trade-offs, Comparative Advantage, and the Market System Modeling Trade-offs:

TRADE-OFFS AMONG AI TRADE-OFFS AMONG AI TECHNIQUES TECHNIQUES Christian Kaestner With slides

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Carving-width, tree-width and area-optimal planar graph drawing Therese Biedl University of

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Le Lecture 9 9 - Convolu lutional l Neural l Networks I2DL: Prof. Niessner, Prof.

for each dst in my.out_edges if dst.depth &gt; my.depth+1 then dst.depth = my.depth+1

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Approximating the Diameter, Width, Smallest Enclosing Cylinder, and Minimum-Width Annulus

Multi-Clique-Width, a Powerful New Width Parameter Martin Frer Pennsylvania State University

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

NIH Collaboratory: Research Transformation in Progress Kevin P. Weinfurt, PhD Adrian Hernandez,

Beam test results of an SOI monolithic pixel sensor SOFIST for the ILC vertex detector 2 SOI

Polynomials Paul Valiant Brown University (Based on joint work with Gregory Valiant, mostly

CS-184: Computer Graphics Lecture #11: Texture and Other Maps Prof. James OBrien University

Design and Analysis for Multifidelity Computer Experiments Ying Hung Department of Statistics

for the IBL Detector Andrea Gaudiello Universit degli Studi Di Genova INFN On behalf of the

Data Mining and Matrices 05 Semi-Discrete Decomposition Rainer Gemulla, Pauli Miettinen May

Members of the SLS Beam Dynamics Group J. Chrin, M. Mu noz, A. Streun, M. B oge JLab

Time-memory Trade-offs for Near-collisions Conclusion Combining trunc & codes Time-memory

for each dst in my.out_edges if dst.depth > my.depth+1 then dst.depth = my.depth+1