UAT: From Shallow to Deep Ju Sun Computer Science & Engineering - PowerPoint PPT Presentation

UAT: From Shallow to Deep Ju Sun Computer Science & Engineering University of Minnesota, Twin Cities January 30, 2020 1 / 22

Logistics – L A T EX source of homework posted in Canvas (Thanks to Logan Stapleton!) 2 / 22

Logistics – L A T EX source of homework posted in Canvas (Thanks to Logan Stapleton!) Mind L A T EX! Mind your math! * Ten Signs a Claimed Mathematical Breakthrough is Wrong 2 / 22

Logistics – L A T EX source of homework posted in Canvas (Thanks to Logan Stapleton!) Mind L A T EX! Mind your math! * Ten Signs a Claimed Mathematical Breakthrough is Wrong * Paper Gestalt ( 50% / 18% , 2009) 2 / 22

Logistics – L A T EX source of homework posted in Canvas (Thanks to Logan Stapleton!) Mind L A T EX! Mind your math! * Ten Signs a Claimed Mathematical Breakthrough is Wrong * Paper Gestalt ( 50% / 18% , 2009) = ⇒ Deep Paper Gestalt ( 50% / 0 . 4% , 2018) 2 / 22

Logistics – L A T EX source of homework posted in Canvas (Thanks to Logan Stapleton!) Mind L A T EX! Mind your math! * Ten Signs a Claimed Mathematical Breakthrough is Wrong * Paper Gestalt ( 50% / 18% , 2009) = ⇒ Deep Paper Gestalt ( 50% / 0 . 4% , 2018) – Matrix Cookbook? Yes and No 2 / 22

Outline Recap and more thoughts From shallow to deep NNs 3 / 22

Supervised learning as function approximation – Underlying true function: f 0 – Training data: y i ≈ f 0 ( x i ) – Choose a family of functions H , so that ∃ f ∈ H and f and f 0 are close – Approximation capacity : H matters (e.g., linear? quadratic? sinusoids? etc) – Optimization & Generalization : how to find the best f ∈ H matters We focus on approximation capacity now. 4 / 22

Approximation capacities of NNs – A single neuron has limited capacity 5 / 22

Approximation capacities of NNs – A single neuron has limited capacity – Deep NNs with linear activation is no better 5 / 22

Approximation capacities of NNs – A single neuron has limited capacity – Deep NNs with linear activation is no better – Add in both depth and nonlinearity activation universal approximation theorem The 2-layer network can approximate arbitrary continuous functions arbitrarily well, provided that the hidden layer is sufficiently wide . two-layer network, linear activation at output 5 / 22

[A] universal approximation theorem (UAT) Theorem (UAT, [Cybenko, 1989, Hornik, 1991]) Let σ : R → R be a nonconstant, bounded, and continuous function. Let I m denote the m -dimensional unit hypercube [0 , 1] m . The space of real-valued continuous functions on I m is denoted by C ( I m ) . Then, given any ε > 0 and any function f ∈ C ( I m ) , there exist an integer N , real constants v i , b i ∈ R and real vectors w i ∈ R m for i = 1 , . . . , N , such that we may define: N � � � w T F ( x ) = v i σ i x + b i i =1 as an approximate realization of the function f ; that is, | F ( x ) − f ( x ) | < ε for all x ∈ I m . 6 / 22

Thoughts – Approximate continuous functions with vector outputs, i.e., I m → R n ? 7 / 22

Thoughts – Approximate continuous functions with vector outputs, i.e., I m → R n ? think of the component functions 7 / 22

Thoughts – Approximate continuous functions with vector outputs, i.e., I m → R n ? think of the component functions – Map to [0 , 1] , {− 1 , +1 } , [0 , ∞ ) ? choose appropriate activation σ at the output � N � � � w T � F ( x ) = σ v i σ i x + b i i =1 ... universality holds in modified form 7 / 22

Thoughts – Approximate continuous functions with vector outputs, i.e., I m → R n ? think of the component functions – Map to [0 , 1] , {− 1 , +1 } , [0 , ∞ ) ? choose appropriate activation σ at the output � N � � � w T � F ( x ) = σ v i σ i x + b i i =1 ... universality holds in modified form – Get deeper? three-layer NN? 7 / 22

Thoughts – Approximate continuous functions with vector outputs, i.e., I m → R n ? think of the component functions – Map to [0 , 1] , {− 1 , +1 } , [0 , ∞ ) ? choose appropriate activation σ at the output � N � � � w T � F ( x ) = σ v i σ i x + b i i =1 ... universality holds in modified form – Get deeper? three-layer NN? change to matrix-vector notation for convenience � F ( x ) = w ⊺ σ ( W 2 σ ( W 1 x + b 1 ) + b 2 ) as w k g k ( x ) k use w k ’s to linearly combine the same function – For geeks : approximate both f and f ′ ? 7 / 22

Thoughts – Approximate continuous functions with vector outputs, i.e., I m → R n ? think of the component functions – Map to [0 , 1] , {− 1 , +1 } , [0 , ∞ ) ? choose appropriate activation σ at the output � N � � � w T � F ( x ) = σ v i σ i x + b i i =1 ... universality holds in modified form – Get deeper? three-layer NN? change to matrix-vector notation for convenience � F ( x ) = w ⊺ σ ( W 2 σ ( W 1 x + b 1 ) + b 2 ) as w k g k ( x ) k use w k ’s to linearly combine the same function – For geeks : approximate both f and f ′ ? check out [Hornik et al., 1990] 7 / 22

Learn to take square-root 8 / 22

Learn to take square-root Suppose we lived in a time square-root is not defined ... � x i , x 2 � – Training data: i , where i x i ∈ R 8 / 22

Learn to take square-root Suppose we lived in a time square-root is not defined ... � x i , x 2 � – Training data: i , where i x i ∈ R – Forward: if x �→ y , − x �→ y also 8 / 22

Learn to take square-root Suppose we lived in a time square-root is not defined ... � x i , x 2 � – Training data: i , where i x i ∈ R – Forward: if x �→ y , − x �→ y also – To invert, what to output? What if just throw in the training data? 8 / 22

Visual “proof” of UAT 9 / 22

What about ReLU? ReLU difference of ReLU’s 10 / 22

What about ReLU? ReLU difference of ReLU’s what happens when the slopes of the ReLU’s are changed? 10 / 22

What about ReLU? ReLU difference of ReLU’s what happens when the slopes of the ReLU’s are changed? How general σ can be? 10 / 22

What about ReLU? ReLU difference of ReLU’s what happens when the slopes of the ReLU’s are changed? How general σ can be? ... enough when σ not a polynomial [Leshno et al., 1993] 10 / 22

Outline Recap and more thoughts From shallow to deep NNs 11 / 22

What’s bad about shallow NNs? From UAT, “... there exist an interger N, ...”, but how large? 12 / 22

What’s bad about shallow NNs? From UAT, “... there exist an interger N, ...”, but how large? What happens in 1 D ? 12 / 22

What’s bad about shallow NNs? From UAT, “... there exist an interger N, ...”, but how large? What happens in 1 D ? Assume the target f is 1 -Lipschitz, i.e., | f ( x ) − f ( y ) | ≤ | x − y | , ∀ x, y ∈ R 12 / 22

What’s bad about shallow NNs? From UAT, “... there exist an interger N, ...”, but how large? What happens in 1 D ? Assume the target f is 1 -Lipschitz, i.e., | f ( x ) − f ( y ) | ≤ | x − y | , ∀ x, y ∈ R For ε accuracy, need 1 ε bumps 12 / 22

What’s bad about shallow NNs? From UAT, “... there exist an interger N, ...”, but how large? What happens in 2 D ? Visual proof in 2D first σ ( w ⊺ x + b ) , σ sigmod 13 / 22

What’s bad about shallow NNs? From UAT, “... there exist an interger N, ...”, but how large? What happens in 2 D ? Visual proof in 2D first σ ( w ⊺ x + b ) , σ sigmod approach 2D step function when making w large Credit: CMU 11-785 13 / 22

Visual proof for 2D functions Keep increasing the number of step functions that are distributed evenly ... 14 / 22

Visual proof for 2D functions Keep increasing the number of step functions that are distributed evenly ... Image Credit: CMU 11-785 14 / 22

What’s bad about shallow NNs? From UAT, “... there exist an interger N, ...”, but how large? What happens in 2 D ? Image Credit: CMU 11-785 15 / 22

What’s bad about shallow NNs? From UAT, “... there exist an interger N, ...”, but how large? What happens in 2 D ? Image Credit: CMU 11-785 Assume the target f is 1 -Lipschitz, i.e., | f ( x ) − f ( y ) | ≤ � x − y � 2 , ∀ x , y ∈ R 2 15 / 22

UAT: From Shallow to Deep Ju Sun Computer Science & Engineering - PowerPoint PPT Presentation

UAT: From Shallow to Deep Ju Sun Computer Science & Engineering University of Minnesota, Twin Cities January 30, 2020 1 / 22 Logistics L A T EX source of homework posted in Canvas (Thanks to Logan Stapleton!) 2 / 22 Logistics

SAHARA Account Reconciliation Application R 8/18/20 UTShare UAT Environment UAT is the

GEOTHERMAL SYSTEMS AND TECHNOLOGIES 5. SHALLOW GEOTHERMAL SYSTEMS 5. SHALLOW GEOTHERMAL SYSTEMS

Shallow vs. deep networks Restricted Boltzmann Machines Shallow : one hidden layer Features

XX Shan Deep Somehistory 80 S N N 2018 Approximation Theory is depth better why Shallow

Deep and Shallow Embeddings in Coq Danil Annenkov Bas Spitters Aarhus University, Concordium

SHALLOW WATER BATHYMETRY WITH AN SHALLOW WATER BATHYMETRY WITH AN INCOHERENT X- -BAND RADAR

1.25 1.25 Moz Moz HIGH HIGH - GRADE, SHALLOW GRADE, SHALLOW WA GOLD PROJECT WA GOLD PROJECT

High resolution 3-D LES simulations: a link between shallow and deep cumulus? Jon Petch Met

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning for Shallow Sequencing Johnny Israeli Nvidia Genomics Group GTC 2018 1 Talk

Question-Answering: Shallow & Deep Techniques for NLP Deep Processing Techniques for NLP

Question-Answering: Shallow & Deep Techniques for NLP Ling571 Deep Processing Techniques

t Di r ect or s Repor t -GSAs Fee Eval uat i on - Sept em ber 5, 201 9 Ag

Moving Forward myDEQ Phase 2 ITAC Presentation May 27, 2014 Timeline 1/4 Adjust Approach

E L DE RS ST ORYT E L L I NG PROJE CT - I NN-04 F I NAL E VAL UAT I ON RE

Flexible lines manufacturing Claudio Bortolin (PH-UAT) Jan Godlewski (PH-DT) Bart Verlaat

CS1063: Understanding CS1063: Understanding CS1063: Understanding CS1063: Understanding

AVOIDING SPEED BUMPS ON THE ROAD TO MICROSERVICES Scott Shaw Head of Technology,

Visualization Spiros Boosalis & Connor Gramazio Amanda Cox @ MIT, Tues Nov. 12 Very Small

ARC Release Management in Action Jon Kerr Nilsen ARC Release Manager Outline A bit on ARC

Ogres and Fairies Secrets of the NVIDIA Demo Team 1 Overview Demo engine overview Procedural

we miss you another Minimal Composite Higgs (through LHC bumps) Argonne BSM workshop with the

Clause Size Reduction with all-UIP Learning Nick Feng and Fahiem Bacchus University of Toronto

11.2 Surface Deformation I Hao Li http://cs621.hao-li.com 1 Acknowledgement Images and Slides

UAT: From Shallow to Deep Ju Sun Computer Science & Engineering - PowerPoint PPT Presentation

UAT: From Shallow to Deep Ju Sun Computer Science & Engineering University of Minnesota, Twin Cities January 30, 2020 1 / 22 Logistics L A T EX source of homework posted in Canvas (Thanks to Logan Stapleton!) 2 / 22 Logistics

SAHARA Account Reconciliation Application R 8/18/20 UTShare UAT Environment UAT is the

GEOTHERMAL SYSTEMS AND TECHNOLOGIES 5. SHALLOW GEOTHERMAL SYSTEMS 5. SHALLOW GEOTHERMAL SYSTEMS

Shallow vs. deep networks Restricted Boltzmann Machines Shallow : one hidden layer Features

XX Shan Deep Somehistory 80 S N N 2018 Approximation Theory is depth better why Shallow

Deep and Shallow Embeddings in Coq Danil Annenkov Bas Spitters Aarhus University, Concordium

SHALLOW WATER BATHYMETRY WITH AN SHALLOW WATER BATHYMETRY WITH AN INCOHERENT X- -BAND RADAR

1.25 1.25 Moz Moz HIGH HIGH - GRADE, SHALLOW GRADE, SHALLOW WA GOLD PROJECT WA GOLD PROJECT

High resolution 3-D LES simulations: a link between shallow and deep cumulus? Jon Petch Met

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning for Shallow Sequencing Johnny Israeli Nvidia Genomics Group GTC 2018 1 Talk

Question-Answering: Shallow &amp; Deep Techniques for NLP Deep Processing Techniques for NLP

Question-Answering: Shallow &amp; Deep Techniques for NLP Ling571 Deep Processing Techniques

t Di r ect or s Repor t -GSAs Fee Eval uat i on - Sept em ber 5, 201 9 Ag

Moving Forward myDEQ Phase 2 ITAC Presentation May 27, 2014 Timeline 1/4 Adjust Approach

E L DE RS ST ORYT E L L I NG PROJE CT - I NN-04 F I NAL E VAL UAT I ON RE

Flexible lines manufacturing Claudio Bortolin (PH-UAT) Jan Godlewski (PH-DT) Bart Verlaat

CS1063: Understanding CS1063: Understanding CS1063: Understanding CS1063: Understanding

AVOIDING SPEED BUMPS ON THE ROAD TO MICROSERVICES Scott Shaw Head of Technology,

Visualization Spiros Boosalis &amp; Connor Gramazio Amanda Cox @ MIT, Tues Nov. 12 Very Small

ARC Release Management in Action Jon Kerr Nilsen ARC Release Manager Outline A bit on ARC

Ogres and Fairies Secrets of the NVIDIA Demo Team 1 Overview Demo engine overview Procedural

we miss you another Minimal Composite Higgs (through LHC bumps) Argonne BSM workshop with the

Clause Size Reduction with all-UIP Learning Nick Feng and Fahiem Bacchus University of Toronto

11.2 Surface Deformation I Hao Li http://cs621.hao-li.com 1 Acknowledgement Images and Slides

Question-Answering: Shallow & Deep Techniques for NLP Deep Processing Techniques for NLP

Question-Answering: Shallow & Deep Techniques for NLP Ling571 Deep Processing Techniques

Visualization Spiros Boosalis & Connor Gramazio Amanda Cox @ MIT, Tues Nov. 12 Very Small