uat from shallow to deep
play

UAT: From Shallow to Deep Ju Sun Computer Science & Engineering - PowerPoint PPT Presentation

UAT: From Shallow to Deep Ju Sun Computer Science & Engineering University of Minnesota, Twin Cities January 30, 2020 1 / 22 Logistics L A T EX source of homework posted in Canvas (Thanks to Logan Stapleton!) 2 / 22 Logistics


  1. UAT: From Shallow to Deep Ju Sun Computer Science & Engineering University of Minnesota, Twin Cities January 30, 2020 1 / 22

  2. Logistics – L A T EX source of homework posted in Canvas (Thanks to Logan Stapleton!) 2 / 22

  3. Logistics – L A T EX source of homework posted in Canvas (Thanks to Logan Stapleton!) Mind L A T EX! Mind your math! * Ten Signs a Claimed Mathematical Breakthrough is Wrong 2 / 22

  4. Logistics – L A T EX source of homework posted in Canvas (Thanks to Logan Stapleton!) Mind L A T EX! Mind your math! * Ten Signs a Claimed Mathematical Breakthrough is Wrong * Paper Gestalt ( 50% / 18% , 2009) 2 / 22

  5. Logistics – L A T EX source of homework posted in Canvas (Thanks to Logan Stapleton!) Mind L A T EX! Mind your math! * Ten Signs a Claimed Mathematical Breakthrough is Wrong * Paper Gestalt ( 50% / 18% , 2009) = ⇒ Deep Paper Gestalt ( 50% / 0 . 4% , 2018) 2 / 22

  6. Logistics – L A T EX source of homework posted in Canvas (Thanks to Logan Stapleton!) Mind L A T EX! Mind your math! * Ten Signs a Claimed Mathematical Breakthrough is Wrong * Paper Gestalt ( 50% / 18% , 2009) = ⇒ Deep Paper Gestalt ( 50% / 0 . 4% , 2018) 2 / 22

  7. Logistics – L A T EX source of homework posted in Canvas (Thanks to Logan Stapleton!) Mind L A T EX! Mind your math! * Ten Signs a Claimed Mathematical Breakthrough is Wrong * Paper Gestalt ( 50% / 18% , 2009) = ⇒ Deep Paper Gestalt ( 50% / 0 . 4% , 2018) – Matrix Cookbook? Yes and No 2 / 22

  8. Outline Recap and more thoughts From shallow to deep NNs 3 / 22

  9. Supervised learning as function approximation – Underlying true function: f 0 – Training data: y i ≈ f 0 ( x i ) – Choose a family of functions H , so that ∃ f ∈ H and f and f 0 are close – Approximation capacity : H matters (e.g., linear? quadratic? sinusoids? etc) – Optimization & Generalization : how to find the best f ∈ H matters We focus on approximation capacity now. 4 / 22

  10. Approximation capacities of NNs – A single neuron has limited capacity 5 / 22

  11. Approximation capacities of NNs – A single neuron has limited capacity – Deep NNs with linear activation is no better 5 / 22

  12. Approximation capacities of NNs – A single neuron has limited capacity – Deep NNs with linear activation is no better – Add in both depth and nonlinearity activation universal approximation theorem The 2-layer network can approximate arbitrary continuous functions arbitrarily well, provided that the hidden layer is sufficiently wide . two-layer network, linear activation at output 5 / 22

  13. [A] universal approximation theorem (UAT) Theorem (UAT, [Cybenko, 1989, Hornik, 1991]) Let σ : R → R be a nonconstant, bounded, and continuous function. Let I m denote the m -dimensional unit hypercube [0 , 1] m . The space of real-valued continuous functions on I m is denoted by C ( I m ) . Then, given any ε > 0 and any function f ∈ C ( I m ) , there exist an integer N , real constants v i , b i ∈ R and real vectors w i ∈ R m for i = 1 , . . . , N , such that we may define: N � � � w T F ( x ) = v i σ i x + b i i =1 as an approximate realization of the function f ; that is, | F ( x ) − f ( x ) | < ε for all x ∈ I m . 6 / 22

  14. Thoughts – Approximate continuous functions with vector outputs, i.e., I m → R n ? 7 / 22

  15. Thoughts – Approximate continuous functions with vector outputs, i.e., I m → R n ? think of the component functions 7 / 22

  16. Thoughts – Approximate continuous functions with vector outputs, i.e., I m → R n ? think of the component functions – Map to [0 , 1] , {− 1 , +1 } , [0 , ∞ ) ? choose appropriate activation σ at the output � N � � � w T � F ( x ) = σ v i σ i x + b i i =1 ... universality holds in modified form 7 / 22

  17. Thoughts – Approximate continuous functions with vector outputs, i.e., I m → R n ? think of the component functions – Map to [0 , 1] , {− 1 , +1 } , [0 , ∞ ) ? choose appropriate activation σ at the output � N � � � w T � F ( x ) = σ v i σ i x + b i i =1 ... universality holds in modified form – Get deeper? three-layer NN? 7 / 22

  18. Thoughts – Approximate continuous functions with vector outputs, i.e., I m → R n ? think of the component functions – Map to [0 , 1] , {− 1 , +1 } , [0 , ∞ ) ? choose appropriate activation σ at the output � N � � � w T � F ( x ) = σ v i σ i x + b i i =1 ... universality holds in modified form – Get deeper? three-layer NN? change to matrix-vector notation for convenience � F ( x ) = w ⊺ σ ( W 2 σ ( W 1 x + b 1 ) + b 2 ) as w k g k ( x ) k use w k ’s to linearly combine the same function – For geeks : approximate both f and f ′ ? 7 / 22

  19. Thoughts – Approximate continuous functions with vector outputs, i.e., I m → R n ? think of the component functions – Map to [0 , 1] , {− 1 , +1 } , [0 , ∞ ) ? choose appropriate activation σ at the output � N � � � w T � F ( x ) = σ v i σ i x + b i i =1 ... universality holds in modified form – Get deeper? three-layer NN? change to matrix-vector notation for convenience � F ( x ) = w ⊺ σ ( W 2 σ ( W 1 x + b 1 ) + b 2 ) as w k g k ( x ) k use w k ’s to linearly combine the same function – For geeks : approximate both f and f ′ ? check out [Hornik et al., 1990] 7 / 22

  20. Learn to take square-root 8 / 22

  21. Learn to take square-root Suppose we lived in a time square-root is not defined ... � x i , x 2 � – Training data: i , where i x i ∈ R 8 / 22

  22. Learn to take square-root Suppose we lived in a time square-root is not defined ... � x i , x 2 � – Training data: i , where i x i ∈ R – Forward: if x �→ y , − x �→ y also 8 / 22

  23. Learn to take square-root Suppose we lived in a time square-root is not defined ... � x i , x 2 � – Training data: i , where i x i ∈ R – Forward: if x �→ y , − x �→ y also – To invert, what to output? What if just throw in the training data? 8 / 22

  24. Learn to take square-root Suppose we lived in a time square-root is not defined ... � x i , x 2 � – Training data: i , where i x i ∈ R – Forward: if x �→ y , − x �→ y also – To invert, what to output? What if just throw in the training data? 8 / 22

  25. Visual “proof” of UAT 9 / 22

  26. What about ReLU? ReLU difference of ReLU’s 10 / 22

  27. What about ReLU? ReLU difference of ReLU’s what happens when the slopes of the ReLU’s are changed? 10 / 22

  28. What about ReLU? ReLU difference of ReLU’s what happens when the slopes of the ReLU’s are changed? How general σ can be? 10 / 22

  29. What about ReLU? ReLU difference of ReLU’s what happens when the slopes of the ReLU’s are changed? How general σ can be? ... enough when σ not a polynomial [Leshno et al., 1993] 10 / 22

  30. Outline Recap and more thoughts From shallow to deep NNs 11 / 22

  31. What’s bad about shallow NNs? From UAT, “... there exist an interger N, ...”, but how large? 12 / 22

  32. What’s bad about shallow NNs? From UAT, “... there exist an interger N, ...”, but how large? What happens in 1 D ? 12 / 22

  33. What’s bad about shallow NNs? From UAT, “... there exist an interger N, ...”, but how large? What happens in 1 D ? Assume the target f is 1 -Lipschitz, i.e., | f ( x ) − f ( y ) | ≤ | x − y | , ∀ x, y ∈ R 12 / 22

  34. What’s bad about shallow NNs? From UAT, “... there exist an interger N, ...”, but how large? What happens in 1 D ? Assume the target f is 1 -Lipschitz, i.e., | f ( x ) − f ( y ) | ≤ | x − y | , ∀ x, y ∈ R For ε accuracy, need 1 ε bumps 12 / 22

  35. What’s bad about shallow NNs? From UAT, “... there exist an interger N, ...”, but how large? 13 / 22

  36. What’s bad about shallow NNs? From UAT, “... there exist an interger N, ...”, but how large? What happens in 2 D ? Visual proof in 2D first σ ( w ⊺ x + b ) , σ sigmod 13 / 22

  37. What’s bad about shallow NNs? From UAT, “... there exist an interger N, ...”, but how large? What happens in 2 D ? Visual proof in 2D first σ ( w ⊺ x + b ) , σ sigmod approach 2D step function when making w large Credit: CMU 11-785 13 / 22

  38. Visual proof for 2D functions Keep increasing the number of step functions that are distributed evenly ... 14 / 22

  39. Visual proof for 2D functions Keep increasing the number of step functions that are distributed evenly ... 14 / 22

  40. Visual proof for 2D functions Keep increasing the number of step functions that are distributed evenly ... 14 / 22

  41. Visual proof for 2D functions Keep increasing the number of step functions that are distributed evenly ... Image Credit: CMU 11-785 14 / 22

  42. What’s bad about shallow NNs? From UAT, “... there exist an interger N, ...”, but how large? 15 / 22

  43. What’s bad about shallow NNs? From UAT, “... there exist an interger N, ...”, but how large? What happens in 2 D ? Image Credit: CMU 11-785 15 / 22

  44. What’s bad about shallow NNs? From UAT, “... there exist an interger N, ...”, but how large? What happens in 2 D ? Image Credit: CMU 11-785 Assume the target f is 1 -Lipschitz, i.e., | f ( x ) − f ( y ) | ≤ � x − y � 2 , ∀ x , y ∈ R 2 15 / 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend