fundamental belief universal approximation theorems
play

Fundamental Belief: Universal Approximation Theorems Ju Sun - PowerPoint PPT Presentation

Fundamental Belief: Universal Approximation Theorems Ju Sun Computer Science & Engineering University of Minnesota, Twin Cities January 29, 2020 1 / 23 Logistics HW 0 posted (due: midnight Feb 07) 2 / 23 Logistics HW 0 posted


  1. Fundamental Belief: Universal Approximation Theorems Ju Sun Computer Science & Engineering University of Minnesota, Twin Cities January 29, 2020 1 / 23

  2. Logistics – HW 0 posted (due: midnight Feb 07) 2 / 23

  3. Logistics – HW 0 posted (due: midnight Feb 07) – Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow (2ed) now available at UMN library (limited e-access) 2 / 23

  4. Logistics – HW 0 posted (due: midnight Feb 07) – Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow (2ed) now available at UMN library (limited e-access) – Guest lectures (Feb 04: Tutorial on Numpy, Scipy, Colab. Bring your laptops if possible!) 2 / 23

  5. Logistics – HW 0 posted (due: midnight Feb 07) – Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow (2ed) now available at UMN library (limited e-access) – Guest lectures (Feb 04: Tutorial on Numpy, Scipy, Colab. Bring your laptops if possible!) – Feb 06: discussion of the course project & ideas 2 / 23

  6. Outline Recap Why should we trust NNs? Suggested reading 3 / 23

  7. Recap I biological neuron vs. artificial neuron 4 / 23

  8. Recap I biological neuron vs. artificial neuron biological NN vs. artificial NN 4 / 23

  9. Recap I biological neuron vs. artificial neuron biological NN vs. artificial NN Artificial NN: (over)-simplification on neuron & connection levels 4 / 23

  10. Recap II Zoo of NN models in ML – Linear regression – Perception and Logistic regression – Softmax regression – Multilayer perceptron (feedforward NNs) 5 / 23

  11. Recap II Zoo of NN models in ML – Linear regression – Perception and Logistic regression – Softmax regression – Multilayer perceptron (feedforward NNs) Also: – Support vector machines (SVM) – PCA (autoencoder) – Matrix factorization 5 / 23

  12. Recap III 6 / 23

  13. Recap III Brief history of NNs: – 1943: first NNs invented (McCulloch and Pitts) – 1958 –1969: perceptron (Rosenblatt) – 1969: Perceptrons (Minsky and Papert)—end of perceptron – 1980’s–1990’s: Neocognitron, CNN, back-prop, SGD—we use today – 1990’s–2010’s: SVMs, Adaboosting, decision trees and random forests – 2010’s–now: DNNs and deep learning 6 / 23

  14. Outline Recap Why should we trust NNs? Suggested reading 7 / 23

  15. Supervised learning General view: NN view: – Gather training data – Gather training data ( x 1 , y 1 ) , . . . , ( x n , y n ) ( x 1 , y 1 ) , . . . , ( x n , y n ) – Choose a family of – Choose a NN with k neurons, so that functions, e.g., H , so that there is a group of weights, e.g., there is f ∈ H to ensure ( w 1 , . . . , w k , b 1 , . . . , b k ) , to ensure y i ≈ y i ≈ f ( x i ) for all i { NN ( w 1 , . . . , w k , b 1 , . . . , b k ) } ( x i ) ∀ i – Set up a loss function ℓ – Set up a loss function ℓ – Find an f ∈ H to – Find weights ( w 1 , . . . , w k , b 1 , . . . , b k ) to minimize the average loss minimize the average loss n n 1 1 � � min ℓ ( y i , f ( x i )) min ℓ [ y i , { NN ( w 1 , . . . , w k , b 1 , . . . , b k ) } ( x i )] n n f ∈H w ′ s,b ′ s i =1 i =1 8 / 23

  16. Supervised learning General view: NN view: – Gather training data – Gather training data ( x 1 , y 1 ) , . . . , ( x n , y n ) ( x 1 , y 1 ) , . . . , ( x n , y n ) – Choose a family of – Choose a NN with k neurons, so that functions, e.g., H , so that there is a group of weights, e.g., there is f ∈ H to ensure ( w 1 , . . . , w k , b 1 , . . . , b k ) , to ensure y i ≈ y i ≈ f ( x i ) for all i { NN ( w 1 , . . . , w k , b 1 , . . . , b k ) } ( x i ) ∀ i – Set up a loss function ℓ – Set up a loss function ℓ – Find an f ∈ H to – Find weights ( w 1 , . . . , w k , b 1 , . . . , b k ) to minimize the average loss minimize the average loss n n 1 1 � � min ℓ ( y i , f ( x i )) min ℓ [ y i , { NN ( w 1 , . . . , w k , b 1 , . . . , b k ) } ( x i )] n n f ∈H w ′ s,b ′ s i =1 i =1 Why should we trust NNs? 8 / 23

  17. Function approximation More accurate description of supervised learning 9 / 23

  18. Function approximation More accurate description of supervised learning – Underlying true function: f 0 9 / 23

  19. Function approximation More accurate description of supervised learning – Underlying true function: f 0 – Training data: y i ≈ f 0 ( x i ) 9 / 23

  20. Function approximation More accurate description of supervised learning – Underlying true function: f 0 – Training data: y i ≈ f 0 ( x i ) – Choose a family of functions H , so that ∃ f ∈ H and f and f 0 are close 9 / 23

  21. Function approximation More accurate description of supervised learning – Underlying true function: f 0 – Training data: y i ≈ f 0 ( x i ) – Choose a family of functions H , so that ∃ f ∈ H and f and f 0 are close – Approximation capacity : H matters (e.g., linear? quadratic? sinusoids? etc) 9 / 23

  22. Function approximation More accurate description of supervised learning – Underlying true function: f 0 – Training data: y i ≈ f 0 ( x i ) – Choose a family of functions H , so that ∃ f ∈ H and f and f 0 are close – Approximation capacity : H matters (e.g., linear? quadratic? sinusoids? etc) – Optimization & Generalization : how to find the best f ∈ H matters 9 / 23

  23. Function approximation More accurate description of supervised learning – Underlying true function: f 0 – Training data: y i ≈ f 0 ( x i ) – Choose a family of functions H , so that ∃ f ∈ H and f and f 0 are close – Approximation capacity : H matters (e.g., linear? quadratic? sinusoids? etc) – Optimization & Generalization : how to find the best f ∈ H matters We focus on approximation capacity now. 9 / 23

  24. A word on notation – k -layer NNs: with k layers of weights 10 / 23

  25. A word on notation – k -layer NNs: with k layers of weights – k -hidden-layer NNs: with k hidden layers of nodes (i.e., ( k + 1) -layer NNs) 10 / 23

  26. First trial Think of single-output (i.e., R ) problems first A single neuron – σ identity or linear: ( f → σ : again, activation always as σ ) H : { x �→ σ ( w ⊺ x + b ) } 11 / 23

  27. First trial Think of single-output (i.e., R ) problems first A single neuron – σ identity or linear: linear functions – σ sign function sign ( w ⊺ x + b ) (perceptron): ( f → σ : again, activation always as σ ) H : { x �→ σ ( w ⊺ x + b ) } 11 / 23

  28. First trial Think of single-output (i.e., R ) problems first A single neuron – σ identity or linear: linear functions – σ sign function sign ( w ⊺ x + b ) (perceptron): 0 / 1 function with hyperplane threshold 1 – σ = 1+ e − z : ( f → σ : again, activation always as σ ) H : { x �→ σ ( w ⊺ x + b ) } 11 / 23

  29. First trial Think of single-output (i.e., R ) problems first A single neuron – σ identity or linear: linear functions – σ sign function sign ( w ⊺ x + b ) (perceptron): 0 / 1 function with hyperplane threshold � � 1 1 – σ = 1+ e − z : x �→ 1+ e − ( w ⊺ x + b ) – σ = max(0 , z ) (ReLU): ( f → σ : again, activation always as σ ) H : { x �→ σ ( w ⊺ x + b ) } 11 / 23

  30. First trial Think of single-output (i.e., R ) problems first A single neuron – σ identity or linear: linear functions – σ sign function sign ( w ⊺ x + b ) (perceptron): 0 / 1 function with hyperplane threshold � � 1 1 – σ = 1+ e − z : x �→ 1+ e − ( w ⊺ x + b ) – σ = max(0 , z ) (ReLU): ( f → σ : again, activation { x �→ max(0 , w ⊺ x + b ) } always as σ ) H : { x �→ σ ( w ⊺ x + b ) } 11 / 23

  31. First trial Think of single-output (i.e., R ) problems first A single neuron – σ identity or linear: linear functions – σ sign function sign ( w ⊺ x + b ) (perceptron): 0 / 1 function with hyperplane threshold � � 1 1 – σ = 1+ e − z : x �→ 1+ e − ( w ⊺ x + b ) – σ = max(0 , z ) (ReLU): ( f → σ : again, activation { x �→ max(0 , w ⊺ x + b ) } always as σ ) H : { x �→ σ ( w ⊺ x + b ) } 11 / 23

  32. Second trial Think of single-output (i.e., R ) problems first 12 / 23

  33. Second trial Think of single-output (i.e., R ) problems first Add depth! . . . 12 / 23

  34. Second trial Think of single-output (i.e., R ) problems first Add depth! But make all hidden-nodes activations identity or linear . . . 12 / 23

  35. Second trial Think of single-output (i.e., R ) problems first Add depth! But make all hidden-nodes activations identity or linear σ ( w ⊺ L ( W L − 1 ( . . . ( W 1 x + b 1 ) + . . . ) b L − 1 ) + b L ) . . . 12 / 23

  36. Second trial Think of single-output (i.e., R ) problems first Add depth! But make all hidden-nodes activations identity or linear σ ( w ⊺ L ( W L − 1 ( . . . ( W 1 x + b 1 ) + . . . ) b L − 1 ) + b L ) No better than a signle neuron! Why? . . . 12 / 23

  37. Third trial Think of single-output (i.e., R ) problems first Add both depth & nonlinearity! 13 / 23

  38. Third trial Think of single-output (i.e., R ) problems first Add both depth & nonlinearity! two-layer network, linear activation at output 13 / 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend