fundamental belief universal approximation theorems
play

Fundamental Belief: Universal Approximation Theorems Ju Sun - PowerPoint PPT Presentation

Fundamental Belief: Universal Approximation Theorems Ju Sun Computer Science & Engineering University of Minnesota, Twin Cities January 29, 2020 1 / 23 Logistics HW 0 posted (due: midnight Feb 07) 2 / 23 Logistics HW 0 posted


  1. Fundamental Belief: Universal Approximation Theorems Ju Sun Computer Science & Engineering University of Minnesota, Twin Cities January 29, 2020 1 / 23

  2. Logistics – HW 0 posted (due: midnight Feb 07) 2 / 23

  3. Logistics – HW 0 posted (due: midnight Feb 07) – Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow (2ed) now available at UMN library (limited e-access) 2 / 23

  4. Logistics – HW 0 posted (due: midnight Feb 07) – Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow (2ed) now available at UMN library (limited e-access) – Guest lectures (Feb 04: Tutorial on Numpy, Scipy, Colab. Bring your laptops if possible!) 2 / 23

  5. Logistics – HW 0 posted (due: midnight Feb 07) – Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow (2ed) now available at UMN library (limited e-access) – Guest lectures (Feb 04: Tutorial on Numpy, Scipy, Colab. Bring your laptops if possible!) – Feb 06: discussion of the course project & ideas 2 / 23

  6. Outline Recap Why should we trust NNs? Suggested reading 3 / 23

  7. Recap I biological neuron vs. artificial neuron 4 / 23

  8. Recap I biological neuron vs. artificial neuron biological NN vs. artificial NN 4 / 23

  9. Recap I biological neuron vs. artificial neuron biological NN vs. artificial NN Artificial NN: (over)-simplification on neuron & connection levels 4 / 23

  10. Recap II Zoo of NN models in ML – Linear regression – Perception and Logistic regression – Softmax regression – Multilayer perceptron (feedforward NNs) 5 / 23

  11. Recap II Zoo of NN models in ML – Linear regression – Perception and Logistic regression – Softmax regression – Multilayer perceptron (feedforward NNs) Also: – Support vector machines (SVM) – PCA (autoencoder) – Matrix factorization 5 / 23

  12. Recap III 6 / 23

  13. Recap III Brief history of NNs: – 1943: first NNs invented (McCulloch and Pitts) – 1958 –1969: perceptron (Rosenblatt) – 1969: Perceptrons (Minsky and Papert)—end of perceptron – 1980’s–1990’s: Neocognitron, CNN, back-prop, SGD—we use today – 1990’s–2010’s: SVMs, Adaboosting, decision trees and random forests – 2010’s–now: DNNs and deep learning 6 / 23

  14. Outline Recap Why should we trust NNs? Suggested reading 7 / 23

  15. Supervised learning General view: NN view: – Gather training data – Gather training data ( x 1 , y 1 ) , . . . , ( x n , y n ) ( x 1 , y 1 ) , . . . , ( x n , y n ) – Choose a family of – Choose a NN with k neurons, so that functions, e.g., H , so that there is a group of weights, e.g., there is f ∈ H to ensure ( w 1 , . . . , w k , b 1 , . . . , b k ) , to ensure y i ≈ y i ≈ f ( x i ) for all i { NN ( w 1 , . . . , w k , b 1 , . . . , b k ) } ( x i ) ∀ i – Set up a loss function ℓ – Set up a loss function ℓ – Find an f ∈ H to – Find weights ( w 1 , . . . , w k , b 1 , . . . , b k ) to minimize the average loss minimize the average loss n n 1 1 � � min ℓ ( y i , f ( x i )) min ℓ [ y i , { NN ( w 1 , . . . , w k , b 1 , . . . , b k ) } ( x i )] n n f ∈H w ′ s,b ′ s i =1 i =1 8 / 23

  16. Supervised learning General view: NN view: – Gather training data – Gather training data ( x 1 , y 1 ) , . . . , ( x n , y n ) ( x 1 , y 1 ) , . . . , ( x n , y n ) – Choose a family of – Choose a NN with k neurons, so that functions, e.g., H , so that there is a group of weights, e.g., there is f ∈ H to ensure ( w 1 , . . . , w k , b 1 , . . . , b k ) , to ensure y i ≈ y i ≈ f ( x i ) for all i { NN ( w 1 , . . . , w k , b 1 , . . . , b k ) } ( x i ) ∀ i – Set up a loss function ℓ – Set up a loss function ℓ – Find an f ∈ H to – Find weights ( w 1 , . . . , w k , b 1 , . . . , b k ) to minimize the average loss minimize the average loss n n 1 1 � � min ℓ ( y i , f ( x i )) min ℓ [ y i , { NN ( w 1 , . . . , w k , b 1 , . . . , b k ) } ( x i )] n n f ∈H w ′ s,b ′ s i =1 i =1 Why should we trust NNs? 8 / 23

  17. Function approximation More accurate description of supervised learning 9 / 23

  18. Function approximation More accurate description of supervised learning – Underlying true function: f 0 9 / 23

  19. Function approximation More accurate description of supervised learning – Underlying true function: f 0 – Training data: y i ≈ f 0 ( x i ) 9 / 23

  20. Function approximation More accurate description of supervised learning – Underlying true function: f 0 – Training data: y i ≈ f 0 ( x i ) – Choose a family of functions H , so that ∃ f ∈ H and f and f 0 are close 9 / 23

  21. Function approximation More accurate description of supervised learning – Underlying true function: f 0 – Training data: y i ≈ f 0 ( x i ) – Choose a family of functions H , so that ∃ f ∈ H and f and f 0 are close – Approximation capacity : H matters (e.g., linear? quadratic? sinusoids? etc) 9 / 23

  22. Function approximation More accurate description of supervised learning – Underlying true function: f 0 – Training data: y i ≈ f 0 ( x i ) – Choose a family of functions H , so that ∃ f ∈ H and f and f 0 are close – Approximation capacity : H matters (e.g., linear? quadratic? sinusoids? etc) – Optimization & Generalization : how to find the best f ∈ H matters 9 / 23

  23. Function approximation More accurate description of supervised learning – Underlying true function: f 0 – Training data: y i ≈ f 0 ( x i ) – Choose a family of functions H , so that ∃ f ∈ H and f and f 0 are close – Approximation capacity : H matters (e.g., linear? quadratic? sinusoids? etc) – Optimization & Generalization : how to find the best f ∈ H matters We focus on approximation capacity now. 9 / 23

  24. A word on notation – k -layer NNs: with k layers of weights 10 / 23

  25. A word on notation – k -layer NNs: with k layers of weights – k -hidden-layer NNs: with k hidden layers of nodes (i.e., ( k + 1) -layer NNs) 10 / 23

  26. First trial Think of single-output (i.e., R ) problems first A single neuron – σ identity or linear: ( f → σ : again, activation always as σ ) H : { x �→ σ ( w ⊺ x + b ) } 11 / 23

  27. First trial Think of single-output (i.e., R ) problems first A single neuron – σ identity or linear: linear functions – σ sign function sign ( w ⊺ x + b ) (perceptron): ( f → σ : again, activation always as σ ) H : { x �→ σ ( w ⊺ x + b ) } 11 / 23

  28. First trial Think of single-output (i.e., R ) problems first A single neuron – σ identity or linear: linear functions – σ sign function sign ( w ⊺ x + b ) (perceptron): 0 / 1 function with hyperplane threshold 1 – σ = 1+ e − z : ( f → σ : again, activation always as σ ) H : { x �→ σ ( w ⊺ x + b ) } 11 / 23

  29. First trial Think of single-output (i.e., R ) problems first A single neuron – σ identity or linear: linear functions – σ sign function sign ( w ⊺ x + b ) (perceptron): 0 / 1 function with hyperplane threshold � � 1 1 – σ = 1+ e − z : x �→ 1+ e − ( w ⊺ x + b ) – σ = max(0 , z ) (ReLU): ( f → σ : again, activation always as σ ) H : { x �→ σ ( w ⊺ x + b ) } 11 / 23

  30. First trial Think of single-output (i.e., R ) problems first A single neuron – σ identity or linear: linear functions – σ sign function sign ( w ⊺ x + b ) (perceptron): 0 / 1 function with hyperplane threshold � � 1 1 – σ = 1+ e − z : x �→ 1+ e − ( w ⊺ x + b ) – σ = max(0 , z ) (ReLU): ( f → σ : again, activation { x �→ max(0 , w ⊺ x + b ) } always as σ ) H : { x �→ σ ( w ⊺ x + b ) } 11 / 23

  31. First trial Think of single-output (i.e., R ) problems first A single neuron – σ identity or linear: linear functions – σ sign function sign ( w ⊺ x + b ) (perceptron): 0 / 1 function with hyperplane threshold � � 1 1 – σ = 1+ e − z : x �→ 1+ e − ( w ⊺ x + b ) – σ = max(0 , z ) (ReLU): ( f → σ : again, activation { x �→ max(0 , w ⊺ x + b ) } always as σ ) H : { x �→ σ ( w ⊺ x + b ) } 11 / 23

  32. Second trial Think of single-output (i.e., R ) problems first 12 / 23

  33. Second trial Think of single-output (i.e., R ) problems first Add depth! . . . 12 / 23

  34. Second trial Think of single-output (i.e., R ) problems first Add depth! But make all hidden-nodes activations identity or linear . . . 12 / 23

  35. Second trial Think of single-output (i.e., R ) problems first Add depth! But make all hidden-nodes activations identity or linear σ ( w ⊺ L ( W L − 1 ( . . . ( W 1 x + b 1 ) + . . . ) b L − 1 ) + b L ) . . . 12 / 23

  36. Second trial Think of single-output (i.e., R ) problems first Add depth! But make all hidden-nodes activations identity or linear σ ( w ⊺ L ( W L − 1 ( . . . ( W 1 x + b 1 ) + . . . ) b L − 1 ) + b L ) No better than a signle neuron! Why? . . . 12 / 23

  37. Third trial Think of single-output (i.e., R ) problems first Add both depth & nonlinearity! 13 / 23

  38. Third trial Think of single-output (i.e., R ) problems first Add both depth & nonlinearity! two-layer network, linear activation at output 13 / 23

Recommend


More recommend