Fundamental Belief: Universal Approximation Theorems Ju Sun - PowerPoint PPT Presentation

Fundamental Belief: Universal Approximation Theorems Ju Sun Computer Science & Engineering University of Minnesota, Twin Cities January 29, 2020 1 / 23

Logistics – HW 0 posted (due: midnight Feb 07) 2 / 23

Logistics – HW 0 posted (due: midnight Feb 07) – Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow (2ed) now available at UMN library (limited e-access) 2 / 23

Logistics – HW 0 posted (due: midnight Feb 07) – Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow (2ed) now available at UMN library (limited e-access) – Guest lectures (Feb 04: Tutorial on Numpy, Scipy, Colab. Bring your laptops if possible!) 2 / 23

Logistics – HW 0 posted (due: midnight Feb 07) – Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow (2ed) now available at UMN library (limited e-access) – Guest lectures (Feb 04: Tutorial on Numpy, Scipy, Colab. Bring your laptops if possible!) – Feb 06: discussion of the course project & ideas 2 / 23

Outline Recap Why should we trust NNs? Suggested reading 3 / 23

Recap I biological neuron vs. artificial neuron 4 / 23

Recap I biological neuron vs. artificial neuron biological NN vs. artificial NN 4 / 23

Recap I biological neuron vs. artificial neuron biological NN vs. artificial NN Artificial NN: (over)-simplification on neuron & connection levels 4 / 23

Recap II Zoo of NN models in ML – Linear regression – Perception and Logistic regression – Softmax regression – Multilayer perceptron (feedforward NNs) 5 / 23

Recap II Zoo of NN models in ML – Linear regression – Perception and Logistic regression – Softmax regression – Multilayer perceptron (feedforward NNs) Also: – Support vector machines (SVM) – PCA (autoencoder) – Matrix factorization 5 / 23

Recap III 6 / 23

Recap III Brief history of NNs: – 1943: first NNs invented (McCulloch and Pitts) – 1958 –1969: perceptron (Rosenblatt) – 1969: Perceptrons (Minsky and Papert)—end of perceptron – 1980’s–1990’s: Neocognitron, CNN, back-prop, SGD—we use today – 1990’s–2010’s: SVMs, Adaboosting, decision trees and random forests – 2010’s–now: DNNs and deep learning 6 / 23

Outline Recap Why should we trust NNs? Suggested reading 7 / 23

Supervised learning General view: NN view: – Gather training data – Gather training data ( x 1 , y 1 ) , . . . , ( x n , y n ) ( x 1 , y 1 ) , . . . , ( x n , y n ) – Choose a family of – Choose a NN with k neurons, so that functions, e.g., H , so that there is a group of weights, e.g., there is f ∈ H to ensure ( w 1 , . . . , w k , b 1 , . . . , b k ) , to ensure y i ≈ y i ≈ f ( x i ) for all i { NN ( w 1 , . . . , w k , b 1 , . . . , b k ) } ( x i ) ∀ i – Set up a loss function ℓ – Set up a loss function ℓ – Find an f ∈ H to – Find weights ( w 1 , . . . , w k , b 1 , . . . , b k ) to minimize the average loss minimize the average loss n n 1 1 � � min ℓ ( y i , f ( x i )) min ℓ [ y i , { NN ( w 1 , . . . , w k , b 1 , . . . , b k ) } ( x i )] n n f ∈H w ′ s,b ′ s i =1 i =1 8 / 23

Supervised learning General view: NN view: – Gather training data – Gather training data ( x 1 , y 1 ) , . . . , ( x n , y n ) ( x 1 , y 1 ) , . . . , ( x n , y n ) – Choose a family of – Choose a NN with k neurons, so that functions, e.g., H , so that there is a group of weights, e.g., there is f ∈ H to ensure ( w 1 , . . . , w k , b 1 , . . . , b k ) , to ensure y i ≈ y i ≈ f ( x i ) for all i { NN ( w 1 , . . . , w k , b 1 , . . . , b k ) } ( x i ) ∀ i – Set up a loss function ℓ – Set up a loss function ℓ – Find an f ∈ H to – Find weights ( w 1 , . . . , w k , b 1 , . . . , b k ) to minimize the average loss minimize the average loss n n 1 1 � � min ℓ ( y i , f ( x i )) min ℓ [ y i , { NN ( w 1 , . . . , w k , b 1 , . . . , b k ) } ( x i )] n n f ∈H w ′ s,b ′ s i =1 i =1 Why should we trust NNs? 8 / 23

Function approximation More accurate description of supervised learning 9 / 23

Function approximation More accurate description of supervised learning – Underlying true function: f 0 9 / 23

Function approximation More accurate description of supervised learning – Underlying true function: f 0 – Training data: y i ≈ f 0 ( x i ) 9 / 23

Function approximation More accurate description of supervised learning – Underlying true function: f 0 – Training data: y i ≈ f 0 ( x i ) – Choose a family of functions H , so that ∃ f ∈ H and f and f 0 are close 9 / 23

Function approximation More accurate description of supervised learning – Underlying true function: f 0 – Training data: y i ≈ f 0 ( x i ) – Choose a family of functions H , so that ∃ f ∈ H and f and f 0 are close – Approximation capacity : H matters (e.g., linear? quadratic? sinusoids? etc) 9 / 23

Function approximation More accurate description of supervised learning – Underlying true function: f 0 – Training data: y i ≈ f 0 ( x i ) – Choose a family of functions H , so that ∃ f ∈ H and f and f 0 are close – Approximation capacity : H matters (e.g., linear? quadratic? sinusoids? etc) – Optimization & Generalization : how to find the best f ∈ H matters 9 / 23

Function approximation More accurate description of supervised learning – Underlying true function: f 0 – Training data: y i ≈ f 0 ( x i ) – Choose a family of functions H , so that ∃ f ∈ H and f and f 0 are close – Approximation capacity : H matters (e.g., linear? quadratic? sinusoids? etc) – Optimization & Generalization : how to find the best f ∈ H matters We focus on approximation capacity now. 9 / 23

A word on notation – k -layer NNs: with k layers of weights 10 / 23

A word on notation – k -layer NNs: with k layers of weights – k -hidden-layer NNs: with k hidden layers of nodes (i.e., ( k + 1) -layer NNs) 10 / 23

First trial Think of single-output (i.e., R ) problems first A single neuron – σ identity or linear: ( f → σ : again, activation always as σ ) H : { x �→ σ ( w ⊺ x + b ) } 11 / 23

First trial Think of single-output (i.e., R ) problems first A single neuron – σ identity or linear: linear functions – σ sign function sign ( w ⊺ x + b ) (perceptron): ( f → σ : again, activation always as σ ) H : { x �→ σ ( w ⊺ x + b ) } 11 / 23

First trial Think of single-output (i.e., R ) problems first A single neuron – σ identity or linear: linear functions – σ sign function sign ( w ⊺ x + b ) (perceptron): 0 / 1 function with hyperplane threshold 1 – σ = 1+ e − z : ( f → σ : again, activation always as σ ) H : { x �→ σ ( w ⊺ x + b ) } 11 / 23

First trial Think of single-output (i.e., R ) problems first A single neuron – σ identity or linear: linear functions – σ sign function sign ( w ⊺ x + b ) (perceptron): 0 / 1 function with hyperplane threshold � � 1 1 – σ = 1+ e − z : x �→ 1+ e − ( w ⊺ x + b ) – σ = max(0 , z ) (ReLU): ( f → σ : again, activation always as σ ) H : { x �→ σ ( w ⊺ x + b ) } 11 / 23

First trial Think of single-output (i.e., R ) problems first A single neuron – σ identity or linear: linear functions – σ sign function sign ( w ⊺ x + b ) (perceptron): 0 / 1 function with hyperplane threshold � � 1 1 – σ = 1+ e − z : x �→ 1+ e − ( w ⊺ x + b ) – σ = max(0 , z ) (ReLU): ( f → σ : again, activation { x �→ max(0 , w ⊺ x + b ) } always as σ ) H : { x �→ σ ( w ⊺ x + b ) } 11 / 23

Second trial Think of single-output (i.e., R ) problems first 12 / 23

Second trial Think of single-output (i.e., R ) problems first Add depth! . . . 12 / 23

Second trial Think of single-output (i.e., R ) problems first Add depth! But make all hidden-nodes activations identity or linear . . . 12 / 23

Second trial Think of single-output (i.e., R ) problems first Add depth! But make all hidden-nodes activations identity or linear σ ( w ⊺ L ( W L − 1 ( . . . ( W 1 x + b 1 ) + . . . ) b L − 1 ) + b L ) . . . 12 / 23

Second trial Think of single-output (i.e., R ) problems first Add depth! But make all hidden-nodes activations identity or linear σ ( w ⊺ L ( W L − 1 ( . . . ( W 1 x + b 1 ) + . . . ) b L − 1 ) + b L ) No better than a signle neuron! Why? . . . 12 / 23

Third trial Think of single-output (i.e., R ) problems first Add both depth & nonlinearity! 13 / 23

Third trial Think of single-output (i.e., R ) problems first Add both depth & nonlinearity! two-layer network, linear activation at output 13 / 23

Fundamental Belief: Universal Approximation Theorems Ju Sun - PowerPoint PPT Presentation

Fundamental Belief: Universal Approximation Theorems Ju Sun Computer Science & Engineering University of Minnesota, Twin Cities January 29, 2020 1 / 23 Logistics HW 0 posted (due: midnight Feb 07) 2 / 23 Logistics HW 0 posted

Overview Independence Belief Networks Conditional Independence Belief networks Chris

26:198:722 Expert Systems I Dempster-Shafer Belief Functions I Combining Belief Functions I Types

Cool theorems proved by undergraduates Ken Ono Emory University Cool theorems proved by

6. Approximation and fitting norm approximation least-norm problems regularized

Introduction: Belief vs Degrees of Belief Hannes Leitgeb LMU Munich October 2014 My three

Sylow Theorems Andrew Clarey Looking at the Structure of Arbitrary Groups Definitions/ Theorems

Many Theorems and a Few Stories John Garnett Seoul 5/12/17 1 OUTLINE I. Extension Theorems.

Algorithmic Meta-Theorems for Restrictions of Treewidth Michael Lampis Computer Science Dept.

Some Co-Birkhoff-Type Theorems Jesse Hughes jesseh@cs.kun.nl University of Nijmegen Some

MATH 20: PROBABILITY Fundamental Theorems of Probability Theory Xingru Chen

Adding Aerosol Cans to the Universal Waste Regulations Where does Universal Waste fit? HAZARDOUS

UNIVERSAL ROBOTS RUC 2018 Universal Robots - Evolving the future UNIVERSAL ROBOTS SET THE

Tech Day: Universal Acceptance Mark van rek Universal Acceptance Todays Objectives

Universal Credit Universal Credit Universal Credit is for working-age people aged over 18 and

Belief Decision Behavior: Theory and Evidence Todd Davies Belief Concepts Proposition

Belief and assertion. Evidence from mood shift Alda Mari Institut Jean Nicod , cnrs/ens/ehess/psl

Training Neural Nets COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Training

Risk-parameter estimation in volatility models Christian Francq Jean-Michel Zakoan CREST and

Some Special Cases Recall: we can classify models: constant mean = linear

KERNEL C.I. USING LINAROS AUTOMATED VALIDATION ARCHITECTURE Wednesday, September 11, 13 TYLER

Review: Supervised Learning CS 6355: Structured Prediction 1 Previous lecture A broad

2017/11/02 Army Environmental Program Division Cultural Resource Legislative/Policy Mandate

GCE N(A) & N(T) EXAMINATIONS 19 DEC 2019 THURS, 2 PM TODAYS PROGRAMME Items By Welcome

NOTICE OF PROPOSED RULEMAKING . Accelerating Wireless Broadband Deployment by Removing Barriers

Fundamental Belief: Universal Approximation Theorems Ju Sun - PowerPoint PPT Presentation

Fundamental Belief: Universal Approximation Theorems Ju Sun Computer Science & Engineering University of Minnesota, Twin Cities January 29, 2020 1 / 23 Logistics HW 0 posted (due: midnight Feb 07) 2 / 23 Logistics HW 0 posted

Overview Independence Belief Networks Conditional Independence Belief networks Chris

26:198:722 Expert Systems I Dempster-Shafer Belief Functions I Combining Belief Functions I Types

Cool theorems proved by undergraduates Ken Ono Emory University Cool theorems proved by

6. Approximation and fitting norm approximation least-norm problems regularized

Introduction: Belief vs Degrees of Belief Hannes Leitgeb LMU Munich October 2014 My three

Sylow Theorems Andrew Clarey Looking at the Structure of Arbitrary Groups Definitions/ Theorems

Many Theorems and a Few Stories John Garnett Seoul 5/12/17 1 OUTLINE I. Extension Theorems.

Algorithmic Meta-Theorems for Restrictions of Treewidth Michael Lampis Computer Science Dept.

Some Co-Birkhoff-Type Theorems Jesse Hughes jesseh@cs.kun.nl University of Nijmegen Some

MATH 20: PROBABILITY Fundamental Theorems of Probability Theory Xingru Chen

Adding Aerosol Cans to the Universal Waste Regulations Where does Universal Waste fit? HAZARDOUS

UNIVERSAL ROBOTS RUC 2018 Universal Robots - Evolving the future UNIVERSAL ROBOTS SET THE

Tech Day: Universal Acceptance Mark van rek Universal Acceptance Todays Objectives

Universal Credit Universal Credit Universal Credit is for working-age people aged over 18 and

Belief Decision Behavior: Theory and Evidence Todd Davies Belief Concepts Proposition

Belief and assertion. Evidence from mood shift Alda Mari Institut Jean Nicod , cnrs/ens/ehess/psl

Training Neural Nets COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Training

Risk-parameter estimation in volatility models Christian Francq Jean-Michel Zakoan CREST and

Some Special Cases Recall: we can classify models: constant mean = linear

KERNEL C.I. USING LINAROS AUTOMATED VALIDATION ARCHITECTURE Wednesday, September 11, 13 TYLER

Review: Supervised Learning CS 6355: Structured Prediction 1 Previous lecture A broad

2017/11/02 Army Environmental Program Division Cultural Resource Legislative/Policy Mandate

GCE N(A) &amp; N(T) EXAMINATIONS 19 DEC 2019 THURS, 2 PM TODAYS PROGRAMME Items By Welcome

NOTICE OF PROPOSED RULEMAKING . Accelerating Wireless Broadband Deployment by Removing Barriers

GCE N(A) & N(T) EXAMINATIONS 19 DEC 2019 THURS, 2 PM TODAYS PROGRAMME Items By Welcome