Learning From Data Lecture 22 Neural Networks and Overfitting - PowerPoint PPT Presentation

Learning From Data Lecture 22 Neural Networks and Overfitting Approximation vs. Generalization Regularization and Early Stopping Minimizing E in More Efficienty M. Magdon-Ismail CSCI 4100/6100

recap: Neural Networks and Fitting the Data Forward Propagation: 0 W(1) W(2) → s (2) · · · W( L ) → x ( L ) = h ( x ) θ θ x = x (0) → s (1) → x (1) → s ( L ) − − − − − Gradient Descent -1 � � 1 s ( ℓ ) = (W ( ℓ ) ) t x ( ℓ − 1) x ( ℓ ) = θ ( s ( ℓ ) ) log 10 (error) -2 (Compute h and E in ) -3 Choose W = { W (1) , W (1) , . . . , W ( L ) } to minimize E in SGD -4 Gradient descent: 0 2 4 6 log 10 (iteration) W( t + 1) ← W( t ) − η ∇ E in (W( t )) ∂ e → need δ ( ℓ ) = ∂ e Compute gradient − → need ∂ W ( ℓ ) − ∂ s ( ℓ ) ∂ e ∂ W ( ℓ ) = x ( ℓ − 1) ( δ ( ℓ ) ) t Symmetry Backpropagation: δ (1) ← − δ (2) · · · ← − δ ( L − 1) ← − δ ( L ) W ( ℓ +1) δ ( ℓ +1) � d ( ℓ ) δ ( ℓ ) = θ ′ ( s ( ℓ ) ) ⊗ � 1 Average Intensity M Neural Networks and Overfitting : 2 /15 � A c L Creator: Malik Magdon-Ismail 2-layer neural network − →

2-Layer Neural Network w 0 w 1 v 1 w 2 v 2   m v 3 w 3 � h ( x ) t x ) x h ( x ) = θ  w 0 + w j θ ( v j  v 4 w 4 j =1 w 5 v 5 v m . . w m . M Neural Networks and Overfitting : 3 /15 � A c L Creator: Malik Magdon-Ismail Tunable Transform − →

The Neural Network has a Tunable Transform Neural Network Nonlinear Transform k -RBF-Network   ˜ d � m � � � k � � � h ( x ) = θ  w 0 + w j Φ j ( x ) h ( x ) = θ w 0 + w j θ ( v j t x ) h ( x ) = θ w 0 + w j φ ( | | x − µ j | | )  j =1 j =1 j =1 � 1 � E in = O m ↑ approximation M Neural Networks and Overfitting : 4 /15 � A c L Creator: Malik Magdon-Ismail Generalization − →

Generalization MLP: d vc = O ( md log( md )) √ m = N (convergence to optimal for MLP, just like k -NN) semi-parametric because you still have to learn parameters. tanh : d vc = O ( md ( m + d )) M Neural Networks and Overfitting : 5 /15 � A c L Creator: Malik Magdon-Ismail Regularization – weight decay − →

Regularization – Weight Decay N E aug ( w ) = 1 ( h ( x n ; w ) − y n ) 2 + λ ( w ( ℓ ) � � ij ) 2 N N n =1 ℓ,i,j ∂E aug ( w ) = ∂E in ( w ) + 2 λ N W ( ℓ ) ∂ W ( ℓ ) ∂ W ( ℓ ) ↑ backpropagation M Neural Networks and Overfitting : 6 /15 � A c L Creator: Malik Magdon-Ismail Digits data − →

Weight Decay with Digits Data No Weight Decay Weight Decay, λ = 0 . 01 Symmetry Symmetry Average Intensity Average Intensity M Neural Networks and Overfitting : 7 /15 � A c L Creator: Malik Magdon-Ismail Early Stopping − →

Early Stopping E out ( w t ) Gradient Descent Ω( d vc ( H t )) w 1 = w 0 − η g 0 | | g 0 | | Error w 0 H 1 H 1 = { w : | | w − w 0 | | ≤ η } E in ( w t ) H 2 w 1 w 2 t ∗ iteration, t w 0 H 2 = H 1 ∪ { w : | | w − w 1 | | ≤ η } H 3 w 1 contour of constant E in w 2 w 0 w 3 H 3 = H 2 ∪ { w : | | w − w 2 | | ≤ η } w ( t ∗ ) Each iteration explores a larger H H 1 ⊂ H 2 ⊂ H 3 ⊂ H 4 ⊂ · · · w (0) M Neural Networks and Overfitting : 8 /15 � A c L Creator: Malik Magdon-Ismail Early stopping on digits data − →

Early Stopping on Digits Data -1 E val log 10 (error) -1.2 Symmetry E in -1.4 -1.6 t ∗ 10 2 10 3 10 4 10 5 10 6 iteration, t Use a validation set to determine t ∗ Output w ∗ , do not retrain with all the data till t ∗ . Average Intensity M Neural Networks and Overfitting : 9 /15 � A c L Creator: Malik Magdon-Ismail Minimizing E in − →

Minimizing E in 1. Use regression for classification 2. Use better algorithms than gradient descent 0 gradient descent -2 log 10 (error) -4 -6 conjugate gradients -8 0.1 1 10 10 2 10 3 10 4 optimization time (sec) M Neural Networks and Overfitting : 10 /15 � A c L Creator: Malik Magdon-Ismail Beefing up gradient descent − →

Beefing Up Gradient Descent Determine the gradient g in-sample error, E in in-sample error, E in E in ( w ) E in ( w ) weights, w weights, w Shallow: use large η . Deep: use small η . M Neural Networks and Overfitting : 11 /15 � A c L Creator: Malik Magdon-Ismail Variable learning rate − →

Variable Learning Rate Gradient Descent 1: Initialize w (0), and η 0 at t = 0. Set α > 1 and β < 1. 2: while stopping criterion has not been met do Let g ( t ) = ∇ E in ( w ( t )), and set v ( t ) = − g ( t ). 3: if E in ( w ( t ) + η t v ( t )) < E in ( w ( t )) then 4: accept: w ( t + 1) = w ( t ) + η t v ( t ); 5: increment η : η t +1 = αη t . α ∈ [1 . 05 , 1 . 1] else 6: reject: w ( t + 1) = w ( t ); 7: decrease η : η t +1 = βη t . β ∈ [0 . 7 , 0 . 8] end if 8: Iterate to the next step, t ← t + 1. 9: 10: end while M Neural Networks and Overfitting : 12 /15 � A c L Creator: Malik Magdon-Ismail Steepest Descent - Line Search − →

Steepest Descent - Line Search 1: Initialize w (0) and set t = 0; contour of constant E in 2: while stopping criterion has not been met do Let g ( t ) = ∇ E in ( w ( t )), and set v ( t ) = − g ( t ). 3: v ( t ) Let η ∗ = argmin η E in ( w ( t ) + η v ( t )). 4: w ∗ w 2 w ( t + 1) = w ( t ) + η ∗ v ( t ). w ( t + 1) 5: − g ( t + 1) Iterate to the next step, t ← t + 1. 6: 7: end while w ( t ) w 1 How to accomplish the line search (step 4)? Simple bisection (binary search) suffices in practice E ( η 3 ) E ( η 1 ) E ( η 2 ) η 1 η 2 η 3 ¯ η M Neural Networks and Overfitting : 13 /15 � A c L Creator: Malik Magdon-Ismail Comparison of optimization heuristics − →

Comparison of Optimization Heuristics 0 -1 -2 log 10 (error) -3 gradient descent -4 variable η -5 steepest descent 0.1 1 10 10 2 10 3 10 4 optimization time (sec) Optimization Time Method 10 sec 1,000 sec 50,000 sec Gradient Descent 0.122 0.0214 0.0113 Stochastic Gradient Descent 0.0203 0.000447 1 . 6310 × 10 − 5 Variable Learning Rate 0.0432 0.0180 0.000197 Steepest Descent 0.0497 0.0194 0.000140 M Neural Networks and Overfitting : 14 /15 � A c L Creator: Malik Magdon-Ismail Conjugate gradients − →

Conjugate Gradients 1. Line search just like steepest descent. 0 2. Choose a better direction than − g steepest descent -2 log 10 (error) contour of constant E in -4 -6 conjugate gradients v ( t ) -8 v ( t + 1) w 2 0.1 1 10 10 2 10 3 10 4 w ( t + 1) optimization time (sec) Optimization Time w ( t ) Method 10 sec 1,000 sec 50,000 sec 1 . 6310 × 10 − 5 Stochastic Gradient Descent 0.0203 0.000447 Steepest Descent 0.0497 0.0194 0.000140 w 1 1 . 13 × 10 − 6 2 . 73 × 10 − 9 Conjugate Gradients 0.0200 There are better algorithms (eg. Levenberg-Marquardt), but we will stop here M Neural Networks and Overfitting : 15 /15 � A c L Creator: Malik Magdon-Ismail

Learning From Data Lecture 22 Neural Networks and Overfitting - PowerPoint PPT Presentation

Learning From Data Lecture 22 Neural Networks and Overfitting Approximation vs. Generalization Regularization and Early Stopping Minimizing E in More Efficienty M. Magdon-Ismail CSCI 4100/6100 recap: Neural Networks and Fitting the Data

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

CHAPTER VI VI CHAPTER Learning in Feedforward Feedforward Learning in Neural Networks Neural

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks for Machine Learning Lecture 2a An overview of the main types of neural network

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

4 Effective Content Campaigns Low-cost ways to improve the results you are receiving from your

Case Study: Lead Nurturing Using Segmented Content T. Baxter Denney Manager, Database Marketing,

Labour A Adjustment & L LBS BS The Roles of Regional Networks and LBS Agencies in

Lecture 6 Patient Support Winter 2015 Richard Anderson 2/11/2015 University of Washington,

Body-Mounted Cameras Claudio Fllmi foellmic@student.ethz.ch May 28, 2013 Distributed Systems

Information Search and Retrieval Exam Projects Academic Year: 2014-2015 Francesco Ricci Free

DPIA Schedule is online Opposition: 2 x 10 minutes MEREL KONING MARCH 1 2018 PRIVACY SEMINAR

Event Cognition-based

Sambuz

Useful Links

Newsletter

Mail Us

Learning From Data Lecture 22 Neural Networks and Overfitting - PowerPoint PPT Presentation

Learning From Data Lecture 22 Neural Networks and Overfitting Approximation vs. Generalization Regularization and Early Stopping Minimizing E in More Efficienty M. Magdon-Ismail CSCI 4100/6100 recap: Neural Networks and Fitting the Data

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

CHAPTER VI VI CHAPTER Learning in Feedforward Feedforward Learning in Neural Networks Neural

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks for Machine Learning Lecture 2a An overview of the main types of neural network

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

4 Effective Content Campaigns Low-cost ways to improve the results you are receiving from your

Case Study: Lead Nurturing Using Segmented Content T. Baxter Denney Manager, Database Marketing,

Labour A Adjustment &amp; L LBS BS The Roles of Regional Networks and LBS Agencies in

Lecture 6 Patient Support Winter 2015 Richard Anderson 2/11/2015 University of Washington,

Body-Mounted Cameras Claudio Fllmi foellmic@student.ethz.ch May 28, 2013 Distributed Systems

Information Search and Retrieval Exam Projects Academic Year: 2014-2015 Francesco Ricci Free

DPIA Schedule is online Opposition: 2 x 10 minutes MEREL KONING MARCH 1 2018 PRIVACY SEMINAR

Event Cognition-based

Sambuz

Useful Links

Newsletter

Mail Us

Labour A Adjustment & L LBS BS The Roles of Regional Networks and LBS Agencies in