Training Neural Networks Milan Straka March 11, 2019 Charles - PowerPoint PPT Presentation

NPFL114, Lecture 2 Training Neural Networks Milan Straka March 11, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

Estimators and Bias An estimator is a rule for computing an estimate of a given value, often an expectation of some random value(s). Bias of an estimator is the difference of the expected value of the estimator and the true value being estimated. If the bias is zero, we call the estimator unbiased , otherwise we call it biased . If we have a sequence of estimates, it also might happen that the bias converges to zero. x , … , x 1 n Consider the well known sample estimate of variance. Given idenpendent and identically distributed random variables, we might estimate mean and variance as 1 ∑ 1 ∑ ^ 2 ^ = , ^ 2 = ( x − ) . μ x σ μ i i n n i i E [ 1 ^ 2 2 ] = (1 − ) σ σ n Such estimate is biased, because , but the bias converges to zero with n increasing . Also, an unbiased estimator does not necessarily have small variance – in some cases it can have large variance, so a biased estimator with smaller variance might be preferred. NPFL114, Lecture 2 ML Basics Loss Gradient Descent Backpropagation NN Training SGDs Adaptive LR LR Schedules 2/39

Machine Learning Basics We usually have a training set , which is assumed to consist of examples generated independently from a data generating distribution . The goal of optimization is to match the training set as well as possible. However, the main goal of machine learning is to perform well on previously unseen data, so called generalization error or test error . We typically estimate the generalization error using a test set of examples independent of the training set, but generated by the same data generating distribution. NPFL114, Lecture 2 ML Basics Loss Gradient Descent Backpropagation NN Training SGDs Adaptive LR LR Schedules 3/39

Machine Learning Basics Challenges in machine learning: underfitting overfitting Figure 5.2, page 113 of Deep Learning Book, http://deeplearningbook.org NPFL114, Lecture 2 ML Basics Loss Gradient Descent Backpropagation NN Training SGDs Adaptive LR LR Schedules 4/39

Machine Learning Basics We can control whether a model underfits or overfits by modifying its capacity . representational capacity effective capacity Figure 5.3, page 115 of Deep Learning Book, http://deeplearningbook.org The No free lunch theorem (Wolpert, 1996) states that averaging over all possible data distributions, every classification algorithm achieves the same overall error when processing unseen examples. In a sense, no machine learning algorithm is universally better than others. NPFL114, Lecture 2 ML Basics Loss Gradient Descent Backpropagation NN Training SGDs Adaptive LR LR Schedules 5/39

Machine Learning Basics Any change in a machine learning algorithm that is designed to reduce generalization error but not necessarily its training error is called regularization . L 2 regularization (also called weighted decay) penalizes models with large weights (i.e., penalty ∣∣ θ ∣∣ 2 of ). Figure 5.5, page 119 of Deep Learning Book, http://deeplearningbook.org NPFL114, Lecture 2 ML Basics Loss Gradient Descent Backpropagation NN Training SGDs Adaptive LR LR Schedules 6/39

Machine Learning Basics Hyperparameters are not adapted by learning algorithm itself. Usually a validation set or development set is used to estimate the generalization error, allowing to update hyperparameters accordingly. NPFL114, Lecture 2 ML Basics Loss Gradient Descent Backpropagation NN Training SGDs Adaptive LR LR Schedules 7/39

Loss Function A model is usually trained in order to minimize the loss on the training data. f ( x ; θ ) θ Assuming that a model computes using parameters , the mean square error is computed as m 1 2 ∑ ( ( i ) ) ( i ) f ( x ; θ ) − y . m i =1 A common principle used to design loss functions is the maximum likelihood principle . NPFL114, Lecture 2 ML Basics Loss Gradient Descent Backpropagation NN Training SGDs Adaptive LR LR Schedules 8/39

Maximum Likelihood Estimation X = { x (1) (2) ( m ) , x , … , x } Let be training data drawn independently from the data-generating ^ data p p data distribution . We denote the empirical data distribution as . ( x ; θ ) p θ model Let be a family of distributions. The maximum likelihood estimation of is: ( X ; θ ) ML = arg max θ p model θ arg max ∏ m ( i ) = ( x ; θ ) p model i =1 θ arg min ∑ m ( i ) = − log p ( x ; θ ) model i =1 θ E = arg min [− log p ( x ; θ )] x ∼ ^ data model p θ = arg min H ( ^ data , p ( x ; θ )) p model θ = arg min ( ^ data ∣∣ p ( x ; θ )) + H ( ^ data ) D KL p p model θ NPFL114, Lecture 2 ML Basics Loss Gradient Descent Backpropagation NN Training SGDs Adaptive LR LR Schedules 9/39

Maximum Likelihood Estimation y x MLE can be easily generalized to a conditional case, where our goal is to predict given : ( Y ∣ X ; θ ) ML = arg max θ p model θ arg max ∏ m ( i ) ( i ) = ( y ∣ x ; θ ) p model i =1 θ arg min ∑ m ( i ) ( i ) = − log p ( y ∣ x ; θ ) model i =1 θ The resulting loss function is called negative log likelihood , or cross-entropy or Kullback-Leibler divegence . NPFL114, Lecture 2 ML Basics Loss Gradient Descent Backpropagation NN Training SGDs Adaptive LR LR Schedules 10/39

Properties of Maximum Likelihood Estimation p data Assume that the true data generating distribution lies within the model family (⋅; θ ) = (⋅; θ ) p θ p p model data model p p data data , and assume there exists a unique such that . θ m MLE is a consistent estimator. If we denote to be the parameters found by MLE for a m θ m training set with examples generated by the data generating distribution, then θ p data converges in probability to . ε > 0 P (∣∣ θ − ∣∣ > ε ) → 0 m → ∞ θ m p data Formally, for any , as . MLE is in a sense most statistic efficient . For any consistent estimator, we might consider E 2 [∣∣ θ − ∣∣ ] θ θ θ x ,…, x ∼ p 2 m p m p data 1 data data m the average distance of and , formally . It can be shown (Rao 1945, Cramér 1946) that no consistent estimator has a lower mean squared error than the maximum likelihood estimator. Therefore, for reasons of consistency and efficiency, maximum likelihood is often considered the preferred estimator for machine learning. NPFL114, Lecture 2 ML Basics Loss Gradient Descent Backpropagation NN Training SGDs Adaptive LR LR Schedules 11/39

Mean Square Error as MLE y ∈ R p ( y ∣ x ) Assume our goal is to perform a regression, i.e., to predict for . ^ ( x ; θ ) y y Let give a prediction of mean of . N ( y ; 2 p ( y ∣ x ) ^ ( x ; θ ), σ ) y σ We define as for a given fixed . Then: m ∑ ( i ) ( i ) arg max p ( y ∣ x ; θ ) = arg min − log p ( y ∣ x ; θ ) θ θ i =1 m ( i ) ( i ) 2 1 ( y − ^ ( x ; θ )) y ∑ − = − arg min log 2 σ 2 e 2 πσ 2 θ i =1 m ( i ) ( i ) 2 ( y − ^ ( x ; θ )) y ∑ 2 −1/2 = − arg min m log(2 πσ ) + − 2 σ 2 θ i =1 m m ( i ) ( i ) 2 ( y − ^ ( x ; θ )) y ∑ ∑ ( i ) ( i ) 2 = arg min = arg min ( y − ^ ( x ; θ )) . y 2 σ 2 θ θ i =1 i =1 NPFL114, Lecture 2 ML Basics Loss Gradient Descent Backpropagation NN Training SGDs Adaptive LR LR Schedules 12/39

Gradient Descent f ( x ; θ ) θ L Let a model compute using parameters , and for a given loss function denote J ( θ ) = E L ( f ( x ; θ ), y ) . ( x , y )∼ ^ data p In order to compute arg min J ( θ ) θ we may use gradient descent : θ ← θ − α ∇ J ( θ ) θ Figure 4.1, page 83 of Deep Learning Book, http://deeplearningbook.org NPFL114, Lecture 2 ML Basics Loss Gradient Descent Backpropagation NN Training SGDs Adaptive LR LR Schedules 13/39

Gradient Descent Variants (Regular) Gradient Descent J ( θ ) We use all training data to compute . Online (or Stochastic) Gradient Descent J ( θ ) We estimate the expectation in using a single randomly sampled example from the training data. Such an estimate is unbiased, but very noisy. J ( θ ) = L ( f ( x ; θ ), y ) for randomly chosen ( x , y ) from ^ data . p Minibatch SGD J ( θ ) The minibatch SGD is a trade-off between gradient descent and SGD – the expectation in m is estimated using random independent examples from the training data. m 1 ∑ ( i ) ( i ) ( i ) ( i ) J ( θ ) = L ( f ( x ; θ ), y ) for randomly chosen ( x , y ) from ^ data . p m i =1 NPFL114, Lecture 2 ML Basics Loss Gradient Descent Backpropagation NN Training SGDs Adaptive LR LR Schedules 14/39

Training Neural Networks Milan Straka March 11, 2019 Charles - PowerPoint PPT Presentation

NPFL114, Lecture 2 Training Neural Networks Milan Straka March 11, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Estimators and Bias An estimator is a

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Artificial Neural Networks Oliver Schulte - CMPT 726 Feed-forward Networks Network Training

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Relaxation and Hopfield Networks Neural Networks Neural Networks - Hopfield 1 Bibliography

Neural Networks 1. Introduction Spring 2020 1 Neural Networks are taking over! Neural

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

GE GENE NERAL AL ME MEET ETING 23 August t 2017 Table of Contents 01 01 Group Financial

Statistics and learning Statistical estimation Emmanuel Rachelson and Matthieu Vignes ISAE

OUR GLOBAL ADVANTAGES Investor Update First Quarter Review May 6th, 2010 CCL Industries Inc.

CCL Industries Inc. Investor Update Fourth Quarter Review February 24, 2011 1 Disclaimer This

Research Issues in Many-Objective Optimization with Evolutionary Algorithms Frederico Gadelha

TREBLE DAMAGES SPELL TRIPLE TROUBLE HUD-DOJ ENFORCEMENT OF FALSE CLAIMS ACT Phillip L. Schulman

Market Abuse Regulation Practical issues 9 months on Breakfast roundtable: 26 April 2017

Formal Concept Analysis Welcome and Organizational Issues Robert J aschke Asmelash Teka Hadgu