CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization, Early stopping, Dataset augmentation, Parameter sharing and tying, Injecting noise at input, Ensemble methods, Dropout Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8

Acknowledgements Chapter 7, Deep Learning book Ali Ghodsi’s Video Lectures on Regularization a Dropout: A Simple Way to Prevent Neural Networks from Overfitting b a Lecture 2.1 and Lecture 2.2 b Dropout 2/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8

Module 8.1 : Bias and Variance 3/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8

We will begin with a quick overview of bias, variance and the trade-off between them. 4/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8

Let us consider the problem of fitting a curve through a given set of points We consider two models : y = ˆ Simple f ( x ) = w 1 x + w 0 ( degree :1) Simple 25 � w i x i + w 0 Complex y = ˆ f ( x ) = ( degree :25) i =1 Complex Note that in both cases we are making an as- The points were drawn from a si- sumption about how y is related to x . We nusoidal function (the true f ( x )) have no idea about the true relation f ( x ) The training data consists of 100 points 5/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8

We sample 25 points from the training data and train a simple and a complex model Simple We repeat the process ‘ k ’ times to train multiple models (each model sees a different Complex sample of the training data) The points were drawn from We make a few observations from these plots a sinusoidal function (the true f ( x )) 6/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8

7/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8

Simple models trained on different samples of the data do not differ much from each other However they are very far from the true sinusoidal curve (under fitting) On the other hand, complex models trained on different samples of the data are very different from each other (high variance) 8/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8

Let f ( x ) be the true model (sinusoidal in this case) and ˆ f ( x ) be our estimate of the model (simple or complex, in this case) then, Bias ( ˆ f ( x )) = E [ ˆ f ( x )] − f ( x ) E [ ˆ f ( x )] is the average (or expected) value of the model We can see that for the simple model the average value (green line) is very far from the Green Line: Average value of ˆ f ( x ) true value f ( x ) (sinusoidal function) for the simple model Blue Curve: Average value of ˆ f ( x ) Mathematically, this means that the simple for the complex model model has a high bias Red Curve: True model ( f (x)) On the other hand, the complex model has a low bias 9/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8

We now define, Variance ( ˆ f ( x )) = E [( ˆ f ( x ) − E [ ˆ f ( x )]) 2 ] (Standard definition from statistics) Roughly speaking it tells us how much the different ˆ f ( x )’s (trained on different samples of the data) differ from each other It is clear that the simple model has a low variance whereas the complex model has a high variance 10/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8

In summary (informally) Simple model: high bias, low variance Complex model: low bias, high variance There is always a trade-off between the bias and variance Both bias and variance contribute to the mean square error. Let us see how 11/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8

Module 8.2 : Train error vs Test error 12/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8

Consider a new point ( x, y ) which was not seen during training We can show that If we use the model ˆ f ( x ) to predict the E [( y − ˆ f ( x )) 2 ] = Bias 2 value of y then the mean square error is given by + V ariance + σ 2 (irreducible error) E [( y − ˆ f ( x )) 2 ] See proof here (average square error in predicting y for many such unseen points) 13/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8

error High bias High variance Sweet spot- The parameters of ˆ f ( x ) (all w i ’s) are trained -perfect tradeoff using a training set { ( x i , y i ) } n -ideal model i =1 complexity However, at test time we are interested in eval- uating the model on a validation (unseen) set which was not used for training model complexity This gives rise to the following two entities of interest: train err (say, mean square error) test err (say, mean square error) E [( y − ˆ f ( x )) 2 ] = Bias 2 Typically these errors exhibit the trend shown + V ariance in the adjacent figure + σ 2 (irreducible error) 14/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8

Intuitions developed so far Let there be n training points and m test (validation) points n � train err = 1 ( y i − ˆ f ( x i )) 2 n i =1 n + m � test err = 1 ( y i − ˆ f ( x i )) 2 m i = n +1 As the model complexity increases train err becomes overly optimistic and gives us a wrong picture of how close ˆ f is to f The validation error gives the real picture of how close ˆ f is to f We will concretize this intuition mathematically now and eventually show how to account for the optimism in the training error 15/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8

Further we use ˆ f to approximate f Let D= { x i , y i } m + n i =1 , then for any and estimate the parameters using T point ( x, y ) we have, ⊂ D such that y i = f ( x i ) + ε i y i = ˆ f ( x i ) which means that y i is related to x i We are interested in knowing by some true function f but there is E [( ˆ f ( x i ) − f ( x i )) 2 ] also some noise ε in the relation For simplicity, we assume but we cannot estimate this directly because we do not know f ε ∼ N (0 , σ 2 ) We will see how to estimate this empirically using the observation y i & and of course we do not know f prediction ˆ y i 16/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8

y i − y i ) 2 ] = E [( ˆ f ( x i ) − f ( x i ) − ε i ) 2 ] E [( ˆ ( y i = f ( x i ) + ε i ) f ( x i ) − f ( x i )) 2 − 2 ε i ( ˆ = E [( ˆ f ( x i ) − f ( x i )) + ε 2 i ] = E [( ˆ f ( x i ) − f ( x i )) 2 ] − 2 E [ ε i ( ˆ f ( x i ) − f ( x i ))] + E [ ε 2 i ] ∴ E [( ˆ i ] + 2 E [ ε i ( ˆ f ( x i ) − f ( x i )) 2 ] = E [( ˆ y i − y i ) 2 ] − E [ ε 2 f ( x i ) − f ( x i )) ] 17/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8

We will take a small detour to understand how to empirically estimate an Expectation and then return to our derivation 18/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8

Suppose we have observed the goals scored( z ) in k matches as z 1 = 2, z 2 = 1, z 3 = 0, ... z k = 2 Now we can empirically estimate E [ z ] i.e. the expected number of goals scored as k � E [ z ] = 1 z i k i =1 Analogy with our derivation: We have a certain number of observations y i & predictions ˆ y i using which we can estimate m � y i − y i ) 2 ] = 1 y i − y i ) 2 E [( ˆ ( ˆ m i =1 19/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8

... returning back to our derivation 20/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8

E [( ˆ i ] + 2 E [ ε i ( ˆ f ( x i ) − f ( x i )) 2 ] = E [( ˆ y i − y i ) 2 ] − E [ ε 2 f ( x i ) − f ( x i )) ] We can empirically evaluate R.H.S using training observations or test observations Case 1 : Using test observations n + m n + m � � 1 − 1 E [( ˆ + 2 E [ ε i ( ˆ f ( x i ) − f ( x i )) 2 ] y i − y i ) 2 ε 2 = ( ˆ f ( x i ) − f ( x i )) ] i � �� m m i = n +1 i = n +1 true error � �� = covariance ( ε i , ˆ f ( x i ) − f ( x i )) empirical estimation of error small constant ∵ covariance( X, Y ) = E [( X − µ X )( Y − µ Y )] = E [( X )( Y − µ Y )](if µ X = E [ X ] = 0) = E [ XY ] − E [ Xµ Y ] = E [ XY ] − µ Y E [ X ] = E [ XY ] 21/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8

E [( ˆ f ( x i ) − f ( x i )) 2 ] � �� true error n + m n + m � � 1 − 1 y i − y i ) 2 ε 2 + 2 E [ ε i ( ˆ f ( x i ) − f ( x i )) ] = ( ˆ i m m � �� i = n +1 i = n +1 � �� = covariance ( ε i , ˆ f ( x i ) − f ( x i )) empirical estimation of error small constant None of the test observations participated in the estimation of ˆ f ( x )[the parameters of ˆ f (x) were estimated only using training data] ∴ ε ⊥ ( ˆ f ( x i ) − f ( x i )) ∴ E [ ε i · ( ˆ f ( x i ) − f ( x i ))] = E [ ε i ] · E [ ˆ f ( x i ) − f ( x i ))] = 0 · E [ ˆ f ( x i ) − f ( x i ))] = 0 ∴ true error = empirical test error + small constant Hence, we should always use a validation set(independent of the training set) to estimate the error 22/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization, Early stopping, Dataset augmentation, Parameter sharing and tying, Injecting noise at input, Ensemble methods, Dropout Mitesh M. Khapra Department of

CS7015 (Deep Learning) : Lecture 10 Learning Vectorial Representations Of Words Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 16 Encoder Decoder Models, Attention Mechanism Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 18 Markov Networks Mitesh M. Khapra Department of Computer

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks (GANs) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of

CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, Backpropagation Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks (GANs) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units

CS7015 (Deep Learning) : Lecture 1 (Partial/Brief) History of Deep Learning Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 1 (Partial/Brief) History of Deep Learning Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 13 Visualizing Convolutional Neural Networks, Guided

CS7015 (Deep Learning) : Lecture 19 Using joint distributions for classification and sampling,

Computational complexity of stochastic programs A. Shapiro School of Industrial and Systems

Instruction Selection on SSA Graphs Sebastian Hack, Sebastian Buchwald, Andreas Zwinkau Compiler

publicpolicies,socialnetworksandepidemicprocesses Social networks

A Bounded Path Propagator on Directed Graphs CP 16 Diego de U na, Graeme Gange, Peter Schachte

Maximum Flow Applications Max flow extensions and applications. Disjoint paths and network

Monotone Graphical Multivariate Markov Chains Roberto Colombi 1 , Sabrina Giordano 2 1 Dept of

Dual Finite Element Formulations and Associated Global Quantities for Field-Circuit Coupling

Motivation (1 of 2) Data are medium-sized, but things we want to compute are intractable,

Sambuz

Useful Links

Newsletter

Mail Us