Machine Learning - MT 2016 4 & 5. Basis Expansion, - PowerPoint PPT Presentation

Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016

Outline ◮ Basis function expansion to capture non-linear relationships ◮ Understanding the bias-variance tradeoff ◮ Overfitting and Regularization ◮ Bayesian View of Machine Learning ◮ Cross-validation to perform model selection 1

Outline Basis Function Expansion Overfitting and the Bias-Variance Tradeoff Ridge Regression and Lasso Bayesian Approach to Machine Learning Model Selection

Linear Regression : Polynomial Basis Expansion 2

Linear Regression : Polynomial Basis Expansion φ ( x ) = [1 , x, x 2 ] w 0 + w 1 x + w 2 x 2 = φ ( x ) · [ w 0 , w 1 , w 2 ] 2

Linear Regression : Polynomial Basis Expansion φ ( x ) = [1 , x, x 2 , · · · , x d ] Model y = w T φ ( x ) + ǫ Here w ∈ R M , where M is the number for expanded features 2

Linear Regression : Polynomial Basis Expansion Getting more data can avoid overfitting! 2

Polynomial Basis Expansion in Higher Dimensions Basis expansion can be performed in higher dimensions We’re still fitting linear models, but using more features y = w · φ ( x ) + ǫ Linear Model Quadratic Model φ ( x ) = [1 , x 1 , x 2 , x 2 1 , x 2 φ ( x ) = [1 , x 1 , x 2 ] 2 , x 1 x 2 ] Using degree d polynomials in D dimensions results in ≈ D d features! 3

Basis Expansion Using Kernels We can use kernels as features A Radial Basis Function (RBF) kernel with width parameter γ is defined as κ ( x ′ , x ) = exp( − γ � x − x ′ � 2 ) Choose centres µ 1 , µ 2 , . . . , µ M Feature map: φ ( x ) = [1 , κ ( µ 1 , x ) , . . . , κ ( µ M , x )] y = w 0 + w 1 κ ( µ 1 , x ) + · · · + w M κ ( µ M , x ) + ǫ = w · φ ( x ) + ǫ How do we choose the centres? 4

Basis Expansion Using Kernels One reasonable choice is to choose data points themselves as centres for kernels Need to choose width parameter γ for the RBF kernel κ ( x , x ′ ) = exp( − γ � x − x ′ � 2 ) As with the choice of degree in polynomial basis expansion depending on the width of the kernel overfitting or underfitting may occur ◮ Overfitting occurs if the width is too small, i.e., γ very large ◮ Underfitting occurs if the width is too large, i.e., γ very small 5

When the kernel width is too large 6

When the kernel width is too small 6

When the kernel width is chosen suitably 6

Big Data: When the kernel width is too large 7

Big Data: When the kernel width is too small 7

Big Data: When the kernel width is chosen suitably 7

Basis Expansion using Kernels ◮ Overfitting occurs if the kernel width is too small, i.e., γ very large ◮ Having more data can help reduce overfitting! ◮ Underfitting occurs if the width is too large, i.e., γ very small ◮ Extra data does not help at all in this case! ◮ When the data lies in a high-dimensional space we may encounter the curse of dimensionality ◮ If the width is too large then we may underfit ◮ Might need exponentially large (in the dimension) sample for using modest width kernels ◮ Connection to Problem 1 on Sheet 1 8

The Bias Variance Tradeoff High Bias High Variance 9

The Bias Variance Tradeoff ◮ Having high bias means that we are underfitting ◮ Having high variance means that we are overfitting ◮ The terms bias and variance in this context are precisely defined statistical notions ◮ See Problem Sheet 2, Q3 for precise calculations in one particular context ◮ See Secs. 7.1-3 in HTF book for a much more detailed description 10

Learning Curves Suppose we’ve trained a model and used it to make predictions But in reality, the predictions are often poor ◮ How can we know whether we have high bias (underfitting) or high variance (overfitting) or neither? ◮ Should we add more features (higher degree polynomials, lower width kernels, etc.) to make the model more expressive? ◮ Should we simplify the model (lower degree polynomials, larger width kernels, etc.) to reduce the number of parameters? ◮ Should we try and obtain more data? ◮ Often there is a computational and monetary cost to using more data 11

Learning Curves Split the data into a training set and testing set Train on increasing sizes of data Plot the training error and test error as a function of training data size More data is not useful More data would be useful 12

Overfitting: How does it occur? When dealing with high-dimensional data (which may be caused by basis expansion) even for a linear model we have many parameters With D = 100 input variables and using degree 10 polynomial basis expansion we have ∼ 10 20 parameters! Enrico Fermi to Freeman Dyson ‘‘I remember my friend Johnny von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle his trunk.’’ [video] How can we prevent overfitting? 13

Overfitting: How does it occur? Suppose we have D = 100 and N = 100 so that X is 100 × 100 Suppose every entry of X is drawn from N (0 , 1) And let y i = x i, 1 + N (0 , σ 2 ) , for σ = 0 . 2 14

Ridge Regression i =1 , where x ∈ R D with D ≫ N Suppose we have data � ( x i , y i ) � N One idea to avoid overfitting is to add a penalty term for weights Least Squares Estimate Objective L ( w ) = ( Xw − y ) T ( Xw − y ) Ridge Regression Objective D � L ridge ( w ) = ( Xw − y ) T ( Xw − y ) + λ w 2 i i =1 15

Ridge Regression We add a penalty term for weights to control model complexity Should not penalise the constant term w 0 for being large 16

Ridge Regression Should translating and scaling inputs contribute to model complexity? Suppose � y = w 0 + w 1 x Supose x is temperature in ◦ C and x ′ in ◦ F � � w 0 − 160 + 5 So � y = 9 w 1 9 w 1 x ′ w 2 In one case ‘‘model complexity’’ is w 2 1 , in the other it is 25 81 w 2 1 < 1 3 Should try and avoid dependence on scaling and translation of variables 17

Ridge Regression Before optimising the ridge objective, it’s a good idea to standardise all inputs (mean 0 and variance 1 ) If in addition, we center the outputs, i.e., the outputs have mean 0 , then the constant term is unnecessary (Exercise on Sheet 2) Then find w that minimises the objective function L ridge ( w ) = ( Xw − y ) T ( Xw − y ) + λ w T w 18

Deriving Estimate for Ridge Regression Suppose the data � ( x i , y i ) � N i =1 with inputs standardised and output centered We want to derive expression for w that minimises L ridge ( w ) = ( Xw − y ) T ( Xw − y ) + λ w T w = w T X T Xw − 2 y T Xw + y T y + λ w T w Let’s take the gradient of the objective with respect to w ∇ w L ridge = 2( X T X ) w − 2 X T y + 2 λ w �� X T X + λ I D w − X T y = 2 Set the gradient to 0 and solve for w � � X T X + λ I D w = X T y � � − 1 X T X + λ I D X T y w ridge = 19

Ridge Regression Minimise ( Xw − y ) T ( Xw − y ) Minimise ( Xw − y ) T ( Xw − y ) + λ w T w subject to w T w ≤ R 20

Ridge Regression As we decrease λ the magnitudes of weights start increasing 21

Summary : Ridge Regression In ridge regression, in addition to the residual sum of squares we penalise the sum of squares of weights Ridge Regression Objective L ridge ( w ) = ( Xw − y ) T ( Xw − y ) + λ w T w This is also called ℓ 2 -regularization or weight-decay Penalising weights ‘‘encourages fitting signal rather than just noise’’ 22

Machine Learning - MT 2016 4 & 5. Basis Expansion, - PowerPoint PPT Presentation

Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships Understanding the

Expansion Study F Expansion Study For Oswego East High School Expansion Study F Expansion Study

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

expansion in Montana Bryce Ward Economic Impacts of Medicaid Expansion Economic Impacts of

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

Linear regression without correspondence Daniel Hsu Columbia University October 3, 2017 Joint

Linear regression Petr Po s k P. Po s k c 2015 Artificial Intelligence 1

Linear Regression 1 / 10 The Linear Model So far weve dealt with classification, where the

1D Regression i.i.d. with mean 0. Univariate Linear

Classification or Regression? Regression Classification: want to learn a discrete target

https://bit.ly/2Lwa3g4 S AMPLE P HOTOGRAPHY A GREEMENTS Work Made for Hire Agreement: you own

*Sum of new residential listings added to the MLS within the same week Comparison of the Number of

Listing Bit Strings List all bit strings of length 3. Listing Bit Strings List all bit strings