machine learning mt 2016 4 5 basis expansion
play

Machine Learning - MT 2016 4 & 5. Basis Expansion, - PowerPoint PPT Presentation

Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships Understanding the


  1. Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016

  2. Outline ◮ Basis function expansion to capture non-linear relationships ◮ Understanding the bias-variance tradeoff ◮ Overfitting and Regularization ◮ Bayesian View of Machine Learning ◮ Cross-validation to perform model selection 1

  3. Outline Basis Function Expansion Overfitting and the Bias-Variance Tradeoff Ridge Regression and Lasso Bayesian Approach to Machine Learning Model Selection

  4. Linear Regression : Polynomial Basis Expansion 2

  5. Linear Regression : Polynomial Basis Expansion 2

  6. Linear Regression : Polynomial Basis Expansion φ ( x ) = [1 , x, x 2 ] w 0 + w 1 x + w 2 x 2 = φ ( x ) · [ w 0 , w 1 , w 2 ] 2

  7. Linear Regression : Polynomial Basis Expansion φ ( x ) = [1 , x, x 2 ] w 0 + w 1 x + w 2 x 2 = φ ( x ) · [ w 0 , w 1 , w 2 ] 2

  8. Linear Regression : Polynomial Basis Expansion φ ( x ) = [1 , x, x 2 , · · · , x d ] Model y = w T φ ( x ) + ǫ Here w ∈ R M , where M is the number for expanded features 2

  9. Linear Regression : Polynomial Basis Expansion Getting more data can avoid overfitting! 2

  10. Polynomial Basis Expansion in Higher Dimensions Basis expansion can be performed in higher dimensions We’re still fitting linear models, but using more features y = w · φ ( x ) + ǫ Linear Model Quadratic Model φ ( x ) = [1 , x 1 , x 2 , x 2 1 , x 2 φ ( x ) = [1 , x 1 , x 2 ] 2 , x 1 x 2 ] Using degree d polynomials in D dimensions results in ≈ D d features! 3

  11. Basis Expansion Using Kernels We can use kernels as features A Radial Basis Function (RBF) kernel with width parameter γ is defined as κ ( x ′ , x ) = exp( − γ � x − x ′ � 2 ) Choose centres µ 1 , µ 2 , . . . , µ M Feature map: φ ( x ) = [1 , κ ( µ 1 , x ) , . . . , κ ( µ M , x )] y = w 0 + w 1 κ ( µ 1 , x ) + · · · + w M κ ( µ M , x ) + ǫ = w · φ ( x ) + ǫ How do we choose the centres? 4

  12. Basis Expansion Using Kernels One reasonable choice is to choose data points themselves as centres for kernels Need to choose width parameter γ for the RBF kernel κ ( x , x ′ ) = exp( − γ � x − x ′ � 2 ) As with the choice of degree in polynomial basis expansion depending on the width of the kernel overfitting or underfitting may occur ◮ Overfitting occurs if the width is too small, i.e., γ very large ◮ Underfitting occurs if the width is too large, i.e., γ very small 5

  13. When the kernel width is too large 6

  14. When the kernel width is too small 6

  15. When the kernel width is chosen suitably 6

  16. Big Data: When the kernel width is too large 7

  17. Big Data: When the kernel width is too small 7

  18. Big Data: When the kernel width is chosen suitably 7

  19. Basis Expansion using Kernels ◮ Overfitting occurs if the kernel width is too small, i.e., γ very large ◮ Having more data can help reduce overfitting! ◮ Underfitting occurs if the width is too large, i.e., γ very small ◮ Extra data does not help at all in this case! ◮ When the data lies in a high-dimensional space we may encounter the curse of dimensionality ◮ If the width is too large then we may underfit ◮ Might need exponentially large (in the dimension) sample for using modest width kernels ◮ Connection to Problem 1 on Sheet 1 8

  20. Outline Basis Function Expansion Overfitting and the Bias-Variance Tradeoff Ridge Regression and Lasso Bayesian Approach to Machine Learning Model Selection

  21. The Bias Variance Tradeoff High Bias High Variance 9

  22. The Bias Variance Tradeoff High Bias High Variance 9

  23. The Bias Variance Tradeoff High Bias High Variance 9

  24. The Bias Variance Tradeoff High Bias High Variance 9

  25. The Bias Variance Tradeoff High Bias High Variance 9

  26. The Bias Variance Tradeoff High Bias High Variance 9

  27. The Bias Variance Tradeoff ◮ Having high bias means that we are underfitting ◮ Having high variance means that we are overfitting ◮ The terms bias and variance in this context are precisely defined statistical notions ◮ See Problem Sheet 2, Q3 for precise calculations in one particular context ◮ See Secs. 7.1-3 in HTF book for a much more detailed description 10

  28. Learning Curves Suppose we’ve trained a model and used it to make predictions But in reality, the predictions are often poor ◮ How can we know whether we have high bias (underfitting) or high variance (overfitting) or neither? ◮ Should we add more features (higher degree polynomials, lower width kernels, etc.) to make the model more expressive? ◮ Should we simplify the model (lower degree polynomials, larger width kernels, etc.) to reduce the number of parameters? ◮ Should we try and obtain more data? ◮ Often there is a computational and monetary cost to using more data 11

  29. Learning Curves Split the data into a training set and testing set Train on increasing sizes of data Plot the training error and test error as a function of training data size More data is not useful More data would be useful 12

  30. Overfitting: How does it occur? When dealing with high-dimensional data (which may be caused by basis expansion) even for a linear model we have many parameters With D = 100 input variables and using degree 10 polynomial basis expansion we have ∼ 10 20 parameters! Enrico Fermi to Freeman Dyson ‘‘I remember my friend Johnny von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle his trunk.’’ [video] How can we prevent overfitting? 13

  31. Overfitting: How does it occur? Suppose we have D = 100 and N = 100 so that X is 100 × 100 Suppose every entry of X is drawn from N (0 , 1) And let y i = x i, 1 + N (0 , σ 2 ) , for σ = 0 . 2 14

  32. Outline Basis Function Expansion Overfitting and the Bias-Variance Tradeoff Ridge Regression and Lasso Bayesian Approach to Machine Learning Model Selection

  33. Ridge Regression i =1 , where x ∈ R D with D ≫ N Suppose we have data � ( x i , y i ) � N One idea to avoid overfitting is to add a penalty term for weights Least Squares Estimate Objective L ( w ) = ( Xw − y ) T ( Xw − y ) Ridge Regression Objective D � L ridge ( w ) = ( Xw − y ) T ( Xw − y ) + λ w 2 i i =1 15

  34. Ridge Regression We add a penalty term for weights to control model complexity Should not penalise the constant term w 0 for being large 16

  35. Ridge Regression Should translating and scaling inputs contribute to model complexity? Suppose � y = w 0 + w 1 x Supose x is temperature in ◦ C and x ′ in ◦ F � � w 0 − 160 + 5 So � y = 9 w 1 9 w 1 x ′ w 2 In one case ‘‘model complexity’’ is w 2 1 , in the other it is 25 81 w 2 1 < 1 3 Should try and avoid dependence on scaling and translation of variables 17

  36. Ridge Regression Before optimising the ridge objective, it’s a good idea to standardise all inputs (mean 0 and variance 1 ) If in addition, we center the outputs, i.e., the outputs have mean 0 , then the constant term is unnecessary (Exercise on Sheet 2) Then find w that minimises the objective function L ridge ( w ) = ( Xw − y ) T ( Xw − y ) + λ w T w 18

  37. Deriving Estimate for Ridge Regression Suppose the data � ( x i , y i ) � N i =1 with inputs standardised and output centered We want to derive expression for w that minimises L ridge ( w ) = ( Xw − y ) T ( Xw − y ) + λ w T w = w T X T Xw − 2 y T Xw + y T y + λ w T w Let’s take the gradient of the objective with respect to w ∇ w L ridge = 2( X T X ) w − 2 X T y + 2 λ w �� � � X T X + λ I D w − X T y = 2 Set the gradient to 0 and solve for w � � X T X + λ I D w = X T y � � − 1 X T X + λ I D X T y w ridge = 19

  38. Ridge Regression Minimise ( Xw − y ) T ( Xw − y ) Minimise ( Xw − y ) T ( Xw − y ) + λ w T w subject to w T w ≤ R 20

  39. Ridge Regression Minimise ( Xw − y ) T ( Xw − y ) Minimise ( Xw − y ) T ( Xw − y ) + λ w T w subject to w T w ≤ R 20

  40. Ridge Regression Minimise ( Xw − y ) T ( Xw − y ) Minimise ( Xw − y ) T ( Xw − y ) + λ w T w subject to w T w ≤ R 20

  41. Ridge Regression Minimise ( Xw − y ) T ( Xw − y ) Minimise ( Xw − y ) T ( Xw − y ) + λ w T w subject to w T w ≤ R 20

  42. Ridge Regression Minimise ( Xw − y ) T ( Xw − y ) Minimise ( Xw − y ) T ( Xw − y ) + λ w T w subject to w T w ≤ R 20

  43. Ridge Regression Minimise ( Xw − y ) T ( Xw − y ) Minimise ( Xw − y ) T ( Xw − y ) + λ w T w subject to w T w ≤ R 20

  44. Ridge Regression Minimise ( Xw − y ) T ( Xw − y ) Minimise ( Xw − y ) T ( Xw − y ) + λ w T w subject to w T w ≤ R 20

  45. Ridge Regression Minimise ( Xw − y ) T ( Xw − y ) Minimise ( Xw − y ) T ( Xw − y ) + λ w T w subject to w T w ≤ R 20

  46. Ridge Regression As we decrease λ the magnitudes of weights start increasing 21

  47. Summary : Ridge Regression In ridge regression, in addition to the residual sum of squares we penalise the sum of squares of weights Ridge Regression Objective L ridge ( w ) = ( Xw − y ) T ( Xw − y ) + λ w T w This is also called ℓ 2 -regularization or weight-decay Penalising weights ‘‘encourages fitting signal rather than just noise’’ 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend