regression and generalization
play

Regression and generalization CE-717: Machine Learning Sharif - PowerPoint PPT Presentation

Regression and generalization CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018 Topics } Beyond linear regression models } Evaluation & model selection } Regularization } Bias-Variance 2 Recall: Linear


  1. Regression and generalization CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018

  2. Topics } Beyond linear regression models } Evaluation & model selection } Regularization } Bias-Variance 2

  3. Recall: Linear regression (squared loss) } Linear regression functions ๐‘” โˆถ โ„ โ†’ โ„ ๐‘”(๐‘ฆ; ๐’™) = ๐‘ฅ - + ๐‘ฅ / ๐‘ฆ ๐‘” โˆถ โ„ 0 โ†’ โ„ ๐‘”(๐’š; ๐’™) = ๐‘ฅ - + ๐‘ฅ / ๐‘ฆ / + . . . ๐‘ฅ 3 ๐‘ฆ 3 ๐’™ = ๐‘ฅ - ,๐‘ฅ / ,...,๐‘ฅ 3 : are the parameters we need to set. } Minimizing the squared loss for linear regression 8 ๐พ(๐’™) = ๐’› โˆ’ ๐’€๐’™ 8 9 = ๐’€ : ๐’€ ;๐Ÿ ๐’€ : ๐’› } We obtain ๐’™ 3

  4. Beyond linear regression } How to extend the linear regression to non-linear functions? } Transform the data using basis functions } Learn a linear regression on the new feature vectors (obtained by basis functions) 4

  5. Beyond linear regression } ๐‘› ?@ order polynomial regression (univariate ๐‘” โˆถ โ„ โŸถ โ„ ) ๐‘” ๐‘ฆ; ๐’™ = ๐‘ฅ - + ๐‘ฅ / ๐‘ฆ + . . . +๐‘ฅ B;/ ๐‘ฆ B;/ +๐‘ฅ B ๐‘ฆ B ;๐Ÿ ๐’€โ€ฒ : ๐’› C = ๐’€โ€ฒ : ๐’€โ€ฒ } Solution: ๐’™ ๐‘ฆ / I ๐‘ฆ / J ๐‘ฆ / L 1 ๐’™ 9 - โ‹ฏ ๐‘ง / ๐’™ 9 / ๐‘ฆ 8 I ๐‘ฆ 8 J ๐‘ฆ 8 L 1 โ‹ฎ โ‹ฏ ๐’› = ๐’€โ€ฒ = ๐’™ = โ‹ฎ โ‹ฎ โ‹ฎ โ‹ฎ โ‹ฎ โ‹ฎ ๐‘ง G ๐’™ 9 B ๐‘ฆ G I ๐‘ฆ G J ๐‘ฆ G I 1 โ‹ฏ 5

  6. Polynomial regression: example ๐‘› = 3 ๐‘› = 1 ๐‘› = 5 ๐‘› = 7 6

  7. Generalized linear } Linear combination of fixed non-linear function of the input vector ๐‘”(๐’š; ๐’™) = ๐‘ฅ - + ๐‘ฅ / ๐œš / (๐’š)+ . . . ๐‘ฅ B ๐œš B (๐’š) {๐œš / (๐’š), . . . , ๐œš B (๐’š)} : set of basis functions (or features) ๐œš S ๐’š : โ„ 3 โ†’ โ„ 7

  8. Basis functions: examples } Linear } Polynomial (univariate) 8

  9. Basis functions: examples J ๐’š;๐’… Y } Gaussian: ๐œš U ๐’š = ๐‘“๐‘ฆ๐‘ž โˆ’ J 8Z Y ๐’š;๐’… Y / } Sigmoid: ๐œš U ๐’š = ๐œ ๐œ ๐‘ = /]^_` (;a) Z Y 9

  10. Radial Basis Functions: prototypes } Predictions based on similarity to โ€œprototypesโ€: ๐œš U ๐’š = ๐‘“๐‘ฆ๐‘ž โˆ’ 1 8 8 ๐’š โˆ’ ๐’… U 2๐œ U } Measuring the similarity to the prototypes ๐’… / , โ€ฆ , ๐’… B } ฯƒ 8 controls how quickly it vanishes as a function of the distance to the prototype. } Training examples themselves could serve as prototypes 10

  11. Generalized linear: optimization G 8 ๐‘ง S โˆ’ ๐‘” ๐’š S ; ๐’™ ๐พ ๐’™ = e Sf/ G 8 ๐‘ง S โˆ’ ๐’™ : ๐” ๐’š S = e Sf/ ๐œš B (๐’š (/) ) ๐œš / (๐’š (/) ) 1 โ‹ฏ ๐‘ฅ - ๐‘ง (/) ๐‘ฅ / ๐œš / (๐’š (8) ) โ‹ฏ ๐œš B (๐’š (8) ) 1 ๐’› = ๐šพ = ๐’™ = โ‹ฎ โ‹ฎ โ‹ฑ โ‹ฎ โ‹ฎ โ‹ฎ ๐‘ง (G) ๐‘ฅ B ๐œš B (๐’š (G) ) ๐œš / (๐’š (G) ) 1 โ‹ฏ ;๐Ÿ ๐šพ : ๐’› j = ๐šพ : ๐šพ ๐’™ 11

  12. Model complexity and overfitting } With limited training data, models may achieve zero training error but a large test error. 1 G 8 ๐‘ง S โˆ’ ๐‘” ๐’š S ; ๐œพ Training ๐‘œ e โ‰ˆ 0 Sf/ (empirical) loss 8 โ‰ซ 0 Expected E ๐ฒ,q ๐‘ง โˆ’ ๐‘” ๐’š; ๐œพ (true) loss } Over-fitting: when the training loss no longer bears any relation to the test (generalization) loss. } Fails to generalize to unseen examples. 12

  13. Polynomial regression ๐‘› = 0 ๐‘› = 1 ๐‘ง ๐‘ง ๐‘› = 9 ๐‘› = 3 ๐‘ง ๐‘ง 13 [Bishop]

  14. ๏ฟฝ Polynomial regression: training and test error 8 ๐‘ง S โˆ’ ๐‘” ๐’š S ; ๐œพ G โˆ‘ Sf/ ๐‘†๐‘๐‘‡๐น = ๐‘œ ๐‘› [Bishop] 14

  15. Over-fitting causes } Model complexity } E.g., Model with a large number of parameters (degrees of freedom) } Low number of training data } Small data size compared to the complexity of the model 15

  16. Model complexity } Example: } Polynomials with larger ๐‘› are becoming increasingly tuned to the random noise on the target values. ๐‘› = 0 ๐‘› = 1 ๐‘ง ๐‘ง ๐‘› = 3 ๐‘› = 9 ๐‘ง ๐‘ง 16 16 [Bishop]

  17. Number of training data & overfitting } Over-fitting problem becomes less severe as the size of training data increases. ๐‘› = 9 ๐‘› = 9 ๐‘œ = 15 ๐‘œ = 100 [Bishop] 17

  18. How to evaluate the learnerโ€™s performance? } Generalization error: true (or expected) error that we would like to optimize } Two ways to assess the generalization error is: } Practical: Use a separate data set to test the model } Theoretical: Law of Large numbers } statistical bounds on the difference between training and expected errors 18

  19. Avoiding over-fitting } Determine a suitable value for model complexity (Model Selection) } Simple hold-out method } Cross-validation } Regularization (Occamโ€™s Razor) } Explicit preference towards simple models } Penalize for the model complexity in the objective function } Bayesian approach 19

  20. Evaluation and model selection } Evaluation : } We need to measure how well the learned function can predicts the target for unseen examples } Model selection: } Most of the time we need to select among a set of models } Example: polynomials with different degree ๐‘› } and thus we need to evaluate these models first 20

  21. Model Selection } learning algorithm defines the data-driven search over the hypothesis space (i.e. search for good parameters) } hyperparameters are the tunable aspects of the model, that the learning algorithm does not select This slide has been adopted from CMU ML course: 21 http://www.cs.cmu.edu/~mgormley/courses/10601-s18/

  22. Model Selection } Model selection is the process by which we choose the โ€œbestโ€ model from among a set of candidates } assume access to a function capable of measuring the quality of a model } typically done โ€œoutsideโ€ the main training algorithm } Model selection / hyperparameter optimization is just another form of learning This slide has been adopted from CMU ML course: 22 http://www.cs.cmu.edu/~mgormley/courses/10601-s18/

  23. ๏ฟฝ Simple hold-out: model selection } Steps: } Divide training data into training and validation set ๐‘ค_๐‘ก๐‘“๐‘ข } Use only the training set to train a set of models } Evaluate each learned model on the validation set 8 ๐‘ง (S) โˆ’ ๐‘” ๐’š (S) ; ๐’™ / ~_โ€ขโ‚ฌ? โˆ‘ } ๐พ ~ ๐’™ = Sโˆˆ~_โ€ขโ‚ฌ? } Choose the best model based on the validation set error } Usually, too wasteful of valuable training data } Training data may be limited. } On the other hand, small validation set give a relatively noisy estimate of performance. 23

  24. Simple hold out: training, validation, and test sets } Simple hold-out chooses the model that minimizes error on validation set. } ๐พ ~ ๐’™ 9 is likely to be an optimistic estimate of generalization error. } extra parameter (e.g., degree of polynomial) is fit to this set. } Estimate generalization error for the test set } performance of the selected model is finally evaluated on the test set Training Validation 24 Test

  25. Cross-Validation (CV): Evaluation } ๐‘™ -fold cross-validation steps: } Shuffle the dataset and randomly partition training data into ๐‘™ groups of approximately equal size } for ๐‘— = 1 to ๐‘™ } Choose the ๐‘— -th group as the held-out validation group } Train the model on all but the ๐‘— -th group of data } Evaluate the model on the held-out group } Performance scores of the model from ๐‘™ runs are averaged . } The average error rate can be considered as an estimation of the true performance. โ€ฆ First run โ€ฆ Second run โ€ฆ โ€ฆ (k-1)th run k-th run โ€ฆ 25

  26. Cross-Validation (CV): Model Selection } For each model we first find the average error find by CV. } The model with the best average performance is selected. 26

  27. Cross-validation: polynomial regression example } 5-fold CV } 100 runs } average ๐‘› = 3 ๐‘› = 1 CV: ๐‘๐‘‡๐น = 1.45 CV: ๐‘๐‘‡๐น = 0.30 ๐‘› = 5 ๐‘› = 7 CV: ๐‘๐‘‡๐น = 45.44 CV: ๐‘๐‘‡๐น = 31759 27

  28. Leave-One-Out Cross Validation (LOOCV) } When data is particularly scarce, cross-validation with ๐‘™ = ๐‘‚ } Leave-one-out treats each training sample in turn as a test example and all other samples as the training set. } Use for small datasets } When training data is valuable } LOOCV can be time expensive as ๐‘‚ training steps are required. 28

  29. Regularization } Adding a penalty term in the cost function to discourage the coefficients from reaching large values. } Ridge regression (weight decay): G 8 ๐‘ง S โˆ’ ๐’™ : ๐” ๐’š S + ๐œ‡๐’™ : ๐’™ ๐พ ๐’™ = e Sf/ ;๐Ÿ ๐šพ : ๐’› C = ๐šพ : ๐šพ + ๐œ‡๐‘ฑ ๐’™ 29

  30. Polynomial order } Polynomials with larger ๐‘› are becoming increasingly tuned to the random noise on the target values. } magnitude of the coefficients typically gets larger by increasing ๐‘› . [Bishop] 30

  31. Regularization parameter ๐‘› = 9 ๐‘ฅ 9 - ๐‘ฅ 9 / ๐‘ฅ 9 8 ๐‘ฅ 9 ห† ๐‘ฅ 9 โ€ฐ ๐‘ฅ 9 ล  ๐‘ฅ 9 โ€น ๐‘ฅ 9 ล’ [Bishop] ๐‘ฅ 9 โ€ข ๐‘ฅ 9 ลฝ ๐‘š๐‘œ๐œ‡ = โˆ’โˆž ๐‘š๐‘œ๐œ‡ = โˆ’18 31

  32. Regularization parameter } Generalization } ๐œ‡ now controls the effective complexity of the model and hence determines the degree of over-fitting 32 [Bishop]

  33. ๏ฟฝ Choosing the regularization parameter } A set of models with different values of ๐œ‡. } Find ๐’™ 9 for each model based on training data } Find ๐พ ~ (๐’™ 9) (or ๐พ โ€™~ (๐’™ 9) ) for each model 8 ๐‘ง (S) โˆ’ ๐‘” ๐‘ฆ (S) ; ๐’™ / G_~ โˆ‘ } ๐พ ~ ๐’™ = Sโˆˆ~_โ€ขโ‚ฌ? } Select the model with the best ๐พ ~ (๐’™ 9) (or ๐พ โ€™~ (๐’™ 9)) 33

  34. The approximation-generailization trade-off } Small true error shows good approximation of ๐‘” out of sample } More complex โ„‹ โ‡’ better chance of approximating ๐‘” } Less complex โ„‹ โ‡’ better chance of generalization out of ๐‘” 34

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend