learning linear methods
play

Learning: Linear Methods CE417: Introduction to Artificial - PowerPoint PPT Presentation

Learning: Linear Methods CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2019 Soleymani Some slides are based on Klein and Abdeel, CS188, UC Berkeley. Paradigms of ML } Supervised learning (regression,


  1. Square error loss function for classification! Square error loss is not suitable for classification: Least square loss penalizes โ€˜too correctโ€™ predictions (that they lie a long way on the correct side } of the decision) Least square loss also lack robustness to noise } - ๐ฟ = 2 * ๐พ ๐’™ = @ ๐‘ฅ๐‘ฆ : + ๐‘ฅ E โˆ’ ๐‘ง : :B' 36

  2. Notation } ๐’™ = ๐‘ฅ E , ๐‘ฅ ' , . . . , ๐‘ฅ Y \ } ๐’š = 1, ๐‘ฆ ' , โ€ฆ , ๐‘ฆ Y \ } ๐‘ฅ E + ๐‘ฅ ' ๐‘ฆ ' + โ‹ฏ + ๐‘ฅ Y ๐‘ฆ Y = ๐’™ \ ๐’š } We show input by ๐’š or ๐‘”(๐’š) 37

  3. SSE cost function for classification ๐ฟ = 2 } Is it more suitable if we set ๐‘” ๐’š; ๐’™ = ๐‘• ๐’™ \ ๐’š ? - sign ๐’™ \ ๐’š โˆ’ ๐‘ง * * ๐พ ๐’™ = @ sign ๐’™ \ ๐’š : โˆ’ ๐‘ง : ๐‘ง = 1 :B' sign ๐‘จ = jโˆ’ 1, ๐‘จ < 0 1, ๐‘จ โ‰ฅ 0 ๐’™ \ ๐’š } ๐พ ๐’™ is a piecewise constant function shows the number of misclassifications ๐พ(๐’™) Training error incurred in classifying training samples 38

  4. Perceptron algorithm } Linear classifier } Two-class: ๐‘ง โˆˆ {โˆ’1,1} } ๐‘ง = โˆ’1 for ๐ท * , ๐‘ง = 1 for ๐ท ' } Goal: โˆ€๐‘—, ๐’š (:) โˆˆ ๐ท ' โ‡’ ๐’™ \ ๐’š (:) > 0 โˆ€๐‘—, ๐’š : โˆˆ ๐ท * โ‡’ ๐’™ \ ๐’š : < 0 } } ๐‘• ๐’š; ๐’™ = sign(๐’™ \ ๐’š) 39

  5. ๏ฟฝ Perceptron criterion ๐พ o ๐’™ = โˆ’ @ ๐’™ \ ๐’š : ๐‘ง : :โˆˆโ„ณ โ„ณ : subset of training data that are misclassified Many solutions? Which solution among them? 40

  6. Cost function ๐พ(๐’™) ๐พ o (๐’™) ๐‘ฅ E ๐‘ฅ E ๐‘ฅ ' ๐‘ฅ ' # of misclassifications Perceptronโ€™s as a cost function cost function There may be many solutions in these cost functions 41 [Duda, Hart, and Stork, 2002]

  7. ๏ฟฝ ๏ฟฝ Batch Perceptron โ€œGradient Descentโ€ to solve the optimization problem: ๐’™ MN' = ๐’™ M โˆ’ ๐œƒ๐›ผ ๐’™ ๐พ o (๐’™ M ) ๐’™ ๐พ o ๐’™ = โˆ’ @ ๐’š : ๐‘ง : ๐›ผ :โˆˆโ„ณ Batch Perceptron converges in finite number of steps for linearly separable data: Initialize ๐’™ Repeat ๐’š : ๐‘ง : ๐’™ = ๐’™ + ๐œƒ โˆ‘ :โˆˆโ„ณ Until convergence 42

  8. Stochastic gradient descent for Perceptron } Single-sample perceptron: } If ๐’š (:) is misclassified: ๐’™ MN' = ๐’™ M + ๐œƒ๐’š (:) ๐‘ง (:) } Perceptron convergence theorem: for linearly separable data } If training data are linearly separable, the single-sample perceptron is also guaranteed to find a solution in a finite number of steps Fixed-Increment single sample Perceptron Initialize ๐’™, ๐‘ข โ† 0 repeat ๐œƒ can be set to 1 and ๐‘ข โ† ๐‘ข + 1 proof still works ๐‘— โ† ๐‘ข mod ๐‘‚ if ๐’š (:) is misclassified then ๐’™ = ๐’™ + ๐’š (:) ๐‘ง (:) Until all patterns properly classified 43

  9. Weight Updates 44

  10. Learning: Binary Perceptron Start with weights = 0 } For each training instance: } Classify with current weights } If correct (i.e., y=y*), no change! } If wrong: adjust the weight vector } ๐’™ MN' = ๐’™ M + ๐œƒ๐’š (:) ๐‘ง (:) 45

  11. Example 46

  12. Perceptron: Example Change ๐’™ in a direction that corrects the error 47 [Bishop]

  13. Learning: Binary Perceptron } Start with weights = 0 } For each training instance: } Classify with current weights } If correct (i.e., y=y*), no change! } If wrong: adjust the weight vector by adding or subtracting the feature vector. Subtract if y* is -1. 48

  14. Examples: Perceptron } Separable Case 49

  15. Convergence of Perceptron [Duda, Hart & Stork, 2002] } For data sets that are not linearly separable, the single-sample perceptron learning algorithm will never converge 50

  16. Multiclass Decision Rule } If we have multiple classes: A weight vector for each class: } Score (activation) of a class y: } Prediction highest score wins } Binary = multiclass where the negative class has weight zero 51

  17. Learning: Multiclass Perceptron } Start with all weights = 0 } Pick up training examples one by one } Predict with current weights } If correct, no change! } If wrong: lower score of wrong answer, raise score of right answer 52

  18. Example: Multiclass Perceptron โ€œwin the voteโ€ โ€œwin the electionโ€ โ€œwin the gameโ€ BIAS : 1 BIAS : 0 BIAS : 0 win : 0 win : 0 win : 0 game : 0 game : 0 game : 0 vote : 0 vote : 0 vote : 0 the : 0 the : 0 the : 0 ... ... ... 53

  19. Properties of Perceptrons } Separability: true if some parameters get the training set perfectly classified Separable } Convergence: if the training is separable, perceptron will eventually converge (binary case) } Mistake Bound: the maximum number of mistakes (binary case) related to the margin or degree of separability Non-Separable 54

  20. Examples: Perceptron } Non-Separable Case 55

  21. Examples: Perceptron } Non-Separable Case 56

  22. Discriminative approach: logistic regression ๐ฟ = 2 ๐‘• ๐’š; ๐’™ = ๐œ(๐’™ \ ๐’š) ๐’š = 1, ๐‘ฆ ' , โ€ฆ , ๐‘ฆ Y ๐’™ = ๐‘ฅ E , ๐‘ฅ ' , โ€ฆ , ๐‘ฅ Y ๐œ . is an activation function } Sigmoid (logistic) function } Activation function 1 ๐œ ๐‘จ = 1 + ๐‘“ wx 57

  23. Logistic regression: cost function ๐’™ F = argmin ๐พ(๐’™) ๐’™ ๐พ ๐’™ = A = @ โˆ’๐‘ง (:) log ๐œ ๐’™ \ ๐’š (:) โˆ’ (1 โˆ’ ๐‘ง (:) )log 1 โˆ’ ๐œ ๐’™ \ ๐’š (:) :B' } ๐พ(๐’™) is convex w.r.t. parameters. 58

  24. Logistic regression: loss function Loss ๐‘ง, ๐‘” ๐’š; ๐’™ = โˆ’๐‘งร—log ๐œ ๐’š; ๐’™ โˆ’ (1 โˆ’ ๐‘ง)ร—log(1 โˆ’ ๐œ ๐’š; ๐’™ ) โˆ’log(๐œ(๐’š; ๐’™)) if ๐‘ง = 1 Loss ๐‘ง, ๐œ ๐’š; ๐’™ = j Since ๐‘ง = 1 or ๐‘ง = 0 โˆ’log(1 โˆ’ ๐œ ๐’š; ๐’™ ) if ๐‘ง = 0 โ‡’ How is it related to zero-one loss? } = j1 ๐‘ง โ‰  ๐‘ง } Loss ๐‘ง, ๐‘ง 0 ๐‘ง = ๐‘ง } 1 ๐œ ๐’š; ๐’™ = 1 + ๐‘“๐‘ฆ๐‘ž(โˆ’๐’™ \ ๐’š) 59

  25. Logistic regression: Gradient descent ๐’™ MN' = ๐’™ M โˆ’ ๐œƒ๐›ผ ๐’™ ๐พ(๐’™ M ) A ๐œ ๐’š : ; ๐’™ โˆ’ ๐‘ง : ๐’š : ๐›ผ ๐’™ ๐พ ๐’™ = @ :B' } Is it similar to gradient of SSE for linear regression? A ๐’™ \ ๐’š : โˆ’ ๐‘ง : ๐’š : ๐›ผ ๐’™ ๐พ ๐’™ = @ :B' 60

  26. Multi-class logistic regression \ } ๐‘• ๐’š; ๐‘ฟ = ๐‘• ' ๐’š, ๐‘ฟ , โ€ฆ , ๐‘• โ€ข ๐’š, ๐‘ฟ } ๐‘ฟ = ๐’™ ' โ‹ฏ ๐’™ โ€ข contains one vector of parameters for each class \ ๐’š ) exp (๐’™ โ€š ๐‘• โ€š ๐’š; ๐‘ฟ = โ€ข \ ๐’š ) โˆ‘ exp (๐’™ โ€ฆ โ€ฆB' 61

  27. Logistic regression: multi-class โ€  = argmin ๐‘ฟ ๐พ(๐‘ฟ) ๐‘ฟ A โ€ข : log ๐‘• โ€š ๐’š (:) ; ๐‘ฟ ๐พ ๐‘ฟ = โˆ’ @ @ ๐‘ง โ€š :B' โ€šB' ๐’› is a vector of length ๐ฟ (1-of-K coding) ๐‘ฟ = ๐’™ ' โ‹ฏ ๐’™ โ€ข e.g., ๐’› = 0,0,1,0 \ when the target class is ๐ท ห† 62

  28. Logistic regression: multi-class MN' = ๐’™ โ€ฆ M โˆ’ ๐œƒ๐›ผ ๐‘ฟ ๐พ(๐‘ฟ M ) ๐’™ โ€ฆ A ๐‘• โ€ฆ ๐’š : ; ๐‘ฟ โˆ’ ๐‘ง โ€ฆ : ๐’š : ๐›ผ ๐’™ โ€ฐ ๐พ ๐‘ฟ = @ :B' 63

  29. Multi-class classifier } ๐‘• ๐’š; ๐‘ฟ = ๐‘• ' ๐’š, ๐‘ฟ , โ€ฆ , ๐‘• โ€ข ๐’š, ๐‘ฟ } ๐‘ฟ = ๐’™ ' โ‹ฏ ๐’™ โ€ข contains one vector of parameters for each class } In linear classifiers, ๐‘ฟ is ๐‘’ร—๐ฟ where ๐‘’ shows number of features } ๐‘ฟ \ ๐’š provides us a vector } ๐‘• ๐’š; ๐‘ฟ contains K numbers giving class scores for the input ๐’š 64

  30. Example } Output obtained from ๐‘ฟ \ ๐’š + ๐’„ ๐‘ฆ ' โ‹ฎ ๐’š = ๐‘ฆ โ€ขลฝโ€ข 28 ๐’™ ' ร—28 ๐‘ฟ \ = โ‹ฎ ๐’™ 'E 'Eร—โ€ขลฝโ€ข ๐‘ ' โ‹ฎ ๐’„ = ๐‘ 'E 65 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017

  31. Example ๐‘ฟ \ How can we tell whether this W and b is good or bad? 66 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017

  32. Bias can also be included in the W matrix 67 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017

  33. Softmax classifier loss: example โ€ โ€ข(โ€“) ๐‘“ ๐‘€ (:) = โˆ’ log โ€ข ๐‘“ โ€ โ€ฐ โˆ‘ โ€ฆB' ๐‘€ (') = โˆ’ log 0.13 = 0.89 68 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017

  34. Support Vector Machines } Maximizing the margin: good according to intuition, theory, practice } Support vector machines (SVMs) find the separator with max margin 69

  35. Hard-margin SVM: Optimization problem 2 max ๐’™ ๐’™,โ€” หœ s. t. ๐’™ \ ๐’š A + ๐‘ฅ E โ‰ฅ 1 โˆ€๐‘ง A = 1 ๐’™ \ ๐’š A + ๐‘ฅ E โ‰ค โˆ’1 โˆ€๐‘ง A = โˆ’1 ๐’™ \ ๐’š + ๐‘ฅ E = 0 ๐‘ฆ 2 1 ๐’™ * Margin: ๐’™ ๐’™ \ ๐’š + ๐‘ฅ E = 1 ๐’™ ๐’™ \ ๐’š + ๐‘ฅ E = โˆ’1 70 ๐‘ฆ 1

  36. Distance between an ๐’š (A) and the plane distance = ๐’™ \ ๐’š (A) + ๐‘ฅ E ๐’™ ๐’š (A) 71

  37. Hard-margin SVM: Optimization problem We can equivalently optimize: 1 2 ๐’™ \ ๐’™ min ๐’™,โ€” หœ ๐’™ \ ๐’š A + ๐‘ฅ E โ‰ฅ 1 ๐‘œ = 1, โ€ฆ , ๐‘‚ s. t. ๐‘ง A } It is a convex Quadratic Programming (QP) problem } There are computationally efficient packages to solve it. } It has a global minimum (if any). 72

  38. Error measure } Margin violation amount ๐œŠ A ( ๐œŠ A โ‰ฅ 0 ): ๐’™ \ ๐’š A + ๐‘ฅ E โ‰ฅ 1 โˆ’ ๐œŠ A } ๐‘ง A - } Total violation: โˆ‘ ๐œŠ A AB' 73

  39. Soft-margin SVM: Optimization problem } SVM with slack variables: allows samples to fall within the margin, but penalizes them - 1 2 ๐’™ * + ๐ท @ ๐œŠ A min ลพ ๐’™,โ€” หœ , ลก โ€บ โ€บล“โ€ข AB' ๐’™ \ ๐’š A + ๐‘ฅ E โ‰ฅ 1 โˆ’ ๐œŠ A ๐‘œ = 1, โ€ฆ , ๐‘‚ s. t. ๐‘ง A ๐œŠ A โ‰ฅ 0 ๐œŠ A : slack variables ๐‘ฆ 2 ๐œŠ < 1 0 < ๐œŠ A < 1 : if ๐’š A is correctly classified but inside margin ๐œŠ A > 1 : if ๐’š A is misclassifed ๐œŠ > 1 ๐‘ฆ 1 74

  40. Soft-margin SVM: Cost function - 1 2 ๐’™ * + ๐ท @ ๐œŠ A min ลพ ๐’™,โ€” หœ , ลก โ€บ โ€บล“โ€ข AB' ๐’™ \ ๐’š A + ๐‘ฅ E โ‰ฅ 1 โˆ’ ๐œŠ A ๐‘œ = 1, โ€ฆ , ๐‘‚ s. t. ๐‘ง A ๐œŠ A โ‰ฅ 0 } It is equivalent to the unconstrained optimization problem: - 1 2 ๐’™ * + ๐ท @ max (0,1 โˆ’ ๐‘ง (A) (๐’™ \ ๐’š (A) + ๐‘ฅ E )) min ๐’™,โ€” หœ AB' 75

  41. Multi-class SVM - ๐พ ๐‘ฟ = 1 ๐‘‚ @ ๐‘€ : + ๐œ‡ ๐‘ฟ * :B' โ€ฆ โ‰ก ๐‘• โ€ฆ ๐’š : ; ๐‘ฟ ๐‘ก ๐‘€ : = @ max 0,1 + ๐‘ก โ€ฆ โˆ’ ๐‘ก ยก (โ€“) Hinge loss: \ ๐’š (:) = ๐’™ โ€ฆ โ€ฆยขยก (โ€“) \ ๐’š (:) โˆ’ ๐’™ ยก (โ€“) \ ๐’š (:) = @ max 0,1 + ๐’™ โ€ฆ โ€ฆยขยก (โ€“) โ€ข Y * ๐‘† ๐‘ฟ = @ @ ๐‘ฅ ยคโ€š L2 regularization: โ€šB' ยคB' 76

  42. Multi-class SVM loss: Example 3 training examples, 3 classes. With some W the scores are ๐‘‹ \ ๐‘ฆ \ ๐’š (:) ๐‘ก โ€ฆ = ๐’™ โ€ฆ ๐‘€ : = @ max 0,1 + ๐‘ก โ€ฆ โˆ’ ๐‘ก ยก (โ€“) โ€ฆยขยก (โ€“) - 1 = 1 ๐‘‚ @ ๐‘€ : 3 2.9 + 0 + 12.9 = 5.7 :B' ๐‘€ (ห†) = max ๐‘€ (') = max 0,1 + 5.1 โˆ’ 3.2 ๐‘€ (*) = max 0,1 + 1.3 โˆ’ 4.9 (0, 2.2 โˆ’ (โˆ’3.1) + 1) +max (0, 2.5 โˆ’ (โˆ’3.1) + 1) + max 0,1 โˆ’ 1.7 โˆ’ 3.2 + max 0,1 + 2 โˆ’ 4.9 = max (0, 6.3) + max (0, 6.6) = max 0,2.9 + max(0, โˆ’3.9) = max 0, โˆ’2.6 + max(0, โˆ’1.9) = 2.9 + 0 = 6.3 + 6.6 = 12.9 = 0 + 0 77 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017

  43. Recap We need ๐›ผ ยช ๐‘€ to update weights โ€ข Y * L2 regularization ๐‘† ๐‘‹ = โˆ‘ โˆ‘ ๐‘ฅ ยคโ€š โ€šB' ยคB' โ€ข Y ๐‘† ๐‘‹ = โˆ‘ โˆ‘ L1 regularization ๐‘ฅ ยคโ€š โ€šB' ยคB' 78 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017

  44. Generalized linear } Linear combination of fixed non-linear function of the input vector ๐‘”(๐’š; ๐’™) = ๐‘ฅ E + ๐‘ฅ ' ๐œš ' (๐’š)+ . . . ๐‘ฅ ยฌ ๐œš ยฌ (๐’š) {๐œš ' (๐’š), . . . , ๐œš ยฌ (๐’š)} : set of basis functions (or features) ๐œš : ๐’š : โ„ Y โ†’ โ„ 79

  45. Basis functions: examples } Linear } Polynomial (univariate) 80

  46. Polynomial regression: example ๐‘› = 3 ๐‘› = 1 ๐‘› = 5 ๐‘› = 7 81

  47. Generalized linear classifier } Assume a transformation ๐œš: โ„ Y โ†’ โ„ ยฌ on the feature space ๐” ๐’š = [๐œš ' (๐’š), . . . , ๐œš ยฌ (๐’š)] } ๐’š โ†’ ๐” ๐’š {๐œš ' (๐’š), . . . , ๐œš ยฌ (๐’š)} : set of basis functions (or features) ๐œš : ๐’š : โ„ Y โ†’ โ„ } Find a hyper-plane in the transformed feature space: ๐œš * (๐’š) ๐‘ฆ 2 ๐œš: ๐’š โ†’ ๐” ๐’š ๐’™ \ ๐” ๐’š + ๐‘ฅ E = 0 ๐‘ฆ 1 ๐œš ' (๐’š) 82

  48. Model complexity and overfitting } With limited training data, models may achieve zero training error but a large test error. 1 A * ๐‘ง : โˆ’ ๐‘” ๐’š : ; ๐œพ Training ๐‘œ @ โ‰ˆ 0 :B' (empirical) loss * โ‰ซ 0 Expected E ๐ฒ,ยด ๐‘ง โˆ’ ๐‘” ๐’š; ๐œพ (true) loss } Over-fitting: when the training loss no longer bears any relation to the test (generalization) loss. } Fails to generalize to unseen examples. 83

  49. Polynomial regression ๐‘› = 0 ๐‘› = 1 ๐‘ง ๐‘ง ๐‘› = 9 ๐‘› = 3 ๐‘ง ๐‘ง 84 [Bishop]

  50. Over-fitting causes } Model complexity } E.g., Model with a large number of parameters (degrees of freedom) } Low number of training data } Small data size compared to the complexity of the model 85

  51. Model complexity } Example: } Polynomials with larger ๐‘› are becoming increasingly tuned to the random noise on the target values. ๐‘› = 0 ๐‘› = 1 ๐‘ง ๐‘ง ๐‘› = 3 ๐‘› = 9 ๐‘ง ๐‘ง 86 86 [Bishop]

  52. Number of training data & overfitting } Over-fitting problem becomes less severe as the size of training data increases. ๐‘› = 9 ๐‘› = 9 ๐‘œ = 15 ๐‘œ = 100 [Bishop] 87

  53. How to evaluate the learnerโ€™s performance? } Generalization error: true (or expected) error that we would like to optimize } Two ways to assess the generalization error is: } Practical: Use a separate data set to test the model } Theoretical: Law of Large numbers } statistical bounds on the difference between training and expected errors 88

  54. Avoiding over-fitting } Determine a suitable value for model complexity (Model Selection) } Simple hold-out method } Cross-validation } Regularization (Occamโ€™s Razor) } Explicit preference towards simple models } Penalize for the model complexity in the objective function 89

  55. Model Selection } learning algorithm defines the data-driven search over the hypothesis space (i.e. search for good parameters) } hyperparameters are the tunable aspects of the model, that the learning algorithm does not select This slide has been adopted from CMU ML course: 90 http://www.cs.cmu.edu/~mgormley/courses/10601-s18/

  56. Model Selection } Model selection is the process by which we choose the โ€œbestโ€ model from among a set of candidates } assume access to a function capable of measuring the quality of a model } typically done โ€œoutsideโ€ the main training algorithm } Model selection / hyperparameter optimization is just another form of learning This slide has been adopted from CMU ML course: 91 http://www.cs.cmu.edu/~mgormley/courses/10601-s18/

  57. ๏ฟฝ Simple hold-out: model selection } Steps: } Divide training data into training and validation set ๐‘ค_๐‘ก๐‘“๐‘ข } Use only the training set to train a set of models } Evaluate each learned model on the validation set * ๐‘ง (:) โˆ’ ๐‘” ๐’š (:) ; ๐’™ ' ยธ_โ€ยนM โˆ‘ } ๐พ ยธ ๐’™ = :โˆˆยธ_โ€ยนM } Choose the best model based on the validation set error } Usually, too wasteful of valuable training data } Training data may be limited. } On the other hand, small validation set give a relatively noisy estimate of performance. 92

  58. Simple hold out: training, validation, and test sets } Simple hold-out chooses the model that minimizes error on validation set. } ๐พ ยธ ๐’™ F is likely to be an optimistic estimate of generalization error. } extra parameter (e.g., degree of polynomial) is fit to this set. } Estimate generalization error for the test set } performance of the selected model is finally evaluated on the test set Training Validation 93 Test

  59. Cross-Validation (CV): Evaluation } ๐‘™ -fold cross-validation steps: } Shuffle the dataset and randomly partition training data into ๐‘™ groups of approximately equal size } for ๐‘— = 1 to ๐‘™ } Choose the ๐‘— -th group as the held-out validation group } Train the model on all but the ๐‘— -th group of data } Evaluate the model on the held-out group } Performance scores of the model from ๐‘™ runs are averaged . } The average error rate can be considered as an estimation of the true performance. โ€ฆ First run โ€ฆ Second run โ€ฆ โ€ฆ (k-1)th run โ€ฆ k-th run 94

  60. Cross-Validation (CV): Model Selection } For each model we first find the average error find by CV. } The model with the best average performance is selected. 95

  61. Cross-validation: polynomial regression example } 5-fold CV } 100 runs } average ๐‘› = 3 ๐‘› = 1 CV: ๐‘๐‘‡๐น = 1.45 CV: ๐‘๐‘‡๐น = 0.30 ๐‘› = 5 ๐‘› = 7 CV: ๐‘๐‘‡๐น = 45.44 CV: ๐‘๐‘‡๐น = 31759 96

  62. Regularization } Adding a penalty term in the cost function to discourage the coefficients from reaching large values. } Ridge regression (weight decay): A * ๐‘ง : โˆ’ ๐’™ \ ๐” ๐’š : + ๐œ‡๐’™ \ ๐’™ ๐พ ๐’™ = @ :B' w๐Ÿ ๐šพ \ ๐’› ยพ = ๐šพ \ ๐šพ + ๐œ‡๐‘ฑ ๐’™ 97

  63. Polynomial order } Polynomials with larger ๐‘› are becoming increasingly tuned to the random noise on the target values. } magnitude of the coefficients typically gets larger by increasing ๐‘› . [Bishop] 98

  64. Regularization parameter ๐‘› = 9 ๐‘ฅ F E ๐‘ฅ F ' ๐‘ฅ F * ๐‘ฅ F ห† ๐‘ฅ F โ€ข ๐‘ฅ F ร‚ ๐‘ฅ F รƒ ๐‘ฅ F โ€ข [Bishop] ๐‘ฅ F ลฝ ๐‘ฅ F ร„ ๐‘š๐‘œ๐œ‡ = โˆ’โˆž ๐‘š๐‘œ๐œ‡ = โˆ’18 99

  65. Regularization parameter } Generalization } ๐œ‡ now controls the effective complexity of the model and hence determines the degree of over-fitting 100 [Bishop]

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend