lecture 3 regularization i
play

Lecture 3: Regularization I Princeton University COS 495 - PowerPoint PPT Presentation

Deep Learning Basics Lecture 3: Regularization I Princeton University COS 495 Instructor: Yingyu Liang What is regularization? In general: any method to prevent overfitting or help the optimization Specifically: additional terms in the


  1. Deep Learning Basics Lecture 3: Regularization I Princeton University COS 495 Instructor: Yingyu Liang

  2. What is regularization? β€’ In general: any method to prevent overfitting or help the optimization β€’ Specifically: additional terms in the training optimization objective to prevent overfitting or help the optimization

  3. Review: overfitting

  4. Overfitting example: regression using polynomials 𝑒 = sin 2πœŒπ‘¦ + πœ— Figure from Machine Learning and Pattern Recognition , Bishop

  5. Overfitting example: regression using polynomials Figure from Machine Learning and Pattern Recognition , Bishop

  6. Overfitting β€’ Empirical loss and expected loss are different β€’ Smaller the data set, larger the difference between the two β€’ Larger the hypothesis class, easier to find a hypothesis that fits the difference between the two β€’ Thus has small training error but large test error (overfitting)

  7. Prevent overfitting β€’ Larger data set helps β€’ Throwing away useless hypotheses also helps β€’ Classical regularization: some principal ways to constrain hypotheses β€’ Other types of regularization: data augmentation, early stopping, etc.

  8. Different views of regularization

  9. Regularization as hard constraint β€’ Training objective π‘œ 𝑀 𝑔 = 1 ΰ·  min π‘œ ෍ π‘š(𝑔, 𝑦 𝑗 , 𝑧 𝑗 ) 𝑔 𝑗=1 subject to: 𝑔 ∈ π“˜ β€’ When parametrized π‘œ 𝑀 πœ„ = 1 ΰ·  min π‘œ ෍ π‘š(πœ„, 𝑦 𝑗 , 𝑧 𝑗 ) πœ„ 𝑗=1 subject to: πœ„ ∈ 𝛻

  10. Regularization as hard constraint β€’ When 𝛻 measured by some quantity 𝑆 π‘œ 𝑀 πœ„ = 1 ΰ·  min π‘œ ෍ π‘š(πœ„, 𝑦 𝑗 , 𝑧 𝑗 ) πœ„ 𝑗=1 subject to: 𝑆 πœ„ ≀ 𝑠 β€’ Example: π‘š 2 regularization π‘œ 𝑀 πœ„ = 1 ΰ·  min π‘œ ෍ π‘š(πœ„, 𝑦 𝑗 , 𝑧 𝑗 ) πœ„ 𝑗=1 2 ≀ 𝑠 2 subject to: | πœ„| 2

  11. Regularization as soft constraint β€’ The hard-constraint optimization is equivalent to soft-constraint π‘œ 𝑀 𝑆 πœ„ = 1 ΰ·  π‘š(πœ„, 𝑦 𝑗 , 𝑧 𝑗 ) + πœ‡ βˆ— 𝑆(πœ„) min π‘œ ෍ πœ„ 𝑗=1 for some regularization parameter πœ‡ βˆ— > 0 β€’ Example: π‘š 2 regularization π‘œ 𝑀 𝑆 πœ„ = 1 ΰ·  2 π‘š(πœ„, 𝑦 𝑗 , 𝑧 𝑗 ) + πœ‡ βˆ— | πœ„| 2 min π‘œ ෍ πœ„ 𝑗=1

  12. Regularization as soft constraint β€’ Showed by Lagrangian multiplier method β„’ πœ„, πœ‡ ≔ ΰ·  𝑀 πœ„ + πœ‡[𝑆 πœ„ βˆ’ 𝑠] β€’ Suppose πœ„ βˆ— is the optimal for hard-constraint optimization πœ„ βˆ— = argmin πœ‡β‰₯0 β„’ πœ„, πœ‡ ≔ ΰ·  max 𝑀 πœ„ + πœ‡[𝑆 πœ„ βˆ’ 𝑠] πœ„ β€’ Suppose πœ‡ βˆ— is the corresponding optimal for max πœ„ βˆ— = argmin β„’ πœ„, πœ‡ βˆ— ≔ ΰ·  𝑀 πœ„ + πœ‡ βˆ— [𝑆 πœ„ βˆ’ 𝑠] πœ„

  13. Regularization as Bayesian prior β€’ Bayesian view: everything is a distribution β€’ Prior over the hypotheses: π‘ž πœ„ β€’ Posterior over the hypotheses: π‘ž πœ„ | {𝑦 𝑗 , 𝑧 𝑗 } β€’ Likelihood: π‘ž 𝑦 𝑗 , 𝑧 𝑗 πœ„) β€’ Bayesian rule: π‘ž πœ„ | {𝑦 𝑗 , 𝑧 𝑗 } = π‘ž πœ„ π‘ž 𝑦 𝑗 , 𝑧 𝑗 πœ„) π‘ž({𝑦 𝑗 , 𝑧 𝑗 })

  14. Regularization as Bayesian prior β€’ Bayesian rule: π‘ž πœ„ | {𝑦 𝑗 , 𝑧 𝑗 } = π‘ž πœ„ π‘ž 𝑦 𝑗 , 𝑧 𝑗 πœ„) π‘ž({𝑦 𝑗 , 𝑧 𝑗 }) β€’ Maximum A Posteriori (MAP): max log π‘ž πœ„ | {𝑦 𝑗 , 𝑧 𝑗 } = max log π‘ž πœ„ + log π‘ž 𝑦 𝑗 , 𝑧 𝑗 | πœ„ πœ„ πœ„ Regularization MLE loss

  15. Regularization as Bayesian prior β€’ Example: π‘š 2 loss with π‘š 2 regularization π‘œ 𝑀 𝑆 πœ„ = 1 πœ„ 𝑦 𝑗 βˆ’ 𝑧 𝑗 2 + πœ‡ βˆ— | πœ„| 2 ΰ·  2 min π‘œ ෍ 𝑔 πœ„ 𝑗=1 β€’ Correspond to a normal likelihood π‘ž 𝑦, 𝑧 | πœ„ and a normal prior π‘ž(πœ„)

  16. Three views β€’ Typical choice for optimization: soft-constraint 𝑀 𝑆 πœ„ = ΰ·  ΰ·  min 𝑀 πœ„ + πœ‡π‘†(πœ„) πœ„ β€’ Hard constraint and Bayesian view: conceptual; or used for derivation

  17. Three views β€’ Hard-constraint preferred if β€’ Know the explicit bound 𝑆 πœ„ ≀ 𝑠 β€’ Soft-constraint causes trapped in a local minima with small πœ„ β€’ Projection back to feasible set leads to stability β€’ Bayesian view preferred if β€’ Know the prior distribution

  18. Some examples

  19. Classical regularization β€’ Norm penalty β€’ π‘š 2 regularization β€’ π‘š 1 regularization β€’ Robustness to noise

  20. π‘š 2 regularization 𝑀(πœ„) + 𝛽 𝑀 𝑆 πœ„ = ΰ·  ΰ·  2 min 2 | πœ„| 2 πœ„ β€’ Effect on (stochastic) gradient descent β€’ Effect on the optimal solution

  21. Effect on gradient descent β€’ Gradient of regularized objective 𝛼෠ 𝑀 𝑆 πœ„ = 𝛼෠ 𝑀(πœ„) + π›½πœ„ β€’ Gradient descent update πœ„ ← πœ„ βˆ’ πœƒπ›Όΰ·  𝑀 𝑆 πœ„ = πœ„ βˆ’ πœƒ 𝛼෠ 𝑀 πœ„ βˆ’ πœƒπ›½πœ„ = 1 βˆ’ πœƒπ›½ πœ„ βˆ’ πœƒ 𝛼෠ 𝑀 πœ„ β€’ Terminology: weight decay

  22. Effect on the optimal solution β€’ Consider a quadratic approximation around πœ„ βˆ— 𝑀 πœ„ βˆ— + 1 𝑀 πœ„ βˆ— + πœ„ βˆ’ πœ„ βˆ— π‘ˆ 𝛼෠ 𝑀 πœ„ β‰ˆ ΰ·  ΰ·  2 πœ„ βˆ’ πœ„ βˆ— π‘ˆ 𝐼 πœ„ βˆ’ πœ„ βˆ— 𝑀 πœ„ βˆ— = 0 β€’ Since πœ„ βˆ— is optimal, 𝛼෠ 𝑀 πœ„ βˆ— + 1 𝑀 πœ„ β‰ˆ ΰ·  ΰ·  2 πœ„ βˆ’ πœ„ βˆ— π‘ˆ 𝐼 πœ„ βˆ’ πœ„ βˆ— 𝛼෠ 𝑀 πœ„ β‰ˆ 𝐼 πœ„ βˆ’ πœ„ βˆ—

  23. Effect on the optimal solution β€’ Gradient of regularized objective 𝑀 𝑆 πœ„ β‰ˆ 𝐼 πœ„ βˆ’ πœ„ βˆ— + π›½πœ„ 𝛼෠ βˆ— β€’ On the optimal πœ„ 𝑆 βˆ— β‰ˆ 𝐼 πœ„ 𝑆 βˆ— βˆ’ πœ„ βˆ— + π›½πœ„ 𝑆 0 = 𝛼෠ βˆ— 𝑀 𝑆 πœ„ 𝑆 βˆ— β‰ˆ 𝐼 + 𝛽𝐽 βˆ’1 πΌπœ„ βˆ— πœ„ 𝑆

  24. Effect on the optimal solution β€’ The optimal βˆ— β‰ˆ 𝐼 + 𝛽𝐽 βˆ’1 πΌπœ„ βˆ— πœ„ 𝑆 β€’ Suppose 𝐼 has eigen-decomposition 𝐼 = 𝑅Λ𝑅 π‘ˆ βˆ— β‰ˆ 𝐼 + 𝛽𝐽 βˆ’1 πΌπœ„ βˆ— = 𝑅 Ξ› + 𝛽𝐽 βˆ’1 Λ𝑅 π‘ˆ πœ„ βˆ— πœ„ 𝑆 β€’ Effect: rescale along eigenvectors of 𝐼

  25. Effect on the optimal solution Notations: βˆ— = ΰ·₯ πœ„ βˆ— = π‘₯ βˆ— , πœ„ 𝑆 π‘₯ Figure from Deep Learning , Goodfellow, Bengio and Courville

  26. π‘š 1 regularization 𝑀 𝑆 πœ„ = ΰ·  ΰ·  min 𝑀(πœ„) + 𝛽| πœ„ | 1 πœ„ β€’ Effect on (stochastic) gradient descent β€’ Effect on the optimal solution

  27. Effect on gradient descent β€’ Gradient of regularized objective 𝛼෠ 𝑀 𝑆 πœ„ = 𝛼෠ 𝑀 πœ„ + 𝛽 sign(πœ„) where sign applies to each element in πœ„ β€’ Gradient descent update πœ„ ← πœ„ βˆ’ πœƒπ›Όΰ·  𝑀 𝑆 πœ„ = πœ„ βˆ’ πœƒ 𝛼෠ 𝑀 πœ„ βˆ’ πœƒπ›½ sign(πœ„)

  28. Effect on the optimal solution β€’ Consider a quadratic approximation around πœ„ βˆ— 𝑀 πœ„ βˆ— + 1 𝑀 πœ„ βˆ— + πœ„ βˆ’ πœ„ βˆ— π‘ˆ 𝛼෠ 𝑀 πœ„ β‰ˆ ΰ·  ΰ·  2 πœ„ βˆ’ πœ„ βˆ— π‘ˆ 𝐼 πœ„ βˆ’ πœ„ βˆ— 𝑀 πœ„ βˆ— = 0 β€’ Since πœ„ βˆ— is optimal, 𝛼෠ 𝑀 πœ„ βˆ— + 1 𝑀 πœ„ β‰ˆ ΰ·  ΰ·  2 πœ„ βˆ’ πœ„ βˆ— π‘ˆ 𝐼 πœ„ βˆ’ πœ„ βˆ—

  29. Effect on the optimal solution β€’ Further assume that 𝐼 is diagonal and positive (𝐼 𝑗𝑗 > 0, βˆ€π‘—) β€’ not true in general but assume for getting some intuition β€’ The regularized objective is (ignoring constants) 1 βˆ— 2 + 𝛽 |πœ„ 𝑗 | ΰ·  𝑀 𝑆 πœ„ β‰ˆ ෍ 2 𝐼 𝑗𝑗 πœ„ 𝑗 βˆ’ πœ„ 𝑗 𝑗 βˆ— β€’ The optimal πœ„ 𝑆 βˆ— βˆ’ 𝛽 βˆ— β‰₯ 0 max πœ„ 𝑗 , 0 if πœ„ 𝑗 𝐼 𝑗𝑗 βˆ— ) 𝑗 β‰ˆ (πœ„ 𝑆 βˆ— + 𝛽 βˆ— < 0 min πœ„ 𝑗 , 0 if πœ„ 𝑗 𝐼 𝑗𝑗

  30. Effect on the optimal solution β€’ Effect: induce sparsity βˆ— ) 𝑗 (πœ„ 𝑆 (πœ„ βˆ— ) 𝑗 𝛽 βˆ’ 𝛽 𝐼 𝑗𝑗 𝐼 𝑗𝑗

  31. Effect on the optimal solution β€’ Further assume that 𝐼 is diagonal βˆ— β€’ Compact expression for the optimal πœ„ 𝑆 βˆ— βˆ’ 𝛽 βˆ— max{ πœ„ 𝑗 βˆ— ) 𝑗 β‰ˆ sign πœ„ 𝑗 (πœ„ 𝑆 , 0} 𝐼 𝑗𝑗

  32. Bayesian view β€’ π‘š 1 regularization corresponds to Laplacian prior π‘ž πœ„ ∝ exp(𝛽 ෍ |πœ„ 𝑗 |) 𝑗 log π‘ž πœ„ = 𝛽 ෍ |πœ„ 𝑗 | + constant = 𝛽| πœ„ | 1 + constant 𝑗

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend