gradient descent for l2 penalized logistic regr
play

Gradient Descent for L2 Penalized Logistic Regr. N 1 X log - PowerPoint PPT Presentation

<latexit


  1. <latexit sha1_base64="WVKZbGz9XgDFns5dP5ZCmxq463Y=">ACgHicbVFdixMxFM2MX2v9qvroy8UizIJbZ6qgCMKygvjgSpXt7kLTDplMpg2bZIYkY7fE/A7/l2/+GMF0tg+6w0h3Puzb05KRrBjU3TX1F87fqNm7d2bvfu3L13/0H/4aNjU7easgmtRa1PC2KY4IpNLeCnTaEVkIdlKcvd/oJ9+YNrxWR3bdsJkC8UrTokNVN7/gSVXuVsB5gqwJHZFO6rnx96wK0qmS40oczhKhwu824UeBGuLwms5kewAtgDbFqZO/Uu8/PQa0XgC07t+6AaTU+/OATsLmC7yGPLyRJukLcLHlynqtdCMvnrmtNiXCfLaBZ/3B+kw7QKugmwLBmgb47z/E5c1bSVTlgpizDRLGztzRFtOBfM93BrWEHpGFmwaoCKSmZnrDPTwLDAlVLUOW1no2L8rHJHGrGURMjdzmsvahvyfNm1t9WbmuGpayxS9aFS1AmwNm9+AkmtGrVgHQKjmYVagSxK8tuHPesGE7PKTr4Lj0TB7ORx9eTXYP9jasYOeoKcoQRl6jfbRzRGE0TR72gQPY/24jhO4hdxdpEaR9uax+ifiN/+AcCfwBM=</latexit> Gradient Descent for L2 Penalized Logistic Regr. N 1 X log BernPMF( t n | σ ( w T φ ( x n ))) 2 λ w T w − min w ∈ R M n =1 | {z } L ( w ) input: initial w 2 R 2 You need to specify: Max. num iterations T • input: initial step size s 0 2 R + al step size s 0 2 R + Step size s • Convergence threshold d • while not converged: e s 0 w w � s t r w L ( w ) s t decay( s 0 , t ) t t + 1 Mike Hughes - Tufts COMP 135 - Spring 2019 1

  2. Will gradient descent always find same solution? Mike Hughes - Tufts COMP 135 - Spring 2019 2

  3. Will gradient descent always find same solution? Yes, if loss looks like this Not if multiple local minima exist Mike Hughes - Tufts COMP 135 - Spring 2019 3

  4. Loss for logistic regression is convex! Mike Hughes - Tufts COMP 135 - Spring 2019 4

  5. Intuition: 1D gradient descent Choosing good step size matters! 𝑔(𝒚) 𝑔(𝒚) 𝒚 𝒚 𝒚 𝒚 Mike Hughes - Tufts COMP 135 - Spring 2019 5

  6. Log likelihood vs iterations Figure Credit: Emily Fox (UW) Maximizing likelihood: Higher is better! (could multiply by -1 and minimize instead) Mike Hughes - Tufts COMP 135 - Spring 2019 6

  7. If step size is to too small Figure Credit: Emily Fox (UW) Mike Hughes - Tufts COMP 135 - Spring 2019 7

  8. If step size is lar large Figure Credit: Emily Fox (UW) Mike Hughes - Tufts COMP 135 - Spring 2019 8

  9. If step size is to too large Figure Credit: Emily Fox (UW) Mike Hughes - Tufts COMP 135 - Spring 2019 9

  10. If step size is wa way too large Figure Credit: Emily Fox (UW) Mike Hughes - Tufts COMP 135 - Spring 2019 10

  11. Rule for picking step sizes • Never try just one! • Usually: Want largest step size that doesn’t diverge • Try several values (exponentially spaced) until • Find one clearly too small • Find one clearly too large (unhelpful oscillation / divergence) • Always make trace plots! • Show the loss, norm of gradient, and parameter values versus epoch • Smarter choices for step size: • Decaying methods • Search methods • Second-order methods Mike Hughes - Tufts COMP 135 - Spring 2019 11

  12. Decaying step sizes input: initial w 2 R input: initial step size s 0 2 R + while not converged: s 0 Linear decay w w � s t r w L ( w ) kt s t decay( s 0 , t ) Exponential decay s 0 e − kt t t + 1 Often helpful, but hard to get right! Mike Hughes - Tufts COMP 135 - Spring 2019 12

  13. Searching for good step size min x f ( x ) Goal: x Step ∆ x = �r x f ( x ) ∆ x Direction: Possible step lengths Exact Line Search: Expensive but gold standard Search for the best scalar s >= 0, such that: s ∗ = arg min s ≥ 0 f ( x + s ∆ x ) Mike Hughes - Tufts COMP 135 - Spring 2019 13

  14. Searching for good step size min x f ( x ) Goal: x Step ∆ x = �r x f ( x ) ∆ x Direction: Possible step lengths Backtracking Line Search: More Efficient! s = 1 ˆ while reduced slope linear extrapolation f ( x + s ∆ x ) < f ( x + s ∆ x ) : s ← 0 . 9 · s Mike Hughes - Tufts COMP 135 - Spring 2019 14

  15. Backtracking line search Python : scipy.optimize.line_search Linear extrapolation with reduced slope by factor alpha acceptable step sizes rejected step sizes s = 1 ˆ while reduced slope linear extrapolation f ( x + s ∆ x ) < f ( x + s ∆ x ) : s ← 0 . 9 · s Mike Hughes - Tufts COMP 135 - Spring 2019 15

  16. More resources on step sizes! Online Textbook: Convex Optimization http://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf Mike Hughes - Tufts COMP 135 - Spring 2019 16

  17. 2 nd order methods for gradient descent Big Idea: 2 nd deriv. can help! Mike Hughes - Tufts COMP 135 - Spring 2019 17

  18. Newton’s method: Use second-derivative to rescale step size! min x f ( x ) Goal: Step Direction: ∆ x Will step directly to minimum ∆ x = � H ( x ) − 1 r x f ( x ) if f is quadratic! In high dimensions, need the Hessian matrix Mike Hughes - Tufts COMP 135 - Spring 2019 18

  19. Animation of Newton’s method f’(x) To optimize, we want to find zeros of first derivative! Mike Hughes - Tufts COMP 135 - Spring 2019 19

  20. L-BFGS: gold standard approximate 2 nd order GD Python : scipy.optimize.fmin_l_bfgs_b L-BFGS : Limited Memory Broyden–Fletcher–Goldfarb–Shanno (BFGS) • Provide loss and gradient functions • Approximates the Hessian via recent history of gradient steps ∆ x = � ˆ ∆ x = � H ( x ) − 1 r x f ( x ) H ( x ) − 1 r x f ( x ) In high dimensions, need the Hessian matrix Instead, use low-rank But this is quadratic in length of x , expensive approximation Mike Hughes - Tufts COMP 135 - Spring 2019 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend