stat 339 approximate inference i
play

STAT 339 Approximate Inference I 15 March 2017 Colin Reimer Dawson - PowerPoint PPT Presentation

STAT 339 Approximate Inference I 15 March 2017 Colin Reimer Dawson Outline Approximation Methods Motivating Model: Logistic Regression Newton-Raphson Method Approximation Methods Thus far we have done a lot of calculus and probability


  1. STAT 339 Approximate Inference I 15 March 2017 Colin Reimer Dawson

  2. Outline Approximation Methods Motivating Model: Logistic Regression Newton-Raphson Method

  3. Approximation Methods ▸ Thus far we have done a lot of calculus and probability math to find exact optima/posterior/predictive distributions for simple models. ▸ We relied heavily on some strong assumptions (e.g., i.i.d. Normal errors, conjugate priors, some parameters fixed, etc.) ▸ In general, the “nice” properties that made exact solutions possible will not be present. ▸ Hence we need to rely on approximations to our optima/distributions/etc.

  4. Two Classes of Approximation We can either 1. Solve for an approximate solution exactly ▸ Settling for local optima ▸ Making the “least bad” simplifying assumptions to make analytic solutions possible 2. Solve for an exact solution approximately ▸ Numerical/stochastic integration ▸ Stochastic searh

  5. Logistic Regression ▸ Linear regression? ˆ P ( t n = 1 ∣ x ) = x n w ▸ This can work if we only care about whether ˆ P ( t = 1 ) > 0 . 5 , but a ▸ Consider binary linear model can return classification invalid probabilities ( t ∈ { 0 , 1 } ) where we ▸ Not great if we want to want to model quantify uncertainty P ( t = 1 ) as an explicit function of feature vector x .

  6. Modeling a Transformed Probability ▸ Idea: keep the linear dependence idea, but instead of modeling P ( t = 1 ∣ x ) directly, model a nonlinear function of P that is not bounded to [ 0 , 1 ] . ▸ The odds : P ( t = 1 ) Odds ( t = 1 ) ∶ = 1 − P ( t = 1 ) ∈ [ 0 , ∞) ▸ The log odds or logit : P ( t = 1 ) Logit ( t = 1 ) ∶ = log ( 1 − P ( t = 1 )) ∈ (−∞ , ∞) ▸ Nice property: equal probabilities corresponds to Logit = 0 .

  7. Logit Transformation 6 4 2 logit(p) 0 −2 −4 −6 0.0 0.2 0.4 0.6 0.8 1.0 p

  8. Logistic Transformation ▸ η = Logit ( p ) = log ( p 1 − p ) ▸ Inverse is the logistic function: exp { η } p = Logistic ( η ) = Logit − 1 ( η ) = 1 + exp { η } 1.0 0.8 logistic ( η ) 0.6 0.4 0.2 0.0 −5 0 5 η

  9. A Linear Model of the Logit ▸ Having defined η with an unrestricted range, we can now model η n = x n w ▸ Or, equivalently, exp { x n w } P ( t n = 1 ∣ x n ) = 1 + exp { x n w } ▸ With an independence assumption, yields a likelihood function L ( w ) = P ( t ∣ X ) = N ( 1 + e x n w ) t n ( 1 + e x n w ) 1 − t n ∏ e x n w 1 n = 1

  10. MLE for w ▸ The likelihood for w is 1 − t n L ( w ) = ∏ N ( 1 + e x n w ) t n ( 1 + e x n w ) e x n w 1 n = 1 ▸ The log likelihood is log L ( w ; X , t ) = N ( t n x n w − log ( 1 + e x n w )) ∑ n = 1 ▸ The d th coordinate of the gradient is N ( t n x nd − x nd e x n w 1 + e x n w ) ∑ ∂ log L = ∂w d n = 1 ▸ Good luck solving for w analytically...

  11. Gradient Ascent/Descent

  12. Iterative Optimization ▸ We want to try to find a peak of the log likelihood iteratively : make a guess, improve near the guess, rinse and repeat until you can’t improve further ▸ Many algorithms exist to do this kind of thing ▸ One good one when we have a gradient is Newton-Raphson (old, old method originally used to find roots of polynomials)

  13. Newton-Raphson Optimization ▸ Setting: have a function f ( w ) ; want to find ˆ w s.t. f ( ˆ w ) = 0 . ▸ Algorithm: w ( 0 ) . 1. Pick an initial guess: ˆ w ( n ) )∣ > ε : 2. For n = 0 , 1 ... while ∣ f ( ˆ a. Approximate f around f ( ˆ w ( n ) ) with a line, ˜ f n + 1 ( w ) . w ( n + 1 ) so that ˜ f ( ˆ w ( n + 1 ) ) = 0 . b. Find ˆ ▸ How to do 2a and 2b? 2. a. Use the tangent line: i.e., f ( n ) ( w ) = f ( ˆ ˜ w ( n ) ) + f ′ ( ˆ w ( n ) )( w − ˆ w ( n ) ) w ( n + 1 ) . b. Set this to zero and solve to find ˆ w ( n ) ) w ( n ) − f ( ˆ w ( n + 1 ) = ˆ ˆ f ′ ( ˆ w ( n ))

  14. OK, but isn’t that just for zero-finding? ▸ Yes, but the stumbling block in our problem (maximum likelihood) was that we could set the gradient to zero! ▸ When optimizing, we want to find zeroes of f ′ ( w ) . So our update step is w ( n ) − f ′ ( ˆ w ( n ) ) w ( n + 1 ) = ˆ f ′′ ( ˆ w ( n ) ) ˆ

  15. Why/when does this work? Intermediate value theorem If f ∶ [ a,b ] → R is continuous, u is real and f ( a ) > u > f ( b ) , then there is some c ∈ ( a,b ) so that f ( c ) = u . But, need to find reasonable initialization, or algorithm could diverge. Also, only finds a local optimum.

  16. Multivariate functions ▸ Recall our log likelihood for logistic regression: log L ( w ; X , t ) = N ( t n x n w − log ( 1 + e x n w )) ∑ n = 1 ▸ The d th coordinate of the gradient is N ( t n x nd − x nd e x n w 1 + e x n w ) ∂ log L ∑ = ∂w d n = 1 ▸ This is a function with a vector input.

  17. Multivariate Derivatives ▸ The analog of the first derivative is the gradient vector, ∇ f ( w ) = ( ∂f ( w ) ,..., ∂f ( w ) T ) w 1 w D ▸ The analog of the second derivative is the matrix of second partial derivatives , which is called the Hessian matrix . ⎛ ⎞ ∂ 2 f ( w ) ∂ 2 f ( w ) ∂ 2 f ( w ) ... ⎜ ⎟ ∂w 2 ∂w 1 ∂w 2 ∂w 1 ∂w D ⎜ ⎟ 1 ⎜ ⎟ H f ( w ) = ... ... ... ... ⎝ ⎠ ∂ 2 f ( w ) ∂ 2 f ( w ) ∂ 2 f ( w ) ... ∂w 2 ∂w D ∂w 1 ∂w D ∂w 2 D

  18. Multivariate Update Equation ▸ The update equation for a function of one variable is w ( n ) − f ′ ( ˆ w ( n ) ) w ( n + 1 ) = ˆ f ′′ ( ˆ w ( n ) ) ˆ ▸ For more than one variable, this becomes w ( n ) − H − 1 f ( w ) ( ˆ w ( n ) )∇ f ( w ) ( ˆ w ( n ) ) w ( n + 1 ) = ˆ ˆ

  19. Example: MLE for Logistic Regression ▸ Recall our log likelihood for logistic regression: log L ( w ; X , t ) = N ( t n x n w − log ( 1 + e x n w )) ∑ n = 1 ▸ The d th coordinate of the gradient is ( t n x nd − x nd e x n w 1 + e x n w ) ∑ N ∂ log L = ∂w d n = 1 ▸ The d,d ′ entry in the Hessian is ∂ 2 log L ∂w d ∂w d ′ = − N ∑ e x n w ( 1 + e x n w ) 2 x nd x nd ′ n = 1

  20. Solution Path

  21. Classification Result

  22. Nonlinear Classification Result

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend