STAT 339 Approximate Inference I 15 March 2017 Colin Reimer Dawson - - PowerPoint PPT Presentation
STAT 339 Approximate Inference I 15 March 2017 Colin Reimer Dawson - - PowerPoint PPT Presentation
STAT 339 Approximate Inference I 15 March 2017 Colin Reimer Dawson Outline Approximation Methods Motivating Model: Logistic Regression Newton-Raphson Method Approximation Methods Thus far we have done a lot of calculus and probability
Outline
Approximation Methods Motivating Model: Logistic Regression Newton-Raphson Method
Approximation Methods
▸ Thus far we have done a lot of calculus and probability
math to find exact optima/posterior/predictive distributions for simple models.
▸ We relied heavily on some strong assumptions (e.g., i.i.d.
Normal errors, conjugate priors, some parameters fixed, etc.)
▸ In general, the “nice” properties that made exact solutions
possible will not be present.
▸ Hence we need to rely on approximations to our
- ptima/distributions/etc.
Two Classes of Approximation
We can either
- 1. Solve for an approximate solution exactly
▸ Settling for local optima ▸ Making the “least bad” simplifying assumptions to make
analytic solutions possible
- 2. Solve for an exact solution approximately
▸ Numerical/stochastic integration ▸ Stochastic searh
Logistic Regression
▸ Consider binary
classification (t ∈ {0,1}) where we want to model P(t = 1) as an explicit function of feature vector x.
▸ Linear regression?
ˆ P(tn = 1 ∣ x) = xnw
▸ This can work if we
- nly care about
whether ˆ P(t = 1) > 0.5, but a linear model can return invalid probabilities
▸ Not great if we want to
quantify uncertainty
Modeling a Transformed Probability
▸ Idea: keep the linear dependence idea, but instead of
modeling P(t = 1 ∣ x) directly, model a nonlinear function
- f P that is not bounded to [0,1].
▸ The odds:
Odds(t = 1) ∶= P(t = 1) 1 − P(t = 1) ∈ [0,∞)
▸ The log odds or logit:
Logit(t = 1) ∶= log ( P(t = 1) 1 − P(t = 1)) ∈ (−∞,∞)
▸ Nice property: equal probabilities corresponds to
Logit = 0.
Logit Transformation
0.0 0.2 0.4 0.6 0.8 1.0 −6 −4 −2 2 4 6 p logit(p)
Logistic Transformation
▸ η = Logit(p) = log( p 1−p) ▸ Inverse is the logistic function:
p = Logistic(η) = Logit−1(η) = exp{η} 1 + exp{η}
−5 5 0.0 0.2 0.4 0.6 0.8 1.0 η logistic(η)
A Linear Model of the Logit
▸ Having defined η with an unrestricted range, we can now
model ηn = xnw
▸ Or, equivalently,
P(tn = 1 ∣ xn) = exp{xnw} 1 + exp{xnw}
▸ With an independence assumption, yields a likelihood
function L(w) = P(t ∣ X) =
N
∏
n=1
( exnw 1 + exnw )
tn
( 1 1 + exnw )
1−tn
MLE for w
▸ The likelihood for w is
L(w) =
N
∏
n=1
( exnw 1 + exnw )
tn
( 1 1 + exnw )
1−tn ▸ The log likelihood is
log L(w;X,t) =
N
∑
n=1
(tnxnw − log(1 + exnw))
▸ The dth coordinate of the gradient is
∂ log L ∂wd =
N
∑
n=1
(tnxnd − xndexnw 1 + exnw )
▸ Good luck solving for w analytically...
Gradient Ascent/Descent
Iterative Optimization
▸ We want to try to find a peak of the log likelihood
iteratively: make a guess, improve near the guess, rinse and repeat until you can’t improve further
▸ Many algorithms exist to do this kind of thing ▸ One good one when we have a gradient is
Newton-Raphson (old, old method originally used to find roots of polynomials)
Newton-Raphson Optimization
▸ Setting: have a function f(w); want to find ˆ
w s.t. f( ˆ w) = 0.
▸ Algorithm:
- 1. Pick an initial guess: ˆ
w(0).
- 2. For n = 0,1... while ∣f( ˆ
w(n))∣ > ε:
- a. Approximate f around f( ˆ
w(n)) with a line, ˜ fn+1(w).
- b. Find ˆ
w(n+1) so that ˜ f( ˆ w(n+1)) = 0.
▸ How to do 2a and 2b?
2.
- a. Use the tangent line: i.e.,
˜ f (n)(w) = f( ˆ w(n)) + f ′( ˆ w(n))(w − ˆ w(n))
- b. Set this to zero and solve to find ˆ
w(n+1). ˆ w(n+1) = ˆ w(n) − f( ˆ w(n)) f ′( ˆ w(n))
OK, but isn’t that just for zero-finding?
▸ Yes, but the stumbling block in our problem (maximum
likelihood) was that we could set the gradient to zero!
▸ When optimizing, we want to find zeroes of f ′(w). So
- ur update step is
ˆ w(n+1) = ˆ w(n) − f ′( ˆ w(n)) f ′′( ˆ w(n))
Why/when does this work?
Intermediate value theorem If f ∶ [a,b] → R is continuous, u is real and f(a) > u > f(b), then there is some c ∈ (a,b) so that f(c) = u.
But, need to find reasonable initialization, or algorithm could
- diverge. Also, only finds a local optimum.
Multivariate functions
▸ Recall our log likelihood for logistic regression:
log L(w;X,t) =
N
∑
n=1
(tnxnw − log(1 + exnw))
▸ The dth coordinate of the gradient is
∂ log L ∂wd =
N
∑
n=1
(tnxnd − xndexnw 1 + exnw )
▸ This is a function with a vector input.
Multivariate Derivatives
▸ The analog of the first derivative is the gradient vector,
∇f(w) = (∂f(w) w1 ,..., ∂f(w) wD )
T ▸ The analog of the second derivative is the matrix of
second partial derivatives, which is called the Hessian matrix. Hf(w) = ⎛ ⎜ ⎜ ⎜ ⎝
∂2f(w) ∂w2
1
∂2f(w) ∂w1∂w2
...
∂2f(w) ∂w1∂wD
... ... ... ...
∂2f(w) ∂wD∂w1 ∂2f(w) ∂wD∂w2
...
∂2f(w) ∂w2
D
⎞ ⎟ ⎟ ⎟ ⎠
Multivariate Update Equation
▸ The update equation for a function of one variable is
ˆ w(n+1) = ˆ w(n) − f ′( ˆ w(n)) f ′′( ˆ w(n))
▸ For more than one variable, this becomes
ˆ w(n+1) = ˆ w(n) − H−1
f(w)( ˆ
w(n))∇f(w)( ˆ w(n))
Example: MLE for Logistic Regression
▸ Recall our log likelihood for logistic regression:
log L(w;X,t) =
N
∑
n=1
(tnxnw − log(1 + exnw))
▸ The dth coordinate of the gradient is
∂ log L ∂wd =
N
∑
n=1
(tnxnd − xndexnw 1 + exnw )
▸ The d,d′ entry in the Hessian is
∂2 log L ∂wd∂wd′ = −
N
∑
n=1