STAT 339 Approximate Inference I 15 March 2017 Colin Reimer Dawson - - PowerPoint PPT Presentation

stat 339 approximate inference i
SMART_READER_LITE
LIVE PREVIEW

STAT 339 Approximate Inference I 15 March 2017 Colin Reimer Dawson - - PowerPoint PPT Presentation

STAT 339 Approximate Inference I 15 March 2017 Colin Reimer Dawson Outline Approximation Methods Motivating Model: Logistic Regression Newton-Raphson Method Approximation Methods Thus far we have done a lot of calculus and probability


slide-1
SLIDE 1

STAT 339 Approximate Inference I

15 March 2017 Colin Reimer Dawson

slide-2
SLIDE 2

Outline

Approximation Methods Motivating Model: Logistic Regression Newton-Raphson Method

slide-3
SLIDE 3

Approximation Methods

▸ Thus far we have done a lot of calculus and probability

math to find exact optima/posterior/predictive distributions for simple models.

▸ We relied heavily on some strong assumptions (e.g., i.i.d.

Normal errors, conjugate priors, some parameters fixed, etc.)

▸ In general, the “nice” properties that made exact solutions

possible will not be present.

▸ Hence we need to rely on approximations to our

  • ptima/distributions/etc.
slide-4
SLIDE 4

Two Classes of Approximation

We can either

  • 1. Solve for an approximate solution exactly

▸ Settling for local optima ▸ Making the “least bad” simplifying assumptions to make

analytic solutions possible

  • 2. Solve for an exact solution approximately

▸ Numerical/stochastic integration ▸ Stochastic searh

slide-5
SLIDE 5

Logistic Regression

▸ Consider binary

classification (t ∈ {0,1}) where we want to model P(t = 1) as an explicit function of feature vector x.

▸ Linear regression?

ˆ P(tn = 1 ∣ x) = xnw

▸ This can work if we

  • nly care about

whether ˆ P(t = 1) > 0.5, but a linear model can return invalid probabilities

▸ Not great if we want to

quantify uncertainty

slide-6
SLIDE 6

Modeling a Transformed Probability

▸ Idea: keep the linear dependence idea, but instead of

modeling P(t = 1 ∣ x) directly, model a nonlinear function

  • f P that is not bounded to [0,1].

▸ The odds:

Odds(t = 1) ∶= P(t = 1) 1 − P(t = 1) ∈ [0,∞)

▸ The log odds or logit:

Logit(t = 1) ∶= log ( P(t = 1) 1 − P(t = 1)) ∈ (−∞,∞)

▸ Nice property: equal probabilities corresponds to

Logit = 0.

slide-7
SLIDE 7

Logit Transformation

0.0 0.2 0.4 0.6 0.8 1.0 −6 −4 −2 2 4 6 p logit(p)

slide-8
SLIDE 8

Logistic Transformation

▸ η = Logit(p) = log( p 1−p) ▸ Inverse is the logistic function:

p = Logistic(η) = Logit−1(η) = exp{η} 1 + exp{η}

−5 5 0.0 0.2 0.4 0.6 0.8 1.0 η logistic(η)

slide-9
SLIDE 9

A Linear Model of the Logit

▸ Having defined η with an unrestricted range, we can now

model ηn = xnw

▸ Or, equivalently,

P(tn = 1 ∣ xn) = exp{xnw} 1 + exp{xnw}

▸ With an independence assumption, yields a likelihood

function L(w) = P(t ∣ X) =

N

n=1

( exnw 1 + exnw )

tn

( 1 1 + exnw )

1−tn

slide-10
SLIDE 10

MLE for w

▸ The likelihood for w is

L(w) =

N

n=1

( exnw 1 + exnw )

tn

( 1 1 + exnw )

1−tn ▸ The log likelihood is

log L(w;X,t) =

N

n=1

(tnxnw − log(1 + exnw))

▸ The dth coordinate of the gradient is

∂ log L ∂wd =

N

n=1

(tnxnd − xndexnw 1 + exnw )

▸ Good luck solving for w analytically...

slide-11
SLIDE 11

Gradient Ascent/Descent

slide-12
SLIDE 12

Iterative Optimization

▸ We want to try to find a peak of the log likelihood

iteratively: make a guess, improve near the guess, rinse and repeat until you can’t improve further

▸ Many algorithms exist to do this kind of thing ▸ One good one when we have a gradient is

Newton-Raphson (old, old method originally used to find roots of polynomials)

slide-13
SLIDE 13

Newton-Raphson Optimization

▸ Setting: have a function f(w); want to find ˆ

w s.t. f( ˆ w) = 0.

▸ Algorithm:

  • 1. Pick an initial guess: ˆ

w(0).

  • 2. For n = 0,1... while ∣f( ˆ

w(n))∣ > ε:

  • a. Approximate f around f( ˆ

w(n)) with a line, ˜ fn+1(w).

  • b. Find ˆ

w(n+1) so that ˜ f( ˆ w(n+1)) = 0.

▸ How to do 2a and 2b?

2.

  • a. Use the tangent line: i.e.,

˜ f (n)(w) = f( ˆ w(n)) + f ′( ˆ w(n))(w − ˆ w(n))

  • b. Set this to zero and solve to find ˆ

w(n+1). ˆ w(n+1) = ˆ w(n) − f( ˆ w(n)) f ′( ˆ w(n))

slide-14
SLIDE 14

OK, but isn’t that just for zero-finding?

▸ Yes, but the stumbling block in our problem (maximum

likelihood) was that we could set the gradient to zero!

▸ When optimizing, we want to find zeroes of f ′(w). So

  • ur update step is

ˆ w(n+1) = ˆ w(n) − f ′( ˆ w(n)) f ′′( ˆ w(n))

slide-15
SLIDE 15

Why/when does this work?

Intermediate value theorem If f ∶ [a,b] → R is continuous, u is real and f(a) > u > f(b), then there is some c ∈ (a,b) so that f(c) = u.

But, need to find reasonable initialization, or algorithm could

  • diverge. Also, only finds a local optimum.
slide-16
SLIDE 16

Multivariate functions

▸ Recall our log likelihood for logistic regression:

log L(w;X,t) =

N

n=1

(tnxnw − log(1 + exnw))

▸ The dth coordinate of the gradient is

∂ log L ∂wd =

N

n=1

(tnxnd − xndexnw 1 + exnw )

▸ This is a function with a vector input.

slide-17
SLIDE 17

Multivariate Derivatives

▸ The analog of the first derivative is the gradient vector,

∇f(w) = (∂f(w) w1 ,..., ∂f(w) wD )

T ▸ The analog of the second derivative is the matrix of

second partial derivatives, which is called the Hessian matrix. Hf(w) = ⎛ ⎜ ⎜ ⎜ ⎝

∂2f(w) ∂w2

1

∂2f(w) ∂w1∂w2

...

∂2f(w) ∂w1∂wD

... ... ... ...

∂2f(w) ∂wD∂w1 ∂2f(w) ∂wD∂w2

...

∂2f(w) ∂w2

D

⎞ ⎟ ⎟ ⎟ ⎠

slide-18
SLIDE 18

Multivariate Update Equation

▸ The update equation for a function of one variable is

ˆ w(n+1) = ˆ w(n) − f ′( ˆ w(n)) f ′′( ˆ w(n))

▸ For more than one variable, this becomes

ˆ w(n+1) = ˆ w(n) − H−1

f(w)( ˆ

w(n))∇f(w)( ˆ w(n))

slide-19
SLIDE 19

Example: MLE for Logistic Regression

▸ Recall our log likelihood for logistic regression:

log L(w;X,t) =

N

n=1

(tnxnw − log(1 + exnw))

▸ The dth coordinate of the gradient is

∂ log L ∂wd =

N

n=1

(tnxnd − xndexnw 1 + exnw )

▸ The d,d′ entry in the Hessian is

∂2 log L ∂wd∂wd′ = −

N

n=1

xndxnd′ exnw (1 + exnw)2

slide-20
SLIDE 20

Solution Path

slide-21
SLIDE 21

Classification Result

slide-22
SLIDE 22

Nonlinear Classification Result