LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT - - PowerPoint PPT Presentation

logistic regression gradient logistic regression gradient
SMART_READER_LITE
LIVE PREVIEW

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT - - PowerPoint PPT Presentation

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON Matthieu R Bloch Thursday, January 30, 2020 1 LOGISTICS LOGISTICS TAs and Office hours Monday: Mehrdad (TSRB 523a) - 2pm-3:15pm Tuesday: TJ (VL


slide-1
SLIDE 1

Matthieu R Bloch Thursday, January 30, 2020

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

1

slide-2
SLIDE 2

LOGISTICS LOGISTICS

TAs and Office hours Monday: Mehrdad (TSRB 523a) - 2pm-3:15pm Tuesday: TJ (VL C449 Cubicle D) - 1:30pm - 2:45pm Wednesday: Matthieu (TSRB 423) - 12:00:pm-1:15pm Thursday: Hossein (VL C449 Cubicle B): 10:45pm - 12:00pm Friday: Brighton (TSRB 523a) - 12pm-1:15pm Homework 1 Hard deadline Friday January 31, 2020 (11:59PM EST) (Wednesday February 5, 2020 for DL) Homework 2 posted Due Wednesday February7,2020 11:59pm EST(Wednesday February14, 2020 for DL)

2

slide-3
SLIDE 3

RECAP: GENERATIVE MODELS RECAP: GENERATIVE MODELS

Quadratic Discriminant Analysis: classes distributed according to Covariance matrices are class dependent but decision boundary not linear anymore Generative model rarely accurate Number of parameters to estimate: class priors, means, elements of covariance matrix Works well if Works poorly if without other tricks (dimensionality reduction, structured covariance) Biggest concern: “one should solve the [classification] problem directly and never solve a more general problem as an intermediate step [such as modeling p(xly)].”, Vapnik, 1998 Revisit binary classifier with LDA We no not need to estimate the full joint distribution!

N( , ) μk Σk K − 1 Kd d(d + 1)

1 2

N ≫ d N ≪ d (x) = = η1 ϕ(x; , Σ) π1 μ1 ϕ(x; , Σ) + ϕ(x; , Σ) π1 μ1 π0 μ0 1 1 + exp(−( x + b)) w⊺

3

slide-4
SLIDE 4

4

slide-5
SLIDE 5

LOGISTIC REGRESSION LOGISTIC REGRESSION

Assume that is of the form Estimate and from the data directly Plugin the result to obtain The function is called the logistic function The binary logistic classifier is (linear) How do we estimate and ? From LDA analysis: , Direct estimation of from maximum likelihood

η(x)

1 1+exp(−( x+b)) w⊺

w ^ b ^ (x) = η ^

1 1+exp(−( x+ )) w ^ ⊺ b ^

x ↦

1 1+e−x

(x) = 1{ (x) ≥ } = 1{ x + ≥ 0} hLC η ^

1 2

w ^ ⊺ b ^ w ^ b ^ = ( − ) w ^ Σ ^ −1 μ ^1 μ ^0 b = − + log

1 2 μ

^⊺

^ −1μ ^0

1 2 μ

^⊺

^ −1μ ^1

π ^1 π ^0

( , b) w ^

5

slide-6
SLIDE 6

MLE FOR LOGISTIC REGRESSION MLE FOR LOGISTIC REGRESSION

We have a parametric density model for Standard trick: and This allows us to lump the offset and write Given our dataset the likelihood is For with we obtain

(y|x) = (x) pθ η ^ = [1, x ~ x⊺]⊺ θ = [b w⊺]⊺ η(x) = 1 1 + exp(− ) θ⊺x ~ {( , ) x ~i yi }N

i=1

L(θ) ≜ ( | ) ∏N

i=1 Pθ yi x

~i K = 2 Y = {0, 1} L(θ) ≜ η( (1 − η( ) ∏

i=1 N

x ~i)yi x ~i )1−yi ℓ(θ) = ( log η( ) + (1 − ) log(1 − η( ))) ∑

i=1 N

yi x ~i yi x ~i ℓ(θ) = ( − log(1 + )) ∑

i=1 N

yiθ⊺x ~i eθ⊺x

~i

6

slide-7
SLIDE 7

7

slide-8
SLIDE 8

8

slide-9
SLIDE 9

FINDING THE MLE FINDING THE MLE

A necessary condition for optimality is Here this means System of non linear equations! Use numerical algorithm to find the solution of Provable convergence when is convex We will discuss two techniques: Gradient descent Newton’s method There are many more, especially useful in high dimension

ℓ(θ) = 0 ∇θ ( − ) = 0 ∑N

i=1 x

~i yi

1 1+exp(− ) θ⊺x ~i

d + 1 − ℓ(θ) argminθ −ℓ

9

slide-10
SLIDE 10

WRAPPING UP PLUGIN METHODS WRAPPING UP PLUGIN METHODS

Naive Bayes, LDA, and logistic regression are all plugin methods that result in linear classifiers Naive Bayes plugin method based on density estimation scales well to high-dimensions and naturally handles mixture of discrete and continuous features Linear discriminant analysis better if Gaussianity assumptions are valid Logistic regression models only the distribution of , not valid for a larger class of distributions fewer parameters to estimate Plugin methods can be useful in practice, but ultimately they are very limited There are always distributions where our assumptions are violated If our assumptions are wrong, the output is totally unpredictable Can be hard to verify whether our assumptions are right Require solving a more difficult problem as an intermediate step

Py|x Py,x

10

slide-11
SLIDE 11

GRADIENT DESCENT GRADIENT DESCENT

Consider the canonical problem Find minimum by find iteratively by “rolling downhill” Start from point ; is the step size Choice of step size really matters: too small and convergence takes forever, too big and might never converge Many variants of gradient descent Momentum: and Accelerated: and In practice, gradient has to be evaluated from data

f(x) with f : → R min

x∈Rd

Rd x(0) = − η∇f(x) x(1) x(0) |x=x(0) η = − η∇f(x) x(2) x(1) |x=x(1) ⋯ η = γ + η∇f(x) vt vt−1 |x=x(t) = − x(t+1) x(t) vt = γ + η∇f(x) vt vt−1 |x=

−γ x(t) vt−1

= − x(t+1) x(t) vt

11

slide-12
SLIDE 12

NEWTON’S METHOD NEWTON’S METHOD

Newton-Raphson method uses the second derivative to automatically adapt step size Hessian matrix Newton’s method is much faster when the dimension is small but impractical when is large

= − [ f(x) ∇f(x) x(j+1) x(j) ∇2 ]−1 |x=xj f(x) = ∇2 ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢

f ∂2 ∂x2

1

f ∂2 ∂ ∂ x1 x2

f ∂2 ∂ ∂ xd x1 f ∂2 ∂ ∂ x1 x2 f ∂2 ∂x2

2

f ∂2 ∂ ∂ xd x2

⋯ ⋯ ⋱ ⋯

f ∂2 ∂ ∂ x1 xd f ∂2 ∂ ∂ x2 xd

f ∂2 ∂x2

d

⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ d d

12

slide-13
SLIDE 13

STOCHASTIC GRADIENT DESCENT STOCHASTIC GRADIENT DESCENT

Oen have a loss function of the form where The gradient is and gradient descent update is Problematic if dataset if huge of if not all data is availabel Use iterative technique instead Tons of variations of principle Batch, minibatch, Adagrad, RMSprop, Adam, etc.

ℓ(θ) = (θ) ∑N

i=1 ℓi

(θ) = f( , , θ) ℓi xi yi ℓ(θ) = ∇ (θ) ∇θ ∑N

i=1

ℓi = − η ∇ (θ) θ(j+1) θ(j) ∑

i=1 N

ℓi = − η∇ (θ) θ(j+1) θ(j) ℓi

13

slide-14
SLIDE 14

14

 