Machine Learning (CSE 446): Probabilistic View of Logistic - - PowerPoint PPT Presentation

machine learning cse 446 probabilistic view of logistic
SMART_READER_LITE
LIVE PREVIEW

Machine Learning (CSE 446): Probabilistic View of Logistic - - PowerPoint PPT Presentation

Machine Learning (CSE 446): Probabilistic View of Logistic Regression and Linear Regression Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 12 Announcements Midterm: Weds, Feb 7th. Policies: You


slide-1
SLIDE 1

Machine Learning (CSE 446): Probabilistic View of Logistic Regression and Linear Regression

Sham M Kakade

c 2018 University of Washington cse446-staff@cs.washington.edu

1 / 12

slide-2
SLIDE 2

Announcements

◮ Midterm: Weds, Feb 7th. Policies:

◮ You may use a single side of a single sheet of handwritten notes that you prepared. ◮ You must turn your sheet of notes in, with your name on it, in at the conclusion of

the exam, even if you never looked at it.

◮ You may not use electronics devices of any sort.

◮ Today:

Review: Regularization and Optimization New: (wrap up GD) + probabilistic modeling!

1 / 12

slide-3
SLIDE 3

Review

1 / 12

slide-4
SLIDE 4

Regularization / Ridge Regression

◮ Regularize the optimization problem:

min

w

1 N

N

  • n=1

(yn − w · xn)2 + λw2 = min

w

1 N Y − X⊤w2 + λw2

◮ This particular case: “Ridge” Regression, Tikhonov regularization ◮ The solution is the least squares estimator:

wleast squares = 1 N X⊤X + λI −1 1 N X⊤Y

  • Regularization is often necessary for the “exact” solution method

(regardless of if d bigger/less than N)

2 / 12

slide-5
SLIDE 5

Gradient Descent

◮ Want to solve:

min

z

F(z)

◮ How should we update z?

3 / 12

slide-6
SLIDE 6

Gradient Descent

Data: function F : Rd → R, number of iterations K, step sizes η(1), . . . , η(K) Result: z ∈ Rd initialize: z(0) = 0; for k ∈ {1, . . . , K} do z(k) = z(k−1) − η(k) · ∇zF(z(k−1)); end return z(K); Algorithm 1: GradientDescent

3 / 12

slide-7
SLIDE 7

Today

3 / 12

slide-8
SLIDE 8

Gradient Descent: Convergence

◮ Denote:

z∗ = argminz F(z): the global minimum z(k): our parameter after k updates.

◮ Thm: Suppose F is convex and “L-smooth”. Using a fixed step size η ≤ 1 L, we

have: F(z(k)) − F(z∗) ≤ z(0) − z∗2 η · k That is the convergence rate is O( 1

k). ◮ This Thm applies to both the square loss and logistic loss!

4 / 12

slide-9
SLIDE 9

Proof intuition: smoothness and GD Convergence

◮ L-Smooth functions: “The gradients don’t change quickly.” Precisely, For all z, z′

∇F(z) − ∇F(z′) ≤ Lz − z′

◮ Proof idea:

  • 1. If our gradient is large, we will make good progress decreasing our function value:
  • 2. If our gradient is small, we must have value near the optimal value:

5 / 12

slide-10
SLIDE 10

A better idea?

◮ Remember the Bayes optimal classifier. D(x, y) is the true probability of (x, y).

f(BO)(x) = argmax

y

D(x, y) = argmax

y

D(y | x)

◮ Of course, we don’t have D(y | x).

Probabilistic machine learning: define a probabilistic model relating random variables x to y and estimate its parameters.

6 / 12

slide-11
SLIDE 11

A Probabilistic Model for Binary Classification: Logistic Regression

◮ For Y ∈ {−1, 1} define pw,b(Y | X) as:

  • 1. Transform feature vector x via the “activation” function:

a = w · x + b

  • 2. Transform a into a binomial probability by passing it through the logistic function:

pw,b(Y = +1 | x) = 1 1 + exp −a = 1 1 + exp − (w · x + b)

  • 10
  • 5

5 10 0.0 0.4 0.8

◮ If we learn pw,b(Y | x), we can (almost) do whatever we like!

7 / 12

slide-12
SLIDE 12

Maximum Likelihood Estimation

The principle of maximum likelihood estimation is to choose our parameters to make

  • ur observed data as likely as possible (under our model).

◮ Mathematically: find ˆ

w that maximizes the probability of the labels y1, . . . yn given the inputs x1, . . . xn.

◮ Note, by the i.i.d. assumption:

D(y1, . . . yn | x1, . . . xN) =

◮ The Maximum Likelihood Estimator (the ’MLE’) is:

ˆ w = argmax

w N

  • n=1

pw(yn | xn)

8 / 12

slide-13
SLIDE 13

Maximum Likelihood Estimation and the Log loss

◮ The ’MLE’ is:

ˆ w = argmax

w N

  • n=1

pw(yn | xn) = argmax

w

log

N

  • n=1

pw(yn | xn) = argmax

w N

  • n=1

log pw(yn | xn) = argmin

w N

  • n=1

− log pw(yn | xn)

◮ This is referred to as the log loss.

9 / 12

slide-14
SLIDE 14

The MLE for Logistic Regression

◮ the MLE for the logistic regression model:

argmin

w N

  • n=1

− log pw(yn | xn) = argmin

w N

  • n=1

log (1 + exp(−ynw · xn))

◮ This is the logistic loss function that we saw earlier. ◮ How do we find the MLE?

10 / 12

slide-15
SLIDE 15

Derivation for Log loss for Logistic Regression: scratch space

10 / 12

slide-16
SLIDE 16

Linear Regression as a Probabilistic Model

Linear regression defines pw(Y | X) as follows:

  • 1. Observe the feature vector x; transform it via the activation function:

µ = w · x

  • 2. Let µ be the mean of a normal distribution and define the density:

pw(Y | x) = 1 σ √ 2π exp −(Y − µ)2 2σ2

  • 3. Sample Y from pw(Y | x).

11 / 12

slide-17
SLIDE 17

Linear Regression-MLE is (Unregularized) Squared Loss Minimization!

argmin

w N

  • n=1

− log pw(yn | xn) ≡ argmin

w

1 N

N

  • n=1

(yn − w · xn)2

  • SquaredLossn(w,b)

Where did the variance go?

12 / 12