Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient - - PowerPoint PPT Presentation

machine learning cse 446 gradient descent and stochastic
SMART_READER_LITE
LIVE PREVIEW

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient - - PowerPoint PPT Presentation

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade 2018 c University of Washington cse446-staff@cs.washington.edu 1 / 12 Announcements Midterm: Weds, Feb 7th. Policies: You may use a single


slide-1
SLIDE 1

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent

Sham M Kakade

c 2018 University of Washington cse446-staff@cs.washington.edu

1 / 12

slide-2
SLIDE 2

Announcements

◮ Midterm: Weds, Feb 7th. Policies:

◮ You may use a single side of a single sheet of handwritten notes that you prepared. ◮ You must turn your sheet of notes in, with your name on it, in at the conclusion of

the exam, even if you never looked at it.

◮ You may not use electronics devices of any sort.

◮ A few comments on the course difficulty ◮ Today:

New: GD and SGD

1 / 12

slide-3
SLIDE 3

Course difficulty

Why is it difficult/what should we learn?

◮ homeworks ◮ exams ◮ grading

1 / 12

slide-4
SLIDE 4

Review

1 / 12

slide-5
SLIDE 5

Gradient Descent: Convergence

◮ Denote:

z∗ = argminz F(z): the global minimum z(k): our parameter after k updates.

◮ Thm: Suppose F is convex and “L-smooth” (e.g. works for square loss and the

logistic loss). Using a fixed step size η ≤ 1

L, we have:

F(z(k)) − F(z∗) ≤ z(0) − z∗2 η · k That is the convergence rate is O( 1

k). ◮ A constant learning rate means no parameter tuning!

2 / 12

slide-6
SLIDE 6

Probabilistic machine learning:

Probabilistic machine learning:

◮ define a probabilistic model relating random variables x to y ◮ estimate its parameters.

2 / 12

slide-7
SLIDE 7

A Probabilistic Model for Binary Classification: Logistic Regression

◮ For Y ∈ {−1, 1} define pw,b(Y | X) as:

  • 1. Transform feature vector x via the “activation” function:

a = w · x + b

  • 2. Transform a into a binomial probability by passing it through the logistic function:

pw,b(Y = +1 | x) = 1 1 + exp −a = 1 1 + exp − (w · x + b)

  • 10
  • 5

5 10 0.0 0.4 0.8

◮ If we learn pw,b(Y | x), we can (almost) do whatever we like!

3 / 12

slide-8
SLIDE 8

Maximum Likelihood Estimation and the Log loss

The principle of maximum likelihood estimation is to choose our parameters to make

  • ur observed data as likely as possible (under our model).

◮ Mathematically: find ˆ

w that maximizes the probability of the labels y1, . . . yn given the inputs x1, . . . xn.

◮ The Maximum Likelihood Estimator (the ’MLE’) is:

ˆ w = argmax

w N

  • n=1

pw(yn | xn) = argmin

w N

  • n=1

− log pw(yn | xn)

4 / 12

slide-9
SLIDE 9

The MLE for Logistic Regression

◮ the MLE for the logistic regression model:

argmin

w N

  • n=1

− log pw(yn | xn) = argmin

w N

  • n=1

log (1 + exp(−ynw · xn))

◮ This is the logistic loss function that we saw earlier. ◮ How do we find the MLE?

5 / 12

slide-10
SLIDE 10

Derivation for Log loss for Logistic Regression: scratch space

5 / 12

slide-11
SLIDE 11

Today

5 / 12

slide-12
SLIDE 12

Linear Regression as a Probabilistic Model

Linear regression defines pw(Y | X) as follows:

  • 1. Observe the feature vector x; transform it via the activation function:

µ = w · x

  • 2. Let µ be the mean of a normal distribution and define the density:

pw(Y | x) = 1 σ √ 2π exp −(Y − µ)2 2σ2

  • 3. Sample Y from pw(Y | x).

6 / 12

slide-13
SLIDE 13

Linear Regression-MLE is (Unregularized) Squared Loss Minimization!

argmin

w N

  • n=1

− log pw(yn | xn) ≡ argmin

w

1 N

N

  • n=1

(yn − w · xn)2

  • SquaredLossn(w,b)

Where did the variance go? What is GD here?

7 / 12

slide-14
SLIDE 14

Loss Minimization & Gradient Descent

w∗ = argmin

w

1 N

N

  • n=1

ℓ(xn, yn, w)

  • ℓn(w)

+R(w) What is GD here? What do we do if N is large?

8 / 12

slide-15
SLIDE 15

Stochastic Gradient Descent (SGD): by example

argmin

w

1 N

N

  • n=1

(yn − w · xn)2

◮ Gradient descent: ◮ Note we are computing an average. What is a crude way to estimate an average? ◮ Stochastic gradient descent:

Will it converge?

9 / 12

slide-16
SLIDE 16

Stochastic Gradient Descent (SGD): by example

argmin

w

1 N

N

  • n=1

(yn − w · xn)2

◮ Gradient descent: ◮ Note we are computing an average. What is a crude way to estimate an average? ◮ Stochastic gradient descent:

Will it converge? If the step size in SGD is a constant, we will not converge.

9 / 12

slide-17
SLIDE 17

Stochastic Gradient Descent (SGD) (without regularization)

Data: loss functions ℓ(·), training data, number of iterations K, step sizes η(1), . . . , η(K) Result: parameters w ∈ Rd initialize: w(0) = 0; for k ∈ {1, . . . , K} do i ∼ Uniform({1, . . . , N}); w(k) = w(k−1) − η(k) · ∇wℓi(w(k−1)); end return w(K); Algorithm 1: SGD

10 / 12

slide-18
SLIDE 18

Stochastic Gradient Descent: Convergence

w∗ = argmin

w

1 N

N

  • n=1

ℓn(w)

◮ w(k): our parameter after k updates. ◮ Thm: Suppose ℓ(·) is convex (and satisfies mild regularity conditions). There

exists a way to decrease our step sizes η(k) over time so that our function value, F(w(k)) will converge to the minimal function value F(w∗).

◮ This Thm is different from GD in that we need to turn down our step sizes

  • ver time!

11 / 12