Machine Learning (CSE 446): Learning as Minimizing Loss: - - PowerPoint PPT Presentation

machine learning cse 446 learning as minimizing loss
SMART_READER_LITE
LIVE PREVIEW

Machine Learning (CSE 446): Learning as Minimizing Loss: - - PowerPoint PPT Presentation

Machine Learning (CSE 446): Learning as Minimizing Loss: Regularization and Gradient Descent Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 12 Announcements Assignment 2 due tomo. Midterm: Weds,


slide-1
SLIDE 1

Machine Learning (CSE 446): Learning as Minimizing Loss: Regularization and Gradient Descent

Sham M Kakade

c 2018 University of Washington cse446-staff@cs.washington.edu

1 / 12

slide-2
SLIDE 2

Announcements

◮ Assignment 2 due tomo. ◮ Midterm: Weds, Feb 7th. ◮ Qz section: review ◮ Today:

Regularization and Optimization!

1 / 12

slide-3
SLIDE 3

Review

1 / 12

slide-4
SLIDE 4

Relax!

◮ The mis-classification optimization problem:

min

w

1 N

N

  • n=1

yn(w · xn) ≤ 0

◮ Instead, use loss function ℓ(yn, w · x) and solve arelaxation:

min

w

1 N

N

  • n=1

ℓ(yn, w · xn)

2 / 12

slide-5
SLIDE 5

Relax!

◮ The mis-classification optimization problem:

min

w

1 N

N

  • n=1

yn(w · xn) ≤ 0

◮ Instead, use loss function ℓ(yn, w · x) and solve arelaxation:

min

w

1 N

N

  • n=1

ℓ(yn, w · xn)

◮ What do we want? ◮ How do we get it?

speed? accuracy?

2 / 12

slide-6
SLIDE 6

Some loss functions:

◮ The square loss:

ℓ(y, w · x) = (y − w · x)2

◮ The logistic loss:

ℓlogistic(y, w · x) = log (1 + exp(−yw · x)) .

◮ They both “upper bound” the mistake rate. ◮ Instead:

◮ Instead, we let’s care about “regression” where y is real valued. ◮ What if we have multiple classes? (not just binary classification?) 3 / 12

slide-7
SLIDE 7

Least squares: let’s minimize it!

◮ The optimization problem:

min

w

1 N

N

  • n=1

(yn − w · xn)2 = min

w Y − Xw2

where Y is an n-vector and X is our n × d data matrix.

◮ The solution is the least squares estimator:

wleast squares = (X⊤X)−1X⊤Y

4 / 12

slide-8
SLIDE 8

Matrix calculus proof: scratch space

5 / 12

slide-9
SLIDE 9

Matrix calculus proof: scratch space

5 / 12

slide-10
SLIDE 10

Let’s remember our linear system solving!

6 / 12

slide-11
SLIDE 11

Today

6 / 12

slide-12
SLIDE 12

Least squares: What could go wrong?!

◮ The optimization problem:

min

w

1 N

N

  • n=1

(yn − w · xn)2 = min

w Y − Xw2

where Y is an n-vector and X is our n × d data matrix.

◮ The solution is the least squares estimator:

wleast squares = (X⊤X)−1X⊤Y

7 / 12

slide-13
SLIDE 13

Least squares: What could go wrong?!

◮ The optimization problem:

min

w

1 N

N

  • n=1

(yn − w · xn)2 = min

w Y − Xw2

where Y is an n-vector and X is our n × d data matrix.

◮ The solution is the least squares estimator:

wleast squares = (X⊤X)−1X⊤Y What if d is bigger than n? Even if not?

7 / 12

slide-14
SLIDE 14

What could go wrong?

Suppose d > n: What about n > d?

8 / 12

slide-15
SLIDE 15

What could go wrong?

Suppose d > n: What about n > d?

◮ What happens if features are very correlated?

(e.g. ’rows/columns in our matrix are co-linear.)

8 / 12

slide-16
SLIDE 16

linear system solving: scratch space

8 / 12

slide-17
SLIDE 17

A fix: Regularization

◮ Regularize the optimization problem:

min

w

1 N

N

  • n=1

(yn − w · xn)2 + λw2 = min

w Y − X⊤w2 + λw2 ◮ This particular case: “Ridge” Regression, Tikhonov regularization ◮ The solution is the least squares estimator:

wleast squares = 1 N X⊤X + λI −1 1 N X⊤Y

  • 9 / 12
slide-18
SLIDE 18

The “general” approach

◮ The regularized optimization problem:

min

w

1 N

N

  • n=1

ℓ(yn, w · xn) + R(w)

◮ Penalty some w more than others.

Example: R(w) = w2 How do we find a solution quickly?

10 / 12

slide-19
SLIDE 19

Remember: convexity

10 / 12

slide-20
SLIDE 20

Gradient Descent

◮ Want to solve:

min

z

F(z)

◮ How should we update z?

11 / 12

slide-21
SLIDE 21

Gradient Descent

Data: function F : Rd → R, number of iterations K, step sizes η(1), . . . , η(K) Result: z ∈ Rd initialize: z(0) = 0; for k ∈ {1, . . . , K} do z(k) = z(k−1) − η(k) · ∇zF(z(k−1)); end return z(K); Algorithm 1: GradientDescent

11 / 12

slide-22
SLIDE 22

Gradient Descent: Convergence

◮ Letting z∗ = argminz F(z) denote the global minimum ◮ Let z(k) be our parameter after k updates. ◮ Thm: Suppose F is convex and “L-smooth”. Using a fixed step size η ≤ 1 L, we

have: F(z(k)) − F(z∗) ≤ z(0) − z∗2 η · k That is the convergence rate is O( 1

k).

12 / 12

slide-23
SLIDE 23

Smoothness and Gradient Descent Convergence

◮ Smooth functions: for all z, z′

∇F(z) − ∇F(z′) ≤ Lz − z′

◮ Proof idea:

  • 1. If our gradient is large, we will make good progress decreasing our function value:
  • 2. If our gradient is small, we must have value near the optimal value:

12 / 12