Machine Learning (CSE 446): Learning as Minimizing Loss; Least - - PowerPoint PPT Presentation

machine learning cse 446 learning as minimizing loss
SMART_READER_LITE
LIVE PREVIEW

Machine Learning (CSE 446): Learning as Minimizing Loss; Least - - PowerPoint PPT Presentation

Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 13 Review 1 / 13 Alternate View of PCA: Minimizing Reconstruction Error Assume that


slide-1
SLIDE 1

Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares

Sham M Kakade

c 2018 University of Washington cse446-staff@cs.washington.edu

1 / 13

slide-2
SLIDE 2

Review

1 / 13

slide-3
SLIDE 3

Alternate View of PCA: Minimizing Reconstruction Error

Assume that the data are centered. Find a line which minimizes the squared reconstruction error.

1 / 13

slide-4
SLIDE 4

Alternate View of PCA: Minimizing Reconstruction Error

Assume that the data are centered. Find a line which minimizes the squared reconstruction error.

1 / 13

slide-5
SLIDE 5

Alternate View: Minimizing Reconstruction Error with K-dim subspace.

Equivalent (“dual”) formulation of PCA: find an “orthonormal basis” u1, u2, . . . uK which minimizes the total reconstruction error on the data: argmin

  • rthonormal basis:u1,u2,...uK

1 N

  • i

(xi − Proju1,...uK(xi))2 Recall the projection of x onto K-orthonormal basis is: Proju1,...uK(x) =

K

  • j=1

(ui · x)ui The SVD “simultaneously” finds all u1, u2, . . . uK

2 / 13

slide-6
SLIDE 6

Projection and Reconstruction: the one dimensional case

◮ Take out mean µ: ◮ Find the “top” eigenvector u of the covariance matrix. ◮ What are your projections? ◮ What are your reconstructions,

X = [ x1| x2| · · · | xN]⊤?

◮ What is your reconstruction error of doing nothing (K = 0) and using K = 1?

1 N

  • i

(xi − µ)2 = 1 N

  • i

(xi − xi)2 =

◮ Reduction in error by using a k-dim PCA projection:

3 / 13

slide-7
SLIDE 7

PCA vs. Clustering

Summarize your data with fewer points or fewer dimensions?

3 / 13

slide-8
SLIDE 8

Loss functions

3 / 13

slide-9
SLIDE 9

Today

3 / 13

slide-10
SLIDE 10

Perceptron

Perceptron Algorithm: A model and an algorithm, rolled into one. Isn’t there a more principled methodology to derive algorithms?

3 / 13

slide-11
SLIDE 11

What we (“naively”) want:

“Minimize training-set error rate”: min

w,b

1 N

N

  • n=1

yn(w · xn + b) ≤ 0

  • zero-one loss on a point n

This problem is NP-hard; even for a (multiplicative) approximation.

margin = y · (w · x + b) loss

Why is this loss function so unwieldy?

4 / 13

slide-12
SLIDE 12

Relax!

◮ The mis-classification optimization problem:

min

w

1 N

N

  • n=1

yn(w · xn) ≤ 0

◮ Instead, let’s try to choose a “reasonable” loss function ℓ(yn, w · x) and then try

to solve the relaxation: min

w

1 N

N

  • n=1

ℓ(yn, w · xn)

5 / 13

slide-13
SLIDE 13

What is a good “relaxation”?

◮ Want that minimizing our surrogate loss helps with minimizing the

mis-classification loss.

◮ idea: try to use a (sharp) upper bound of the zero-one loss by ℓ:

y(w · x) ≤ 0 ≤ ℓ(y, w · x)

◮ want our relaxed optimization problem to be easy to solve.

What properties might we want for ℓ(·)?

6 / 13

slide-14
SLIDE 14

What is a good “relaxation”?

◮ Want that minimizing our surrogate loss helps with minimizing the

mis-classification loss.

◮ idea: try to use a (sharp) upper bound of the zero-one loss by ℓ:

y(w · x) ≤ 0 ≤ ℓ(y, w · x)

◮ want our relaxed optimization problem to be easy to solve.

What properties might we want for ℓ(·)?

◮ differentiable? sensitive to changes in w? ◮ convex? 6 / 13

slide-15
SLIDE 15

The square loss! (and linear regression)

◮ The square loss: ℓ(y, w · x) = (y − w · x)2. ◮ The relaxed optimization problem:

min

w

1 N

N

  • n=1

(yn − w · xn)2

◮ nice properties:

◮ for binary classification, it is a an upper bound on the zero-one loss. ◮ It makes sense more generally, e.g. if we want to predict real valued y. ◮ We have a convex optimization problem.

◮ For classification, what is your decision rule using a w?

7 / 13

slide-16
SLIDE 16

The square loss as an upper bound

◮ We have:

y(w · x) ≤ 0 ≤ (y − w · x)2

◮ Easy to see, by plotting:

8 / 13

slide-17
SLIDE 17

Remember this problem?

Data derived from https://archive.ics.uci.edu/ml/datasets/Auto+MPG

mpg; cylinders; displacement; horsepower; weight; acceleration; year; origin

Input: a row in this table. Goal: predict whether mpg is < 23 (“bad” = 0) or above (“good” = 1) given the input row.

9 / 13

slide-18
SLIDE 18

Remember this problem?

Data derived from https://archive.ics.uci.edu/ml/datasets/Auto+MPG

mpg; cylinders; displacement; horsepower; weight; acceleration; year; origin

Input: a row in this table. Goal: predict whether mpg is < 23 (“bad” = 0) or above (“good” = 1) given the input row. Predicting a real y (often) makes more sense.

9 / 13

slide-19
SLIDE 19

A better (convex) upper bound

◮ The logistic loss:

ℓlogistic(y, w · x) = log (1 + exp(−yw · x)) .

◮ We have:

y(w · x) ≤ 0 ≤ constant ∗ ℓlogistic(y, w · x)

◮ Again, easy to see, by plotting:

10 / 13

slide-20
SLIDE 20

Least squares: let’s minimize it!

◮ The optimization problem:

min

w

1 N

N

  • n=1

(yn − w · xn)2 = min

w Y − Xw2

where Y is an n-vector and X is our n × d data matrix.

◮ How do we interpret Xw?

11 / 13

slide-21
SLIDE 21

Least squares: let’s minimize it!

◮ The optimization problem:

min

w

1 N

N

  • n=1

(yn − w · xn)2 = min

w Y − Xw2

where Y is an n-vector and X is our n × d data matrix.

◮ How do we interpret Xw?

The solution is the least squares estimator: wleast squares = (X⊤X)−1X⊤Y

11 / 13

slide-22
SLIDE 22

Matrix calculus proof: scratch space

12 / 13

slide-23
SLIDE 23

Matrix calculus proof: scratch space

12 / 13

slide-24
SLIDE 24

Remember your linear system solving!

12 / 13

slide-25
SLIDE 25

Lots of questions:

◮ What could go wrong with least squares?

◮ Suppose we are in “high dimensions”: more dimensions than data points. ◮ Inductive bias: we need a way to control the complexity of the model.

◮ How do we minimize (sum) logistic loss? ◮ Optimization: how do we do this all quickly?

13 / 13