Machine Learning (CSE 446): PCA (continued) and Learning as - - PowerPoint PPT Presentation

machine learning cse 446 pca continued and learning as
SMART_READER_LITE
LIVE PREVIEW

Machine Learning (CSE 446): PCA (continued) and Learning as - - PowerPoint PPT Presentation

Machine Learning (CSE 446): PCA (continued) and Learning as Minimizing Loss Sham M Kakade 2018 c University of Washington cse446-staff@cs.washington.edu 1 / 17 PCA: continuing on... 1 / 17 Dimension of Greatest Variance Assume that the


slide-1
SLIDE 1

Machine Learning (CSE 446): PCA (continued) and Learning as Minimizing Loss

Sham M Kakade

c 2018 University of Washington cse446-staff@cs.washington.edu

1 / 17

slide-2
SLIDE 2

PCA: continuing on...

1 / 17

slide-3
SLIDE 3

Dimension of Greatest Variance

Assume that the data are centered, i.e., that mean

  • xnN

n=1

  • = 0.

1 / 17

slide-4
SLIDE 4

Dimension of Greatest Variance

Assume that the data are centered, i.e., that mean

  • xnN

n=1

  • = 0.

1 / 17

slide-5
SLIDE 5

Projection into One Dimension

Let u be the dimension of greatest variance, where u2 = 1. pn = xn · u is the projection of the nth example onto u. Since the mean of the data is 0, the mean of p1, . . . , pN is also 0. This implies that the variance of p1, . . . , pN is 1 N

N

  • n=1

p2

n.

The u that gives the greatest variance, then, is: argmax

u N

  • n=1

(xn · u)2

2 / 17

slide-6
SLIDE 6

Finding the Maximum-Variance Direction

argmax

u N

  • n=1

(xn · u)2 s.t. u2 = 1 (Why do we constrain u to have length 1?) If we let X =      x⊤

1

x⊤ . . . x⊤

N

    , then we want: argmax

u

Xu2, s.t. u2 = 1. 2-This is PCA in one dimension!

3 / 17

slide-7
SLIDE 7

Linear algebra review: things to understand

◮ x2 is the Euclidean norm. ◮ What is the dimension of Xu? ◮ What is i-th component of Xu? ◮ Also, note: u2 = u⊤u ◮ So what is Xu2?

4 / 17

slide-8
SLIDE 8

Constrained Optimization

The blue lines represent contours: all points on a blue line have the same

  • bjective function value.

5 / 17

slide-9
SLIDE 9

Deriving the Solution

Don’t panic.

argmax

u

Xu2, s.t. u2 = 1

◮ The Lagrangian encoding of the problem moves the constraint into the objective:

max

u

min

λ Xu2 − λ(u2 − 1)

⇒ min

λ max u

Xu2 − λ(u2 − 1)

6 / 17

slide-10
SLIDE 10

Deriving the Solution

Don’t panic.

argmax

u

Xu2, s.t. u2 = 1

◮ The Lagrangian encoding of the problem moves the constraint into the objective:

max

u

min

λ Xu2 − λ(u2 − 1)

⇒ min

λ max u

Xu2 − λ(u2 − 1)

◮ Gradient (first derivatives with respect to u): 2X⊤Xu − 2λu ◮ Setting equal to 0 leads to: λu = X⊤Xu ◮ You may recognize this as the definition of an eigenvector (u) and eigenvalue (λ)

for the matrix X⊤X.

◮ We take the first (largest) eigenvalue.

6 / 17

slide-11
SLIDE 11

Deriving the Solution: Scratch space

7 / 17

slide-12
SLIDE 12

Deriving the Solution: Scratch space

7 / 17

slide-13
SLIDE 13

Deriving the Solution: Scratch space

7 / 17

slide-14
SLIDE 14

Variance in Multiple Dimensions

So far, we’ve projected each xn into one dimension. To get a second direction v, we solve the same problem again, but this time with another constraint: argmax

v

Xv2, s.t. v2 = 1 and u · v = 0 (That is, we want a dimension that’s orthogonal to the u that we found earlier.) Following the same steps we had for u, the solution will be the second eigenvector.

8 / 17

slide-15
SLIDE 15

“Eigenfaces”

  • Fig. from https://github.com/AlexOuyang/RealTimeFaceRecognition

9 / 17

slide-16
SLIDE 16

Principal Components Analysis

◮ Input: unlabeled data X = [x1|x2| · · · |xN]⊤; dimensionality K < d ◮ Output: K-dimensional “subspace”. ◮ Algorithm:

  • 1. Compute the mean µ
  • 2. compute the covariance matrix:

Σ = 1 N

  • i

(xi − µ)(xi − µ)⊤

  • 3. let λ1, . . . , λK be the top K eigenvalues of Σ and u1, . . . , uK be the

corresponding eigenvectors

◮ Let

U = [u1|u| · · · |uK] Return U You can read about many algorithms for finding eigendecompositions of a matrix.

10 / 17

slide-17
SLIDE 17

Alternate View of PCA: Minimizing Reconstruction Error

Assume that the data are centered. Find a line which minimizes the squared reconstruction error.

11 / 17

slide-18
SLIDE 18

Alternate View of PCA: Minimizing Reconstruction Error

Assume that the data are centered. Find a line which minimizes the squared reconstruction error.

11 / 17

slide-19
SLIDE 19

Projection and Reconstruction: the one dimensional case

◮ Take out mean µ: ◮ Find the “top” eigenvector u of the covariance matrix. ◮ What are your projections? ◮ What are your reconstructions,

X = [ x1| x2| · · · | xN]⊤?

◮ Whis is your reconstruction error?

1 N

  • i

(xi − xi)2 =??

12 / 17

slide-20
SLIDE 20

Alternate View: Minimizing Reconstruction Error with K-dim subspace.

Equivalent (“dual”) formulation of PCA: find an “orthonormal basis” u1, u2, . . . uK which minimizes the total reconstruction error on the data: argmin

  • rthonormal basis:u1,u2,...uK

1 N

  • i

(xi − Proju1,...uK(xi))2 Recall the projection of x onto K-orthonormal basis is: Proju1,...uK(x) =

K

  • j=1

(ui · x)ui The SVD “simultaneously” finds all u1, u2, . . . uK

13 / 17

slide-21
SLIDE 21

Choosing K (Hyperparameter Tuning)

How do you select K for PCA? Read CIML (similar methods for K-means)

13 / 17

slide-22
SLIDE 22

PCA and Clustering

There’s a unified view of both PCA and clustering.

◮ K-Means chooses cluster-means so that squared distances to data are small. ◮ PCA chooses a basis so that reconstruction error of data is small.

Both attempt to find a “simple” way to summarize the data: fewer points or fewer dimensions. Both could be used to create new features for supervised learning

14 / 17

slide-23
SLIDE 23

Loss functions

14 / 17

slide-24
SLIDE 24

Perceptron

A model and an algorithm, rolled into one. Model: f(x) = sign(w · x + b), known as linear, visualized by a (hopefully) separating hyperplane in feature-space. Algorithm: PerceptronTrain, an error-driven, iterative updating algorithm.

15 / 17

slide-25
SLIDE 25

A Different View of PerceptronTrain: Optimization

“Minimize training-set error rate”: min

w,b

1 N

N

  • n=1

yn · (w · x + b) ≤ 0

  • ǫtrain≡ zero-one loss

margin = y · (w · x + b) loss

16 / 17

slide-26
SLIDE 26

A Different View of PerceptronTrain: Optimization

“Minimize training-set error rate”: min

w,b

1 N

N

  • n=1

yn · (w · x + b) ≤ 0

  • ǫtrain≡ zero-one loss

This problem is NP-hard; even solving trying to get a (multiplicaive) approximatation is NP-hard.

margin = y · (w · x + b) loss

16 / 17

slide-27
SLIDE 27

A Different View of PerceptronTrain: Optimization

“Minimize training-set error rate”: min

w,b

1 N

N

  • n=1

yn · (w · x + b) ≤ 0

  • ǫtrain≡ zero-one loss

What the perceptron does: min

w,b

1 N

N

  • n=1

max(−yn · (w · x + b), 0)

  • perceptron loss

margin = y · (w · x + b) loss loss margin = y · (w · x + b)

16 / 17

slide-28
SLIDE 28

A Different View of PerceptronTrain: Optimization

“Minimize training-set error rate”: min

w,b

1 N

N

  • n=1

yn · (w · x + b) ≤ 0

  • ǫtrain≡ zero-one loss

What the perceptron does: min

w,b

1 N

N

  • n=1

max(−yn · (w · x + b), 0)

  • perceptron loss

16 / 17

slide-29
SLIDE 29

A Different View of PerceptronTrain: Optimization

“Minimize training-set error rate”: min

w,b

1 N

N

  • n=1

yn · (w · x + b) ≤ 0

  • ǫtrain≡ zero-one loss

What the perceptron does: min

w,b

1 N

N

  • n=1

max(−yn · (w · x + b), 0)

  • perceptron loss

16 / 17

slide-30
SLIDE 30

Smooth out the Loss?

17 / 17