Linear Models DS-GA 1013 / MATH-GA 2824 Optimization-based Data - - PowerPoint PPT Presentation

linear models
SMART_READER_LITE
LIVE PREVIEW

Linear Models DS-GA 1013 / MATH-GA 2824 Optimization-based Data - - PowerPoint PPT Presentation

Linear Models DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/OBDA_fall17/index.html Carlos Fernandez-Granda Linear regression Least-squares estimation Geometric interpretation Probabilistic


slide-1
SLIDE 1

Linear Models

DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis

http://www.cims.nyu.edu/~cfgranda/pages/OBDA_fall17/index.html Carlos Fernandez-Granda

slide-2
SLIDE 2

Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

slide-3
SLIDE 3

Regression

The aim is to learn a function h that relates

◮ a response or dependent variable y ◮ to several observed variables x1, x2, . . . , xp, known as covariates,

features or independent variables The response is assumed to be of the form y = h ( x) + z where x ∈ Rp contains the features and z is noise

slide-4
SLIDE 4

Linear regression

The regression function h is assumed to be linear y(i) = x (i) T β∗ + z(i), 1 ≤ i ≤ n Our aim is to estimate β∗ ∈ Rp from the data

slide-5
SLIDE 5

Linear regression

In matrix form     y(1) y(2) · · · y(n)     =     

  • x (1)

1

  • x (1)

2

· · ·

  • x (1)

p

  • x (2)

1

  • x (2)

2

· · ·

  • x (2)

p

· · · · · · · · · · · ·

  • x (n)

1

  • x (n)

2

· · ·

  • x (n)

p

         

  • β∗

1

  • β∗

2

· · ·

  • β∗

p

     +     z(1) z(2) · · · z(n)     Equivalently,

  • y = X

β∗ + z

slide-6
SLIDE 6

Linear model for GDP

State GDP (millions) Population Unemployment Rate                                 North Dakota 52 089 757 952 2.4 Alabama 204 861 4 863 300 3.8 Mississippi 107 680 2 988 726 5.2 Arkansas 120 689 2 988 248 3.5 Kansas 153 258 2 907 289 3.8 Georgia 525 360 10 310 371 4.5 Iowa 178 766 3 134 693 3.2 West Virginia 73 374 1 831 102 5.1 Kentucky 197 043 4 436 974 5.2 Tennessee ??? 6 651 194 3.0

slide-7
SLIDE 7

Centering

  • ycent =

            −127 147 25 625 −71 556 −58 547 −25 978 470 −105 862 17 807             Xcent =               3 044 121 −1.7 1 061 227 −2.8 −813 346 1.1 −813 825 −5.8 −894 784 −2.8 6508 298 4.2 −667 379 −8.8 −1 970 971 1.0 634 901 1.1               av ( y) = 179 236 av (X) =

  • 3 802 073

4.1

slide-8
SLIDE 8

Normalizing

  • ynorm =

              −0.321 0.065 −0.180 −0.148 −0.065 0.872 −0.001 −0.267 0.045               Xnorm =               −0.394 −0.600 0.137 −0.099 −0.105 0.401 −0.105 −0.207 −0.116 −0.099 0.843 0.151 −0.086 −0.314 −0.255 0.366 0.082 0.401               std ( y) = 396 701 std (X) =

  • 7 720 656

2.80

slide-9
SLIDE 9

Linear model for GDP

Aim: find β ∈ R2 such that ynorm ≈ Xnorm β The estimate for the GDP of Tennessee will be

  • yTen = av (

y) + std ( y)

  • xTen

norm,

β

  • where

xTen

norm is centered using av (X) and normalized using std (X)

slide-10
SLIDE 10

Temperature predictor

A friend tells you: I found a cool way to predict the average daily temperature in New York: It’s just a linear combination of the temperature in every other state. I fit the model on data from the last month and a half and it’s perfect!

slide-11
SLIDE 11

System of equations

A is n × p and full rank A b = c

◮ If n < p the system is underdetermined: infinite solutions for any

b! (overfitting)

◮ If n = p the system is determined: unique solution for any

b (overfitting)

◮ If n > p the system is overdetermined: unique solution exists only if

  • b ∈ col (A) (if there is noise, no solutions)
slide-12
SLIDE 12

Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

slide-13
SLIDE 13

Least squares

For fixed β we can evaluate the error using

n

  • i=1
  • y(i) −

x (i) T β 2 =

  • y − X

β

  • 2

2

The least-squares estimate βLS minimizes this cost function

  • βLS := arg min
  • β
  • y − X

β

  • 2

=

  • X TX

−1 X T y if X is full rank and n ≥ p

slide-14
SLIDE 14

Least-squares fit

2 1 1 2

x

2 1 1 2

y

Data Least-squares fit

slide-15
SLIDE 15

Least-squares solution

Let X = USV T

  • y = UUT

y +

  • I − UUT
  • y

By the Pythagorean theorem

  • y − X

β

  • 2

2 =

slide-16
SLIDE 16

Least-squares solution

Let X = USV T

  • y = UUT

y +

  • I − UUT
  • y

By the Pythagorean theorem

  • y − X

β

  • 2

2 =

  • I − UUT
  • y
  • 2

2 +

  • UUT

y − X β

  • 2

2

arg min

  • β
  • y − X

β

  • 2

2

slide-17
SLIDE 17

Least-squares solution

Let X = USV T

  • y = UUT

y +

  • I − UUT
  • y

By the Pythagorean theorem

  • y − X

β

  • 2

2 =

  • I − UUT
  • y
  • 2

2 +

  • UUT

y − X β

  • 2

2

arg min

  • β
  • y − X

β

  • 2

2 = arg min

  • β
  • UUT

y − X β

  • 2

2

slide-18
SLIDE 18

Least-squares solution

Let X = USV T

  • y = UUT

y +

  • I − UUT
  • y

By the Pythagorean theorem

  • y − X

β

  • 2

2 =

  • I − UUT
  • y
  • 2

2 +

  • UUT

y − X β

  • 2

2

arg min

  • β
  • y − X

β

  • 2

2 = arg min

  • β
  • UUT

y − X β

  • 2

2

= arg min

  • β
  • UUT

y − USV T β

  • 2

2

slide-19
SLIDE 19

Least-squares solution

Let X = USV T

  • y = UUT

y +

  • I − UUT
  • y

By the Pythagorean theorem

  • y − X

β

  • 2

2 =

  • I − UUT
  • y
  • 2

2 +

  • UUT

y − X β

  • 2

2

arg min

  • β
  • y − X

β

  • 2

2 = arg min

  • β
  • UUT

y − X β

  • 2

2

= arg min

  • β
  • UUT

y − USV T β

  • 2

2

= arg min

  • β
  • UT

y − SV T β

  • 2

2

slide-20
SLIDE 20

Least-squares solution

Let X = USV T

  • y = UUT

y +

  • I − UUT
  • y

By the Pythagorean theorem

  • y − X

β

  • 2

2 =

  • I − UUT
  • y
  • 2

2 +

  • UUT

y − X β

  • 2

2

arg min

  • β
  • y − X

β

  • 2

2 = arg min

  • β
  • UUT

y − X β

  • 2

2

= arg min

  • β
  • UUT

y − USV T β

  • 2

2

= arg min

  • β
  • UT

y − SV T β

  • 2

2

= VS−1UT y =

  • X TX

−1 X T y

slide-21
SLIDE 21

Linear model for GDP

The least-squares estimate is

  • βLS =

1.019 −0.111

  • GDP roughly proportional to the population

Unemployment has a negative (linear) effect

slide-22
SLIDE 22

Linear model for GDP

State GDP Estimate                                 North Dakota 52 089 46 241 Alabama 204 861 239 165 Mississippi 107 680 119 005 Arkansas 120 689 145 712 Kansas 153 258 136 756 Georgia 525 360 513 343 Iowa 178 766 158 097 West Virginia 73 374 59 969 Kentucky 197 043 194 829 Tennessee 328 770 345 352

slide-23
SLIDE 23

Maximum temperatures in Oxford, UK

1860 1880 1900 1920 1940 1960 1980 2000 5 10 15 20 25 30 Temperature (Celsius)

slide-24
SLIDE 24

Maximum temperatures in Oxford, UK

1900 1901 1902 1903 1904 1905 5 10 15 20 25 Temperature (Celsius)

slide-25
SLIDE 25

Linear model

  • yt ≈

β0 + β1 cos 2πt 12

  • +

β2 sin 2πt 12

  • +

β3 t 1 ≤ t ≤ n is the time in months (n = 12 · 150)

slide-26
SLIDE 26

Model fitted by least squares

1860 1880 1900 1920 1940 1960 1980 2000 5 10 15 20 25 30 Temperature (Celsius) Data Model

slide-27
SLIDE 27

Model fitted by least squares

1900 1901 1902 1903 1904 1905 5 10 15 20 25 Temperature (Celsius) Data Model

slide-28
SLIDE 28

Model fitted by least squares

1960 1961 1962 1963 1964 1965 5 5 10 15 20 25 Temperature (Celsius) Data Model

slide-29
SLIDE 29

Trend: Increase of 0.75 ◦C / 100 years (1.35 ◦F)

1860 1880 1900 1920 1940 1960 1980 2000 5 10 15 20 25 30 Temperature (Celsius) Data Trend

slide-30
SLIDE 30

Model for minimum temperatures

1860 1880 1900 1920 1940 1960 1980 2000 10 5 5 10 15 20 Temperature (Celsius) Data Model

slide-31
SLIDE 31

Model for minimum temperatures

1900 1901 1902 1903 1904 1905 2 2 4 6 8 10 12 14 Temperature (Celsius) Data Model

slide-32
SLIDE 32

Model for minimum temperatures

1960 1961 1962 1963 1964 1965 10 5 5 10 15 Temperature (Celsius) Data Model

slide-33
SLIDE 33

Trend: Increase of 0.88 ◦C / 100 years (1.58 ◦F)

1860 1880 1900 1920 1940 1960 1980 2000 10 5 5 10 15 20 Temperature (Celsius) Data Trend

slide-34
SLIDE 34

Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

slide-35
SLIDE 35

Geometric interpretation

◮ Any vector X

β is in the span of the columns of X

◮ The least-squares estimate is the closest vector to

y that can be represented in this way

◮ This is the projection of

y onto the column space of X X βLS = USV TVS−1UT y = UUT y

slide-36
SLIDE 36

Geometric interpretation

slide-37
SLIDE 37

Face denoising

We denoise by projecting onto:

◮ S1: the span of the 9 images from the same subject ◮ S2: the span of the 360 images in the training set

Test error: || x − PS1 y||2 || x||2 = 0.114 || x − PS2 y||2 || x||2 = 0.078

slide-38
SLIDE 38

S1

S1 := span

slide-39
SLIDE 39

Denoising via projection onto S1

Projection

  • nto S1

Projection

  • nto S⊥

1

Signal

  • x

= 0.993 + 0.114 +

Noise

  • z

= 0.007 + 0.150 =

Data

  • y

= +

Estimate

slide-40
SLIDE 40

S2

S2 := span

  • · · ·
slide-41
SLIDE 41

Denoising via projection onto S2

Projection

  • nto S2

Projection

  • nto S⊥

2

Signal

  • x

= 0.998 + 0.063 +

Noise

  • z

= 0.043 + 0.144 =

Data

  • y

= +

Estimate

slide-42
SLIDE 42

PS1 y and PS2 y

  • x

PS1 y PS2 y

slide-43
SLIDE 43

Lessons of Face Denoising

What does our intuition learned from Face Denoising tell us about linear regression?

slide-44
SLIDE 44

Lessons of Face Denoising

What does our intuition learned from Face Denoising tell us about linear regression?

◮ More features = larger column space ◮ Larger column space = captures more of the true image ◮ Larger column space = captures more of the noise ◮ Balance between underfitting and overfitting

slide-45
SLIDE 45

Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

slide-46
SLIDE 46

Motivation

Model data y1, . . . , yn as realizations of a set of random variables y1, . . . , yn The joint pdf depends on a vector of parameters β f

β (y1, . . . , yn) := fy1,...,yn (y1, . . . , yn)

is the probability density of y1, . . . , yn at the observed data Idea: Choose β such that the density is as high as possible

slide-47
SLIDE 47

Likelihood

The likelihood is equal to the joint pdf Ly1,...,yn

  • β
  • := f

β (y1, . . . , yn)

interpreted as a function of the parameters The log-likelihood function is the log of the likelihood log Ly1,...,yn

  • β
slide-48
SLIDE 48

Maximum-likelihood estimator

The likelihood quantifies how likely the data are according to the model Maximum-likelihood (ML) estimator :

  • βML (y1, . . . , yn) := arg max
  • β

Ly1,...,yn

  • β
  • = arg max
  • β

log Ly1,...,yn

  • β
  • Maximizing the log-likelihood is equivalent, and often more convenient
slide-49
SLIDE 49

Probabilistic interpretation

We model the noise as an iid Gaussian random vector z Entries have zero mean and variance σ2 The data are a realization of the random vector

  • y := X

β + z

  • y is Gaussian with mean X

β and covariance matrix σ2I

slide-50
SLIDE 50

Likelihood

The joint pdf of y is f

y (

a) :=

n

  • i=1

1 √ 2πσ exp

  • − 1

2σ2

  • a[i] −
  • X

β

  • [i]

2 = 1

  • (2π)nσn exp
  • − 1

2σ2

  • a − X

β

  • 2

2

  • The likelihood is

L

y

  • β
  • =

1

  • (2π)n exp
  • −1

2

  • y − X

β

  • 2

2

slide-51
SLIDE 51

Maximum-likelihood estimate

The maximum-likelihood estimate is

  • βML = arg max
  • β

L

y

  • β
  • = arg max
  • β

log L

y

  • β
  • = arg min
  • β
  • y − X

β

  • 2

2

= βLS

slide-52
SLIDE 52

Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

slide-53
SLIDE 53

Estimation error

If the data are generated according to the linear model

  • y := X

β∗ + z then

  • βLS −

β∗

slide-54
SLIDE 54

Estimation error

If the data are generated according to the linear model

  • y := X

β∗ + z then

  • βLS −

β∗ =

  • X TX

−1 X T X β∗ + z

β∗

slide-55
SLIDE 55

Estimation error

If the data are generated according to the linear model

  • y := X

β∗ + z then

  • βLS −

β∗ =

  • X TX

−1 X T X β∗ + z

β∗ =

  • X TX

−1 X T z as long as X is full rank

slide-56
SLIDE 56

LS estimator is unbiased

Assume noise z is random and has zero mean, then E

  • βLS −

β∗

slide-57
SLIDE 57

LS estimator is unbiased

Assume noise z is random and has zero mean, then E

  • βLS −

β∗ =

  • X TX

−1 X TE ( z)

slide-58
SLIDE 58

LS estimator is unbiased

Assume noise z is random and has zero mean, then E

  • βLS −

β∗ =

  • X TX

−1 X TE ( z) = 0 The estimate is unbiased: its mean equals β∗

slide-59
SLIDE 59

Least-squares error

If the data are generated according to the linear model

  • y := X

β∗ + z then || z||2 σ1 ≤

  • βLS −

β∗

  • 2 ≤ ||

z||2 σp σ1 and σp are the largest and smallest singular values of X

slide-60
SLIDE 60

Least-squares error: Proof

The error is given by

  • βLS −

β∗ = (X TX)−1X T z. How can we bound (X TX)−1X T z2?

slide-61
SLIDE 61

Singular values

The singular values of a matrix A ∈ Rn×p of rank p satisfy σ1 = max {||

x||2=1 | x∈Rn}

||A x||2 σp = min {||

x||2=1 | x∈Rn}

||A x||2

slide-62
SLIDE 62

Least-squares error

  • βLS −

β∗ = VS−1UT z The smallest and largest singular values of VS−1U are 1/σ1 and 1/σp, so || z||2 σ1 ≤

  • VS−1UT

z

  • 2 ≤ ||

z||2 σp

slide-63
SLIDE 63

Experiment

Xtrain, Xtest, ztrain and β∗ are sampled iid from a standard Gaussian Data has 50 features

  • ytrain = Xtrain

β∗ + ztrain

  • ytest = Xtest

β∗ (No Test Noise) We use ytrain and Xtrain to compute βLS errortrain =

  • Xtrain

βLS − ytrain

  • 2

|| ytrain||2 errortest =

  • Xtest

βLS − ytest

  • 2

|| ytest||2

slide-64
SLIDE 64

Experiment

100 200 300 400 500 50 n 0.0 0.1 0.2 0.3 0.4 0.5 Relative error (l2 norm) Error (training) Error (test) Noise level (training)

slide-65
SLIDE 65

Experiment Questions

  • 1. Can we approximate the relative noise level

z2/ y2?

  • 2. Why does the training error start at 0?
  • 3. Why does the relative training error converge to the noise level?
  • 4. Why does the relative test error converge to zero?
slide-66
SLIDE 66

Experiment Questions

  • 1. Can we approximate the relative noise level

z2/ y2?

  • β∗2 ≈

√ 50, Xtrain β∗2 ≈ √ 50n, ztrain2 ≈ √n,

1 √ 51 ≈ 0.140

  • 2. Why does the training error start at 0?
  • 3. Why does the relative training error converge to the noise level?
  • 4. Why does the relative test error converge to zero?
slide-67
SLIDE 67

Experiment Questions

  • 1. Can we approximate the relative noise level

z2/ y2?

  • β∗2 ≈

√ 50, Xtrain β∗2 ≈ √ 50n, ztrain2 ≈ √n,

1 √ 51 ≈ 0.140

  • 2. Why does the training error start at 0?

X is square and invertible

  • 3. Why does the relative training error converge to the noise level?
  • 4. Why does the relative test error converge to zero?
slide-68
SLIDE 68

Experiment Questions

  • 1. Can we approximate the relative noise level

z2/ y2?

  • β∗2 ≈

√ 50, Xtrain β∗2 ≈ √ 50n, ztrain2 ≈ √n,

1 √ 51 ≈ 0.140

  • 2. Why does the training error start at 0?

X is square and invertible

  • 3. Why does the relative training error converge to the noise level?

Xtrain βLS − ytrain2 = Xtrain( βLS − β∗) − ztrain2 and βLS → β

  • 4. Why does the relative test error converge to zero?
slide-69
SLIDE 69

Experiment Questions

  • 1. Can we approximate the relative noise level

z2/ y2?

  • β∗2 ≈

√ 50, Xtrain β∗2 ≈ √ 50n, ztrain2 ≈ √n,

1 √ 51 ≈ 0.140

  • 2. Why does the training error start at 0?

X is square and invertible

  • 3. Why does the relative training error converge to the noise level?

Xtrain βLS − ytrain2 = Xtrain( βLS − β∗) − ztrain2 and βLS → β

  • 4. Why does the relative test error converge to zero?

We assumed no test noise, and βLS → β∗

slide-70
SLIDE 70

Non-asymptotic bound

Let

  • y := X

β∗ + z, where the entries of X and z are iid standard Gaussians The least-squares estimate satisfies

  • (1 − ǫ)

(1 + ǫ) p n ≤

  • βLS −

β∗

  • 2 ≤
  • (1 + ǫ)

(1 − ǫ) p n with probability at least 1 − 1/p − 2 exp

  • −pǫ2/8
  • as long as

n ≥ 64p log(12/ǫ)/ǫ2

slide-71
SLIDE 71

Proof

  • UT

z

  • 2

σ1 ≤

  • VS−1UT

z

  • 2 ≤
  • UT

z

  • 2

σp

slide-72
SLIDE 72

Projection onto a fixed subspace

Let S be a k-dimensional subspace of Rn and z ∈ Rn a vector of iid standard Gaussian noise For any ǫ > 0 P

  • k (1 − ǫ) < ||PS

z||2

2 < k (1 + ǫ)

  • ≥ 1 − 2 exp
  • −kǫ2

8

slide-73
SLIDE 73

Projection onto a fixed subspace

Let S be a k-dimensional subspace of Rn and z ∈ Rn a vector of iid standard Gaussian noise For any ǫ > 0 P

  • k (1 − ǫ) < ||PS

z||2

2 < k (1 + ǫ)

  • ≥ 1 − 2 exp
  • −kǫ2

8

  • Consequence: With probability 1 − 2 exp
  • −pǫ2/8
  • (1 − ǫ) p ≤
  • UT

z

  • 2

2 ≤ (1 + ǫ) p

slide-74
SLIDE 74

Singular values of a Gaussian matrix

Let A be a n × k matrix with iid standard Gaussian entries such that n > k For any fixed ǫ > 0, the singular values of A satisfy

  • n (1 − ǫ) ≤ σk ≤ σ1 ≤
  • n (1 + ǫ)

with probability at least 1 − 1/k as long as n > 64k ǫ2 log 12 ǫ

slide-75
SLIDE 75

Proof

With probability 1 − 1/p

  • n (1 − ǫ) ≤ σp ≤ σ1 ≤
  • n (1 + ǫ)

as long as n ≥ 64p log(12/ǫ)/ǫ2

slide-76
SLIDE 76

Experiment:

  • β
  • 2 ≈ p

Plot of

β∗− βLS2

  • β∗2

5000 10000 15000 20000 50 n 0.00 0.02 0.04 0.06 0.08 0.10 Relative coefficient error (l2 norm) p=50 p=100 p=200 1/

pn

slide-77
SLIDE 77

Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

slide-78
SLIDE 78

Condition number

The condition number of A ∈ Rn×p, n ≥ p, is the ratio σ1/σp of its largest and smallest singular values A matrix is ill conditioned if its condition is large (almost rank defficient)

slide-79
SLIDE 79

Noise amplification

Let

  • y := X

β∗ + z, where z is iid standard Gaussian With probability at least 1 − 2 exp

  • −ǫ2/8
  • βLS −

β∗

  • 2 ≥

√1 − ǫ σp where σp is the smallest singular value of X

slide-80
SLIDE 80

Proof

  • βLS −

β∗

  • 2

2

slide-81
SLIDE 81

Proof

  • βLS −

β∗

  • 2

2 =

  • VS−1UT

z

  • 2

2

slide-82
SLIDE 82

Proof

  • βLS −

β∗

  • 2

2 =

  • VS−1UT

z

  • 2

2

=

  • S−1UT

z

  • 2

2

V is orthogonal

slide-83
SLIDE 83

Proof

  • βLS −

β∗

  • 2

2 =

  • VS−1UT

z

  • 2

2

=

  • S−1UT

z

  • 2

2

V is orthogonal =

p

  • i
  • uT

i

z 2 σ2

i

slide-84
SLIDE 84

Proof

  • βLS −

β∗

  • 2

2 =

  • VS−1UT

z

  • 2

2

=

  • S−1UT

z

  • 2

2

V is orthogonal =

p

  • i
  • uT

i

z 2 σ2

i

  • uT

p

z 2 σ2

p

slide-85
SLIDE 85

Projection onto a fixed subspace

Let S be a k-dimensional subspace of Rn and z ∈ Rn a vector of iid standard Gaussian noise For any ǫ > 0 P

  • k (1 − ǫ) < ||PS

z||2

2 < k (1 + ǫ)

  • ≥ 1 − 2 exp
  • −kǫ2

8

slide-86
SLIDE 86

Projection onto a fixed subspace

Let S be a k-dimensional subspace of Rn and z ∈ Rn a vector of iid standard Gaussian noise For any ǫ > 0 P

  • k (1 − ǫ) < ||PS

z||2

2 < k (1 + ǫ)

  • ≥ 1 − 2 exp
  • −kǫ2

8

  • Consequence: With probability 1 − 2 exp
  • −ǫ2/8
  • uT

p

z 2 ≥ (1 − ǫ)

slide-87
SLIDE 87

Example

Let

  • y := X

β∗ + z where X :=         0.212 −0.099 0.605 −0.298 −0.213 0.113 0.589 −0.285 0.016 0.006 0.059 0.032         ,

  • β∗ :=

0.471 −1.191

  • ,
  • z :=

        0.066 −0.077 −0.010 −0.033 0.010 0.028         ||z||2 = 0.11

slide-88
SLIDE 88

Example

Condition number = 100 X = USV T =         −0.234 0.427 −0.674 −0.202 0.241 0.744 −0.654 0.350 0.017 −0.189 0.067 0.257         1.00 0.01 −0.898 0.440 0.440 0.898

slide-89
SLIDE 89

Example

  • βLS −

β∗

slide-90
SLIDE 90

Example

  • βLS −

β∗ = VS−1UT z

slide-91
SLIDE 91

Example

  • βLS −

β∗ = VS−1UT z = V 1.00 100.00

  • UT

z

slide-92
SLIDE 92

Example

  • βLS −

β∗ = VS−1UT z = V 1.00 100.00

  • UT

z = V 0.058 3.004

slide-93
SLIDE 93

Example

  • βLS −

β∗ = VS−1UT z = V 1.00 100.00

  • UT

z = V 0.058 3.004

  • =

1.270 2.723

slide-94
SLIDE 94

Example

  • βLS −

β∗ = VS−1UT z = V 1.00 100.00

  • UT

z = V 0.058 3.004

  • =

1.270 2.723

  • so that
  • βLS −

β∗

  • 2

|| z||2 = 27.00

slide-95
SLIDE 95

Multicollinearity

Feature matrix is ill conditioned if any subset of columns is close to being linearly dependent (there is a vector almost in the null space) This occurs if features are highly correlated For any X ∈ Rn×p, with normalized columns, if Xi and Xj, i = j, satisfy Xi, Xj2 ≥ 1 − ǫ2 then the smallest singular value σp ≤ ǫ

slide-96
SLIDE 96

Multicollinearity

Feature matrix is ill conditioned if any subset of columns is close to being linearly dependent (there is a vector almost in the null space) This occurs if features are highly correlated For any X ∈ Rn×p, with normalized columns, if Xi and Xj, i = j, satisfy Xi, Xj2 ≥ 1 − ǫ2 then the smallest singular value σp ≤ ǫ Proof Idea: Consider X( ei − ej)2.

slide-97
SLIDE 97

Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

slide-98
SLIDE 98

Motivation

Avoid noise amplification due to multicollinearity Problem: Noise amplification blows up coefficients Solution: Penalize large-norm solutions when fitting the model Adding a penalty term promoting a particular structure is called regularization

slide-99
SLIDE 99

Ridge regression

For a fixed regularization parameter λ > 0

  • βridge := arg min
  • β
  • y − X

β

  • 2

2 +λ

  • β
  • 2

2

slide-100
SLIDE 100

Ridge regression

For a fixed regularization parameter λ > 0

  • βridge := arg min
  • β
  • y − X

β

  • 2

2 +λ

  • β
  • 2

2

=

  • X TX + λI

−1 X T y

slide-101
SLIDE 101

Ridge regression

For a fixed regularization parameter λ > 0

  • βridge := arg min
  • β
  • y − X

β

  • 2

2 +λ

  • β
  • 2

2

=

  • X TX + λI

−1 X T y λI increases the singular values of X TX

slide-102
SLIDE 102

Ridge regression

For a fixed regularization parameter λ > 0

  • βridge := arg min
  • β
  • y − X

β

  • 2

2 +λ

  • β
  • 2

2

=

  • X TX + λI

−1 X T y λI increases the singular values of X TX When λ → 0 then βridge → βLS

slide-103
SLIDE 103

Ridge regression

For a fixed regularization parameter λ > 0

  • βridge := arg min
  • β
  • y − X

β

  • 2

2 +λ

  • β
  • 2

2

=

  • X TX + λI

−1 X T y λI increases the singular values of X TX When λ → 0 then βridge → βLS When λ → ∞ then βridge → 0

slide-104
SLIDE 104

Proof

  • βridge is the solution to a modified least-squares problem
  • βridge = arg min
  • β
  • y

X √ λI

  • β
  • 2

2

slide-105
SLIDE 105

Proof

  • βridge is the solution to a modified least-squares problem
  • βridge = arg min
  • β
  • y

X √ λI

  • β
  • 2

2

= X √ λI T X √ λI −1 X √ λI T y

slide-106
SLIDE 106

Proof

  • βridge is the solution to a modified least-squares problem
  • βridge = arg min
  • β
  • y

X √ λI

  • β
  • 2

2

= X √ λI T X √ λI −1 X √ λI T y

  • =
  • X TX + λI

−1 X T y

slide-107
SLIDE 107

Modified projection

  • yridge := X

βridge

slide-108
SLIDE 108

Modified projection

  • yridge := X

βridge = X

  • X TX + λI

−1 X T y

slide-109
SLIDE 109

Modified projection

  • yridge := X

βridge = X

  • X TX + λI

−1 X T y = USV T VS2V T + λVV T−1 VSUT y

slide-110
SLIDE 110

Modified projection

  • yridge := X

βridge = X

  • X TX + λI

−1 X T y = USV T VS2V T + λVV T−1 VSUT y = USV TV

  • S2 + λI

−1 V TVSUT y

slide-111
SLIDE 111

Modified projection

  • yridge := X

βridge = X

  • X TX + λI

−1 X T y = USV T VS2V T + λVV T−1 VSUT y = USV TV

  • S2 + λI

−1 V TVSUT y = US

  • S2 + λI

−1 SUT y

slide-112
SLIDE 112

Modified projection

  • yridge := X

βridge = X

  • X TX + λI

−1 X T y = USV T VS2V T + λVV T−1 VSUT y = USV TV

  • S2 + λI

−1 V TVSUT y = US

  • S2 + λI

−1 SUT y =

p

  • i=1

σ2

i

σ2

i + λ

y, ui ui Component of data in direction of ui is shrunk by

σ2

i

σ2

i +λ

slide-113
SLIDE 113

Modified projection: Relation to PCA

Component of data in direction of ui is shrunk by

σ2

i

σ2

i +λ

Instead of orthogonally projecting on to the column space of X as in standard regression, we shrink and project Which directions are shrunk the most?

slide-114
SLIDE 114

Modified projection: Relation to PCA

Component of data in direction of ui is shrunk by

σ2

i

σ2

i +λ

Instead of orthogonally projecting on to the column space of X as in standard regression, we shrink and project Which directions are shrunk the most? The directions in the data with smallest variance

slide-115
SLIDE 115

Modified projection: Relation to PCA

Component of data in direction of ui is shrunk by

σ2

i

σ2

i +λ

Instead of orthogonally projecting on to the column space of X as in standard regression, we shrink and project Which directions are shrunk the most? The directions in the data with smallest variance In PCA, we delete the directions with smallest variance (i.e., shrink them to zero) Can think of Ridge Regression as a continuous variant of performing regression on principal components

slide-116
SLIDE 116

Ridge-regression estimate

If y := X β∗ + z

  • βridge = V

       

σ2

1

σ2

1+λ

· · ·

σ2

2

σ2

2+λ

· · · · · · · · ·

σ2

p

σ2

p +λ

        V T β∗ + V        

σ1 σ2

1+λ

· · ·

σ2 σ2

2+λ

· · · · · · · · ·

σp σ2

p +λ

        UT z

where X = USV T and σ1, . . . , σp are the singular values For comparison,

  • βLS =

β∗ + VS−1UT z

slide-117
SLIDE 117

Bias-variance tradeoff

Error βridge − β∗ can be divided into two terms: Bias (depends on β∗) and variance (depends on z) The bias equals E

  • βridge −

β∗ = −V      

λ σ2

1+λ

· · ·

λ σ2

2+λ

· · · · · · · · ·

λ σ2

p+λ

      V T β∗ Larger λ increases bias, but dampens noise (decreases variance)

slide-118
SLIDE 118

Example

Let

  • y := X

β∗ + z where X :=         0.212 −0.099 0.605 −0.298 −0.213 0.113 0.589 −0.285 0.016 0.006 0.059 0.032         ,

  • β∗ :=

0.471 −1.191

  • ,
  • z :=

        0.066 −0.077 −0.010 −0.033 0.010 0.028         ||z||2 = 0.11

slide-119
SLIDE 119

Example

  • βridge −

β∗ = V  

λ 1+λ λ 0.012+λ

  V T β∗ − V  

1 1+λ 0.01 0.012+λ

  UT z

slide-120
SLIDE 120

Example

Setting λ = 0.01

  • βridge −

β∗ = V  

λ 1+λ λ 0.012+λ

  V T β∗ − V  

1 1+λ 0.01 0.012+λ

  UT z = −V  0.001 0.99   V T β∗ + V  0.99 0.99   UT z =  0.329 0.823  

slide-121
SLIDE 121

Example

Least-squares relative error = 27.00

  • βridge −

β∗

  • 2

|| z||2 = 7.96

slide-122
SLIDE 122

Example

10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102 103

Regularization parameter

0.5 0.0 0.5 1.0 1.5 2.0 2.5

Coefficients Coefficient error Least-squares fit

slide-123
SLIDE 123

Maximum-a-posteriori estimator

Is there a probabilistic interpretation of ridge regression? Bayesian viewpoint: β is modeled as random, not deterministic The maximum-a-posteriori (MAP) estimator of β given y is

  • βMAP (

y) := arg max

  • β

f

β | y

  • β |

y

  • ,

f

β | y is the conditional pdf of

β given y

slide-124
SLIDE 124

Maximum-a-posteriori estimator

Let y ∈ Rn be a realization of

  • y := X

β + z where β and z are iid Gaussian with mean zero and variance σ2

1 and σ2 2

If X ∈ Rn×m is known, then

  • βMAP = arg min
  • β
  • y − X

β

  • 2

2 + λ

  • β
  • 2

2

where λ := σ2

2/σ2 1

What does it mean if σ2

1 is tiny or large? How about σ2 2?

slide-125
SLIDE 125

Problem

How to calibrate regularization parameter Cannot use coefficient error (we don’t know the true value!) Cannot minimize over training data (why?) Solution: Check fit on new data

slide-126
SLIDE 126

Cross validation

Given a set of examples

  • y(1),

x (1) ,

  • y(2),

x (2) , . . . ,

  • y(n),

x (n) ,

  • 1. Partition data into a training set Xtrain ∈ Rntrain×p,

ytrain ∈ Rntrain and a validation set Xval ∈ Rnval×p, yval ∈ Rnval

  • 2. Fit model using the training set for every λ in a set Λ
  • βridge (λ) := arg min
  • β
  • ytrain − Xtrain

β

  • 2

2 + λ

  • β
  • 2

2

and evaluate the fitting error on the validation set err (λ) :=

  • ytrain − Xtrain

βridge(λ)

  • 2

2

  • 3. Choose the value of λ that minimizes the validation-set error

λcv := arg min

λ∈Λ err (λ)

slide-127
SLIDE 127

Prediction of house prices

Aim: Predicting the price of a house from

  • 1. Area of the living room
  • 2. Condition (integer between 1 and 5)
  • 3. Grade (integer between 7 and 12)
  • 4. Area of the house without the basement
  • 5. Area of the basement
  • 6. The year it was built
  • 7. Latitude
  • 8. Longitude
  • 9. Average area of the living room of houses within 15 blocks
slide-128
SLIDE 128

Prediction of house prices

Training data: 15 houses Validation data: 15 houses Test data: 15 houses Condition number of training-data feature matrix: 9.94 We evaluate the relative fit

  • y − X

βridge

  • 2

|| y||2

slide-129
SLIDE 129

Prediction of house prices

10-3 10-2 10-1 100 101 102 103 Regularization parameter 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 Coefficients l2-norm cost (training set) l2-norm cost (validation set)

slide-130
SLIDE 130

Prediction of house prices

Best λ: 0.27 Validation set error: 0.672 (least-squares: 0.906) Test set error: 0.799 (least-squares: 1.186)

slide-131
SLIDE 131

Training

200000 400000 600000 800000

True price (dollars)

200000 400000 600000 800000

Estimated price (dollars)

Least squares Ridge regression

slide-132
SLIDE 132

Validation

200000 400000 600000 800000

True price (dollars)

200000 400000 600000 800000

Estimated price (dollars)

Least squares Ridge regression

slide-133
SLIDE 133

Test

200000 400000 600000 800000

True price (dollars)

200000 400000 600000 800000

Estimated price (dollars)

Least squares Ridge regression

slide-134
SLIDE 134

Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

slide-135
SLIDE 135

The classification problem

Goal: Assign examples to one of several predefined categories We have n examples of labels and corresponding features

  • y(1),

x (1) ,

  • y(2),

x (2) , . . . ,

  • y(n),

x (n) . Here, we consider only two categories: labels are 0 or 1

slide-136
SLIDE 136

Logistic function

Smoothed version of step function g (t) := 1 1 + exp(−t)

slide-137
SLIDE 137

Logistic function

−8 −6 −4 −2 2 4 6 8 0.2 0.4 0.6 0.8 1 t

1 1+exp(−t)

slide-138
SLIDE 138

Logistic regression

Generalized linear model: linear model + entrywise link function y(i) ≈ g

  • β0 +

x (i), β

  • .
slide-139
SLIDE 139

Maximum likelihood

If y(1), . . . , y(n) are independent samples from Bernoulli random variables with parameter py(i) (1) := g

  • x (i),

β

  • where

x (1), . . . , x (n) ∈ Rp are known, the ML estimate of β given y(1), . . . , y(n) is

  • βML :=

n

  • i=1

y(i) log g

  • x (i),

β

  • +
  • 1 − y(i)

log

  • 1 − g
  • x (i),

β

slide-140
SLIDE 140

Maximum likelihood

L

  • β
  • := py(1),...,y(n)
  • y(1), . . . , y(n)
slide-141
SLIDE 141

Maximum likelihood

L

  • β
  • := py(1),...,y(n)
  • y(1), . . . , y(n)

=

n

  • i=1

g

  • x (i),

β y(i) 1 − g

  • x (i),

β 1−y(i)

slide-142
SLIDE 142

Logistic-regression estimator

  • βLR :=

n

  • i=1

y(i) log g

  • x (i),

β

  • +
  • 1 − y(i)

log

  • 1 − g
  • x (i),

β

  • For a new

x the logistic-regression prediction is yLR :=

  • 1

if g

  • x,

βLR

  • ≥ 1/2
  • therwise

g

  • x,

βLR

  • can be interpreted as the probability that the label is 1
slide-143
SLIDE 143

Iris data set

Aim: Classify flowers using sepal width and length Two species, 5 examples each:

◮ Iris setosa (label 0): sepal lengths 5.4, 4.3, 4.8, 5.1 and 5.7, and

sepal widths 3.7, 3, 3.1, 3.8 and 3.8

◮ Iris versicolor (label 1): sepal lengths 6.5, 5.7, 7, 6.3 and 6.1, and

sepal widths 2.8, 2.8, 3.2, 2.3 and 2.8 Two new examples: (5.1, 3.5), (5,2)

slide-144
SLIDE 144

Iris data set

After centering and normalizing

  • βLR =

32.1 −29.6

  • and

β0 = 2.06

i 1 2 3 4 5

  • x (i)[1]
  • 0.12
  • 0.56
  • 0.36
  • 0.24

0.00

  • x (i)[2]

0.38

  • 0.09
  • 0.02

0.45 0.45

  • x (i),

βLR + β0

  • 12.9
  • 13.5
  • 8.9
  • 18.8
  • 11.0

g

  • x (i),

βLR + β0

  • 0.00

0.00 0.00 0.00 0.00 i 6 7 8 9 10

  • x (i)[1]

0.33 0.00 0.53 0.25 0.17

  • x (i)[2]
  • 0.22
  • 0.22

0.05

  • 0.05
  • 0.22
  • x (i),

βLR + β0 19.1 8.7 17.7 26.3 13.9 g

  • x (i),

βLR + β0

  • 1.00

1.00 1.00 1.00 1.00

slide-145
SLIDE 145

Iris data set

1.0 0.5 0.0 0.5 1.0

Sepal length

1.0 0.5 0.0 0.5 1.0

Sepal width

??? Setosa Versicolor 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

slide-146
SLIDE 146

Iris data set

0.3 0.2 0.1 0.0 0.1 0.2 0.3

Sepal width

0.3 0.2 0.1 0.0 0.1 0.2 0.3

Petal length

Virginica Versicolor 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

slide-147
SLIDE 147

Digit classification

MNIST data Aim: Distinguish one digit from another

  • xi is an image of a 6 or a 9
  • yi = 1 or

yi = 0 if image i is a 6 or 9, respectively 2000 training examples and 2000 test examples, each half 6 half 9 Training error rate: 0.0, Test error rate = 0.006

slide-148
SLIDE 148

Digit classification: β

0.4 0.3 0.2 0.1 0.0 0.1 0.2 0.3 0.4

slide-149
SLIDE 149

Digit classification: True Positives

  • βTx

Probability

  • f 6

Image 20.878 1.00 18.217 1.00 16.408 1.00

slide-150
SLIDE 150

Digit classification: True Negatives

  • βTx

Probability

  • f 6

Image

  • 14.71

0.00

  • 15.829

0.00

  • 17.02

0.00

slide-151
SLIDE 151

Digit classification: False Positives

  • βTx

Probability

  • f 6

Image 7.612 0.9995 0.4341 0.606 7.822484 0.9996

slide-152
SLIDE 152

Digit classification: False Negatives

  • βTx

Probability

  • f 6

Image

  • 5.984

0.0025

  • 2.384

.084

  • 1.164

0.238

slide-153
SLIDE 153

Digit Classification

This is a toy problem: distinguishing one digit from another is very easy Harder is to classify any given digit We used it to give insight into how logistic regression works It turns out, on this simplified problem, a very easy solution for β gives good results. Can you guess it?

slide-154
SLIDE 154

Digit Classification

Average of 6’s minus average of 9’s Training error: 0.005, Test error: 0.0035

1.0 0.5 0.0 0.5 1.0 1.5