5. Summary of linear regression so far Main points - - PowerPoint PPT Presentation

5 summary of linear regression so far main points
SMART_READER_LITE
LIVE PREVIEW

5. Summary of linear regression so far Main points - - PowerPoint PPT Presentation

5. Summary of linear regression so far Main points Model/function/predictor class of linear regressors x w T x . ERM principle: we chose a loss (least squares) and find a good predictor by minimizing empirical risk. ERM solution


slide-1
SLIDE 1
  • 5. Summary of linear regression so far
slide-2
SLIDE 2

Main points

◮ Model/function/predictor class of linear regressors x → wTx. ◮ ERM principle: we chose a loss (least squares) and find a good predictor by minimizing empirical risk. ◮ ERM solution for least squares: pick w satisfying ATAw = ATb, which is not unique; one unique choice is the ordinary least squares solution A+b.

18 / 94

slide-3
SLIDE 3

Part 2 of linear regression lecture. . .

slide-4
SLIDE 4

Recap on SVD. (A messy slide, I’m sorry.)

Suppose 0 = M ∈ Rn×d, thus r := rank(M) > 0. ◮ “Decomposition form” thin SVD: M = r

i=1 siuivT i , and

s1 ≥ · · · ≥ sr > 0, and M + = r

i=1 1 si viuT i . and in general

M +M = r

i=1 vivT i = I.

◮ “Factorization form” thin SVD: M = USV T, U ∈ Rn×r and V ∈ Rd×r

  • rthonormal but U TU and V TV are not identity matrices in general, and

S = diag(s1, . . . , sr) ∈ Rr×r with s1 ≥ · · · ≥ sr > 0; pseudoinverse M + = V S−1U T and in general M +M = MM + = I. ◮ Full SVD: M = U fSfV T

f , U f ∈ Rn×n and V ∈ Rd×d orthonormal and

full rank so U T

f U f and V T f V f are identity matrices and Sf ∈ Rn×d is zero

everywhere except the first r diagonal entries which are s1 ≥ · · · ≥ sr > 0; pseudoinverse M + = V fS+

f U T f where S+ f

is obtained by transposing Sf and then flipping nonzero entries, and in general M +M = MM + = I. Additional property: agreement with eigendecompositions of MM T and M TM. The “full SVD” adds columns to U and V which hit zeros of S and therefore don’t matter (as a sanity check, verify for yourself that all these SVDs are equal).

19 / 94

slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7

Recap on SVD, zero matrix case

Suppose 0 = M ∈ Rn×d, thus r := rank(M) = 0. ◮ In all types of SVD, M + is M T (another zero matrix). ◮ Technically speaking, s is a singular value of M iff exist nonzero vectors (u, v) with Mv = su and M Tu = sv, and zero matrix therefore has no singular values (or left/right singular vectors). ◮ “Factorization form thin SVD” becomes a little messy.

20 / 94

slide-8
SLIDE 8
  • 6. More on the normal equations
slide-9
SLIDE 9

Recall our matrix notation

Let labeled examples ((xi, yi))n

i=1 be given.

Define n × d matrix A and n × 1 column vector b by A := 1 √n     ← xT

1

→ . . . ← xT

n

→     , b := 1 √n     y1 . . . yn     .

21 / 94

slide-10
SLIDE 10

Recall our matrix notation

Let labeled examples ((xi, yi))n

i=1 be given.

Define n × d matrix A and n × 1 column vector b by A := 1 √n     ← xT

1

→ . . . ← xT

n

→     , b := 1 √n     y1 . . . yn     . Can write empirical risk as

  • R(w) = 1

n

n

  • i=1
  • yi − x

T

i w

2 = Aw − b2

2.

21 / 94

slide-11
SLIDE 11

Recall our matrix notation

Let labeled examples ((xi, yi))n

i=1 be given.

Define n × d matrix A and n × 1 column vector b by A := 1 √n     ← xT

1

→ . . . ← xT

n

→     , b := 1 √n     y1 . . . yn     . Can write empirical risk as

  • R(w) = 1

n

n

  • i=1
  • yi − x

T

i w

2 = Aw − b2

2.

Necessary condition for w to be a minimizer of R: ∇ R(w) = 0, i.e., w is a critical point of R.

21 / 94

slide-12
SLIDE 12

Recall our matrix notation

Let labeled examples ((xi, yi))n

i=1 be given.

Define n × d matrix A and n × 1 column vector b by A := 1 √n     ← xT

1

→ . . . ← xT

n

→     , b := 1 √n     y1 . . . yn     . Can write empirical risk as

  • R(w) = 1

n

n

  • i=1
  • yi − x

T

i w

2 = Aw − b2

2.

Necessary condition for w to be a minimizer of R: ∇ R(w) = 0, i.e., w is a critical point of R. This translates to (A

TA)w = A Tb,

a system of linear equations called the normal equations.

21 / 94

slide-13
SLIDE 13

Recall our matrix notation

Let labeled examples ((xi, yi))n

i=1 be given.

Define n × d matrix A and n × 1 column vector b by A := 1 √n     ← xT

1

→ . . . ← xT

n

→     , b := 1 √n     y1 . . . yn     . Can write empirical risk as

  • R(w) = 1

n

n

  • i=1
  • yi − x

T

i w

2 = Aw − b2

2.

Necessary condition for w to be a minimizer of R: ∇ R(w) = 0, i.e., w is a critical point of R. This translates to (A

TA)w = A Tb,

a system of linear equations called the normal equations. We’ll now finally show that normal equations imply optimality.

21 / 94

slide-14
SLIDE 14
slide-15
SLIDE 15

Normal equations imply optimality

Consider w with ATAw = ATy, and any w′; then Aw′ − y2 = Aw′ − Aw + Aw − y2 = Aw′ − Aw2 + 2(Aw′ − Aw)

T(Aw − y) + Aw − y2.

Since (Aw′ − Aw)

T(Aw − y) = (w′ − w) T(A TAw − A Ty) = 0,

then Aw′ − y2 = Aw′ − Aw2 + Aw − y2. This means w′ is optimal.

22 / 94

slide-16
SLIDE 16

Normal equations imply optimality

Consider w with ATAw = ATy, and any w′; then Aw′ − y2 = Aw′ − Aw + Aw − y2 = Aw′ − Aw2 + 2(Aw′ − Aw)

T(Aw − y) + Aw − y2.

Since (Aw′ − Aw)

T(Aw − y) = (w′ − w) T(A TAw − A Ty) = 0,

then Aw′ − y2 = Aw′ − Aw2 + Aw − y2. This means w′ is optimal. Morever, writing A = r

i=1 siuivT i ,

Aw′−Aw2 = (w′−w)⊤(A⊤A)(w′−w) = (w′−w)⊤  

r

  • i=1

s2

i viv

T

i

  (w′−w), so w′ optimal iff w′ − w is in the right nullspace of A.

22 / 94

slide-17
SLIDE 17

Normal equations imply optimality

Consider w with ATAw = ATy, and any w′; then Aw′ − y2 = Aw′ − Aw + Aw − y2 = Aw′ − Aw2 + 2(Aw′ − Aw)

T(Aw − y) + Aw − y2.

Since (Aw′ − Aw)

T(Aw − y) = (w′ − w) T(A TAw − A Ty) = 0,

then Aw′ − y2 = Aw′ − Aw2 + Aw − y2. This means w′ is optimal. Morever, writing A = r

i=1 siuivT i ,

Aw′−Aw2 = (w′−w)⊤(A⊤A)(w′−w) = (w′−w)⊤  

r

  • i=1

s2

i viv

T

i

  (w′−w), so w′ optimal iff w′ − w is in the right nullspace of A. (We’ll revisit all this with convexity later.)

22 / 94

slide-18
SLIDE 18

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so A =    ↑ ↑ a1 · · · ad ↓ ↓    .

23 / 94

slide-19
SLIDE 19

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so A =    ↑ ↑ a1 · · · ad ↓ ↓    . Minimizing Aw − b2

2 is the same as finding vector ˆ

b ∈ range(A) closest to b.

23 / 94

slide-20
SLIDE 20

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so A =    ↑ ↑ a1 · · · ad ↓ ↓    . Minimizing Aw − b2

2 is the same as finding vector ˆ

b ∈ range(A) closest to b. Solution ˆ b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b ˆ b a1 a2

23 / 94

slide-21
SLIDE 21

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so A =    ↑ ↑ a1 · · · ad ↓ ↓    . Minimizing Aw − b2

2 is the same as finding vector ˆ

b ∈ range(A) closest to b. Solution ˆ b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b ˆ b a1 a2

◮ ˆ b is uniquely determined; indeed, ˆ b = AA+b = r

i=1 uiuT i b.

23 / 94

slide-22
SLIDE 22

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so A =    ↑ ↑ a1 · · · ad ↓ ↓    . Minimizing Aw − b2

2 is the same as finding vector ˆ

b ∈ range(A) closest to b. Solution ˆ b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b ˆ b a1 a2

◮ ˆ b is uniquely determined; indeed, ˆ b = AA+b = r

i=1 uiuT i b.

◮ If r = rank(A) < d, then >1 way to write ˆ b as linear combination of a1, . . . , ad.

23 / 94

slide-23
SLIDE 23

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so A =    ↑ ↑ a1 · · · ad ↓ ↓    . Minimizing Aw − b2

2 is the same as finding vector ˆ

b ∈ range(A) closest to b. Solution ˆ b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b ˆ b a1 a2

◮ ˆ b is uniquely determined; indeed, ˆ b = AA+b = r

i=1 uiuT i b.

◮ If r = rank(A) < d, then >1 way to write ˆ b as linear combination of a1, . . . , ad. If rank(A) < d, then ERM solution is not unique.

23 / 94

slide-24
SLIDE 24

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so A =    ↑ ↑ a1 · · · ad ↓ ↓    . Minimizing Aw − b2

2 is the same as finding vector ˆ

b ∈ range(A) closest to b. Solution ˆ b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b ˆ b a1 a2

◮ ˆ b is uniquely determined; indeed, ˆ b = AA+b = r

i=1 uiuT i b.

◮ If r = rank(A) < d, then >1 way to write ˆ b as linear combination of a1, . . . , ad. If rank(A) < d, then ERM solution is not unique. To get w from ˆ b: solve system of linear equations Aw = ˆ b.

23 / 94

slide-25
SLIDE 25
  • 7. Features
slide-26
SLIDE 26

Enhancing linear regression models with features

Linear functions alone are restrictive, but become powerful with creative side-information, or features. Idea: Predict with x → wTφ(x), where φ is a feature mapping.

24 / 94

slide-27
SLIDE 27

Enhancing linear regression models with features

Linear functions alone are restrictive, but become powerful with creative side-information, or features. Idea: Predict with x → wTφ(x), where φ is a feature mapping. Examples:

  • 1. Non-linear transformations of existing variables: for x ∈ R,

φ(x) = ln(1 + x).

  • 2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,

φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).

  • 3. Trigonometric expansion: for x ∈ R,

φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).

  • 4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,

φ(x) = (1, x1, . . . , xd, x2

1, . . . , x2 d, x1x2, . . . , x1xd, . . . , xd−1xd).

24 / 94

slide-28
SLIDE 28

Example: Taking advantage of linearity

Suppose you are trying to predict some health outcome. ◮ Physician suggests that body temperature is relevant, specifically the (square) deviation from normal body temperature: φ(x) = (xtemp − 98.6)2. ◮ What if you didn’t know about this magic constant 98.6? ◮ Instead, use φ(x) = (1, xtemp, x2

temp).

Can learn coefficients w such that w

Tφ(x) = (xtemp − 98.6)2,

  • r any other quadratic polynomial in xtemp (which may be better!).

25 / 94

slide-29
SLIDE 29

Quadratic expansion

Quadratic function f : R → R f(x) = ax2 + bx + c, x ∈ R, for a, b, c ∈ R.

26 / 94

slide-30
SLIDE 30

Quadratic expansion

Quadratic function f : R → R f(x) = ax2 + bx + c, x ∈ R, for a, b, c ∈ R. This can be written as a linear function of φ(x), where φ(x) := (1, x, x2), since f(x) = w

Tφ(x)

where w = (c, b, a).

26 / 94

slide-31
SLIDE 31

Quadratic expansion

Quadratic function f : R → R f(x) = ax2 + bx + c, x ∈ R, for a, b, c ∈ R. This can be written as a linear function of φ(x), where φ(x) := (1, x, x2), since f(x) = w

Tφ(x)

where w = (c, b, a). For multivariate quadratic function f : Rd → R, use φ(x) := (1, x1, . . . , xd

  • linear terms

, x2

1, . . . , x2 d

  • squared terms

, x1x2, . . . , x1xd, . . . , xd−1xd

  • cross terms

).

26 / 94

slide-32
SLIDE 32

Affine expansion and “Old Faithful”

Woodward needed an affine expansion for “Old Faithful” data: φ(x) := (1, x).

27 / 94

slide-33
SLIDE 33

Affine expansion and “Old Faithful”

Woodward needed an affine expansion for “Old Faithful” data: φ(x) := (1, x).

1 2 3 4 5 6 duration of last eruption 20 40 60 80 100 time until next eruption affine function

Affine function fa,b : R → R for a, b ∈ R, fa,b(x) = a + bx, is a linear function fw of φ(x) for w = (a, b). (This easily generalizes to multivariate affine functions.)

27 / 94

slide-34
SLIDE 34

Final remarks on features

◮ “Feature engineering” can drastically change the power of a model. ◮ Some people consider it messy, unprincipled, pure “trial-and-error”. ◮ Deep learning is somewhat touted as removing some of this, but it doesn’t do so completely (e.g., took a lot of work to come up with the “convolutional neural network” (side question, who came up with that?)).

28 / 94

slide-35
SLIDE 35
slide-36
SLIDE 36
  • 8. Statistical view of least squares; maximum likelihood
slide-37
SLIDE 37

Maximum likelihood estimation (MLE) refresher

Parametric statistical model: P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.

29 / 94

slide-38
SLIDE 38

Maximum likelihood estimation (MLE) refresher

Parametric statistical model: P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data. ◮ Θ: parameter space.

29 / 94

slide-39
SLIDE 39

Maximum likelihood estimation (MLE) refresher

Parametric statistical model: P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data. ◮ Θ: parameter space. ◮ θ ∈ Θ: a particular parameter (or parameter vector).

29 / 94

slide-40
SLIDE 40

Maximum likelihood estimation (MLE) refresher

Parametric statistical model: P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data. ◮ Θ: parameter space. ◮ θ ∈ Θ: a particular parameter (or parameter vector). ◮ Pθ: a particular probability distribution for observed data.

29 / 94

slide-41
SLIDE 41

Maximum likelihood estimation (MLE) refresher

Parametric statistical model: P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data. ◮ Θ: parameter space. ◮ θ ∈ Θ: a particular parameter (or parameter vector). ◮ Pθ: a particular probability distribution for observed data. Likelihood of θ ∈ Θ given observed data x: For discrete X ∼ Pθ with probability mass function pθ, L(θ) := pθ(x). For continuous X ∼ Pθ with probability density function fθ, L(θ) := fθ(x).

29 / 94

slide-42
SLIDE 42

Maximum likelihood estimation (MLE) refresher

Parametric statistical model: P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data. ◮ Θ: parameter space. ◮ θ ∈ Θ: a particular parameter (or parameter vector). ◮ Pθ: a particular probability distribution for observed data. Likelihood of θ ∈ Θ given observed data x: For discrete X ∼ Pθ with probability mass function pθ, L(θ) := pθ(x). For continuous X ∼ Pθ with probability density function fθ, L(θ) := fθ(x). Maximum likelihood estimator (MLE): Let ˆ θ be the θ ∈ Θ of highest likelihood given observed data.

29 / 94

slide-43
SLIDE 43

Distributions over labeled examples

X: Space of possible side-information (feature space). Y: Space of possible outcomes (label space or output space).

30 / 94

slide-44
SLIDE 44

Distributions over labeled examples

X: Space of possible side-information (feature space). Y: Space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X × Y can be thought

  • f in two parts:

30 / 94

slide-45
SLIDE 45

Distributions over labeled examples

X: Space of possible side-information (feature space). Y: Space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X × Y can be thought

  • f in two parts:
  • 1. Marginal distribution PX of X:

PX is a probability distribution on X.

30 / 94

slide-46
SLIDE 46

Distributions over labeled examples

X: Space of possible side-information (feature space). Y: Space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X × Y can be thought

  • f in two parts:
  • 1. Marginal distribution PX of X:

PX is a probability distribution on X.

  • 2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X:

PY |X=x is a probability distribution on Y.

30 / 94

slide-47
SLIDE 47

Distributions over labeled examples

X: Space of possible side-information (feature space). Y: Space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X × Y can be thought

  • f in two parts:
  • 1. Marginal distribution PX of X:

PX is a probability distribution on X.

  • 2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X:

PY |X=x is a probability distribution on Y. This lecture: Y = R (regression problems).

30 / 94

slide-48
SLIDE 48

Optimal predictor

What function f : X → R has smallest (squared loss) risk R(f) := E[(f(X) − Y )2]? Note: earlier we discussed empirical risk.

31 / 94

slide-49
SLIDE 49

Optimal predictor

What function f : X → R has smallest (squared loss) risk R(f) := E[(f(X) − Y )2]? Note: earlier we discussed empirical risk. ◮ Conditional on X = x, the minimizer of conditional risk ˆ y → E[(ˆ y − Y )2 | X = x] is the conditional mean E[Y | X = x].

31 / 94

slide-50
SLIDE 50

Optimal predictor

What function f : X → R has smallest (squared loss) risk R(f) := E[(f(X) − Y )2]? Note: earlier we discussed empirical risk. ◮ Conditional on X = x, the minimizer of conditional risk ˆ y → E[(ˆ y − Y )2 | X = x] is the conditional mean E[Y | X = x]. ◮ Therefore, the function f ⋆ : R → R where f ⋆(x) = E[Y | X = x], x ∈ R has the smallest risk.

31 / 94

slide-51
SLIDE 51

Optimal predictor

What function f : X → R has smallest (squared loss) risk R(f) := E[(f(X) − Y )2]? Note: earlier we discussed empirical risk. ◮ Conditional on X = x, the minimizer of conditional risk ˆ y → E[(ˆ y − Y )2 | X = x] is the conditional mean E[Y | X = x]. ◮ Therefore, the function f ⋆ : R → R where f ⋆(x) = E[Y | X = x], x ∈ R has the smallest risk. ◮ f ⋆ is called the regression function or conditional mean function.

31 / 94

slide-52
SLIDE 52

Linear regression models

When side-information is encoded as vectors of real numbers x = (x1, . . . , xd) (called features or variables), it is common to use a linear regression model, such as the following: Y | X = x ∼ N(x

Tw, σ2),

x ∈ Rd.

32 / 94

slide-53
SLIDE 53

Linear regression models

When side-information is encoded as vectors of real numbers x = (x1, . . . , xd) (called features or variables), it is common to use a linear regression model, such as the following: Y | X = x ∼ N(x

Tw, σ2),

x ∈ Rd. ◮ Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.

32 / 94

slide-54
SLIDE 54

Linear regression models

When side-information is encoded as vectors of real numbers x = (x1, . . . , xd) (called features or variables), it is common to use a linear regression model, such as the following: Y | X = x ∼ N(x

Tw, σ2),

x ∈ Rd. ◮ Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0. ◮ X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).

32 / 94

slide-55
SLIDE 55

Linear regression models

When side-information is encoded as vectors of real numbers x = (x1, . . . , xd) (called features or variables), it is common to use a linear regression model, such as the following: Y | X = x ∼ N(x

Tw, σ2),

x ∈ Rd. ◮ Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0. ◮ X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables). ◮ Conditional distribution of Y given X is normal.

32 / 94

slide-56
SLIDE 56

Linear regression models

When side-information is encoded as vectors of real numbers x = (x1, . . . , xd) (called features or variables), it is common to use a linear regression model, such as the following: Y | X = x ∼ N(x

Tw, σ2),

x ∈ Rd. ◮ Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0. ◮ X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables). ◮ Conditional distribution of Y given X is normal. ◮ Marginal distribution of X not specified.

32 / 94

slide-57
SLIDE 57

Linear regression models

When side-information is encoded as vectors of real numbers x = (x1, . . . , xd) (called features or variables), it is common to use a linear regression model, such as the following: Y | X = x ∼ N(x

Tw, σ2),

x ∈ Rd. ◮ Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0. ◮ X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables). ◮ Conditional distribution of Y given X is normal. ◮ Marginal distribution of X not specified. In this model, the regression function f ⋆ is a linear function fw : Rd → R, fw(x) = x

Tw =

d

  • i=1

xiw, x ∈ Rd.

(We’ll often refer to fw just by w.)

  • 1
  • 0.5

0.5 1 x

  • 5

5 y f* 32 / 94

slide-58
SLIDE 58

Maximum likelihood estimation for linear regression

Linear regression model with Gaussian noise: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with Y | X = x ∼ N(x

Tw, σ2),

x ∈ Rd. (Traditional to study linear regression in context of this model.)

33 / 94

slide-59
SLIDE 59

Maximum likelihood estimation for linear regression

Linear regression model with Gaussian noise: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with Y | X = x ∼ N(x

Tw, σ2),

x ∈ Rd. (Traditional to study linear regression in context of this model.) Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:

n

  • i=1
  • − 1

2σ2 (x

T

i w − yi)2 + 1

2 ln 1 2πσ2

  • +
  • terms not involving (w, σ2)
  • .

33 / 94

slide-60
SLIDE 60

Maximum likelihood estimation for linear regression

Linear regression model with Gaussian noise: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with Y | X = x ∼ N(x

Tw, σ2),

x ∈ Rd. (Traditional to study linear regression in context of this model.) Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:

n

  • i=1
  • − 1

2σ2 (x

T

i w − yi)2 + 1

2 ln 1 2πσ2

  • +
  • terms not involving (w, σ2)
  • .

The w that maximizes log-likelihood is also w that minimizes 1 n

n

  • i=1

(x

T

i w − yi)2.

33 / 94

slide-61
SLIDE 61

Maximum likelihood estimation for linear regression

Linear regression model with Gaussian noise: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with Y | X = x ∼ N(x

Tw, σ2),

x ∈ Rd. (Traditional to study linear regression in context of this model.) Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:

n

  • i=1
  • − 1

2σ2 (x

T

i w − yi)2 + 1

2 ln 1 2πσ2

  • +
  • terms not involving (w, σ2)
  • .

The w that maximizes log-likelihood is also w that minimizes 1 n

n

  • i=1

(x

T

i w − yi)2.

This coincides with another approach, called empirical risk minimization, which is studied beyond the context of the linear regression model . . .

33 / 94

slide-62
SLIDE 62

Empirical distribution and empirical risk

Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability mass function pn given by pn((x, y)) := 1 n

n

  • i=1

1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.

34 / 94

slide-63
SLIDE 63

Empirical distribution and empirical risk

Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability mass function pn given by pn((x, y)) := 1 n

n

  • i=1

1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R. Plug-in principle: Goal is to find function f that minimizes (squared loss) risk R(f) = E[(f(X) − Y )2]. But we don’t know the distribution P of (X, Y ).

34 / 94

slide-64
SLIDE 64

Empirical distribution and empirical risk

Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability mass function pn given by pn((x, y)) := 1 n

n

  • i=1

1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R. Plug-in principle: Goal is to find function f that minimizes (squared loss) risk R(f) = E[(f(X) − Y )2]. But we don’t know the distribution P of (X, Y ). Replace P with Pn → Empirical (squared loss) risk R(f):

  • R(f) := 1

n

n

  • i=1

(f(xi) − yi)2.

34 / 94

slide-65
SLIDE 65

Empirical distribution and empirical risk

Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability mass function pn given by pn((x, y)) := 1 n

n

  • i=1

1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R. Plug-in principle: Goal is to find function f that minimizes (squared loss) risk R(f) = E[(f(X) − Y )2]. But we don’t know the distribution P of (X, Y ). Replace P with Pn → Empirical (squared loss) risk R(f):

  • R(f) := 1

n

n

  • i=1

(f(xi) − yi)2. (“Plug-in principle” is used throughout statistics in this same way.)

34 / 94

slide-66
SLIDE 66

Empirical risk minimization

Empirical risk minimization (ERM) is the learning method that returns a function (from a specified function class) that minimizes the empirical risk.

35 / 94

slide-67
SLIDE 67

Empirical risk minimization

Empirical risk minimization (ERM) is the learning method that returns a function (from a specified function class) that minimizes the empirical risk. For linear functions and squared loss: ERM returns ˆ w ∈ arg min

w∈Rd

  • R(w),

which coincides with MLE under the basic linear regression model.

35 / 94

slide-68
SLIDE 68

Empirical risk minimization

Empirical risk minimization (ERM) is the learning method that returns a function (from a specified function class) that minimizes the empirical risk. For linear functions and squared loss: ERM returns ˆ w ∈ arg min

w∈Rd

  • R(w),

which coincides with MLE under the basic linear regression model. In general: ◮ MLE makes sense in context of statistical model for which it is derived. ◮ ERM makes sense in context of general iid model for supervised learning.

35 / 94

slide-69
SLIDE 69

Empirical risk minimization

Empirical risk minimization (ERM) is the learning method that returns a function (from a specified function class) that minimizes the empirical risk. For linear functions and squared loss: ERM returns ˆ w ∈ arg min

w∈Rd

  • R(w),

which coincides with MLE under the basic linear regression model. In general: ◮ MLE makes sense in context of statistical model for which it is derived. ◮ ERM makes sense in context of general iid model for supervised learning. Further remarks. ◮ In MLE, we assume a model, and we not only maximize likelihood, but can try to argue we “recover” a “true” parameter. ◮ In ERM, by default there is no assumption of a “true” parameter to recover. Useful examples: medical testing, gene expression, . . .

35 / 94

slide-70
SLIDE 70
slide-71
SLIDE 71

Old faithful data under this least squares statistical model

Recall our data, consisting of historical records of eruptions:

a1 b1 a2 a3 a0 b2 b3 b0 . . . Y1 Y2 Y3

36 / 94

slide-72
SLIDE 72

Old faithful data under this least squares statistical model

Recall our data, consisting of historical records of eruptions:

a1 b1 a2 a3 a0 b2 b3 b0 . . . Y1 Y2 Y3

Statistical model (not just IID!): Y1, . . . , Yn, Y ∼iid N(µ, σ2). ◮ Data: Yi := ai − bi−1, i = 1, . . . , n.

(Admittedly not a great model, since durations are non-negative.)

36 / 94

slide-73
SLIDE 73

Old faithful data under this least squares statistical model

Recall our data, consisting of historical records of eruptions:

an bn an−1 bn−1

. . . Y

data

. . . t

Statistical model (not just IID!): Y1, . . . , Yn, Y ∼iid N(µ, σ2). ◮ Data: Yi := ai − bi−1, i = 1, . . . , n.

(Admittedly not a great model, since durations are non-negative.)

Task: At later time t (when an eruption ends), predict time of next eruption t + Y . For the linear regression model, we’ll assume Y | X = x ∼ N(x

Tw, σ2),

x ∈ Rd. (This extends the model above if we add the “1” feature.)

36 / 94

slide-74
SLIDE 74
  • 9. Regularization and ridge regression
slide-75
SLIDE 75

Inductive bias

Suppose ERM solution is not unique. What should we do?

37 / 94

slide-76
SLIDE 76

Inductive bias

Suppose ERM solution is not unique. What should we do? One possible answer: Pick the w of shortest length.

37 / 94

slide-77
SLIDE 77

Inductive bias

Suppose ERM solution is not unique. What should we do? One possible answer: Pick the w of shortest length. ◮ Fact: The shortest solution ˆ w to (ATA)w = ATb is always unique.

37 / 94

slide-78
SLIDE 78

Inductive bias

Suppose ERM solution is not unique. What should we do? One possible answer: Pick the w of shortest length. ◮ Fact: The shortest solution ˆ w to (ATA)w = ATb is always unique. ◮ Obtain ˆ w via ˆ w = A+b where A+ is the (Moore-Penrose) pseudoinverse of A.

37 / 94

slide-79
SLIDE 79

Inductive bias

Suppose ERM solution is not unique. What should we do? One possible answer: Pick the w of shortest length. ◮ Fact: The shortest solution ˆ w to (ATA)w = ATb is always unique. ◮ Obtain ˆ w via ˆ w = A+b where A+ is the (Moore-Penrose) pseudoinverse of A. Why should this be a good idea?

37 / 94

slide-80
SLIDE 80

Inductive bias

Suppose ERM solution is not unique. What should we do? One possible answer: Pick the w of shortest length. ◮ Fact: The shortest solution ˆ w to (ATA)w = ATb is always unique. ◮ Obtain ˆ w via ˆ w = A+b where A+ is the (Moore-Penrose) pseudoinverse of A. Why should this be a good idea? ◮ Data does not give reason to choose a shorter w over a longer w.

37 / 94

slide-81
SLIDE 81

Inductive bias

Suppose ERM solution is not unique. What should we do? One possible answer: Pick the w of shortest length. ◮ Fact: The shortest solution ˆ w to (ATA)w = ATb is always unique. ◮ Obtain ˆ w via ˆ w = A+b where A+ is the (Moore-Penrose) pseudoinverse of A. Why should this be a good idea? ◮ Data does not give reason to choose a shorter w over a longer w. ◮ The preference for shorter w is an inductive bias: it will work well for some problems (e.g., when “true” w⋆ is short), not for others.

37 / 94

slide-82
SLIDE 82

Inductive bias

Suppose ERM solution is not unique. What should we do? One possible answer: Pick the w of shortest length. ◮ Fact: The shortest solution ˆ w to (ATA)w = ATb is always unique. ◮ Obtain ˆ w via ˆ w = A+b where A+ is the (Moore-Penrose) pseudoinverse of A. Why should this be a good idea? ◮ Data does not give reason to choose a shorter w over a longer w. ◮ The preference for shorter w is an inductive bias: it will work well for some problems (e.g., when “true” w⋆ is short), not for others. All learning algorithms encode some kind of inductive bias.

37 / 94

slide-83
SLIDE 83

Example

ERM with scaled trigonometric feature expansion: φ(x) = (1, sin(x), cos(x), 1

2 sin(2x), 1 2 cos(2x), 1 3 sin(3x), 1 3 cos(3x), . . . ).

38 / 94

slide-84
SLIDE 84

Example

ERM with scaled trigonometric feature expansion: φ(x) = (1, sin(x), cos(x), 1

2 sin(2x), 1 2 cos(2x), 1 3 sin(3x), 1 3 cos(3x), . . . ).

Training data:

1 2 3 4 5 6 x

  • 2.5
  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 2.5 f(x) 38 / 94

slide-85
SLIDE 85

Example

ERM with scaled trigonometric feature expansion: φ(x) = (1, sin(x), cos(x), 1

2 sin(2x), 1 2 cos(2x), 1 3 sin(3x), 1 3 cos(3x), . . . ).

Training data and some arbitrary ERM:

1 2 3 4 5 6 x

  • 2.5
  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 2.5 f(x) 38 / 94

slide-86
SLIDE 86

Example

ERM with scaled trigonometric feature expansion: φ(x) = (1, sin(x), cos(x), 1

2 sin(2x), 1 2 cos(2x), 1 3 sin(3x), 1 3 cos(3x), . . . ).

Training data and least ℓ2 norm ERM:

1 2 3 4 5 6 x

  • 2.5
  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 2.5 f(x)

It is not a given that the least norm ERM is better than the other ERM!

38 / 94

slide-87
SLIDE 87

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

  • R(w) + λw2

2

  • ver w ∈ Rd.

39 / 94

slide-88
SLIDE 88

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

  • R(w) + λw2

2

  • ver w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

39 / 94

slide-89
SLIDE 89

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

  • R(w) + λw2

2

  • ver w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)! ◮ This is called ridge regression. (λ = 0 is ERM / Ordinary Least Squares.) Explicit solution (ATA + λI)−1ATb.

39 / 94

slide-90
SLIDE 90

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

  • R(w) + λw2

2

  • ver w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)! ◮ This is called ridge regression. (λ = 0 is ERM / Ordinary Least Squares.) Explicit solution (ATA + λI)−1ATb. ◮ Parameter λ controls how much attention is paid to the regularizer w2

2

relative to the data fitting term R(w).

39 / 94

slide-91
SLIDE 91

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

  • R(w) + λw2

2

  • ver w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)! ◮ This is called ridge regression. (λ = 0 is ERM / Ordinary Least Squares.) Explicit solution (ATA + λI)−1ATb. ◮ Parameter λ controls how much attention is paid to the regularizer w2

2

relative to the data fitting term R(w). ◮ Choose λ using cross-validation.

39 / 94

slide-92
SLIDE 92

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

  • R(w) + λw2

2

  • ver w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)! ◮ This is called ridge regression. (λ = 0 is ERM / Ordinary Least Squares.) Explicit solution (ATA + λI)−1ATb. ◮ Parameter λ controls how much attention is paid to the regularizer w2

2

relative to the data fitting term R(w). ◮ Choose λ using cross-validation. Note: in deep networks, this regularization is called “weight decay”. (Why?) Note: another popular regularizer for linear regression is ℓ1.

39 / 94

slide-93
SLIDE 93