- 5. Summary of linear regression so far
5. Summary of linear regression so far Main points - - PowerPoint PPT Presentation
5. Summary of linear regression so far Main points - - PowerPoint PPT Presentation
5. Summary of linear regression so far Main points Model/function/predictor class of linear regressors x w T x . ERM principle: we chose a loss (least squares) and find a good predictor by minimizing empirical risk. ERM solution
Main points
◮ Model/function/predictor class of linear regressors x → wTx. ◮ ERM principle: we chose a loss (least squares) and find a good predictor by minimizing empirical risk. ◮ ERM solution for least squares: pick w satisfying ATAw = ATb, which is not unique; one unique choice is the ordinary least squares solution A+b.
18 / 94
Part 2 of linear regression lecture. . .
Recap on SVD. (A messy slide, I’m sorry.)
Suppose 0 = M ∈ Rn×d, thus r := rank(M) > 0. ◮ “Decomposition form” thin SVD: M = r
i=1 siuivT i , and
s1 ≥ · · · ≥ sr > 0, and M + = r
i=1 1 si viuT i . and in general
M +M = r
i=1 vivT i = I.
◮ “Factorization form” thin SVD: M = USV T, U ∈ Rn×r and V ∈ Rd×r
- rthonormal but U TU and V TV are not identity matrices in general, and
S = diag(s1, . . . , sr) ∈ Rr×r with s1 ≥ · · · ≥ sr > 0; pseudoinverse M + = V S−1U T and in general M +M = MM + = I. ◮ Full SVD: M = U fSfV T
f , U f ∈ Rn×n and V ∈ Rd×d orthonormal and
full rank so U T
f U f and V T f V f are identity matrices and Sf ∈ Rn×d is zero
everywhere except the first r diagonal entries which are s1 ≥ · · · ≥ sr > 0; pseudoinverse M + = V fS+
f U T f where S+ f
is obtained by transposing Sf and then flipping nonzero entries, and in general M +M = MM + = I. Additional property: agreement with eigendecompositions of MM T and M TM. The “full SVD” adds columns to U and V which hit zeros of S and therefore don’t matter (as a sanity check, verify for yourself that all these SVDs are equal).
19 / 94
Recap on SVD, zero matrix case
Suppose 0 = M ∈ Rn×d, thus r := rank(M) = 0. ◮ In all types of SVD, M + is M T (another zero matrix). ◮ Technically speaking, s is a singular value of M iff exist nonzero vectors (u, v) with Mv = su and M Tu = sv, and zero matrix therefore has no singular values (or left/right singular vectors). ◮ “Factorization form thin SVD” becomes a little messy.
20 / 94
- 6. More on the normal equations
Recall our matrix notation
Let labeled examples ((xi, yi))n
i=1 be given.
Define n × d matrix A and n × 1 column vector b by A := 1 √n ← xT
1
→ . . . ← xT
n
→ , b := 1 √n y1 . . . yn .
21 / 94
Recall our matrix notation
Let labeled examples ((xi, yi))n
i=1 be given.
Define n × d matrix A and n × 1 column vector b by A := 1 √n ← xT
1
→ . . . ← xT
n
→ , b := 1 √n y1 . . . yn . Can write empirical risk as
- R(w) = 1
n
n
- i=1
- yi − x
T
i w
2 = Aw − b2
2.
21 / 94
Recall our matrix notation
Let labeled examples ((xi, yi))n
i=1 be given.
Define n × d matrix A and n × 1 column vector b by A := 1 √n ← xT
1
→ . . . ← xT
n
→ , b := 1 √n y1 . . . yn . Can write empirical risk as
- R(w) = 1
n
n
- i=1
- yi − x
T
i w
2 = Aw − b2
2.
Necessary condition for w to be a minimizer of R: ∇ R(w) = 0, i.e., w is a critical point of R.
21 / 94
Recall our matrix notation
Let labeled examples ((xi, yi))n
i=1 be given.
Define n × d matrix A and n × 1 column vector b by A := 1 √n ← xT
1
→ . . . ← xT
n
→ , b := 1 √n y1 . . . yn . Can write empirical risk as
- R(w) = 1
n
n
- i=1
- yi − x
T
i w
2 = Aw − b2
2.
Necessary condition for w to be a minimizer of R: ∇ R(w) = 0, i.e., w is a critical point of R. This translates to (A
TA)w = A Tb,
a system of linear equations called the normal equations.
21 / 94
Recall our matrix notation
Let labeled examples ((xi, yi))n
i=1 be given.
Define n × d matrix A and n × 1 column vector b by A := 1 √n ← xT
1
→ . . . ← xT
n
→ , b := 1 √n y1 . . . yn . Can write empirical risk as
- R(w) = 1
n
n
- i=1
- yi − x
T
i w
2 = Aw − b2
2.
Necessary condition for w to be a minimizer of R: ∇ R(w) = 0, i.e., w is a critical point of R. This translates to (A
TA)w = A Tb,
a system of linear equations called the normal equations. We’ll now finally show that normal equations imply optimality.
21 / 94
Normal equations imply optimality
Consider w with ATAw = ATy, and any w′; then Aw′ − y2 = Aw′ − Aw + Aw − y2 = Aw′ − Aw2 + 2(Aw′ − Aw)
T(Aw − y) + Aw − y2.
Since (Aw′ − Aw)
T(Aw − y) = (w′ − w) T(A TAw − A Ty) = 0,
then Aw′ − y2 = Aw′ − Aw2 + Aw − y2. This means w′ is optimal.
22 / 94
Normal equations imply optimality
Consider w with ATAw = ATy, and any w′; then Aw′ − y2 = Aw′ − Aw + Aw − y2 = Aw′ − Aw2 + 2(Aw′ − Aw)
T(Aw − y) + Aw − y2.
Since (Aw′ − Aw)
T(Aw − y) = (w′ − w) T(A TAw − A Ty) = 0,
then Aw′ − y2 = Aw′ − Aw2 + Aw − y2. This means w′ is optimal. Morever, writing A = r
i=1 siuivT i ,
Aw′−Aw2 = (w′−w)⊤(A⊤A)(w′−w) = (w′−w)⊤
r
- i=1
s2
i viv
T
i
(w′−w), so w′ optimal iff w′ − w is in the right nullspace of A.
22 / 94
Normal equations imply optimality
Consider w with ATAw = ATy, and any w′; then Aw′ − y2 = Aw′ − Aw + Aw − y2 = Aw′ − Aw2 + 2(Aw′ − Aw)
T(Aw − y) + Aw − y2.
Since (Aw′ − Aw)
T(Aw − y) = (w′ − w) T(A TAw − A Ty) = 0,
then Aw′ − y2 = Aw′ − Aw2 + Aw − y2. This means w′ is optimal. Morever, writing A = r
i=1 siuivT i ,
Aw′−Aw2 = (w′−w)⊤(A⊤A)(w′−w) = (w′−w)⊤
r
- i=1
s2
i viv
T
i
(w′−w), so w′ optimal iff w′ − w is in the right nullspace of A. (We’ll revisit all this with convexity later.)
22 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so A = ↑ ↑ a1 · · · ad ↓ ↓ .
23 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so A = ↑ ↑ a1 · · · ad ↓ ↓ . Minimizing Aw − b2
2 is the same as finding vector ˆ
b ∈ range(A) closest to b.
23 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so A = ↑ ↑ a1 · · · ad ↓ ↓ . Minimizing Aw − b2
2 is the same as finding vector ˆ
b ∈ range(A) closest to b. Solution ˆ b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b ˆ b a1 a2
23 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so A = ↑ ↑ a1 · · · ad ↓ ↓ . Minimizing Aw − b2
2 is the same as finding vector ˆ
b ∈ range(A) closest to b. Solution ˆ b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b ˆ b a1 a2
◮ ˆ b is uniquely determined; indeed, ˆ b = AA+b = r
i=1 uiuT i b.
23 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so A = ↑ ↑ a1 · · · ad ↓ ↓ . Minimizing Aw − b2
2 is the same as finding vector ˆ
b ∈ range(A) closest to b. Solution ˆ b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b ˆ b a1 a2
◮ ˆ b is uniquely determined; indeed, ˆ b = AA+b = r
i=1 uiuT i b.
◮ If r = rank(A) < d, then >1 way to write ˆ b as linear combination of a1, . . . , ad.
23 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so A = ↑ ↑ a1 · · · ad ↓ ↓ . Minimizing Aw − b2
2 is the same as finding vector ˆ
b ∈ range(A) closest to b. Solution ˆ b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b ˆ b a1 a2
◮ ˆ b is uniquely determined; indeed, ˆ b = AA+b = r
i=1 uiuT i b.
◮ If r = rank(A) < d, then >1 way to write ˆ b as linear combination of a1, . . . , ad. If rank(A) < d, then ERM solution is not unique.
23 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so A = ↑ ↑ a1 · · · ad ↓ ↓ . Minimizing Aw − b2
2 is the same as finding vector ˆ
b ∈ range(A) closest to b. Solution ˆ b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b ˆ b a1 a2
◮ ˆ b is uniquely determined; indeed, ˆ b = AA+b = r
i=1 uiuT i b.
◮ If r = rank(A) < d, then >1 way to write ˆ b as linear combination of a1, . . . , ad. If rank(A) < d, then ERM solution is not unique. To get w from ˆ b: solve system of linear equations Aw = ˆ b.
23 / 94
- 7. Features
Enhancing linear regression models with features
Linear functions alone are restrictive, but become powerful with creative side-information, or features. Idea: Predict with x → wTφ(x), where φ is a feature mapping.
24 / 94
Enhancing linear regression models with features
Linear functions alone are restrictive, but become powerful with creative side-information, or features. Idea: Predict with x → wTφ(x), where φ is a feature mapping. Examples:
- 1. Non-linear transformations of existing variables: for x ∈ R,
φ(x) = ln(1 + x).
- 2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,
φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).
- 3. Trigonometric expansion: for x ∈ R,
φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).
- 4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,
φ(x) = (1, x1, . . . , xd, x2
1, . . . , x2 d, x1x2, . . . , x1xd, . . . , xd−1xd).
24 / 94
Example: Taking advantage of linearity
Suppose you are trying to predict some health outcome. ◮ Physician suggests that body temperature is relevant, specifically the (square) deviation from normal body temperature: φ(x) = (xtemp − 98.6)2. ◮ What if you didn’t know about this magic constant 98.6? ◮ Instead, use φ(x) = (1, xtemp, x2
temp).
Can learn coefficients w such that w
Tφ(x) = (xtemp − 98.6)2,
- r any other quadratic polynomial in xtemp (which may be better!).
25 / 94
Quadratic expansion
Quadratic function f : R → R f(x) = ax2 + bx + c, x ∈ R, for a, b, c ∈ R.
26 / 94
Quadratic expansion
Quadratic function f : R → R f(x) = ax2 + bx + c, x ∈ R, for a, b, c ∈ R. This can be written as a linear function of φ(x), where φ(x) := (1, x, x2), since f(x) = w
Tφ(x)
where w = (c, b, a).
26 / 94
Quadratic expansion
Quadratic function f : R → R f(x) = ax2 + bx + c, x ∈ R, for a, b, c ∈ R. This can be written as a linear function of φ(x), where φ(x) := (1, x, x2), since f(x) = w
Tφ(x)
where w = (c, b, a). For multivariate quadratic function f : Rd → R, use φ(x) := (1, x1, . . . , xd
- linear terms
, x2
1, . . . , x2 d
- squared terms
, x1x2, . . . , x1xd, . . . , xd−1xd
- cross terms
).
26 / 94
Affine expansion and “Old Faithful”
Woodward needed an affine expansion for “Old Faithful” data: φ(x) := (1, x).
27 / 94
Affine expansion and “Old Faithful”
Woodward needed an affine expansion for “Old Faithful” data: φ(x) := (1, x).
1 2 3 4 5 6 duration of last eruption 20 40 60 80 100 time until next eruption affine function
Affine function fa,b : R → R for a, b ∈ R, fa,b(x) = a + bx, is a linear function fw of φ(x) for w = (a, b). (This easily generalizes to multivariate affine functions.)
27 / 94
Final remarks on features
◮ “Feature engineering” can drastically change the power of a model. ◮ Some people consider it messy, unprincipled, pure “trial-and-error”. ◮ Deep learning is somewhat touted as removing some of this, but it doesn’t do so completely (e.g., took a lot of work to come up with the “convolutional neural network” (side question, who came up with that?)).
28 / 94
- 8. Statistical view of least squares; maximum likelihood
Maximum likelihood estimation (MLE) refresher
Parametric statistical model: P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.
29 / 94
Maximum likelihood estimation (MLE) refresher
Parametric statistical model: P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data. ◮ Θ: parameter space.
29 / 94
Maximum likelihood estimation (MLE) refresher
Parametric statistical model: P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data. ◮ Θ: parameter space. ◮ θ ∈ Θ: a particular parameter (or parameter vector).
29 / 94
Maximum likelihood estimation (MLE) refresher
Parametric statistical model: P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data. ◮ Θ: parameter space. ◮ θ ∈ Θ: a particular parameter (or parameter vector). ◮ Pθ: a particular probability distribution for observed data.
29 / 94
Maximum likelihood estimation (MLE) refresher
Parametric statistical model: P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data. ◮ Θ: parameter space. ◮ θ ∈ Θ: a particular parameter (or parameter vector). ◮ Pθ: a particular probability distribution for observed data. Likelihood of θ ∈ Θ given observed data x: For discrete X ∼ Pθ with probability mass function pθ, L(θ) := pθ(x). For continuous X ∼ Pθ with probability density function fθ, L(θ) := fθ(x).
29 / 94
Maximum likelihood estimation (MLE) refresher
Parametric statistical model: P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data. ◮ Θ: parameter space. ◮ θ ∈ Θ: a particular parameter (or parameter vector). ◮ Pθ: a particular probability distribution for observed data. Likelihood of θ ∈ Θ given observed data x: For discrete X ∼ Pθ with probability mass function pθ, L(θ) := pθ(x). For continuous X ∼ Pθ with probability density function fθ, L(θ) := fθ(x). Maximum likelihood estimator (MLE): Let ˆ θ be the θ ∈ Θ of highest likelihood given observed data.
29 / 94
Distributions over labeled examples
X: Space of possible side-information (feature space). Y: Space of possible outcomes (label space or output space).
30 / 94
Distributions over labeled examples
X: Space of possible side-information (feature space). Y: Space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X × Y can be thought
- f in two parts:
30 / 94
Distributions over labeled examples
X: Space of possible side-information (feature space). Y: Space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X × Y can be thought
- f in two parts:
- 1. Marginal distribution PX of X:
PX is a probability distribution on X.
30 / 94
Distributions over labeled examples
X: Space of possible side-information (feature space). Y: Space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X × Y can be thought
- f in two parts:
- 1. Marginal distribution PX of X:
PX is a probability distribution on X.
- 2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X:
PY |X=x is a probability distribution on Y.
30 / 94
Distributions over labeled examples
X: Space of possible side-information (feature space). Y: Space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X × Y can be thought
- f in two parts:
- 1. Marginal distribution PX of X:
PX is a probability distribution on X.
- 2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X:
PY |X=x is a probability distribution on Y. This lecture: Y = R (regression problems).
30 / 94
Optimal predictor
What function f : X → R has smallest (squared loss) risk R(f) := E[(f(X) − Y )2]? Note: earlier we discussed empirical risk.
31 / 94
Optimal predictor
What function f : X → R has smallest (squared loss) risk R(f) := E[(f(X) − Y )2]? Note: earlier we discussed empirical risk. ◮ Conditional on X = x, the minimizer of conditional risk ˆ y → E[(ˆ y − Y )2 | X = x] is the conditional mean E[Y | X = x].
31 / 94
Optimal predictor
What function f : X → R has smallest (squared loss) risk R(f) := E[(f(X) − Y )2]? Note: earlier we discussed empirical risk. ◮ Conditional on X = x, the minimizer of conditional risk ˆ y → E[(ˆ y − Y )2 | X = x] is the conditional mean E[Y | X = x]. ◮ Therefore, the function f ⋆ : R → R where f ⋆(x) = E[Y | X = x], x ∈ R has the smallest risk.
31 / 94
Optimal predictor
What function f : X → R has smallest (squared loss) risk R(f) := E[(f(X) − Y )2]? Note: earlier we discussed empirical risk. ◮ Conditional on X = x, the minimizer of conditional risk ˆ y → E[(ˆ y − Y )2 | X = x] is the conditional mean E[Y | X = x]. ◮ Therefore, the function f ⋆ : R → R where f ⋆(x) = E[Y | X = x], x ∈ R has the smallest risk. ◮ f ⋆ is called the regression function or conditional mean function.
31 / 94
Linear regression models
When side-information is encoded as vectors of real numbers x = (x1, . . . , xd) (called features or variables), it is common to use a linear regression model, such as the following: Y | X = x ∼ N(x
Tw, σ2),
x ∈ Rd.
32 / 94
Linear regression models
When side-information is encoded as vectors of real numbers x = (x1, . . . , xd) (called features or variables), it is common to use a linear regression model, such as the following: Y | X = x ∼ N(x
Tw, σ2),
x ∈ Rd. ◮ Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.
32 / 94
Linear regression models
When side-information is encoded as vectors of real numbers x = (x1, . . . , xd) (called features or variables), it is common to use a linear regression model, such as the following: Y | X = x ∼ N(x
Tw, σ2),
x ∈ Rd. ◮ Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0. ◮ X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).
32 / 94
Linear regression models
When side-information is encoded as vectors of real numbers x = (x1, . . . , xd) (called features or variables), it is common to use a linear regression model, such as the following: Y | X = x ∼ N(x
Tw, σ2),
x ∈ Rd. ◮ Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0. ◮ X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables). ◮ Conditional distribution of Y given X is normal.
32 / 94
Linear regression models
When side-information is encoded as vectors of real numbers x = (x1, . . . , xd) (called features or variables), it is common to use a linear regression model, such as the following: Y | X = x ∼ N(x
Tw, σ2),
x ∈ Rd. ◮ Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0. ◮ X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables). ◮ Conditional distribution of Y given X is normal. ◮ Marginal distribution of X not specified.
32 / 94
Linear regression models
When side-information is encoded as vectors of real numbers x = (x1, . . . , xd) (called features or variables), it is common to use a linear regression model, such as the following: Y | X = x ∼ N(x
Tw, σ2),
x ∈ Rd. ◮ Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0. ◮ X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables). ◮ Conditional distribution of Y given X is normal. ◮ Marginal distribution of X not specified. In this model, the regression function f ⋆ is a linear function fw : Rd → R, fw(x) = x
Tw =
d
- i=1
xiw, x ∈ Rd.
(We’ll often refer to fw just by w.)
- 1
- 0.5
0.5 1 x
- 5
5 y f* 32 / 94
Maximum likelihood estimation for linear regression
Linear regression model with Gaussian noise: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with Y | X = x ∼ N(x
Tw, σ2),
x ∈ Rd. (Traditional to study linear regression in context of this model.)
33 / 94
Maximum likelihood estimation for linear regression
Linear regression model with Gaussian noise: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with Y | X = x ∼ N(x
Tw, σ2),
x ∈ Rd. (Traditional to study linear regression in context of this model.) Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:
n
- i=1
- − 1
2σ2 (x
T
i w − yi)2 + 1
2 ln 1 2πσ2
- +
- terms not involving (w, σ2)
- .
33 / 94
Maximum likelihood estimation for linear regression
Linear regression model with Gaussian noise: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with Y | X = x ∼ N(x
Tw, σ2),
x ∈ Rd. (Traditional to study linear regression in context of this model.) Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:
n
- i=1
- − 1
2σ2 (x
T
i w − yi)2 + 1
2 ln 1 2πσ2
- +
- terms not involving (w, σ2)
- .
The w that maximizes log-likelihood is also w that minimizes 1 n
n
- i=1
(x
T
i w − yi)2.
33 / 94
Maximum likelihood estimation for linear regression
Linear regression model with Gaussian noise: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with Y | X = x ∼ N(x
Tw, σ2),
x ∈ Rd. (Traditional to study linear regression in context of this model.) Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:
n
- i=1
- − 1
2σ2 (x
T
i w − yi)2 + 1
2 ln 1 2πσ2
- +
- terms not involving (w, σ2)
- .
The w that maximizes log-likelihood is also w that minimizes 1 n
n
- i=1
(x
T
i w − yi)2.
This coincides with another approach, called empirical risk minimization, which is studied beyond the context of the linear regression model . . .
33 / 94
Empirical distribution and empirical risk
Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability mass function pn given by pn((x, y)) := 1 n
n
- i=1
1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.
34 / 94
Empirical distribution and empirical risk
Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability mass function pn given by pn((x, y)) := 1 n
n
- i=1
1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R. Plug-in principle: Goal is to find function f that minimizes (squared loss) risk R(f) = E[(f(X) − Y )2]. But we don’t know the distribution P of (X, Y ).
34 / 94
Empirical distribution and empirical risk
Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability mass function pn given by pn((x, y)) := 1 n
n
- i=1
1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R. Plug-in principle: Goal is to find function f that minimizes (squared loss) risk R(f) = E[(f(X) − Y )2]. But we don’t know the distribution P of (X, Y ). Replace P with Pn → Empirical (squared loss) risk R(f):
- R(f) := 1
n
n
- i=1
(f(xi) − yi)2.
34 / 94
Empirical distribution and empirical risk
Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability mass function pn given by pn((x, y)) := 1 n
n
- i=1
1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R. Plug-in principle: Goal is to find function f that minimizes (squared loss) risk R(f) = E[(f(X) − Y )2]. But we don’t know the distribution P of (X, Y ). Replace P with Pn → Empirical (squared loss) risk R(f):
- R(f) := 1
n
n
- i=1
(f(xi) − yi)2. (“Plug-in principle” is used throughout statistics in this same way.)
34 / 94
Empirical risk minimization
Empirical risk minimization (ERM) is the learning method that returns a function (from a specified function class) that minimizes the empirical risk.
35 / 94
Empirical risk minimization
Empirical risk minimization (ERM) is the learning method that returns a function (from a specified function class) that minimizes the empirical risk. For linear functions and squared loss: ERM returns ˆ w ∈ arg min
w∈Rd
- R(w),
which coincides with MLE under the basic linear regression model.
35 / 94
Empirical risk minimization
Empirical risk minimization (ERM) is the learning method that returns a function (from a specified function class) that minimizes the empirical risk. For linear functions and squared loss: ERM returns ˆ w ∈ arg min
w∈Rd
- R(w),
which coincides with MLE under the basic linear regression model. In general: ◮ MLE makes sense in context of statistical model for which it is derived. ◮ ERM makes sense in context of general iid model for supervised learning.
35 / 94
Empirical risk minimization
Empirical risk minimization (ERM) is the learning method that returns a function (from a specified function class) that minimizes the empirical risk. For linear functions and squared loss: ERM returns ˆ w ∈ arg min
w∈Rd
- R(w),
which coincides with MLE under the basic linear regression model. In general: ◮ MLE makes sense in context of statistical model for which it is derived. ◮ ERM makes sense in context of general iid model for supervised learning. Further remarks. ◮ In MLE, we assume a model, and we not only maximize likelihood, but can try to argue we “recover” a “true” parameter. ◮ In ERM, by default there is no assumption of a “true” parameter to recover. Useful examples: medical testing, gene expression, . . .
35 / 94
Old faithful data under this least squares statistical model
Recall our data, consisting of historical records of eruptions:
a1 b1 a2 a3 a0 b2 b3 b0 . . . Y1 Y2 Y3
36 / 94
Old faithful data under this least squares statistical model
Recall our data, consisting of historical records of eruptions:
a1 b1 a2 a3 a0 b2 b3 b0 . . . Y1 Y2 Y3
Statistical model (not just IID!): Y1, . . . , Yn, Y ∼iid N(µ, σ2). ◮ Data: Yi := ai − bi−1, i = 1, . . . , n.
(Admittedly not a great model, since durations are non-negative.)
36 / 94
Old faithful data under this least squares statistical model
Recall our data, consisting of historical records of eruptions:
an bn an−1 bn−1
. . . Y
data
. . . t
Statistical model (not just IID!): Y1, . . . , Yn, Y ∼iid N(µ, σ2). ◮ Data: Yi := ai − bi−1, i = 1, . . . , n.
(Admittedly not a great model, since durations are non-negative.)
Task: At later time t (when an eruption ends), predict time of next eruption t + Y . For the linear regression model, we’ll assume Y | X = x ∼ N(x
Tw, σ2),
x ∈ Rd. (This extends the model above if we add the “1” feature.)
36 / 94
- 9. Regularization and ridge regression
Inductive bias
Suppose ERM solution is not unique. What should we do?
37 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do? One possible answer: Pick the w of shortest length.
37 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do? One possible answer: Pick the w of shortest length. ◮ Fact: The shortest solution ˆ w to (ATA)w = ATb is always unique.
37 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do? One possible answer: Pick the w of shortest length. ◮ Fact: The shortest solution ˆ w to (ATA)w = ATb is always unique. ◮ Obtain ˆ w via ˆ w = A+b where A+ is the (Moore-Penrose) pseudoinverse of A.
37 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do? One possible answer: Pick the w of shortest length. ◮ Fact: The shortest solution ˆ w to (ATA)w = ATb is always unique. ◮ Obtain ˆ w via ˆ w = A+b where A+ is the (Moore-Penrose) pseudoinverse of A. Why should this be a good idea?
37 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do? One possible answer: Pick the w of shortest length. ◮ Fact: The shortest solution ˆ w to (ATA)w = ATb is always unique. ◮ Obtain ˆ w via ˆ w = A+b where A+ is the (Moore-Penrose) pseudoinverse of A. Why should this be a good idea? ◮ Data does not give reason to choose a shorter w over a longer w.
37 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do? One possible answer: Pick the w of shortest length. ◮ Fact: The shortest solution ˆ w to (ATA)w = ATb is always unique. ◮ Obtain ˆ w via ˆ w = A+b where A+ is the (Moore-Penrose) pseudoinverse of A. Why should this be a good idea? ◮ Data does not give reason to choose a shorter w over a longer w. ◮ The preference for shorter w is an inductive bias: it will work well for some problems (e.g., when “true” w⋆ is short), not for others.
37 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do? One possible answer: Pick the w of shortest length. ◮ Fact: The shortest solution ˆ w to (ATA)w = ATb is always unique. ◮ Obtain ˆ w via ˆ w = A+b where A+ is the (Moore-Penrose) pseudoinverse of A. Why should this be a good idea? ◮ Data does not give reason to choose a shorter w over a longer w. ◮ The preference for shorter w is an inductive bias: it will work well for some problems (e.g., when “true” w⋆ is short), not for others. All learning algorithms encode some kind of inductive bias.
37 / 94
Example
ERM with scaled trigonometric feature expansion: φ(x) = (1, sin(x), cos(x), 1
2 sin(2x), 1 2 cos(2x), 1 3 sin(3x), 1 3 cos(3x), . . . ).
38 / 94
Example
ERM with scaled trigonometric feature expansion: φ(x) = (1, sin(x), cos(x), 1
2 sin(2x), 1 2 cos(2x), 1 3 sin(3x), 1 3 cos(3x), . . . ).
Training data:
1 2 3 4 5 6 x
- 2.5
- 2
- 1.5
- 1
- 0.5
0.5 1 1.5 2 2.5 f(x) 38 / 94
Example
ERM with scaled trigonometric feature expansion: φ(x) = (1, sin(x), cos(x), 1
2 sin(2x), 1 2 cos(2x), 1 3 sin(3x), 1 3 cos(3x), . . . ).
Training data and some arbitrary ERM:
1 2 3 4 5 6 x
- 2.5
- 2
- 1.5
- 1
- 0.5
0.5 1 1.5 2 2.5 f(x) 38 / 94
Example
ERM with scaled trigonometric feature expansion: φ(x) = (1, sin(x), cos(x), 1
2 sin(2x), 1 2 cos(2x), 1 3 sin(3x), 1 3 cos(3x), . . . ).
Training data and least ℓ2 norm ERM:
1 2 3 4 5 6 x
- 2.5
- 2
- 1.5
- 1
- 0.5
0.5 1 1.5 2 2.5 f(x)
It is not a given that the least norm ERM is better than the other ERM!
38 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
- R(w) + λw2
2
- ver w ∈ Rd.
39 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
- R(w) + λw2
2
- ver w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
39 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
- R(w) + λw2
2
- ver w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)! ◮ This is called ridge regression. (λ = 0 is ERM / Ordinary Least Squares.) Explicit solution (ATA + λI)−1ATb.
39 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
- R(w) + λw2
2
- ver w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)! ◮ This is called ridge regression. (λ = 0 is ERM / Ordinary Least Squares.) Explicit solution (ATA + λI)−1ATb. ◮ Parameter λ controls how much attention is paid to the regularizer w2
2
relative to the data fitting term R(w).
39 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
- R(w) + λw2
2
- ver w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)! ◮ This is called ridge regression. (λ = 0 is ERM / Ordinary Least Squares.) Explicit solution (ATA + λI)−1ATb. ◮ Parameter λ controls how much attention is paid to the regularizer w2
2
relative to the data fitting term R(w). ◮ Choose λ using cross-validation.
39 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
- R(w) + λw2
2
- ver w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)! ◮ This is called ridge regression. (λ = 0 is ERM / Ordinary Least Squares.) Explicit solution (ATA + λI)−1ATb. ◮ Parameter λ controls how much attention is paid to the regularizer w2
2
relative to the data fitting term R(w). ◮ Choose λ using cross-validation. Note: in deep networks, this regularization is called “weight decay”. (Why?) Note: another popular regularizer for linear regression is ℓ1.
39 / 94