Applied Machine Learning
Linear Regression
Siamak Ravanbakhsh
COMP 551 (fall 2020)
Applied Machine Learning Linear Regression Siamak Ravanbakhsh COMP - - PowerPoint PPT Presentation
Applied Machine Learning Linear Regression Siamak Ravanbakhsh COMP 551 (fall 2020) Learning objectives linear model evaluation criteria how to find the best fit geometric interpretation maximum likelihood interpretation Motivation History:
Siamak Ravanbakhsh
COMP 551 (fall 2020)
effect of income inequality on health and social problems
source: http://chrisauld.com/2012/10/07/what-do-we-know-about-the-effect-of-income-inequality-on-health/
History: method of least squares was invented by Legendre and Gauss (1800's) Gauss at age 24 used it to predict the future location of Ceres (largest astroid in the astroid belt)
COMP 551 | Fall 2020
(n)
vectors are assume to be column vectors x =
1
2
⊤
a feature
(n)
we assume N instances in the dataset D = {(x
, y )}
(n) (n n=1 N
for example, is the feature d of instance n
each instance has D features indexed by d
x ∈
d (n)
R
X = ⎣ ⎢ ⎢ ⎢ ⎢ ⎡ x(1)⊤ x(2)⊤ ⋮ x(N)⊤⎦ ⎥ ⎥ ⎥ ⎥ ⎤
each row is a datapoint, each column is a feature
1 (1)
1 (N)
2 (1)
2 (N)
(1)
(N)⎦
COMP 551 | Fall 2020
Example: Micro array data (X), contains gene expression levels labels (y) can be {cancer/no cancer} label for each patient
gene (d) patient (n)
w
1 1
D D
w
D
will generalize to a vector later
1 D ⊤
w
⊤
bias or intercept model parameters or weights
2 1
square error loss (a.k.a. L2 loss)
for a single instance (a function of labels)
w (n)
(n) (n)
minimize a measure of difference between = y ^(n) f (x )
w (n)
y(n)
2 1 ∑n=1 N (n)
⊤ (n) 2
sum of squared errors cost function
for future convenience for the whole dataset versus
∗
w ∑n (n)
T (n) 2
1
∗
(x , y )
(3) (3)
(x , y )
(1) (1)
(x , y )
(2) (2)
(x , y )
(4) (4)
w∗
∗
1 ∗
(3)
(3)
COMP 551 | Fall 2020
∗
w ∑n (n)
T (n) 2
∗
w∗
∗
1 ∗ 1
2 ∗ 2
⊤ (n) 1 × D D × 1 ∈ R
N × D N × 1 D × 1
w 2 1
2 2
2 1
⊤
squared L2 norm of the residual vector
the cost function is a smooth function of w find minimum by setting partial derivatives to zero
both scalar
2 1 ∑n (n)
(n) 2
∗ x ∑n
(n)2
x y ∑n
(n) (n)
dw dJ
(n) (n)
(n)
global minimum because cost is smooth and convex
more on convexity layer
for a multivariate function J(w , w )
1
partial derivatives instead of derivative
∂w1 ∂ 1
ϵ J(w ,w +ϵ)−J(w ,w )
1 1
critical point: all partial derivatives are zero
= derivative when other vars. are fixed
gradient: vector of all partial derivatives
∂w1 ∂ ∂wD ∂ ⊤
setting
∂wd ∂
cost is a smooth and convex function of w
∂wd ∂ ∑n 2 1 (n)
w (n) 2
using chain rule:
∂wd ∂J dfw dJ ∂wd ∂fw
⊤ (n)
(n) d (n)
we get
(n)
⊤ (n) d (n)
matrix form (using the design matrix)
N × 1
D × N each row enforces one of D equations
Normal equation: because for optimal w, the residual vector is
normal to column space of the design matrix
y − Xw
2nd column of the design matrix
X Xw =
⊤
X y
⊤
system of D linear equations
Aw = b
X Xw =
⊤
X y
⊤
⊤ −1 ⊤
X (y −
⊤
Xw) = 0
projection matrix into column space of X
∗
⊤ −1 ⊤
D × D D × N N × 1
pseudo-inverse of X
y − Xw
we can get a closed form solution!
we can get a closed form solution!
w =
∗
(X X) X y
⊤ −1 ⊤
what if the covariance matrix is not invertible?
QΛQ⊤
recall the eigenvalue decomposition for the covariance matrix when is this not invertible?
(QΛQ ) =
⊤ −1
QΛ Q
−1 ⊤
Λ =
−1
⎣ ⎢ ⎡ λ1
1 λ2 1
… … …
λD 1 ⎦
⎥ ⎤
this matrix is not well-defined when some eigenvalues are zero!
where
under some assumptions, we can get a closed form solution!
(QΛQ ) =
⊤ −1
QΛ Q
−1 ⊤ Λ =
−1
⎣ ⎢ ⎡ λ1
1 λ2 1
… … …
λD 1 ⎦
⎥ ⎤ this does not exist when some eigenvalues are zero!
that is, if features are completely correlated ... or more generally if features are not linearly independent
λ1 λ2
α x = ∑d
d d
there exists some such that
{α }
d
COMP 551 | Fall 2020
examples having a binary feature as well as its negation
x1 x =
2
(1 − x )
1
that is, if features are completely correlated ... or more generally if features are not linearly independent α x = ∑d
d d
there exists some such that
{α }
d
X (y −
⊤
Xw ) =
∗
if satisfies the normal equation
w∗
then the following are also solutions
w +
∗
cα ∀c
alternatively because X(w +
∗
cα) = Xw +
∗
cXα
when we have many features and D ≥ N
patient (n) gene (d)
∗
⊤ −1 ⊤
O(D )
3
matrix inversion O(ND) D elements, each using N ops. O(D N)
2
D x D elements, each requiring N multiplications
total complexity for N > D is O(ND +
2
D )
3
in practice we don't directly use matrix inversion (unstable)
D × D D × N N × 1
however, other more stable solutions (e.g., Gaussian elimination) have similar complexity
instead of
Y ∈ RN×D′
we have
∗
⊤ −1 ⊤
D × D D × N N × D′
N × D D × D′
a different weight vectors for each target each column of Y is associated with a column of W
N × D′
so far we learned a linear function f
w
d d
sometimes this may be too simplistic idea create new more useful features out of initial set of given features e.g., x , x x , log(x),
1 2 1 2 how about ?
x +
1
2x3
so far we learned a linear function f
w
d d
let's denote the set of all features by
⊤ ∗
⊤
solution simply becomes with Φ
Φ = ⎣ ⎢ ⎢ ⎢ ⎢ ⎡ ϕ (x ),
1 (1)
ϕ (x ),
1 (2)
⋮ ϕ (x ),
1 (N)
ϕ (x ),
2 (1)
ϕ (x ),
2 (2)
⋮ ϕ (x ),
2 (N)
⋯ , ⋯ , ⋱ ⋯ , ϕ (x )
D (1)
ϕ (x )
D (2)
⋮ ϕ (x )
D (N) ⎦
⎥ ⎥ ⎥ ⎥ ⎤
replacing
a (nonlinear) feature
ϕ (x)∀d
d
the problem of linear regression doesn't change f
w
d d ϕ (x)
d
is the new x
polynomial bases
k
Gaussian bases
k
s2 (x−μ ) k 2
Sigmoid bases
k 1+e−
s x−μk
1
k
s2 (x−μ ) k 2
k k
the green curve (our fit) is the sum of these scaled Gaussian bases plus the intercept. Each basis is scaled by the corresponding weight
we are using a fixed standard deviation of s=1
COMP 551 | Fall 2020
Sigmoid bases
k 1+e−
s x−μk
1
we are using a fixed standard deviation of s=1
k k
the green curve (our fit) is the sum of these scaled Gaussian bases plus the intercept. Each basis is scaled by the corresponding weight
given the dataset
idea
image: http://blog.nguyenvq.com/blog/2009/05/12/linear-regression-plot-with-normal-curves-for-error-sideways/
⊤
D = {(x , y ), … , (x , y )}
(1) (1) (N) (N)
learn a probabilistic model p(y∣x; w)
p (y ∣
w
x) = N(y ∣ w x, σ ) =
⊤ 2
e
2πσ2 1 −
2σ2 (y−w x) ⊤ 2
consider with the following form
p(y∣x; w)
assume a fixed variance, say σ =
2
1
Q: how to fit the model? A: maximize the conditional likelihood!
COMP 551 | Fall 2020
⊤ 2
e
2πσ2 1 −
2σ2 (y−w x) ⊤ 2
image: http://blog.nguyenvq.com/blog/2009/05/12/linear-regression-plot-with-normal-curves-for-error-sideways/
T
log likelihood ℓ(w) =
− (y − ∑n
2σ2 1 (n)
w x ) +
⊤ (n) 2
constants L(w) = p(y ∣ ∏n=1
N (n)
x ; w)
(n)
likelihood max-likelihood params. w =
∗
arg max ℓ(w) =
w
arg min (y −
w 2 1 ∑n (n)
w x )
⊤ (n) 2 linear least squares!
whenever we use square loss, we are assuming Gaussian noise!
linear regression: models targets as a linear function of features fit the model by minimizing the sum of squared errors has a direct solution with complexity probabilistic interpretation we can build more expressive models: using any number of non-linear features
O(ND +
2
D )
3